LLM 관련 주요 논문 - 2026-04-08

1. Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents


2. How LLMs Follow Instructions: Skillful Coordination, Not a Universal Mechanism


3. Epistemic Blinding: An Inference-Time Protocol for Auditing Prior Contamination in LLM-Assisted Analysis


4. Flowr – Scaling Up Retail Supply Chain Operations Through Agentic AI in Large Scale Supermarket Chains


5. Beyond Compromise: Pareto-Lenient Consensus for Efficient Multi-Preference LLM Alignment


6. Context-Value-Action Architecture for Value-Driven Large Language Model Agents


7. HybridKV: Hybrid KV Cache Compression for Efficient Multimodal Large Language Model Inference


8. Joint Knowledge Base Completion and Question Answering by Combining Large Language Models and Small Language Models


9. JTON: A Token-Efficient JSON Superset with Zen Grid Tabular Encoding for Large Language Models


10. When Do We Need LLMs? A Diagnostic for Language-Driven Bandits


11. Deep Researcher Agent: An Autonomous Framework for 24/7 Deep Learning Experimentation with Zero-Cost Monitoring


12. Vision-Guided Iterative Refinement for Frontend Code Generation


13. Hierarchical Reinforcement Learning with Augmented Step-Level Transitions for LLM Agents


14. Can Large Language Models Reinvent Foundational Algorithms?


15. LUDOBENCH: Evaluating LLM Behavioural Decision-Making Through Spot-Based Board Game Scenarios in Ludo


16. CuraLight: Debate-Guided Data Curation for LLM-Centered Traffic Signal Control


17. Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge


18. ResearchEVO: An End-to-End Framework for Automated Scientific Discovery and Documentation


19. COSMO-Agent: Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration


20. From Large Language Model Predicates to Logic Tensor Networks: Neurosymbolic Offer Validation in Regulated Procurement


21. SignalClaw: LLM-Guided Evolutionary Synthesis of Interpretable Traffic Signal Control Skills


22. Experience Transfer for Multimodal LLM Agents in Minecraft Game


23. ActivityEditor: Learning to Synthesize Physically Valid Human Mobility


24. Market-Bench: Benchmarking Large Language Models on Economic and Trade Competition


25. Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models


26. Auditable Agents


27. Can We Trust a Black-box LLM? LLM Untrustworthy Boundary Detection via Bias-Diffusion and Multi-Agent Reinforcement Learning


28. Automated Auditing of Hospital Discharge Summaries for Care Transitions


29. CODESTRUCT: Code Agents over Structured Action Spaces


30. HYVE: Hybrid Views for LLM Context Engineering over Machine Data


31. Towards Effective In-context Cross-domain Knowledge Transfer via Domain-invariant-neurons-based Retrieval


32. LLM-as-Judge for Semantic Judging of Powerline Segmentation in UAV Inspection


33. TFRBench: A Reasoning Benchmark for Evaluating Forecasting Systems


34. ETR: Entropy Trend Reward for Efficient Chain-of-Thought Reasoning


35. From Retinal Evidence to Safe Decisions: RETINA-SAFE and ECRT for Hallucination Risk Triage in Medical LLMs


36. Dynamic Agentic AI Expert Profiler System Architecture for Multidomain Intelligence Modeling


37. TRACE: Capability-Targeted Agentic Training


38. Pressure, What Pressure? Sycophancy Disentanglement in Language Models via Reward Decomposition


39. Attribution Bias in Large Language Models


40. ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces


41. Instruction-Tuned LLMs for Parsing and Mining Unstructured Logs on Leadership HPC Systems


42. IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents


43. Uncertainty-Guided Latent Diagnostic Trajectory Learning for Sequential Clinical Diagnosis


44. MMORF: A Multi-agent Framework for Designing Multi-objective Retrosynthesis Planning Systems


45. ReVEL: Multi-Turn Reflective LLM-Guided Heuristic Evolution via Structured Performance Feedback


46. Pramana: Fine-Tuning Large Language Models for Epistemic Reasoning through Navya-Nyaya


47. In-Place Test-Time Training


48. Toward Consistent World Models with Multi-Token Prediction and Latent Semantic Enhancement


49. Generating Synthetic Doctor-Patient Conversations for Long-form Audio Summarization


50. Gym-Anything: Turn any Software into an Agent Environment


51. Lightweight Multimodal Adaptation of Vision Language Models for Species Recognition and Habitat Context Interpretation in Drone Thermal Imagery


52. LLM4CodeRE: Generative AI for Code Decompilation Analysis and Reverse Engineering


53. Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives


54. LAG-XAI: A Lie-Inspired Affine Geometric Framework for Interpretable Paraphrasing in Transformer Latent Spaces


55. Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning


56. Stories of Your Life as Others: A Round-Trip Evaluation of LLM-Generated Life Stories Conditioned on Rich Psychometric Profiles


57. A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models


58. CritBench: A Framework for Evaluating Cybersecurity Capabilities of Large Language Models in IEC 61850 Digital Substation Environments


59. The Model Agreed, But Didn’t Learn: Diagnosing Surface Compliance in Large Language Models


60. A Formal Security Framework for MCP-Based AI Agents: Threat Taxonomy, Verification Models, and Defense Mechanisms


61. Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM-based Issue Resolution


62. “I See What You Did There”: Can Large Vision-Language Models Understand Multimodal Puns?


63. Swiss-Bench 003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts


64. What Models Know, How Well They Know It: Knowledge-Weighted Fine-Tuning for Learning When to Say “I Don’t Know”


65. CAKE: Cloud Architecture Knowledge Evaluation of Large Language Models


66. Hackers or Hallucinators? A Comprehensive Analysis of LLM-Based Automated Penetration Testing



68. Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion


69. From Incomplete Architecture to Quantified Risk: Multimodal LLM-Driven Security Assessment for Cyber-Physical Systems


70. LLM Reasoning as Trajectories: Step-Specific Representation Geometry and Correctness Signals


71. Semantic-Topological Graph Reasoning for Language-Guided Pulmonary Screening


72. Foundations for Agentic AI Investigations from the Forensic Analysis of OpenClaw


73. Context-Agent: Dynamic Discourse Trees for Non-Linear Dialogue


74. FastDiSS: Few-step Match Many-step Diffusion Language Model on Sequence-to-Sequence Generation–Full Version


75. Turbulence-like 5/3 spectral scaling in contextual representations of language as a complex system


76. Unifying VLM-Guided Flow Matching and Spectral Anomaly Detection for Interpretable Veterinary Diagnosis


77. On the Role of Fault Localization Context for LLM-Based Program Repair


78. LLM Evaluation as Tensor Completion: Low Rank Structure and Semiparametric Efficiency


79. MA-IDS: Multi-Agent RAG Framework for IoT Network Intrusion Detection with an Experience Library


80. LanG – A Governance-Aware Agentic AI Platform for Unified Security Operations


81. Your LLM Agent Can Leak Your Data: Data Exfiltration via Backdoored Tool Use


82. Bridging Natural Language and Microgrid Dynamics: A Context-Aware Simulator and Dataset


83. ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads


84. VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG


85. OGA-AID: Clinician-in-the-loop AI Report Drafting Assistant for Multimodal Observational Gait Analysis in Post-Stroke Rehabilitation


86. LLMs Should Express Uncertainty Explicitly


87. Spec Kit Agents: Context-Grounded Agentic Workflows


88. Exemplar Retrieval Without Overhypothesis Induction: Limits of Distributional Sequence Learning in Early Word Learning


89. XMark: Reliable Multi-Bit Watermarking for LLM-Generated Texts


90. Improving Clinical Trial Recruitment using Clinical Narratives and Large Language Models


91. Not All Turns Are Equally Hard: Adaptive Thinking Budgets For Efficient Multi-Turn Reasoning


92. Planning to Explore: Curiosity-Driven Planning for LLM Test Generation


93. Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation


94. EffiPair: Improving the Efficiency of LLM-generated Code with Relative Contrastive Feedback


95. Reasoning Through Chess: How Reasoning Evolves from Data Through Fine-Tuning and Reinforcement Learning


96. Watch Before You Answer: Learning from Visually Grounded Post-Training


97. $π^2$: Structure-Originated Reasoning Data Improves Long-Context Reasoning Ability of Large Language Models


98. Edit, But Verify: An Empirical Audit of Instructed Code-Editing Benchmarks


99. Beyond LLM-as-a-Judge: Deterministic Metrics for Multilingual Generative Text Evaluation


100. Nidus: Externalized Reasoning for AI-Assisted Engineering


101. This Treatment Works, Right? Evaluating LLM Sensitivity to Patient Question Framing in Medical QA


102. Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space


103. StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing


104. Scaling Coding Agents via Atomic Skills


105. Comparative Characterization of KV Cache Management Strategies for LLM Inference


106. EduIllustrate: Towards Scalable Automated Generation Of Multimodal Educational Content


107. Evaluation of Embedding-Based and Generative Methods for LLM-Driven Document Classification: Opportunities and Challenges


108. FreakOut-LLM: The Effect of Emotional Stimuli on Safety Alignment


109. Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling


110. CURE:Circuit-Aware Unlearning for LLM-based Recommendation


111. MG$^2$-RAG: Multi-Granularity Graph for Multimodal Retrieval-Augmented Generation


112. The Planetary Cost of AI Acceleration, Part II: The 10th Planetary Boundary and the 6.5-Year Countdown


113. Generative AI for Video Trailer Synthesis: From Extractive Heuristics to Autoregressive Creativity


114. Synthetic Trust Attacks: Modeling How Generative AI Manipulates Human Decisions in Social Engineering Fraud


115. Learning to Retrieve from Agent Trajectories


116. From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering


117. SUMMIR: A Hallucination-Aware Framework for Ranking Sports Insights from LLMs


118. Inclusion-of-Thoughts: Mitigating Preference Instability via Purifying the Decision Space


119. The Illusion of Latent Generalization: Bi-directionality and the Reversal Curse


120. TDA-RC: Task-Driven Alignment for Knowledge-Based Reasoning Chains in Large Language Models


121. Web Retrieval-Aware Chunking (W-RAC) for Efficient and Cost-Effective Retrieval-Augmented Generation Systems