LLM 관련 주요 논문 - 2026-05-20

1. A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents


2. Neurosymbolic Learning for Inference-Time Argumentation


3. Probing Embodied LLMs: When Higher Observation Fidelity Hurts Problem Solving


4. When Skills Don’t Help: A Negative Result on Procedural Knowledge for Tool-Grounded Agents in Offensive Cybersecurity


5. Probabilistic Tiny Recursive Model


6. PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents


7. Streamlined Constraint Reasoning via CNN Pattern Recognition on Enumerated Solutions


8. From Prompts to Pavement Through Time: Temporal Grounding in Agentic Scene-to-Plan Reasoning


9. Prior Knowledge or Search? A Study of LLM Agents in Hardware-Aware Code Optimization


10. OpenComputer: Verifiable Software Worlds for Computer-Use Agents


11. What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code


12. Memory-Augmented Reinforcement Learning Agent for CAD Generation


13. EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design


14. Pseudocode-Guided Structured Reasoning for Automating Reliable Inference in Vision-Language Models


15. Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents


16. Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption


17. Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries


18. Generative-Evaluative Agreement: A Necessary Validity Criterion for LLM-Enabled Adaptive Assessment


19. BLINKG: A Benchmark for LLM-Integrated Knowledge Graph Generation


20. Position: The Turing-Completeness of Real-World Autoregressive Transformers Relies Heavily on Context Management


21. What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents


22. Conflict-Resilient Multi-Agent Reasoning via Signed Graph Modeling


23. PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning


24. Agentic Trading: When LLM Agents Meet Financial Markets


25. MOCHA: Multi-Objective Chebyshev Annealing for Agent Skill Optimization


26. Causal Evidence for Attention Head Imbalance in Modality Conflict Hallucination


27. Can Large Language Models Revolutionize Survey Research? Experiments with Disaster Preparedness Responses


28. SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents


29. Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints


30. POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents


31. Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts


32. Trustworthy Agent Network: Trust in Agent Networks Must Be Baked In, Not Bolted On


33. Learn-by-Wire Training Control Governance: Bounded Autonomous Training Under Stress for Stability and Efficiency


34. Evaluating the Utility of Personal Health Records in Personalized Health AI


35. Operationalizing Document AI: A Microservice Architecture for OCR and LLM Pipelines in Production


36. Position: Let’s Develop Data Probes to Fundamentally Understand How Data Affects LLM Performance


37. Rethinking Visual Attribution for Chest X-ray Reasoning in Large Vision Language Models


38. Less Back-and-Forth: A Comparative Study of Structured Prompting


39. Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding


40. ThoughtTrace: Understanding User Thoughts in Real-World LLM Interactions


41. What Do Evolutionary Coding Agents Evolve?


42. BalanceRAG: Joint Risk Calibration for Cascaded Retrieval-Augmented Generation


43. VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving


44. CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning


45. Towards LLM-Assisted Architecture Recovery for Real-World ROS~2 Systems: An Agent-Based Multi-Level Approach to Hierarchical Structural Architecture Reconstruction


46. PromptRad: Knowledge-Enhanced Multi-Label Prompt-Tuning for Low-Resource Radiology Report Labeling


47. LLM Benchmark Datasets Should Be Contamination-Resistant


48. A Case for Agentic Tuning: From Documentation to Action in PostgreSQL


49. Block-Sphere Vector Quantization


50. Detecting Fluent Optimization-Based Adversarial Prompts via Sequential Entropy Changes


51. A Measure-Theoretic Analysis of Reasoning: Structural Generalization and Approximation Limits


52. Breaking Modality Heterogeneity in Low-Bit Quantization for Large Vision-Language Models


53. FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding


54. Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation




57. Synthesis and Evaluation of Long-term History-aware Medical Dialogue


58. TERGAD: Structure-Aware Text-Enhanced Representations for Graph Anomaly Detection


59. ContextRAG: Extraction-Free Hierarchical Graph Construction for Retrieval-Augmented Generation


60. Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges


61. Measuring Safety Alignment Effects in Autonomous Security Agents


62. CriterAlign: Criterion-Centric Rationale Alignment for Code Preference Judging


63. The Accessibility Capability Boundary: Operational Limits and Expansion Potential of AI-Generated Browser-Native Accessibility Systems


64. optimize_anything: A Universal API for Optimizing any Text Parameter


65. MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models


66. A novel YOLO26-MoE optimized by an LLM agent for insulator fault detection considering UAV images


67. TORQ: Two-Level Orthogonal Rotation for MXFP4 Quantization


68. EgoCoT-Bench: Benchmarking Grounded and Verifiable Operation-Centric Chain of Thought Reasoning for MLLMs


69. Investigating Cross-Modal Skill Injection: Scenarios, Methods, and Hyperparameters


70. Backtracking When It Strays: Mitigating Dual Exposure Biases in LLM Reasoning Distillation


71. When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR


72. EmbGen: Teaching with Reassembled Corpora


73. The Evaluation Game: Beyond Static LLM Benchmarking


74. Concept-Guided Noisy Negative Suppression for Zero-Shot Classification and Grounding of Chest X-Ray Findings


75. Toward User Comprehension Supports for LLM Agent Skill Specifications


76. Brain alignment of reasoning and action representations from vision-language and action models during naturalistic gameplay


77. PAVE: A Cognitive Architecture for Legitimate Violation in Generative Agent Societies


78. HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models


79. RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding


80. Exploring and Developing a Pre-Model Safeguard with Draft Models


81. Are Rationales Necessary and Sufficient? Tuning LLMs for Explainable Misinformation Detection


82. FormalASR: End-to-End Spoken Chinese to Formal Text


83. Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution


84. Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering


85. Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference


86. Flash PD-SSM: Memory-Optimized Structured Sparse State-Space Models


87. Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks


88. GRASP: Deterministic argument ranking in interaction graphs


89. EgoBabyVLM: Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data


90. FAGER: Factually Grounded Evaluation and Refinement of Text-to-Image Models


91. ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models


92. Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German


93. Toward an AI-Powered Computational Testbed for Workforce Policy


94. Automated Grading of Handwritten Mathematics Using Vision-Capable LLMs


95. Surviving the Unseen: Predictive Defense for Novel Multi-Turn Multimodal Attacks


96. HypergraphFormer: Learning Hypergraphs from LLMs for Editable Floor Plan Generation


97. OEP: Poisoning Self-Evolving LLM Agents via Locally Correct but Non-Transferable Experiences


98. ESLD (External Surrogate Latent Defense): A Latent-Space Architecture for Faster, Stronger Prompt-Injection Defense


99. DMN: A Compositional Framework for Jailbreaking Multimodal LLMs with Multi-Image Inputs


100. Don’t Let Bandit Feedback Pull Continual LLM-Recommender Updates Off Target


101. Stop Drawing Scientific Claims from LLM Social Simulations Without Robustness Audits


102. To Call or Not to Call: Diagnosing Intrinsic Over-Calling Bias in LLM Agents


103. ZeroUnlearn: Few-Shot Knowledge Unlearning in Large Language Models


104. Distributional Energy-Based Models for Uncertainty-Aware Structured LLM Reasoning


105. MO-CAPO: Multi-Objective Cost-Aware Prompt Optimization


106. DarkLLM: Learning Language-Driven Adversarial Attacks with Large Language Models


107. SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs


108. TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing


109. The 99% Success Paradox: When Near-Perfect Retrieval Equals Random Selection


110. Robust Checkpoint Selection for Multimodal LLMs via Agentic Evaluation and Stability-Aware Ranking


111. KadiAssistant: A conversational AI Agent for information retrieval in Kadi4Mat


112. Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling


113. Precision Tracked Transformer via Kalman Filtering, Kriging and Process Noise


114. Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training


115. D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting


116. Compositional Literary Primitives in Instruction-Tuned LLMs: Cross-Architectural SAE Features for Self, Style, and Affect


117. RecoAtlas: From Semantic Plausibility to Set-Level Utility in LLM Recommendation Agents


118. Theory-optimal Quantization Based on Flatness


119. ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning


120. HELLoRA: Hot Experts Layer-Level Low-Rank Adaptation for Mixture-of-Experts Models


121. Features have life history. And we should care


122. Can LLMs Emulate Human Belief Dynamics?


123. A Reproducibility Analysis of PO4ISR: Diagnosing and Mitigating Semantic Drift in LLM-Based Session Recommendation


124. M3DocDep: Multi-modal, Multi-page, Multi-document Dependency Chunking with Large Vision-Language Models


125. Improving Retrieval-Augmented Generation without Taxonomy-based Error Categorization


126. Agentic GraphRAG: Navigating Unstructured Financial Data with Collaborative AI


127. ClusterRAG: Cluster-Based Collaborative Filtering for Personalized Retrieval-Augmented Generation


128. STAR: Semantic-Tuned and Tail-Adaptive Retriever for Graph-Augmented Generation


129. From Intent to AI Pipelines: A Controlled Agentic Framework for Non-AI Expert Scientists


130. Query-Conditioned Graph Retrieval for Contextualized LLM Reasoning in Personalized Wearable Data


131. ALDEN: Boosting Private Data Extraction from Retrieval-Augmented Generation Systems via Active Learning and Distribution Estimation


132. Interoceptive Divergence in Aesthetic Evaluation and Implications for Human-AI Alignment