LLM 관련 주요 논문 - 2026-05-11

1. VecCISC: Improving Confidence-Informed Self-Consistency with Reasoning Trace Clustering and Candidate Answer Selection


2. Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning


3. Abductive Reasoning with Probabilistic Commonsense


4. TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples


5. AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents


6. Hierarchical Task Network Planning with LLM-Generated Heuristics


7. GASim: A Graph-Accelerated Hybrid Framework for Social Simulation


8. FactoryBench: Evaluating Industrial Machine Understanding


9. Open-Ended Task Discovery via Bayesian Optimization


10. From Pixels to Prompts: Vision-Language Models


11. Efficient Data Selection for Multimodal Models via Incremental Optimization Utility


12. GraphReAct: Reasoning and Acting for Multi-step Graph Inference


13. Tools as Continuous Flow for Evolving Agentic Reasoning


14. Discovering Ordinary Differential Equations with LLM-Based Qualitative and Quantitative Evaluation


15. Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training


16. SOM: Structured Opponent Modeling for LLM-based Agents via Structural Causal Model


17. Structured Role-Aware Policy Optimization for Multimodal Reasoning


18. Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning


19. EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation


20. HMACE: Heterogeneous Multi-Agent Collaborative Evolution for Combinatorial Optimization


21. Can You Break RLVER? Probing Adversarial Robustness of RL-Trained Empathetic Agents


22. ARMOR: An Agentic Framework for Reaction Feasibility Prediction via Adaptive Utility-aware Multi-tool Reasoning


23. 2.5-D Decomposition for LLM-Based Spatial Construction



25. Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight


26. Learning and Reusing Policy Decompositions for Hierarchical Generalized Planning with LLM Agents


27. Self-Programmed Execution for Language-Model Agents


28. Mitigating Cognitive Bias in RLHF by Altering Rationality


29. How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem


30. Agentick: A Unified Benchmark for General Sequential Decision-Making Agents


31. Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning


32. Towards Security-Auditable LLM Agents: A Unified Graph Representation


33. When Does Critique Improve AI-Assisted Theoretical Physics? SCALAR: Structured Critic–Actor Loop for Agentic Reasoning


34. Weblica: Scalable and Reproducible Training Environments for Visual Web Agents


35. When Does a Language Model Commit? A Finite-Answer Theory of Pre-Verbalization Commitment


36. From Storage to Experience: A Survey on the Evolution of LLM Agent Memory Mechanisms


37. CASCADE: Case-Based Continual Adaptation for Large Language Models During Deployment


38. Hidden Coalitions in Multi-Agent AI: A Spectral Diagnostic from Internal Representations


39. GraphDC: A Divide-and-Conquer Multi-Agent System for Scalable Graph Algorithm Reasoning


40. Flow-OPD: On-Policy Distillation for Flow Matching Models


41. The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents


42. CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation


43. Fast Byte Latent Transformer


44. Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph


45. Tool Calling is Linearly Readable and Steerable in Language Models


46. Towards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios


47. Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation


48. Where’s the Plan? Locating Latent Planning in Language Models with Lightweight Mechanistic Interventions


49. KL for a KL: On-Policy Distillation with Control Variate Baseline


50. MatryoshkaLoRA: Learning Accurate Hierarchical Low-Rank Representations for LLM Fine-Tuning


51. CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios


52. GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning


53. Beyond Confidence: Rethinking Self-Assessments for Performance Prediction in LLMs


54. Tracing Uncertainty in Language Model “Reasoning”


55. POETS: Uncertainty-Aware LLM Optimization via Compute-Efficient Policy Ensembles


56. Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs


57. SOD: Step-wise On-policy Distillation for Small Language Model Agents


58. LLM hallucinations in the wild: Large-scale evidence from non-existent citations


59. Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models


60. The AI-Native Large-Scale Agile Software Development Manifesto


61. DRIP-R: A Benchmark for Decision-Making and Reasoning Under Real-World Policy Ambiguity in the Retail Domain


62. TRACE: Tourism Recommendation with Accountable Citation Evidence


63. Operating Within the Operational Design Domain: Zero-Shot Perception with Vision-Language Models


64. Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation


65. LithoBench: Benchmarking Large Multimodal Models for Remote-Sensing Lithology Interpretation


66. Post-training makes large language models less human-like


67. Mathematical Reasoning via Intervention-Based Time-Series Causal Discovery Using LLMs as Concept Mastery Simulators


68. Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor’s Internal States


69. ProteinJEPA: Latent prediction complements protein language models



71. SHRED: Retain-Set-Free Unlearning via Self-Distillation with Logit Demotion


72. Vaporizer: Breaking Watermarking Schemes for Large Language Model Outputs


73. HBEE: Human Behavioral Entropy Engine – Pre-Registered Multi-Agent LLM Simulation of Peer-Suspicion-Based Detection Inversion


74. The Moltbook Files: A Harmless Slopocalypse or Humanity’s Last Experiment


75. Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs


76. Prompt Engineering Strategies for LLM-based Qualitative Coding of Psychological Safety in Software Engineering Communities: A Controlled Empirical Study


77. Unsolvability Ceiling in Multi-LLM Routing: An Empirical Study of Evaluation Artifacts


78. BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning


79. MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference


80. TTF: Temporal Token Fusion for Efficient Video-Language Model


81. Mage: Multi-Axis Evaluation of LLM-Generated Executable Game Scenes Beyond Compile-Pass Rate


82. Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective


83. CSR: Infinite-Horizon Real-Time Policies with Massive Cached State Representations


84. Activation Differences Reveal Backdoors: A Comparison of SAE Architectures


85. DCGL: Dual-Channel Graph Learning with Large Language Models for Knowledge-Aware Recommendation


86. BioProVLA-Agent: An Affordable, Protocol-Driven, Vision-Enhanced VLA-Enabled Embodied Multi-Agent System with Closed-Loop-Capable Reasoning for Biological Laboratory Manipulation


87. MedAction: Towards Active Multi-turn Clinical Diagnostic LLMs


88. EgoPro-Bench: Benchmarking Personalized Proactive Interaction in Egocentric Video Streams


89. Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions


90. Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models


91. Reformulating KV Cache Eviction Problem for Long-Context LLM Inference


92. Hallucination Detection via Activations of Open-Weight Proxy Analyzers


93. PSK@EEUCA 2026: Fine-Tuning Large Language Models with Synthetic Data Augmentation for Multi-Class Toxicity Detection in Gaming Chat


94. The Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval


95. MathlibPR: Pull Request Merge-Readiness Benchmark for Formal Mathematical Libraries


96. Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding


97. Neurosymbolic Framework for Concept-Driven Logical Reasoning in Skeleton-Based Human Action Recognition


98. Structural Rationale Distillation via Reasoning Space Compression


99. Adaptive Negative Reinforcement for LLM Reasoning:Dynamically Balancing Correction and Diversity in RLVR


100. Region4Web: Rethinking Observation Space Granularity for Web Agents


101. RRCM: Ranking-Driven Retrieval over Collaborative and Meta Memories for LLM Recommendation


102. Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation


103. The Translation Tax Is Not a Scalar: A Counterfactual Audit of English-Source Cue Inheritance in Chinese Multilingual Benchmarks


104. WiCER: Wiki-memory Compile, Evaluate, Refine Iterative Knowledge Compilation for LLM Wiki Systems


105. Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training


106. Do Joint Audio-Video Generation Models Understand Physics?


107. MedExAgent: Training LLM Agents to Ask, Examine, and Diagnose in Noisy Clinical Environments


108. An Interpretable and Scalable Framework for Evaluating Large Language Models


109. Cognitive Agent Compilation for Explicit Problem Solver Modeling


110. LensVLM: Selective Context Expansion for Compressed Visual Representation of Text


111. $f$-Divergence Regularized RLHF: Two Tales of Sampling and Unified Analyses


112. From Surface Learning to Deep Understanding: A Grounded AI Tutoring System for Moodle


113. Bridging the Last Mile of Circuit Design: PostEDA-Bench, a Hierarchical Benchmark for PPA Convergence and DRC Fixing


114. In-Context Credit Assignment via the Core


115. Regulating Branch Parallelism in LLM Serving


116. Same Signal, Opposite Meaning: Direction-Informed Adaptive Learning for LLM Agents


117. MELD: Multi-Task Equilibrated Learning Detector for AI-Generated Text


118. MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes


119. Don’t Retrain, Align: Adapting Autoregressive LMs to Diffusion LMs via Representation Alignment


120. How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment


121. LLM-Guided Open Hypothesis Learning from Autonomous Scanning Probe Microscopy Experiments


122. IntentGrasp: A Comprehensive Benchmark for Intent Understanding


123. VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing


124. R$^3$L: Reasoning 3D Layouts from Relative Spatial Relations


125. Gradient Extrapolation-Based Policy Optimization


126. Geometric Kolmogorov–Arnold Network (GeoKAN)


127. A Self-Healing Framework for Reliable LLM-Based Autonomous Agents


128. Beyond Factor Aggregation: Gauge-Aware Low-Rank Server Representations for Federated LoRA


129. OmicsLM: A Multimodal Large Language Model for Multi-Sample Omics Reasoning


130. Conditional generation of antibody sequences with classifier-guided germline-absorbing discrete diffusion


131. Agentic AI and the Industrialization of Cyber Offense: Forecast, Consequences, and Defensive Priorities for Enterprises and the Mittelstand


132. Visual Text Compression as Measure Transport


133. The Single-File Test: A Longitudinal Public-Interface Evaluation of First-Output LLM Web Generation with Social Reach Tracking


134. Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models


135. Evaluating Prompt Injection Defenses for Educational LLM Tutors: Security-Usability-Latency Trade-offs


136. CommFuse: Hiding Tail Latency via Communication Decomposition and Fusion for Distributed LLM Training


137. Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR