LLM 관련 주요 논문 - 2026-02-13

1. Agentic Test-Time Scaling for WebAgents


2. CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use


3. Think like a Scientist: Physics-guided LLM Agent for Equation Discovery


4. Pedagogically-Inspired Data Synthesis for Language Model Knowledge Distillation


5. Statistical Parsing for Logical Information Retrieval


6. Sci-CoE: Co-evolving Scientific Reasoning LLMs via Geometric Consensus with Sparse Supervision


7. GPT-4o Lacks Core Features of Theory of Mind


8. Seq2Seq2Seq: Lossless Data Compression via Discrete Latent Transformers and Reinforcement Learning


9. STAR : Bridging Statistical and Agentic Reasoning for Large Model Performance Prediction


10. Value Alignment Tax: Measuring Value Trade-offs in LLM Alignment


11. The Pensieve Paradigm: Stateful Language Models Mastering Their Own Context


12. Differentiable Modal Logic for Multi-Agent Diagnosis, Orchestration and Communication


13. InjectRBP: Steering Large Language Model Reasoning Behavior via Pattern Injection


14. CSEval: A Framework for Evaluating Clinical Semantics in Text-to-Image Generation


15. Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments


16. MEME: Modeling the Evolutionary Modes of Financial Markets


17. From Atoms to Trees: Building a Structured Feature Forest with Hierarchical Sparse Autoencoders


18. Talk2DM: Enabling Natural Language Querying and Commonsense Reasoning for Vehicle-Road-Cloud Integrated Dynamic Maps with Large Language Models


19. Prototype Transformer: Towards Language Model Architectures Interpretable by Design


20. Revis: Sparse Latent Steering to Mitigate Object Hallucination in Large Vision-Language Models


21. Predicting LLM Output Length via Entropy-Guided Representations


22. Beyond End-to-End Video Models: An LLM-Based Multi-Agent System for Educational Video Generation


23. FlowMind: Execute-Summarize for Structured Workflow Generation from LLM Reasoning


24. RELATE: A Reinforcement Learning-Enhanced LLM Framework for Advertising Text Generation


25. TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents


26. AIR: Improving Agent Safety through Incident Response


27. Text2GQL-Bench: A Text to Graph Query Language Benchmark [Experiment, Analysis & Benchmark]


28. Cross-Architecture Model Diffing with Crosscoders: Unsupervised Discovery of Differences Between LLMs


29. Beyond Parameter Arithmetic: Sparse Complementary Fusion for Distribution-Aware Model Merging


30. Beyond Pixels: Vector-to-Graph Transformation for Reliable Schematic Auditing


31. Benchmark Health Index: A Systematic Framework for Benchmarking the Benchmarks of LLMs


32. PhyNiKCE: A Neurosymbolic Agentic Framework for Autonomous Computational Fluid Dynamics


33. Quark Medical Alignment: A Holistic Multi-Dimensional Alignment and Collaborative Optimization Paradigm


34. Do MLLMs Really Understand Space? A Mathematical Reasoning Evaluation


35. When Agents Disagree With Themselves: Measuring Behavioral Consistency in LLM-Based Agents


36. scPilot: Large Language Model Reasoning Toward Automated Single-Cell Analysis and Discovery


37. MAPLE: Modality-Aware Post-training and Learning Ecosystem


38. The Five Ws of Multi-Agent Communication: Who Talks to Whom, When, What, and Why – A Survey from MARL to Emergent Language and LLMs


39. Learning to Configure Agentic AI Systems


40. SemaPop: Semantic-Persona Conditioned Population Synthesis


41. Budget-Constrained Agentic Large Language Models: Intention-Based Planning for Costly Tool Use


42. AgentLeak: A Full-Stack Benchmark for Privacy Leakage in Multi-Agent LLM Systems


43. Credit Where It is Due: Cross-Modality Connectivity Drives Precise Reinforcement Learning for MLLM Reasoning


44. ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences


45. Pushing Forward Pareto Frontiers of Proactive Agents with Behavioral Agentic Optimization


46. AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition


47. Bi-Level Prompt Optimization for Multimodal LLM-as-a-Judge


48. The PBSAI Governance Ecosystem: A Multi-Agent AI Reference Architecture for Securing Enterprise AI Estates


49. UniT: Unified Multimodal Chain-of-Thought Test-time Scaling


50. AttentionRetriever: Attention Layers are Secretly Long Document Retrievers


51. Olmix: A Framework for Data Mixing Throughout LM Development


52. Towards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training


53. VIRENA: Virtual Arena for Research, Education, and Democratic Innovation


54. Visual Reasoning Benchmark: Evaluating Multimodal LLMs on Classroom-Authentic Visual Problems from Primary Education


55. 3DGSNav: Enhancing Vision-Language Model Reasoning for Object Navigation via Active 3D Gaussian Splatting


56. dVoting: Fast Voting for dLLMs


57. Meta-Sel: Efficient Demonstration Selection for In-Context Learning via Supervised Meta-Learning


58. DeepSight: An All-in-One LM Safety Toolkit


59. Choose Your Agent: Tradeoffs in Adopting AI Advisors, Coaches, and Delegates in Multi-Party Negotiation


60. ModelWisdom: An Integrated Toolkit for TLA+ Model Visualization, Digest and Repair


61. Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?


62. Manifold-Aware Temporal Domain Generalization for Large Language Models


63. AdaptEvolve: Improving Efficiency of Evolutionary AI Agents through Adaptive Model Selection


64. Who Does What? Archetypes of Roles Assigned to LLMs During Human-AI Decision-Making


65. Leveraging LLMs to support co-evolution between definitions and instances of textual DSLs: A Systematic Evaluation


66. Mitigating Mismatch within Reference-based Preference Optimization


67. Towards Fair and Comprehensive Evaluation of Routers in Collaborative LLM Systems


68. Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception


69. Improving Neural Retrieval with Attribution-Guided Query Rewriting


70. Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing


71. MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling


72. Cooperation Breakdown in LLM Agents Under Communication Delays


73. Adapting Vision-Language Models for E-commerce Understanding at Scale


74. LLM-Driven 3D Scene Generation of Agricultural Simulation Environments


75. TabSieve: Explicit In-Table Evidence Selection for Tabular Prediction


76. PatientHub: A Unified Framework for Patient Simulation


77. SToRM: Supervised Token Reduction for Multi-modal LLMs toward efficient end-to-end autonomous driving


78. LoRA-based Parameter-Efficient LLMs for Continuous Learning in Edge-based Malware Detection


79. ScalSelect: Scalable Training-Free Multimodal Data Selection for Efficient Visual Instruction Tuning


80. ABot-N0: Technical Report on the VLA Foundation Model for Versatile Embodied Navigation



82. Native Reasoning Models: Training Language Models to Reason on Unverifiable Data


83. Krause Synchronization Transformers


84. Stop Tracking Me! Proactive Defense Against Attribute Inference Attack in LLMs


85. Differentially Private and Communication Efficient Large Language Model Split Inference via Stochastic Quantization and Soft Prompt


86. Multimodal Fact-Level Attribution for Verifiable Reasoning


87. RooflineBench: A Benchmarking Framework for On-Device LLMs via Roofline Analysis


88. Compiler-Guided Inference-Time Adaptation: Improving GPT-5 Programming Performance in Idris


89. Towards Reliable Machine Translation: Scaling LLMs for Critical Error Detection and Safety


90. The Energy of Falsehood: Detecting Hallucinations via Diffusion Model Likelihoods


91. Finding the Cracks: Improving LLMs Reasoning with Paraphrastic Probing and Consistency Verification


92. When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing


93. CryptoAnalystBench: Failures in Multi-Tool Long-Form LLM Analysis


94. HiFloat4 Format for Language Model Inference


95. How Many Features Can a Language Model Store Under the Linear Representation Hypothesis?


96. DDL2PropBank Agent: Benchmarking Multi-Agent Frameworks’ Developer Experience Through a Novel Relational Schema Mapping Task


97. Spectra: Rethinking Optimizers for LLMs Under Spectral Anisotropy


98. KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models


99. From Instruction to Output: The Role of Prompting in Modern NLG


100. What Do LLMs Know About Alzheimer’s Disease? Fine-Tuning, Probing, and Data Synthesis for AD Detection


101. Evaluating Few-Shot Temporal Reasoning of LLMs for Human Activity Prediction in Smart Environments


102. The Script Tax: Measuring Tokenization-Driven Efficiency and Latency Disparities in Multilingual Language Models


103. Efficient Hyper-Parameter Search for LoRA via Language-aided Bayesian Optimization


104. Visualizing and Benchmarking LLM Factual Hallucination Tendencies via Internal State Analysis and Clustering


105. Small Updates, Big Doubts: Does Parameter-Efficient Fine-tuning Enhance Hallucination Detection ?


106. Assessing LLM Reliability on Temporally Recent Open-Domain Questions


107. Automated Optimization Modeling via a Localizable Error-Driven Perspective


108. HybridRAG: A Practical LLM-based ChatBot Framework based on Pre-Generated Q&A over Raw Unstructured Documents


109. Improving Medical Visual Reinforcement Fine-Tuning via Perception and Reasoning Augmentation