LLM 관련 주요 논문 - 2026-02-04

1. Conformal Thinking: Risk Control for Reasoning on a Compute Budget


2. Understanding Agent Scaling in LLM-Based Multi-Agent Systems via Diversity


3. TodyComm: Task-Oriented Dynamic Communication for Multi-Round LLM-based Multi-Agent System


4. Mitigating Conversational Inertia in Multi-Turn Agents


5. Can LLMs Do Rocket Science? Exploring the Limits of Complex Reasoning with GTOC 12


6. EHRWorld: A Patient-Centric Medical World Model for Long-Horizon Clinical Trajectories


7. Persona Generators: Generating Diverse Synthetic Personas at Scale


8. When Routing Collapses: On the Degenerate Convergence of LLM Routers


9. IntentRL: Training Proactive User-intent Agents for Open-ended Deep Research via Reinforcement Learning


10. Ontology-to-tools compilation for executable semantic constraint enforcement in LLM agents


11. DiscoverLLM: From Executing Intents to Discovering Them


12. Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility


13. GFlowPO: Generative Flow Network as a Language Model Prompt Optimizer


14. MentalSeek-Dx: Towards Progressive Hypothetico-Deductive Reasoning for Real-world Psychiatric Diagnosis


15. Agentic Proposing: Enhancing Large Language Model Reasoning via Compositional Skill Synthesis


16. CSR-Bench: A Benchmark for Evaluating the Cross-modal Safety and Reliability of MLLMs


17. LPS-Bench: Benchmarking Safety Awareness of Computer-Use Agents in Long-Horizon Planning under Benign and Adversarial Scenarios


18. Accordion-Thinking: Self-Regulated Step Summaries for Efficient and Readable LLM Reasoning


19. The Necessity of a Unified Framework for LLM-Based Agent Evaluation


20. Beyond Quantity: Trajectory Diversity Scaling for Code Agents


21. VALUEFLOW: Toward Pluralistic and Steerable Value-based Alignment in Large Language Models


22. Enhancing Foundation VLM Robustness to Missing Modality: Scalable Diffusion for Bi-directional Feature Restoration


23. Understanding Multi-Agent LLM Frameworks: A Unified Benchmark and Experimental Analysis


24. Risky-Bench: Probing Agentic Safety Risks under Real-World Deployment


25. De-conflating Preference and Qualification: Constrained Dual-Perspective Reasoning for Job Recommendation with Large Language Models


26. MAS-ProVe: Understanding the Process Verification of Multi-Agent Systems


27. Visual Reasoning over Time Series via Multi-Agent System


28. RC-GRPO: Reward-Conditioned Group Relative Policy Optimization for Multi-Turn Tool Calling Agents


29. STAR: Similarity-guided Teacher-Assisted Refinement for Super-Tiny Function Calling Models


30. Distilling LLM Reasoning into Graph of Concept Predictors


31. Methods and Open Problems in Differentiable Social Choice: Learning Mechanisms, Decisions, and Alignment


32. Large Language Models Can Take False First Steps at Inference-time Planning


33. Are LLMs Biased Like Humans? Causal Reasoning as a Function of Prior Knowledge, Irrelevant Information, and Reasoning Budget


34. Generative Engine Optimization: A VLM and Agent Framework for Pinterest Acquisition Growth


35. DeltaEvolve: Accelerating Scientific Discovery through Momentum-Driven Evolution


36. Reasoning about Reasoning: BAPO Bounds on Chain-of-Thought Token Complexity in LLMs


37. FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights


38. Aligning Language Model Benchmarks with Pairwise Preferences


39. “I May Not Have Articulated Myself Clearly”: Diagnosing Dynamic Instability in LLM Reasoning at Inference Time



41. AutoSizer: Automatic Sizing of Analog and Mixed-Signal Circuits via Large Language Model (LLM) Agents


42. Chain of Simulation: A Dual-Mode Reasoning Framework for Large Language Models with Dynamic Problem Routing


43. Scaling-Aware Adapter for Structure-Grounded LLM Reasoning


44. Dynamic Mix Precision Routing for Efficient Multi-step LLM Interaction


45. ATLAS : Adaptive Self-Evolutionary Research Agent with Task-Distributed Multi-LLM Supporters


46. MARS: Modular Agent with Reflective Search for Automated AI Research


47. A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior


48. PeerRank: Autonomous LLM Evaluation Through Web-Grounded, Bias-Controlled Peer Review


49. Uncertainty and Fairness Awareness in LLM-Based Recommendation Systems


50. Experience-Driven Multi-Agent Systems Are Training-free Context-aware Earth Observers


51. CreditAudit: 2D Auditing for LLM Evaluation and Selection


52. Accelerating Scientific Research with Gemini: Case Studies and Common Techniques


53. Antidistillation Fingerprinting


54. Bridging Online and Offline RL: Contextual Bandit Learning for Multi-Turn Code Generation


55. Efficient Estimation of Kernel Surrogate Models for Task Attribution


56. An Empirical Study of Collective Behaviors and Social Dynamics in Large Language Model Agents


57. UniGeM: Unifying Data Mixing and Selection via Geometric Exploration and Mining


58. Zero-shot large vision-language model prompting for automated bone identification in paleoradiology x-ray archives


59. Cognitively Diverse Multiple-Choice Question Generation: A Hybrid Multi-Agent Framework with Large Language Models


60. Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight Averaging


61. Agent Primitives: Reusable Latent Building Blocks for Multi-Agent Systems


62. LLM-Inspired Pretrain-Then-Finetune for Small-Data, Large-Scale Optimization


63. Universal One-third Time Scaling in Learning Peaked Distributions


64. RAGTurk: Best Practices for Retrieval Augmented Generation in Turkish


65. BIRDTurk: Adaptation of the BIRD Text-to-SQL Dataset to Turkish



67. $V_0$: A Generalist Value Model for Any Policy at State Zero


68. Don’t believe everything you read: Understanding and Measuring MCP Behavior under Misleading Tool Descriptions


69. Use Graph When It Needs: Efficiently and Adaptively Integrating Retrieval-Augmented Generation with Graphs


70. When Single Answer Is Not Enough: Rethinking Single-Step Retrosynthesis Benchmarks for LLMs


71. Not All Negative Samples Are Equal: LLMs Learn Better from Plausible Reasoning


72. Mitigating Staleness in Asynchronous Pipeline Parallelism via Basis Rotation


73. Self-Verification Dilemma: Experience-Driven Suppression of Overused Checking in LLM Reasoning


74. Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing


75. Hierarchical Concept-to-Appearance Guidance for Multi-Subject Image Generation


76. Socratic-Geo: Synthetic Data Generation and Geometric Reasoning via Multi-Agent Interaction


77. Precision in Practice: Knowledge Guided Code Summarizing Grounded in Industrial Expectations


78. On the Entropy Dynamics in Reinforcement Fine-Tuning of Large Language Models


79. MeKi: Memory-based Expert Knowledge Injection for Efficient LLM Scaling


80. MedSAM-Agent: Empowering Interactive Medical Image Segmentation with Multi-turn Agentic Reinforcement Learning


81. Entropy-Gated Selective Policy Optimization:Token-Level Gradient Allocation for Hybrid Training of Large Language Models


82. R1-SyntheticVL: Is Synthetic Data from Generative Models Ready for Multimodal Large Language Model?


83. POP: Prefill-Only Pruning for Efficient Large Model Inference


84. ATACompressor: Adaptive Task-Aware Compression for Efficient Long-Context Processing in LLMs


85. Reinforcement Learning with Promising Tokens for Large Language Models


86. Prompt Augmentation Scales up GRPO Training on Mathematical Reasoning


87. Privasis: Synthesizing the Largest “Public” Private Dataset from Scratch


88. MemCast: Memory-Driven Time Series Forecasting with Experience-Conditioned Reasoning


89. Internet of Agentic AI: Incentive-Compatible Distributed Teaming and Workflow


90. Self-Hinting Language Models Enhance Reinforcement Learning


91. SwiftVLM: Efficient Vision-Language Model Inference via Cross-Layer Token Bypass


92. Contrastive Concept-Tree Search for LLM-Assisted Algorithm Discovery


93. Quantized Evolution Strategies: High-precision Fine-tuning of Quantized LLMs at Low-precision Cost



95. Task–Specificity Score: Measuring How Much Instructions Really Matter for Supervision


96. TextME: Bridging Unseen Modalities Through Text Descriptions


97. The Trigger in the Haystack: Extracting and Reconstructing LLM Backdoor Triggers


98. Evaluating LLMs When They Do Not Know the Answer: Statistical Evaluation of Mathematical Reasoning via Comparative Signals


99. CoBA-RL: Capability-Oriented Budget Allocation for Reinforcement Learning in LLMs


100. Bongards at the Boundary of Perception and Reasoning: Programs or Language?


101. FedKRSO: Communication and Memory Efficient Federated Fine-Tuning of Large Language Models


102. VOILA: Value-of-Information Guided Fidelity Selection for Cost-Aware Multimodal Question Answering


103. NLI:Non-uniform Linear Interpolation Approximation of Nonlinear Operations for Efficient LLMs Inference


104. Aligning Forest and Trees in Images and Long Captions for Visually Grounded Understanding


105. Where Norms and References Collide: Evaluating LLMs on Normative Reasoning


106. Nüwa: Mending the Spatial Integrity Torn by VLM Token Pruning


107. Equal Access, Unequal Interaction: A Counterfactual Audit of LLM Fairness


108. HALT: Hallucination Assessment via Log-probs as Time series


109. Learning-Infused Formal Reasoning: From Contract Synthesis to Artifact Reuse and Formal Semantics


110. Scaling Small Agents Through Strategy Auctions


111. Entropy-Guided Dynamic Tokens for Graph-LLM Alignment in Molecular Understanding


112. When Noise Lowers The Loss: Rethinking Likelihood-Based Evaluation in Music Large Language Models


113. Predicting first-episode homelessness among US Veterans using longitudinal EHR data: time-varying models and social risk factors


114. BinaryPPO: Efficient Policy Optimization for Binary Classification


115. Monotonicity as an Architectural Bias for Robust Language Models


116. Benchmarking Large Language Models for Zero-shot and Few-shot Phishing URL Detection


117. Performance of Small Language Model Pretraining on FABRIC: An Empirical Study


118. Trailer Reimagined: An Innovative, Llm-DRiven, Expressive Automated Movie Summary framework (TRAILDREAMS)


119. daVinci-Agency: Unlocking Long-Horizon Agency Data-Efficiently


120. Exploring Silicon-Based Societies: An Early Study of the Moltbook Agent Community


121. Gender Dynamics and Homophily in a Social Network of LLM Agents


122. Fine-Tuning Language Models to Know What They Know


123. AI Assisted Economics Measurement From Survey: Evidence from Public Employee Pension Choice


124. CaST: Causal Discovery via Spatio-Temporal Graphs in Disaster Tweets


125. Step-Wise Refusal Dynamics in Autoregressive and Diffusion Language Models


126. RAP: KV-Cache Compression via RoPE-Aligned Pruning


127. Social Catalysts, Not Moral Agents: The Illusion of Alignment in LLM Societies


128. ContextEvolve: Multi-Agent Context Compression for Systems Code Optimization


129. Constitutional Spec-Driven Development: Enforcing Security by Construction in AI-Assisted Code Generation


130. QuantLRM: Quantization of Large Reasoning Models via Fine-Tuning Signals


131. Reward Shaping for Inference-Time Alignment: A Stackelberg Game Perspective


132. DECEIVE-AFC: Adversarial Claim Attacks against Search-Enabled LLM-based Fact-Checking Systems


133. MathlibLemma: Folklore Lemma Generation and Benchmark for Formal Mathematics


134. Beyond Experience Retrieval: Learning to Generate Utility-Optimized Structured Experience for Frozen LLMs


135. Learning to Explore with Parameter-Space Noise: A Deep Dive into Parameter-Space Noise for Reinforcement Learning with Verifiable Rewards


136. HyPAC: Cost-Efficient LLMs-Human Hybrid Annotation with PAC Error Guarantees


137. ToolTok: Tool Tokenization for Efficient and Generalizable GUI Agents


138. D$^2$Quant: Accurate Low-bit Post-Training Weight Quantization for LLMs


139. Beyond Alignment: Expanding Reasoning Capacity via Manifold-Reshaping Policy Optimization


140. SPA-Cache: Singular Proxies for Adaptive Caching in Diffusion Language Models


141. Toward Ultra-Long-Horizon Sequential Model Editing


142. IMU-1: Sample-Efficient Pre-training of Small Language Models


143. Scaled Dot-Product Attention implements projection of inputs onto a common surface


144. Evaluation of Large Language Models’ educational feedback in Higher Education: potential, limitations and implications for educational practice


145. GraphDancer: Training LLMs to Explore and Reason over Graphs via Curriculum Reinforcement Learning


146. Beyond Translation: Cross-Cultural Meme Transcreation with Vision-Language Models


147. CodeGuard: Improving LLM Guardrails in CS Education


148. Test-Time Detoxification without Training or Learning Anything


149. STEMVerse: A Dual-Axis Diagnostic Framework for STEM Reasoning in Large Language Models


150. RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System