LLM 관련 주요 논문 - 2025-10-17

1. Agentic Design of Compositional Machines


2. GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning


3. Stable but Miscalibrated: A Kantian View on Overconfidence from Filters to Large Language Models


4. Budget-aware Test-time Scaling via Discriminative Verification


5. Mapping Smarter, Not Harder: A Test-Time Reinforcement Learning Agent That Improves Without Labels or Model Updates


6. The Gatekeeper Knows Enough


7. Where to Search: Measure the Prior-Structured Search Space of LLM Agents


8. Boosting Instruction Following at Scale


9. RoboGPT-R1: Enhancing Robot Planning with Reinforcement Learning


10. Agentic NL2SQL to Reduce Computational Costs


11. SimKO: Simple Pass@K Policy Optimization


12. ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling


13. Cognitive-Aligned Spatio-Temporal Large Language Models For Next Point-of-Interest Prediction


14. Beyond Hallucinations: The Illusion of Understanding in Large Language Models


15. ColorBench: Benchmarking Mobile Agents with Graph-Structured Framework for Complex Long-Horizon Tasks


16. LLM Agents Beyond Utility: An Open-Ended Perspective


17. JSPLIT: A Taxonomy-based Solution for Prompt Bloating in Model Context Protocol


18. IMAGINE: Integrating Multi-Agent System into One Model for Complex Reasoning and Planning


19. Hi-Agent: Hierarchical Vision-Language Agents for Mobile Device Control


20. Can MLLMs Absorb Math Reasoning Abilities from LLMs as Free Lunch?


21. Metacognitive Self-Correction for Multi-Agent System via Prototype-Guided Next-Execution Reconstruction


22. Terrarium: Revisiting the Blackboard for Multi-Agent Safety, Privacy, and Security Studies


23. A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space


24. Towards Agentic Self-Learning LLMs in Search Environment


25. Echoes of Human Malice in Agents: Benchmarking LLMs for Multi-Turn Online Harassment Attacks


26. JEDA: Query-Free Clinical Order Search from Ambient Dialogues


27. CodeEvolve: An open source evolutionary coding agent for algorithm discovery and optimization


28. Formalizing the Safety, Security, and Functional Properties of Agentic AI Systems


29. Generating Fair Consensus Statements with Social Choice on Token-Level MDPs


30. Do Large Language Models Show Biases in Causal Learning? Insights from Contingency Judgment


31. From Pixels to Words – Towards Native Vision-Language Primitives at Scale


32. Attention Is All You Need for KV Cache in Diffusion LLMs


33. TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar


34. RDD: Retrieval-Based Demonstration Decomposer for Planner Alignment in Long-Horizon Tasks


35. Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents


36. MetaBench: A Multi-task Benchmark for Assessing LLMs in Metabolomics


37. LaSeR: Reinforcement Learning with Last-Token Self-Rewarding


38. Predicting Task Performance with Context-aware Scaling Laws


39. Reasoning with Sampling: Your Base Model is Smarter Than You Think


40. Benchmarking Multimodal Large Language Models for Face Recognition


41. Cross-Scenario Unified Modeling of User Interests at Billion Scale


42. Finding Answers in Thought Matters: Revisiting Evaluation on Large Language Models with Reasoning


43. COIG-Writer: A High-Quality Dataset for Chinese Creative Writing with Thought Processes


44. Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries


45. DEXTER: Diffusion-Guided EXplanations with TExtual Reasoning for Vision Models


46. Seesaw: Accelerating Training by Balancing Learning Rate and Batch Size Scheduling


47. xLLM Technical Report


48. An Efficient Rubric-based Generative Verifier for Search-Augmented LLMs


49. RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis via RLAIF


50. Code-driven Number Sequence Calculation: Enhancing the inductive Reasoning Abilities of Large Language Models


51. Beyond Correctness: Evaluating Subjective Writing Preferences Across Cultures


52. Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering


53. Just-In-Time Objectives: A General Approach for Specialized AI Interactions


54. Selective Labeling with False Discovery Rate Control


55. State Your Intention to Steer Your Attention: An AI Assistant for Intentional Digital Living


56. E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task


57. Stealthy Dual-Trigger Backdoors: Attacking Prompt Tuning in LM-Empowered Graph Foundation Models


58. LiRA: Linguistic Robust Anchoring for Cross-lingual Large Language Models


59. Holdout-Loss-Based Data Selection for LLM Finetuning via In-Context Learning


60. A Free Lunch in LLM Compression: Revisiting Retraining after Pruning


61. Instructions are all you need: Self-supervised Reinforcement Learning for Instruction Following


62. The Role of Social Learning and Collective Norm Formation in Fostering Cooperation in LLM Multi-Agent Systems


63. MedTrust-RAG: Evidence Verification and Trust Alignment for Biomedical Question Answering


64. FairBatching: Fairness-Aware Batch Formation for LLM Inference


65. Are My Optimized Prompts Compromised? Exploring Vulnerabilities of LLM-based Optimizers


66. From Binary to Bilingual: How the National Weather Service is Using Artificial Intelligence to Develop a Comprehensive Translation Program


67. CURE: Confidence-driven Unified Reasoning Ensemble Framework for Medical Question Answering


68. Beyond One World: Benchmarking Super Heros in Role-Playing Across Multiversal Contexts


69. Stop-RAG: Value-Based Retrieval Control for Iterative RAG


70. Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL


71. MERLIN: A Testbed for Multilingual Multimodal Entity Recognition and Linking


72. Watermarking for Factuality: Guiding Vision-Language Models Toward Truth via Tri-layer Contrastive Decoding


73. PRISM: Agentic Retrieval with LLMs for Multi-Hop Question Answering


74. Less is More: Denoising Knowledge Graphs For Retrieval Augmented Generation


75. CAST: Compositional Analysis via Spectral Tracking for Understanding Transformer Layer Functions


76. Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models


77. Large Scale Retrieval for the LinkedIn Feed using Causal Language Models


78. LiteStage: Latency-aware Layer Skipping for Multi-stage Reasoning


79. DPRF: A Generalizable Dynamic Persona Refinement Framework for Optimizing Behavior Alignment Between Personalized LLM Role-Playing Agents and Humans


80. FinAI Data Assistant: LLM-based Financial Database Query Processing with the OpenAI Function Calling API


81. Inferred global dense residue transition graphs from primary structure sequences enable protein interaction prediction via directed graph convolutional neural networks


82. Toward Cybersecurity-Expert Small Language Models


83. Unlocking Out-of-Distribution Generalization in Transformers via Recursive Latent Space Reasoning


84. Every Language Model Has a Forgery-Resistant Signature


85. One Bug, Hundreds Behind: LLMs for Large-Scale Bug Discovery


86. Think Globally, Group Locally: Evaluating LLMs Using Multi-Lingual Word Grouping Games


87. Efficient Few-Shot Learning in Remote Sensing: Fusing Vision and Vision-Language Models


88. Static Sandboxes Are Inadequate: Modeling Societal Complexity Requires Open-Ended Co-Evolution in LLM-Based Multi-Agent Simulations


89. Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention


90. Big Reasoning with Small Models: Instruction Retrieval at Inference Time


91. LLMs Can Get “Brain Rot”!


92. Readability $\ne$ Learnability: Rethinking the Role of Simplicity in Training Small Language Models


93. Synthesizing Agentic Data for Web Agents with Progressive Difficulty Enhancement Mechanisms


94. AI Debaters are More Persuasive when Arguing in Alignment with Their Own Beliefs


95. Knowledge Reasoning Language Model: Unifying Knowledge and Language for Inductive Knowledge Graph Reasoning


96. Schema for In-Context Learning


97. Benefits and Limitations of Communication in Multi-Agent Reasoning


98. Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences


99. GenCellAgent: Generalizable, Training-Free Cellular Image Segmentation via Large Language Model Agents


100. Guarding the Guardrails: A Taxonomy-Driven Approach to Jailbreak Detection


101. K-frames: Scene-Driven Any-k Keyframe Selection for long video understanding


102. A Survey on Collaborating Small and Large Language Models for Performance, Cost-effectiveness, Cloud-edge Privacy, and Trustworthiness


103. Reliable Fine-Grained Evaluation of Natural Language Math Proofs


104. Order from Chaos: Comparative Study of Ten Leading LLMs on Unstructured Data Categorization


105. Catch Your Breath: Adaptive Computation for Self-Paced Sequence Production


106. Unlocking the Potential of Diffusion Language Models through Template Infilling


107. Ensembling Large Language Models to Characterize Affective Dynamics in Student-AI Tutor Dialogues


108. ShishuLM: Lightweight Language Model with Hybrid Decoder-MLP Architecture and Paired Weight Sharing


109. Benchmarking Correctness and Security in Multi-Turn Code Generation


110. From Craft to Constitution: A Governance-First Paradigm for Principled Agent Engineering


111. Multimodal Retrieval-Augmented Generation with Large Language Models for Medical VQA


112. Harnessing Consistency for Robust Test-Time LLM Ensemble


113. BenchPress: A Human-in-the-Loop Annotation System for Rapid Text-to-SQL Benchmark Curation


114. ConsistencyAI: A Benchmark to Assess LLMs’ Factual Consistency When Responding to Different Demographic Groups


115. Revisiting the UID Hypothesis in LLM Reasoning Traces


116. On-device System of Compositional Multi-tasking in Large Language Models


117. DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models


118. Serialized EHR make for good text representations


119. ADMIT: Few-shot Knowledge Poisoning Attacks on RAG-based Fact Checking


120. Meronymic Ontology Extraction via Large Language Models


121. SIMBA UQ: Similarity-Based Aggregation for Uncertainty Quantification in Large Language Models


122. ConDABench: Interactive Evaluation of Language Models for Data Analysis


123. Informed Routing in LLMs: Smarter Token-Level Computation for Faster Inference


124. Users as Annotators: LLM Preference Learning from Comparison Mode


125. A Linguistics-Aware LLM Watermarking via Syntactic Predictability


126. A2AS: Agentic AI Runtime Security and Self-Defense


127. Generative AI in Heritage Practice: Improving the Accessibility of Heritage Guidance