LLM 관련 주요 논문 - 2025-10-01

1. Fairness Testing in Retrieval-Augmented Generation: How Small Perturbations Reveal Bias in Small Language Models


2. Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark


3. Rearchitecting Datacenter Lifecycle for AI: A TCO-Driven Framework


4. OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always!


5. TVS Sidekick: Challenges and Practical Insights from Deploying Large Language Models in the Enterprise


6. Extreme Self-Preference in Language Models


7. Zero-Shot Decentralized Federated Learning


8. OntoAligner Meets Knowledge Graph Embedding Aligners


9. Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents


10. SafeBehavior: Simulating Human-Like Multistage Reasoning to Mitigate Jailbreak Attacks in Large Language Models


11. AI Playing Business Games: Benchmarking Large Language Models on Managerial Decision-Making in Dynamic Simulations


12. Interactive Learning for LLM Reasoning


13. ExoPredicator: Learning Abstract Models of Dynamic Worlds for Robot Planning


14. SlimPack: Fine-Grained Asymmetric Packing for Balanced and Efficient Variable-Length LLM Training


15. Diversity-Incentivized Exploration for Versatile Reasoning


16. Human-Centered Evaluation of RAG outputs: a framework and questionnaire for human-AI collaboration


17. LLM Agents for Knowledge Discovery in Atomic Layer Processing


18. ‘Too much alignment; not enough culture’: Re-balancing cultural alignment practices in LLMs


19. 90% Faster, 100% Code-Free: MLLM-Driven Zero-Code 3D Game Development


20. Beyond the Algorithm: A Field Guide to Deploying AI Agents in Clinical Practice


21. MEDAKA: Construction of Biomedical Knowledge Graphs Using Large Language Models


22. SafeEvalAgent: Toward Agentic and Self-Evolving Safety Evaluation of LLMs


23. Evaluating the Use of Large Language Models as Synthetic Social Agents in Social Science Research



25. Towards Unified Multimodal Misinformation Detection in Social Media: A Benchmark Dataset and Baseline


26. Scalable and Robust LLM Unlearning by Correcting Responses with Retrieved Exclusions


27. RoRecomp: Enhancing Reasoning Efficiency via Rollout Response Recomposition in Reinforcement Learning


28. NuRisk: A Visual Question Answering Dataset for Agent-Level Risk Assessment in Autonomous Driving


29. Boosting Process-Correct CoT Reasoning by Modeling Solvability of Multiple-Choice QA


30. DeepJSONEval: Benchmarking Complex Nested JSON Data Mining for Large Language Models


31. SafeMind: Benchmarking and Mitigating Safety Risks in Embodied LLM Agents


32. Lita: Light Agent Uncovers the Agentic Coding Capabilities of LLMs


33. ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack


34. HiStyle: Hierarchical Style Embedding Predictor for Text-Prompt-Guided Controllable Speech Synthesis



36. Planner-R1: Reward Shaping Enables Efficient Agentic RL with Smaller LLMs


37. Galton’s Law of Mediocrity: Why Large Language Models Regress to the Mean and Fail at Creativity in Advertising


38. NePTune: A Neuro-Pythonic Framework for Tunable Compositional Reasoning on Vision-Language


39. Collaborative Compression for Large-Scale MoE Deployment on Edge


40. SING-SQL: A Synthetic Data Generation Framework for In-Domain Text-to-SQL Translation


41. GroundSight: Augmenting Vision-Language Models with Grounding Information and De-hallucination


42. SOCK: A Benchmark for Measuring Self-Replication in Large Language Models


43. A Framework for Studying AI Agent Behavior: Evidence from Consumer Choice Experiments


44. Hybrid Reward Normalization for Process-supervised Non-verifiable Agentic Tasks


45. Causal Autoencoder-like Generation of Feedback Fuzzy Cognitive Maps with an LLM Agent


46. Building the EHR Foundation Model via Next Event Prediction


47. ATLAS: Constraints-Aware Multi-Agent Collaboration for Real-World Travel Planning


48. Skip-It? Theoretical Conditions for Layer Skipping in Vision-Language Models


49. Radiology’s Last Exam (RadLE): Benchmarking Frontier Multimodal AI Against Human Experts and a Taxonomy of Visual Reasoning Errors in Radiology


50. A(I)nimism: Re-enchanting the World Through AI-Mediated Object Interaction


51. RadOnc-GPT: An Autonomous LLM Agent for Real-Time Patient Outcomes Labeling at Scale


52. Beyond Static Retrieval: Opportunities and Pitfalls of Iterative Retrieval in GraphRAG


53. Understanding Generative Recommendation with Semantic IDs from a Model-scaling View


54. TDHook: A Lightweight Framework for Interpretability


55. Plug-and-Play Emotion Graphs for Compositional Prompting in Zero-Shot Speech Emotion Recognition


56. RADAR: Reasoning-Ability and Difficulty-Aware Routing for Reasoning LLMs



58. From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models


59. Where LLM Agents Fail and How They can Learn From Failures


60. Structural Reward Model: Enhancing Interpretability, Efficiency, and Scalability in Reward Modeling


61. SynthPert: Enhancing LLM Biological Reasoning via Synthetic Reasoning Traces for Cellular Perturbation Prediction


62. Spontaneous High-Order Generalization in Neural Theory-of-Mind Networks


63. Dive into the Agent Matrix: A Realistic Evaluation of Self-Replication Risk in LLM Agents


64. Flash-Searcher: Fast and Effective Web Agents via DAG-Based Parallel Execution


65. ID-RAG: Identity Retrieval-Augmented Generation for Long-Horizon Persona Coherence in Generative Agents


66. Toward Causal-Visual Programming: Enhancing Agentic Reasoning in Low-Code Environments


67. RL in the Wild: Characterizing RLVR Training in LLM Deployment


68. RADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized Collaboration


69. Language Model Planning from an Information Theoretic Perspective


70. Fact Grounded Attention: Eliminating Hallucination in Large Language Models Through Attention Level Knowledge Integration


71. A Formal Comparison Between Chain-of-Thought and Latent Thought


72. Blueprint-Bench: Comparing spatial intelligence of LLMs, agents and image models


73. Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training


74. MENLO: From Preferences to Proficiency - Evaluating and Modeling Native-like Quality Across 47 Languages


75. Deconstructing Self-Bias in LLM-generated Translation Benchmarks


76. Are Robust LLM Fingerprints Adversarially Robust?


77. OceanGym: A Benchmark Environment for Underwater Embodied Agents



79. VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications


80. Regression Language Models for Code



82. ACT: Agentic Classification Tree


83. AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size


84. SeedPrints: Fingerprints Can Even Tell Which Seed Your Large Language Model Was Trained From


85. Game-Time: Evaluating Temporal Dynamics in Spoken Language Models


86. Efficient and Transferable Agentic Knowledge Graph RAG via Reinforcement Learning


87. SDA-PLANNER: State-Dependency Aware Adaptive Planner for Embodied Task Planning



89. QUARTZ : QA-based Unsupervised Abstractive Refinement for Task-oriented Dialogue Summarization


90. Finetune Once: Decoupling General & Domain Learning with Dynamic Boosted Annealing



92. Toward an Unbiased Collective Memory for Efficient LLM-Based Agentic 6G Cross-Domain Management


93. Auto-ARGUE: LLM-Based Report Generation Evaluation


94. Towards Continual Expansion of Data Coverage: Automatic Text-guided Edge-case Synthesis


95. OWL: Geometry-Aware Spatial Reasoning for Audio Large Language Models


96. End-to-End Aspect-Guided Review Summarization at Scale


97. Muon Outperforms Adam in Tail-End Associative Memory Learning


98. Learning Egocentric In-Hand Object Segmentation through Weak Supervision from Human Narrations


99. MHINDR - a DSM5 based mental health diagnosis and recommendation framework using LLM


100. R-Log: Incentivizing Log Analysis Capability in LLMs via Reasoning-based Reinforcement Learning


101. Accelerating LLM Inference with Precomputed Query Storage


102. PerQ: Efficient Evaluation of Multilingual Text Personalization Quality


103. RoleConflictBench: A Benchmark of Role Conflict Scenarios for Evaluating LLMs’ Contextual Sensitivity


104. Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation


105. More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models


106. Distillation of Large Language Models via Concrete Score Matching


107. VELA: An LLM-Hybrid-as-a-Judge Approach for Evaluating Long Image Captions


108. Learning to Reason as Action Abstractions with Scalable Mid-Training RL


109. Better with Less: Small Proprietary Models Surpass Large Language Models in Financial Transaction Understanding


110. Point-It-Out: Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding


111. V-HUB: A Visual-Centric Humor Understanding Benchmark for Video LLMs


112. Free Lunch Alignment of Text-to-Image Diffusion Models without Preference Image Pairs


113. TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning


114. Think Less, Label Better: Multi-Stage Domain-Grounded Synthetic Data Generation for Fine-Tuning Large Language Models in Telecommunications


115. The AI Productivity Index (APEX)


116. HNote: Extending YNote with Hexadecimal Encoding for Fine-Tuning LLMs in Music Modeling


117. LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts


118. STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents


119. Probing the Limits of Stylistic Alignment in Vision-Language Models


120. Aligning Multilingual Reasoning with Verifiable Semantics from a High-Resource Expert Model


121. Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play


122. Toxicity in Online Platforms and AI Systems: A Survey of Needs, Challenges, Mitigations, and Future Directions


123. Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning


124. VISOR++: Universal Visual Inputs based Steering for Large Vision Language Models


125. Calibrating Verbalized Confidence with Self-Generated Distractors


126. LLM-RG: Referential Grounding in Outdoor Scenarios using Large Language Models


127. Not Wrong, But Untrue: LLM Overconfidence in Document-Based Queries


128. EMO-TTA: Improving Test-Time Adaptation of Audio-Language Models for Speech Emotion Recognition


129. PIPer: On-Device Environment Setup via Online Reinforcement Learning


130. Rethinking Parameter Sharing for LLM Fine-Tuning with Multiple LoRAs


131. From Faithfulness to Correctness: Generative Reward Models that Think Critically


132. A Cartography of Open Collaboration in Open Source AI: Mapping Practices, Motivations, and Governance in 14 Open Large Language Model Projects


133. SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs


134. Predicting Training Re-evaluation Curves Enables Effective Data Curriculums for LLMs


135. Generative Value Conflicts Reveal LLM Priorities


136. From Internal Representations to Text Quality: A Geometric Approach to LLM Evaluation


137. Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning


138. Automatically Generating Web Applications from Requirements Via Multi-Agent Test-Driven Development


139. A Measurement Study of Model Context Protocol


140. Artificial Authority: From Machine Minds to Political Alignments. An Experimental Analysis of Democratic and Autocratic Biases in Large-Language Models


141. Effectiveness of Large Language Models in Simulating Regional Psychological Structures: An Empirical Examination of Personality and Subjective Well-being


142. DNABERT-2: Fine-Tuning a Genomic Language Model for Colorectal Gene Enhancer Classification


143. Dynamic Policy Induction for Adaptive Prompt Optimization: Bridging the Efficiency-Accuracy Gap via Lightweight Reinforcement Learning


144. From NL2SQL to NL2GeoSQL: GeoSQL-Eval for automated evaluation of LLMs on PostGIS queries


145. Knowledge distillation through geometry-aware representational alignment


146. BEV-VLM: Trajectory Planning via Unified BEV Abstraction


147. BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software


148. Protocode: Prototype-Driven Interpretability for Code Generation in LLMs


149. Reinforcement Learning-Guided Chain-of-Draft for Token-Efficient Code Generation


150. A Benchmark for Localizing Code and Non-Code Issues in Software Projects


151. HAMMER: Hamiltonian Curiosity Augmented Large Language Model Reinforcement


152. PALADIN: Self-Correcting Language Model Agents to Cure Tool-Failure Cases


153. Multi-level Diagnosis and Evaluation for Robust Tabular Feature Engineering with Large Language Models


154. Spectral Logit Sculpting: Adaptive Low-Rank Logit Transformation for Controlled Text Generation


155. Generating High-Quality Datasets for Code Editing via Open-Source Language Models


156. Towards Repository-Level Program Verification with Large Language Models


157. APRIL: API Synthesis with Automatic Prompt Optimization and Reinforcement Learning


158. Devstral: Fine-tuning Language Models for Coding Agent Applications


159. UML-CoT: Structured Reasoning and Planning with Unified Modeling Language for Robotic Room Cleaning


160. AdaptCache: KV Cache Native Storage Hierarchy for Low-Delay and High-Quality Language Model Serving


161. Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking