LLM 관련 주요 논문 - 2026-01-08

1. MAGMA: A Multi-Graph based Agentic Memory Architecture for AI Agents


2. InfiAgent: An Infinite-Horizon Framework for General-Purpose Autonomous Agents


3. Rationale-Grounded In-Context Learning for Time Series Reasoning with Multimodal Large Language Models


4. Batch-of-Thought: Cross-Instance Learning for Enhanced LLM Reasoning


5. Logical Phase Transitions: Understanding Collapse in LLM Logical Reasoning


6. ReTreVal: Reasoning Tree with Validation – A Hybrid Framework for Enhanced LLM Multi-Step Reasoning


7. HAL: Inducing Human-likeness in LLMs with Alignment


8. LLM Agent Framework for Intelligent Change Analysis in Urban Environment using Remote Sensing Imagery


9. The Path Ahead for Agentic AI: Challenges and Opportunities


10. Time-Scaling Is What Agents Need Now


11. Learning from Prompt itself: the Hierarchical Attribution Prompt Optimization


12. AWARE-US: Benchmark for Preference-Aware Resolution in Tool-Calling Agents


13. Orchestral AI: A Framework for Agent Orchestration


14. SimpleMem: Efficient Lifelong Memory for LLM Agents


15. Textual Explanations and Their Evaluations for Reinforcement Learning Policy


16. Multi-RADS Synthetic Radiology Report Dataset and Head-to-Head Benchmarking of 41 Open-Weight and Proprietary Language Models


17. The Sonar Moment: Benchmarking Audio-Language Models in Audio Geo-Localization


18. Fine-tuning Small Language Models as Efficient Enterprise Search Relevance Labelers


19. UltraLogic: Enhancing LLM Reasoning through Large-Scale Data Synthesis and Bipolar Float Reward


20. DIP: Dynamic In-Context Planner For Diffusion Language Models


21. AnatomiX, an Anatomy-Aware Grounded Multimodal Large Language Model for Chest X-Ray Interpretation


22. Decentralized Autoregressive Generation


23. Prompt-Counterfactual Explanations for Generative AI System Behavior


24. Self-Verification is All You Need To Pass The Japanese Bar Examination


25. ToxiGAN: Toxic Data Augmentation via LLM-Guided Directional Adversarial Generation


26. Who Laughs with Whom? Disentangling Influential Factors in Humor Preferences across User Clusters and LLMs


27. Text-Guided Layer Fusion Mitigates Hallucination in Multimodal LLMs


28. Grad-ELLM: Gradient-based Explanations for Decoder-only LLMs


29. Joint Encoding of KV-Cache Blocks for Scalable LLM Serving


30. Do LLMs Encode Functional Importance of Reasoning Tokens?


31. Lil: Less is Less When Applying Post-Training Sparse-Attention Algorithms in Long-Decode Stage


32. Dementia-R1: Reinforced Pretraining and Reasoning from Unstructured Clinical Notes for Real-World Dementia Prognosis


33. SentGraph: Hierarchical Sentence Graph for Multi-hop Retrieval-Augmented Question Answering


34. JPU: Bridging Jailbreak Defense and Unlearning via On-Policy Path Rectification


35. Towards Faithful Reasoning in Comics for Small MLLMs


36. Interpretable All-Type Audio Deepfake Detection with Audio LLMs via Frequency-Time Reinforcement Learning


37. Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders


38. Correct, Concise and Complete: Multi-stage Training For Adaptive Reasoning


39. MoE Adapter for Large Audio Language Models: Sparsity, Disentanglement, and Gradient-Conflict-Free


40. The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models


41. SastBench: A Benchmark for Testing Agentic SAST Triage


42. PrismVAU: Prompt-Refined Inference System for Multimodal Video Anomaly Understanding


43. RAL2M: Retrieval Augmented Learning-To-Match Against Hallucination in Compliance-Guaranteed Service Systems


44. TA-Prompting: Enhancing Video Large Language Models for Dense Video Captioning via Temporal Anchors


45. LongBench Pro: A More Realistic and Comprehensive Bilingual Long-Context Evaluation Benchmark


46. TiMem: Temporal-Hierarchical Memory Consolidation for Long-Horizon Conversational Agents


47. Netflix Artwork Personalization via LLM Post-training


48. Window-based Membership Inference Attacks Against Fine-tuned Large Language Models


49. Hypothesize-Then-Verify: Speculative Root Cause Analysis for Microservices with Pathwise Parallelism


50. Agentic Memory Enhanced Recursive Reasoning for Root Cause Localization in Microservices


51. Extracting books from production language models


52. When Do Tools and Planning Help LLMs Think? A Cost- and Latency-Aware Benchmark


53. Prioritized Replay for RL Post-training


54. TAAF: A Trace Abstraction and Analysis Framework Synergizing Knowledge Graphs and LLMs


55. Improved Evidence Extraction for Document Inconsistency Detection with LLMs


56. LAsset: An LLM-assisted Security Asset Identification Framework for System-on-Chip (SoC) Verification


57. Chronicals: A High-Performance Framework for LLM Fine-Tuning with 3.51x Speedup over Unsloth


58. LongDA: Benchmarking LLM Agents for Long-Document Data Analysis


59. FlowPlan-G2P: A Structured Generation Framework for Transforming Scientific Papers into Patent Descriptions


60. Reconstructing Item Characteristic Curves using Fine-Tuned Large Language Models


61. Fact-Checking with Large Language Models via Probabilistic Certainty and Consistency


62. LendNova: Towards Automated Credit Risk Assessment with Language Models


63. AI-exposed jobs deteriorated before ChatGPT


64. ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation


65. Enhancing Debugging Skills with AI-Powered Assistance: A Real-Time Tool for Debugging Support


66. GEM-Style Constraints for PEFT with Dual Gradient Projection in LoRA


67. Evaluating the Diagnostic Classification Ability of Multimodal Large Language Models: Insights from the Osteoarthritis Initiative


68. Focus on What Matters: Fisher-Guided Adaptive Multimodal Fusion for Vulnerability Detection


69. WebCoderBench: Benchmarking Web Application Generation with Comprehensive and Interpretable Evaluation Metrics


70. A large-scale nanocrystal database with aligned synthesis and properties enabling generative inverse design


71. The Vibe-Check Protocol: Quantifying Cognitive Offloading in AI Programming


72. PCEval: A Benchmark for Evaluating Physical Computing Capabilities of Large Language Models


73. Tree of Preferences for Diversified Recommendation


74. How to Discover Knowledge for FutureG: Contextual RAG and LLM Prompting for O-RAN


75. The Refutability Gap: Challenges in Validating Reasoning by Large Language Models


76. LeafTutor: An AI Agent for Programming Assignment Tutoring


77. Permission Manifests for Web Agents


78. TextBridgeGNN: Pre-training Graph Neural Network for Cross-Domain Recommendation via Text-Guided Transfer


79. MAGMA: A Multi-Graph based Agentic Memory Architecture for AI Agents


80. InfiAgent: An Infinite-Horizon Framework for General-Purpose Autonomous Agents


81. Rationale-Grounded In-Context Learning for Time Series Reasoning with Multimodal Large Language Models


82. Batch-of-Thought: Cross-Instance Learning for Enhanced LLM Reasoning


83. Logical Phase Transitions: Understanding Collapse in LLM Logical Reasoning


84. ReTreVal: Reasoning Tree with Validation – A Hybrid Framework for Enhanced LLM Multi-Step Reasoning


85. HAL: Inducing Human-likeness in LLMs with Alignment


86. LLM Agent Framework for Intelligent Change Analysis in Urban Environment using Remote Sensing Imagery


87. The Path Ahead for Agentic AI: Challenges and Opportunities


88. Time-Scaling Is What Agents Need Now


89. Learning from Prompt itself: the Hierarchical Attribution Prompt Optimization


90. AWARE-US: Benchmark for Preference-Aware Resolution in Tool-Calling Agents


91. Orchestral AI: A Framework for Agent Orchestration


92. SimpleMem: Efficient Lifelong Memory for LLM Agents


93. Textual Explanations and Their Evaluations for Reinforcement Learning Policy


94. Multi-RADS Synthetic Radiology Report Dataset and Head-to-Head Benchmarking of 41 Open-Weight and Proprietary Language Models


95. The Sonar Moment: Benchmarking Audio-Language Models in Audio Geo-Localization


96. Fine-tuning Small Language Models as Efficient Enterprise Search Relevance Labelers


97. UltraLogic: Enhancing LLM Reasoning through Large-Scale Data Synthesis and Bipolar Float Reward


98. DIP: Dynamic In-Context Planner For Diffusion Language Models


99. AnatomiX, an Anatomy-Aware Grounded Multimodal Large Language Model for Chest X-Ray Interpretation


100. Decentralized Autoregressive Generation


101. Prompt-Counterfactual Explanations for Generative AI System Behavior


102. Self-Verification is All You Need To Pass The Japanese Bar Examination


103. ToxiGAN: Toxic Data Augmentation via LLM-Guided Directional Adversarial Generation


104. Who Laughs with Whom? Disentangling Influential Factors in Humor Preferences across User Clusters and LLMs


105. Text-Guided Layer Fusion Mitigates Hallucination in Multimodal LLMs


106. Grad-ELLM: Gradient-based Explanations for Decoder-only LLMs


107. Joint Encoding of KV-Cache Blocks for Scalable LLM Serving


108. Do LLMs Encode Functional Importance of Reasoning Tokens?


109. Lil: Less is Less When Applying Post-Training Sparse-Attention Algorithms in Long-Decode Stage


110. Dementia-R1: Reinforced Pretraining and Reasoning from Unstructured Clinical Notes for Real-World Dementia Prognosis


111. SentGraph: Hierarchical Sentence Graph for Multi-hop Retrieval-Augmented Question Answering


112. JPU: Bridging Jailbreak Defense and Unlearning via On-Policy Path Rectification


113. Towards Faithful Reasoning in Comics for Small MLLMs


114. Interpretable All-Type Audio Deepfake Detection with Audio LLMs via Frequency-Time Reinforcement Learning


115. Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders


116. Correct, Concise and Complete: Multi-stage Training For Adaptive Reasoning


117. MoE Adapter for Large Audio Language Models: Sparsity, Disentanglement, and Gradient-Conflict-Free


118. The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models


119. SastBench: A Benchmark for Testing Agentic SAST Triage


120. PrismVAU: Prompt-Refined Inference System for Multimodal Video Anomaly Understanding


121. RAL2M: Retrieval Augmented Learning-To-Match Against Hallucination in Compliance-Guaranteed Service Systems


122. TA-Prompting: Enhancing Video Large Language Models for Dense Video Captioning via Temporal Anchors


123. LongBench Pro: A More Realistic and Comprehensive Bilingual Long-Context Evaluation Benchmark


124. TiMem: Temporal-Hierarchical Memory Consolidation for Long-Horizon Conversational Agents


125. Netflix Artwork Personalization via LLM Post-training


126. Window-based Membership Inference Attacks Against Fine-tuned Large Language Models


127. Hypothesize-Then-Verify: Speculative Root Cause Analysis for Microservices with Pathwise Parallelism


128. Agentic Memory Enhanced Recursive Reasoning for Root Cause Localization in Microservices


129. Extracting books from production language models


130. When Do Tools and Planning Help LLMs Think? A Cost- and Latency-Aware Benchmark


131. Prioritized Replay for RL Post-training


132. TAAF: A Trace Abstraction and Analysis Framework Synergizing Knowledge Graphs and LLMs


133. Improved Evidence Extraction for Document Inconsistency Detection with LLMs


134. LAsset: An LLM-assisted Security Asset Identification Framework for System-on-Chip (SoC) Verification


135. Chronicals: A High-Performance Framework for LLM Fine-Tuning with 3.51x Speedup over Unsloth


136. LongDA: Benchmarking LLM Agents for Long-Document Data Analysis


137. FlowPlan-G2P: A Structured Generation Framework for Transforming Scientific Papers into Patent Descriptions


138. Reconstructing Item Characteristic Curves using Fine-Tuned Large Language Models


139. Fact-Checking with Large Language Models via Probabilistic Certainty and Consistency


140. LendNova: Towards Automated Credit Risk Assessment with Language Models


141. AI-exposed jobs deteriorated before ChatGPT


142. ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation


143. Enhancing Debugging Skills with AI-Powered Assistance: A Real-Time Tool for Debugging Support


144. GEM-Style Constraints for PEFT with Dual Gradient Projection in LoRA


145. Evaluating the Diagnostic Classification Ability of Multimodal Large Language Models: Insights from the Osteoarthritis Initiative


146. Focus on What Matters: Fisher-Guided Adaptive Multimodal Fusion for Vulnerability Detection


147. WebCoderBench: Benchmarking Web Application Generation with Comprehensive and Interpretable Evaluation Metrics


148. A large-scale nanocrystal database with aligned synthesis and properties enabling generative inverse design


149. The Vibe-Check Protocol: Quantifying Cognitive Offloading in AI Programming


150. PCEval: A Benchmark for Evaluating Physical Computing Capabilities of Large Language Models


151. Tree of Preferences for Diversified Recommendation


152. How to Discover Knowledge for FutureG: Contextual RAG and LLM Prompting for O-RAN


153. The Refutability Gap: Challenges in Validating Reasoning by Large Language Models


154. LeafTutor: An AI Agent for Programming Assignment Tutoring


155. Permission Manifests for Web Agents


156. TextBridgeGNN: Pre-training Graph Neural Network for Cross-Domain Recommendation via Text-Guided Transfer


157. Towards Trustworthy LLM-Based Recommendation via Rationale Integration


158. The Impact of LLM-Generated Reviews on Recommender Systems: Textual Shifts, Performance Effects, and Strategic Platform Control


159. TWIST: Training-free and Label-free Short Text Clustering through Iterative Vector Updating with LLMs