LLM 관련 주요 논문 - 2026-01-09

1. Agent Drift: Quantifying Behavioral Degradation in Multi-Agent LLM Systems Over Extended Interactions


2. Trade-R1: Bridging Verifiable Rewards to Stochastic Environments via Process-Level Reasoning Verification


3. Current Agents Fail to Leverage World Model as Tool for Foresight


4. ROI-Reasoning: Rational Optimization for Inference via Pre-Computation Meta-Cognition


5. EntroCoT: Enhancing Chain-of-Thought via Adaptive Entropy-Guided Segmentation


6. Personalized Medication Planning via Direct Domain Modeling and LLM-Generated Heuristics


7. Architecting Agentic Communities using Design Patterns


8. Interleaved Tool-Call Reasoning for Protein Function Understanding


9. Controllable LLM Reasoning via Sparse Autoencoder-Based Steering


10. SCRIBE: Structured Mid-Level Supervision for Tool-Using Language Models


11. ReEfBench: Quantifying the Reasoning Efficiency of LLMs


12. STAR-S: Improving Safety Alignment through Self-Taught Reasoning on Safety Rules


13. Evolving Programmatic Skill Networks


14. CPGPrompt: Translating Clinical Guidelines into LLM-Executable Decision Support


15. Enhancing LLM Instruction Following: An Evaluation-Driven Multi-Agentic Workflow for Prompt Instructions Optimization


16. Digital Red Queen: Adversarial Program Evolution in Core War with LLMs


17. ContextFocus: Activation Steering for Contextual Faithfulness in Large Language Models


18. Layer-wise Positional Bias in Short-Context Language Modeling


19. HoneyTrap: Deceiving Large Language Model Attackers to Honeypot Traps with Resilient Multi-Agent Defense


20. FOREVER: Forgetting Curve-Inspired Memory Replay for Language Model Continual Learning


21. FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection


22. Adaptive-Boundary-Clipping GRPO: Ensuring Bounded Ratios for Stable and Generalizable Training


23. What Matters For Safety Alignment?


24. When Numbers Start Talking: Implicit Numerical Coordination Among LLM-Based Agents


25. AI Generated Text Detection


26. Where meaning lives: Layer-wise accessibility of psycholinguistic features in encoder and decoder language models


27. Do LLMs Really Memorize Personally Identifiable Information? Revisiting PII Leakage with a Cue-Controlled Memorization Framework


28. Membox: Weaving Topic Continuity into Long-Range Memory for LLM Agents


29. Evaluation of Multilingual LLMs Personalized Text Generation Capabilities Targeting Groups and Social-Media Platforms


30. O-Researcher: An Open Ended Deep Research Model via Multi-Agent Distillation and Agentic RL


31. RadDiff: Describing Differences in Radiology Image Sets with Natural Language


32. From Laboratory to Real-World Applications: Benchmarking Agentic Code Reasoning at the Repository Level


33. CSMCIR: CoT-Enhanced Symmetric Alignment with Memory Bank for Composed Image Retrieval


34. R$^3$L: Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification


35. MHRC-Bench: A Multilingual Hardware Repository-Level Code Completion benchmark


36. TreeAdv: Tree-Structured Advantage Redistribution for Group-Based RL


37. ADEPT: Adaptive Dynamic Early-Exit Process for Transformers


38. From Implicit to Explicit: Token-Efficient Logical Supervision for Mathematical Reasoning in LLMs


39. Towards Compositional Generalization of LLMs via Skill Taxonomy Guided Data Synthesis


40. e5-omni: Explicit Cross-modal Alignment for Omni-modal Embeddings


41. AMIR-GRPO: Inducing Implicit Preference Signals into GRPO


42. Evaluating the Pre-Consultation Ability of LLMs using Diagnostic Guidelines


43. Policy-Guided Search on Tree-of-Thoughts for Efficient Problem Solving with Bounded Language Model Queries


44. ALERT: Zero-shot LLM Jailbreak Detection via Internal Discrepancy Amplification


45. From Chains to Graphs: Self-Structured Reasoning for General-Domain LLMs


46. Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions


47. Evaluating LLMs for Police Decision-Making: A Framework Based on Police Action Scenarios


48. Value-Action Alignment in Large Language Models under Privacy-Prosocial Conflict


49. Layer-Order Inversion: Rethinking Latent Multi-Hop Reasoning in Large Language Models



51. IntroLM: Introspective Language Models via Prefilling-Time Self-Evaluation


52. Beyond Perplexity: A Lightweight Benchmark for Knowledge Retention in Supervised Fine-Tuning


53. SDCD: Structure-Disrupted Contrastive Decoding for Mitigating Hallucinations in Large Vision-Language Models


54. Efficient Sequential Recommendation for Long Term User Interest Via Personalization


55. EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering for Enhanced Alignment and Reasoning


56. Content vs. Form: What Drives the Writing Score Gap Across Socioeconomic Backgrounds? A Generated Panel Approach


57. FROST-Drive: Scalable and Efficient End-to-End Driving with a Frozen Vision Encoder


58. Automated Feedback Generation for Undergraduate Mathematics: Development and Evaluation of an AI Teaching Assistant


59. Grading Scale Impact on LLM-as-a-Judge: Human-LLM Alignment Is Highest on 0-5 Grading Scale


60. MARVEL: A Multi Agent-based Research Validator and Enabler using Large Language Models


61. Training-Free Adaptation of New-Generation LLMs using Legacy Clinical Models


62. Jailbreaking LLMs Without Gradients or Priors: Effective and Transferable Attacks


63. Tigrinya Number Verbalization: Rules, Algorithm, and Implementation


64. Eye-Q: A Multilingual Benchmark for Visual Word Puzzle Solving and Image-to-Phrase Reasoning


65. Metaphors are a Source of Cross-Domain Misalignment of Large Reasoning Models


66. MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models


67. Bare-Metal Tensor Virtualization: Overcoming the Memory Wall in Edge-AI Inference on ARM64


68. Aligning Findings with Diagnosis: A Self-Consistent Reinforcement Learning Framework for Trustworthy Radiology Reporting


69. Ratio-Variance Regularized Policy Optimization for Efficient LLM Fine-tuning


70. Why LLMs Aren’t Scientists Yet: Lessons from Four Autonomous Research Attempts


71. VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models


72. 130k Lines of Formal Topology in Two Weeks: Simple and Cheap Autoformalization for Everyone?


73. AgentMark: Utility-Preserving Behavioral Watermarking for Agents


74. Automated Post-Incident Policy Gap Analysis via Threat-Informed Evidence Mapping using Large Language Models


75. HyperCLOVA X 32B Think


76. Feedback Indices to Evaluate LLM Responses to Rebuttals for Multiple Choice Type Questions


77. $α^3$-Bench: A Unified Benchmark of Safety, Robustness, and Efficiency for LLM-Based UAV Agents over 6G Networks


78. MixRx: Predicting Drug Combination Interactions with LLMs


79. Topic Segmentation Using Generative Language Models


80. LLM_annotate: A Python package for annotating and analyzing fiction characters


81. GuardEval: A Multi-Perspective Benchmark for Evaluating Safety, Fairness, and Robustness in LLM Moderators


82. Less is more: Not all samples are effective for evaluation


83. The Instruction Gap: LLMs get lost in Following Instruction


84. Benchmarking and Adapting On-Device Large Language Models for Clinical Decision Support


85. Internal Reasoning vs. External Control: A Thermodynamic Analysis of Sycophancy in Large Language Models