LLM 관련 주요 논문 - 2026-01-30

1. World of Workflows: a Benchmark for Bringing World Models to Enterprise Systems


2. The Patient is not a Moving Document: A World Model Training Paradigm for Longitudinal EHR


3. Optimizing Agentic Workflows using Meta-tools


4. CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty


5. Mind the Gap: How Elicitation Protocols Shape the Stated-Revealed Preference Gap in Language Models


6. Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic


7. ToolWeaver: Weaving Collaborative Semantics for Scalable Tool Use in Large Language Models


8. Retrieval-Infused Reasoning Sandbox: A Benchmark for Decoupling Retrieval and Reasoning Capabilities


9. AgenticSimLaw: A Juvenile Courtroom Multi-Agent Debate Simulation for Explainable High-Stakes Tabular Decision Making


10. From Meta-Thought to Execution: Cognitively Aligned Post-Training for Generalizable and Reliable LLM Reasoning


11. astra-langchain4j: Experiences Combining LLMs and Agent Programming


12. KnowBias: Mitigating Social Bias in LLMs via Know-Bias Neuron Enhancement


13. CORE:Toward Ubiquitous 6G Intelligence Through Collaborative Orchestration of Large Language Model Agents Over Hierarchical Edge


14. A Unified XAI-LLM Approach for EndotrachealSuctioning Activity Recognition


15. BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics


16. Language-based Trial and Error Falls Behind in the Era of Experience


17. Epistemic Context Learning: Building Trust the Right Way in LLM-Based Multi-Agent Systems


18. E-mem: Multi-agent based Episodic Context Reconstruction for LLM Agent Memory


19. FBS: Modeling Native Parallel Reading inside a Transformer


20. TCAP: Tri-Component Attention Profiling for Unsupervised Backdoor Detection in MLLM Fine-Tuning


21. SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding


22. ScholarGym: Benchmarking Deep Research Workflows on Academic Literature Retrieval


23. Semantic Content Determines Algorithmic Performance


24. RecNet: Self-Evolving Preference Propagation for Agentic Recommender Systems


25. CORE: Collaborative Reasoning via Cross Teaching


26. Beyond Imitation: Reinforcement Learning for Active Latent Planning


27. Chain Of Thought Compression: A Theoritical Analysis


28. EmboCoach-Bench: Benchmarking AI Agents on Developing Embodied Robots


29. Meta Context Engineering via Agentic Skill Evolution


30. ShardMemo: Masked MoE Routing for Sharded Agentic LLM Memory


31. ARGORA: Orchestrated Argumentation for Causally Grounded LLM Reasoning and Decision Making


32. LLaMEA-SAGE: Guiding Automated Algorithm Design with Structural Feedback from Explainable AI


33. The Effectiveness of Style Vectors for Steering Large Language Models: A Human Evaluation


34. MAR: Efficient Large Language Models via Module-aware Architecture Refinement


35. The Path of Least Resistance: Guiding LLM Reasining Trajectories with Prefix Consensus


36. ScaleSim: Serving Large-Scale Multi-Agent Simulation with Invocation Distance-Based Memory Management


37. ChipBench: A Next-Step Benchmark for Evaluating LLM Performance in AI-Aided Chip Design


38. The Paradox of Robustness: Decoupling Rule-Based Logic from Affective Noise in High-Stakes Decision-Making


39. When Prohibitions Become Permissions: Auditing Negation Sensitivity in Language Models


40. System 1&2 Synergy via Dynamic Model Interpolation


41. TeachBench: A Syllabus-Grounded Framework for Evaluating Teaching Ability in Large Language Models


42. NEMO: Execution-Aware Optimization Modeling via Autonomous Coding Agents


43. Latent Chain-of-Thought as Planning: Decoupling Reasoning from Verbalization


44. Dynamic Framework for Collaborative Learning: Leveraging Advanced LLM with Adaptive Feedback Mechanisms


45. Ostrakon-VL: Towards Domain-Expert MLLM for Food-Service and Retail Stores


46. EHR-RAG: Bridging Long-Horizon Structured Electronic Health Records and Large Language Models via Enhanced Retrieval-Augmented Generation


47. Within-Model vs Between-Prompt Variability in Large Language Models for Creative Tasks


48. TIDE: Tuning-Integrated Dynamic Evolution for LLM-Based Automated Heuristic Design


49. Just Ask: Curious Code Agents Reveal System Prompts in Frontier LLMs


50. Intelli-Planner: Towards Customized Urban Planning via Large Language Model Empowered Reinforcement Learning


51. Uncovering Hidden Correctness in LLM Causal Reasoning via Symbolic Verification


52. Do Reasoning Models Enhance Embedding Models?


53. MAD: Modality-Adaptive Decoding for Mitigating Cross-Modal Hallucinations in Multimodal Large Language Models


54. FrontierScience: Evaluating AI’s Ability to Perform Expert-Level Scientific Tasks


55. Concise Geometric Description as a Bridge: Unleashing the Potential of LLM for Plane Geometry Problem Solving


56. Bridging the Arithmetic Gap: The Cognitive Complexity Benchmark and Financial-PoT for Robust Financial Reasoning


57. Beyond a Single Reference: Training and Evaluation with Paraphrases in Sign Language Translation


58. Planner-Auditor Twin: Agentic Discharge Planning with FHIR-Based LLM Planning, Guideline Recall, Optional Caching and Self-Improvement


59. How does information access affect LLM monitors’ ability to detect sabotage?


60. Magellan: Autonomous Discovery of Novel Compiler Optimization Heuristics with AlphaEvolve


61. OpenSec: Measuring Incident Response Agent Calibration Under Adversarial Evidence


62. Bayesian-LoRA: Probabilistic Low-Rank Adaptation of Large Language Models


63. Do LLMs Favor LLMs? Quantifying Interaction Effects in Peer Review


64. RedSage: A Cybersecurity Generalist LLM


65. DynaWeb: Model-Based Reinforcement Learning of Web Agents


66. Reasoning While Asking: Transforming Reasoning Large Language Models from Passive Solvers to Proactive Inquirers


67. StepShield: When, Not Whether to Intervene on Rogue Agents


68. SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents


69. SINA: A Circuit Schematic Image-to-Netlist Generator Using Artificial Intelligence


70. Value-Based Pre-Training with Downstream Feedback


71. ECO: Quantized Training without Full-Precision Master Weights


72. Latent Adversarial Regularization for Offline Preference Optimization


73. Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models


74. MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources


75. A Separable Architecture for Continuous Token Representation in Language Models


76. Thinking Out of Order: When Output Order Stops Reflecting Reasoning Order in Diffusion Language Models


77. When “Better” Prompts Hurt: Evaluation-Driven Iteration for LLM Applications


78. Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units


79. Token-Guard: Towards Token-Level Hallucination Control via Self-Checking Decoding


80. Industrialized Deception: The Collateral Effects of LLM-Generated Misinformation on Digital Ecosystems


81. Learn-to-Distance: Distance Learning for Detecting LLM-Generated Text


82. Test-Time Compute Games


83. Moral Outrage Shapes Commitments Beyond Attention: Multimodal Moral Emotions on YouTube in Korea and the US


84. Effective LoRA Adapter Routing using Task Representations


85. Assessing the Business Process Modeling Competences of Large Language Models


86. EWSJF: An Adaptive Scheduler with Hybrid Partitioning for Mixed-Workload LLM Inference


87. Enhancing Language Models for Robust Greenwashing Detection


88. TACLer: Tailored Curriculum Reinforcement Learning for Efficient Reasoning


89. Toward Culturally Aligned LLMs through Ontology-Guided Multi-Agent Reasoning


90. Curriculum Learning for LLM Pretraining: An Analysis of Learning Dynamics


91. FIT: Defying Catastrophic Forgetting in Continual LLM Unlearning


92. SWE-Spot: Building Small Repo-Experts with Repository-Centric Learning


93. ILRR: Inference-Time Steering Method for Masked Diffusion Language Models


94. HeRo-Q: A General Framework for Stable Low Bit Quantization via Hessian Conditioning


95. Breaking the Overscaling Curse: Thinking Parallelism Before Parallel Thinking


96. Thinking Broad, Acting Fast: Latent Reasoning Distillation from Multi-Perspective Chain-of-Thought for E-Commerce Relevance


97. Scalable Power Sampling: Unlocking Efficient, Training-Free Reasoning for LLMs via Distribution Sharpening


98. Shaping capabilities with token-level data filtering


99. On the Adversarial Robustness of Large Vision-Language Models under Visual Token Compression


100. More Bang for the Buck: Improving the Inference of Large Language Models at a Fixed Budget using Reset and Discard (ReD)


101. Adaptive Confidence Gating in Multi-Agent Collaboration for Efficient and Optimized Code Generation


102. Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation


103. Unifying Speech Editing Detection and Content Localization via Prior-Enhanced Audio LLMs


104. L$^3$: Large Lookup Layers


105. HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing


106. SAGE: Sequence-level Adaptive Gradient Evolution for Generative Recommendation


107. From Consistency to Complementarity: Aligned and Disentangled Multi-modal Learning for Time Series Understanding and Reasoning


108. The Compliance Paradox: Semantic-Instruction Decoupling in Automated Academic Code Evaluation


109. Theoretically Optimal Attention/FFN Ratios in Disaggregated LLM Serving


110. Self-Improving Pretraining: using post-trained models to pretrain better models


111. DUET: Distilled LLM Unlearning from an Efficiently Contextualized Teacher


112. GeoRC: A Benchmark for Geolocation Reasoning Chains


113. More Code, Less Reuse: Investigating Code Quality and Reviewer Sentiment towards AI-generated Pull Requests


114. Less Noise, More Voice: Reinforcement Learning for Reasoning via Instruction Purification


115. PTQ4ARVG: Post-Training Quantization for AutoRegressive Visual Generation Models


116. SHARP: Social Harm Analysis via Risk Profiles for Measuring Inequities in Large Language Models


117. MGSM-Pro: A Simple Strategy for Robust Multilingual Mathematical Reasoning Evaluation


118. Scaling Embeddings Outperforms Scaling Experts in Language Models


119. Thinker: A vision-language foundation model for embodied intelligence


120. ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling


121. Adaptive and Robust Cost-Aware Proof of Quality for Decentralized LLM Inference Networks


122. Output-Space Search: Targeting LLM Generations in a Frozen Encoder-Defined Output Space


123. Mobility-Embedded POIs: Learning What A Place Is and How It Is Used from Human Movement


124. PhaseCoder: Microphone Geometry-Agnostic Spatial Audio Understanding for Multimodal LLMs


125. AI-Assisted Engineering Should Track the Epistemic Status and Temporal Validity of Architectural Decisions


126. LOCUS: Low-Dimensional Model Embeddings for Efficient Model Exploration, Comparison, and Selection


127. Towards Comprehensive Benchmarking Infrastructure for LLMs In Software Engineering


128. Textual Equilibrium Propagation for Deep Compound AI Systems


129. Thinking in Frames: How Visual Context and Test-Time Scaling Empower Video Reasoning


130. Solver-in-the-Loop: MDP-Based Benchmarks for Self-Correction and Behavioral Rationality in Operations Research


131. UrduBench: An Urdu Reasoning Benchmark using Contextually Ensembled Translations with Human-in-the-Loop


132. The Depth Delusion: Why Transformers Should Be Wider, Not Deeper


133. Noisy but Valid: Robust Statistical Evaluation of LLMs with Imperfect Judges


134. Non-Markov Multi-Round Conversational Image Generation with History-Conditioned MLLMs


135. ICON: Intent-Context Coupling for Efficient Multi-Turn Jailbreak Attack


136. Rethinking LLM-Driven Heuristic Design: Generating Efficient and Specialized Solvers via Dynamics-Aware Optimization


137. Generalizable Prompt Tuning for Audio-Language Models via Semantic Expansion