LLM 관련 주요 논문 - 2026-01-29

1. SokoBench: Evaluating Long-Horizon Planning and Reasoning in Large Language Models


2. Deep Researcher with Sequential Plan Reflection and Candidates Crossover (Deep Researcher Reflect Evolve)


3. MemCtrl: Using MLLMs as Active Memory Controllers on Embodied Agents


4. Investigating the Development of Task-Oriented Communication in Vision-Language Models


5. Dialogical Reasoning Across AI Architectures: A Multi-Model Framework for Testing AI Alignment Strategies


6. PathWise: Planning through World Model for Automated Heuristic Design via Self-Evolving LLMs


7. CtrlCoT: Dual-Granularity Chain-of-Thought Compression for Controllable Reasoning


8. Policy of Thoughts: Scaling LLM Reasoning via Test-time Policy Evolution


9. AMA: Adaptive Memory via Multi-Agent Collaboration


10. ECG-Agent: On-Device Tool-Calling Agent for ECG Multi-Turn Dialogue


11. Scaling Medical Reasoning Verification via Tool-Integrated Reinforcement Learning


12. Towards Intelligent Urban Park Development Monitoring: LLM Agents for Multi-Modal Information Fusion and Analysis


13. Should I Have Expressed a Different Intent? Counterfactual Generation for LLM-Based Autonomous Control


14. Insight Agents: An LLM-Based Multi-Agent System for Data Insights


15. Fuzzy Categorical Planning: Autonomous Goal Satisfaction with Graded Semantic Constraints


16. Teaching LLMs to Ask: Self-Querying Category-Theoretic Planning for Under-Specified Reasoning


17. Reward Models Inherit Value Biases from Pretraining


18. Open-Vocabulary Functional 3D Human-Scene Interaction Generation


19. Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning


20. Reinforcement Learning via Self-Distillation


21. HESTIA: A Hessian-Guided Differentiable Quantization-Aware Training Framework for Extremely Low-Bit LLMs


22. QueerGen: How LLMs Reflect Societal Norms on Gender and Sexuality in Sentence Completion Tasks


23. Beyond GEMM-Centric NPUs: Enabling Efficient Diffusion LLM Sampling


24. LEMON: How Well Do MLLMs Perform Temporal Multimodal Understanding on Instructional Videos?


25. Decoupling Perception and Calibration: Label-Efficient Image Quality Assessment Framework


26. Harnessing Large Language Models for Precision Querying and Retrieval-Augmented Knowledge Extraction in Clinical Data Science


27. GDCNet: Generative Discrepancy Comparison Network for Multimodal Sarcasm Detection


28. Agent Benchmarks Fail Public Sector Requirements


29. Interpreting Emergent Extreme Events in Multi-Agent Systems


30. Audio Deepfake Detection in the Age of Advanced Text-to-Speech models


31. Let’s Roll a BiFTA: Bi-refinement for Fine-grained Text-visual Alignment in Vision-Language Models


32. Meeting SLOs, Slashing Hours: Automated Enterprise LLM Optimization with OptiKIT


33. GuideAI: A Real-time Personalized Learning Solution with Adaptive Interventions


34. LLM-AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning


35. Multimodal Multi-Agent Ransomware Analysis Using AutoGen


36. Demonstration-Free Robotic Control via LLM Agents


37. Beyond Speedup – Utilizing KV Cache for Sampling and Reasoning



39. SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips


40. Physically Guided Visual Mass Estimation from a Single RGB Image


41. Truthfulness Despite Weak Supervision: Evaluating and Training LLMs Using Peer Prediction


42. Beyond the Needle’s Illusion: Decoupled Evaluation of Evidence Access and Use under Semantic Interference at 326M-Token Scale


43. Eliciting Least-to-Most Reasoning for Phishing URL Detection


44. Automated Benchmark Generation from Domain Guidelines Informed by Bloom’s Taxonomy


45. MALLOC: Benchmarking the Memory-aware Long Sequence Compression for Large Sequential Recommendation


46. What’s the plan? Metrics for implicit planning in LLMs and their application to rhyme generation and question answering


47. Large language models accurately predict public perceptions of support for climate action worldwide


48. Rewarding Intellectual Humility Learning When Not To Answer In Large Language Models


49. Membership Inference Attacks Against Fine-tuned Diffusion Language Models


50. Dynamics of Human-AI Collective Knowledge on the Web: A Scalable Model and Insights for Sustainable Growth


51. LLaTTE: Scaling Laws for Multi-Stage Sequence Modeling in Large-Scale Ads Recommendation


52. VERGE: Formal Refinement and Guidance Engine for Verifiable LLM Reasoning


53. CiMRAG: Cim-Aware Domain-Adaptive and Noise-Resilient Retrieval-Augmented Generation for Edge-Based LLMs


54. LinguaMap: Which Layers of LLMs Speak Your Language and How to Tune Them?


55. On the Effectiveness of LLM-Specific Fine-Tuning for Detecting AI-Generated Text


56. VoxPrivacy: A Benchmark for Evaluating Interactional Privacy of Speech Language Models


57. LTS-VoiceAgent: A Listen-Think-Speak Framework for Efficient Streaming Voice Interaction via Semantic Triggering and Incremental Reasoning


58. Bench4HLS: End-to-End Evaluation of LLMs in High-Level Synthesis Code Generation


59. Gap-K%: Measuring Top-1 Prediction Gap for Detecting Pretraining Data


60. Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents


61. Quantifying non deterministic drift in large language models


62. Text-to-State Mapping for Non-Resolution Reasoning: The Contradiction-Preservation Principle


63. SDUs DAISY: A Benchmark for Danish Culture


64. Stingy Context: 18:1 Hierarchical Code Compression for LLM Auto-Coding


65. The Grammar of Transformers: A Systematic Review of Interpretability Research on Syntactic Knowledge in Language Models


66. Evaluating Large Language Models for Abstract Evaluation Tasks: An Empirical Study


67. OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling


68. Table-BiEval: A Self-Supervised, Dual-Track Framework for Decoupling Structure and Content in LLM Evaluation


69. HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue


70. Demystifying Multi-Agent Debate: The Role of Confidence and Diversity


71. Modeling Next-Token Prediction as Left-Nested Intuitionistic Implication


72. Simulating Complex Multi-Turn Tool Calling Interactions in Stateless Execution Environments


73. From Intuition to Expertise: Rubric-Based Cognitive Calibration for Human Detection of LLM-Generated Korean Text


74. Analysis of LLM Vulnerability to GPU Soft Errors: An Instruction-Level Fault Injection Study


75. DABench-LLM: Standardized and In-Depth Benchmarking of Post-Moore Dataflow AI Accelerators for LLMs


76. STELLAR: Structure-guided LLM Assertion Retrieval and Generation for Formal Verification