LLM 관련 주요 논문 - 2026-01-01

1. Web World Models


2. Divergent-Convergent Thinking in Large Language Models for Creative Problem Generation


3. The Gaining Paths to Investment Success: Information-Driven LLM Graph Reasoning for Venture Capital Prediction


4. Replay Failures as Successes: Sample-Efficient Reinforcement Learning for Instruction Following


5. AKG kernel Agent: A Multi-Agent Framework for Cross-Platform Kernel Synthesis


6. CubeBench: Diagnosing Interactive, Long-Horizon Spatial Reasoning Under Partial Observations


7. Agentic Physical AI toward a Domain-Specific Foundation Model for Nuclear Reactor Control


8. TCEval: Using Thermal Comfort to Assess Cognitive and Perceptual Abilities of AI


9. From Model Choice to Model Belief: Establishing a New Measure for LLM-Based Research



11. InSPO: Unlocking Intrinsic Self-Reflection for LLM Preference Optimization


12. Benchmark Success, Clinical Failure: When Reinforcement Learning Optimizes for Benchmarks, Not Patients


13. Problems With Large Language Models for Learner Modelling: Why LLMs Alone Fall Short for Responsible Tutoring in K–12 Education


14. Multimodal Fact-Checking: An Agent-based Approach


15. HiSciBench: A Hierarchical Multi-disciplinary Benchmark for Scientific Intelligence from Reading to Discovery


16. Memento-II: Learning by Stateful Reflective Memory


17. TravelBench: A Real-World Benchmark for Multi-Turn and Tool-Augmented Travel Planning


18. DICE: Discrete Interpretable Comparative Evaluation with Probabilistic Scoring for Retrieval-Augmented Generation


19. The Wisdom of Deliberating AI Crowds: Does Deliberation Improve LLM-Based Forecasting?


20. LLM Agents as VC investors: Predicting Startup Success via RolePlay-Based Collective Simulation


21. Learning Multi-Modal Mobility Dynamics for Generalized Next Location Recommendation


22. Lessons from Neuroscience for AI: How integrating Actions, Compositional Structure and Episodic Memory could enable Safe, Interpretable and Human-Like AI


23. DarkPatterns-LLM: A Multi-Layer Benchmark for Detecting Manipulative and Harmful AI Behavior


24. Monadic Context Engineering


25. HalluMat: Detecting Hallucinations in LLM-Generated Materials Science Content Through Multi-Stage Verification


26. Logic Sketch Prompting (LSP): A Deterministic and Interpretable Prompting Method


27. Shape of Thought: When Distribution Matters More than Correctness in Reasoning Tasks


28. GamiBench: Evaluating Spatial Reasoning and 2D-to-3D Planning Capabilities of MLLMs with Origami Folding Tasks


29. Emergent Persuasion: Will LLMs Persuade Without Being Prompted?


30. Bidirectional RAG: Safe Self-Improving Retrieval-Augmented Generation Through Multi-Stage Validation


31. Multilingual Hidden Prompt Injection Attacks on LLM-Based Academic Reviewing


32. BOAD: Discovering Hierarchical Software Engineering Agents via Bandit Optimization


33. RxnBench: A Multimodal Benchmark for Evaluating Large Language Models on Chemical Reaction Understanding from Scientific Literature


34. VL-RouterBench: A Benchmark for Vision-Language Model Routing


35. Toward Trustworthy Agentic AI: A Multimodal Framework for Preventing Prompt Injection Attacks


36. Lie to Me: Knowledge Graphs for Robust Hallucination Self-Detection in LLMs


37. PathFound: An Agentic Multimodal Model Activating Evidence-seeking Pathological Diagnosis


38. Alpha-R1: Alpha Screening with LLM Reasoning via Reinforcement Learning


39. UniHetero: Could Generation Enhance Understanding for Vision-Language-Model at Large Data Scale?


40. Agentic AI for Autonomous Defense in Software Supply Chain Security: Beyond Provenance to Vulnerability Mitigation


41. Semantic Tree Inference on Text Corpa using a Nested Density Approach together with Large Language Model Embeddings


42. Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance


43. CoFi-Dec: Hallucination-Resistant Decoding via Coarse-to-Fine Generative Feedback in Large Vision-Language Models


44. Theoretical Foundations of Scaling Law in Familial Models


45. Securing the AI Supply Chain: What Can We Learn From Developer-Reported Security Issues and Solutions of AI Projects?


46. Post-Training Quantization of OpenPangu Models for Efficient Deployment on Atlas A2


47. AI Meets Brain: Memory Systems from Cognitive Neuroscience to Autonomous Agents


48. The Law of Multi-Model Collaboration: Scaling Limits of Model Ensembling for Large Language Models


49. Splitwise: Collaborative Edge-Cloud Inference for LLMs via Lyapunov-Assisted DRL


50. MedGemma vs GPT-4: Open-Source and Proprietary Zero-shot Medical Disease Classification from Images


51. Interpretable Safety Alignment via SAE-Constructed Low-Rank Subspace Adaptation


52. ViLaCD-R1: A Vision-Language Framework for Semantic Change Detection in Remote Sensing


53. Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process


54. Not too long do read: Evaluating LLM-generated extreme scientific summaries


55. EquaCode: A Multi-Strategy Jailbreak Approach for Large Language Models via Equation Solving and Code Completion


56. Reservoir Computing inspired Matrix Multiplication-free Language Model


57. It’s a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents


58. How Much Data Is Enough? Uniform Convergence Bounds for Generative & Vision-Language Models under Low-Dimensional Structure


59. A Note on Hybrid Online Reinforcement and Imitation Learning for LLMs: Formulations and Algorithms


60. Taming the Tail: Stable LLM Reinforcement Learning via Dynamic Vocabulary Pruning


61. Trust Region Masking for Long-Horizon LLM Reinforcement Learning


62. Viability and Performance of a Private LLM Server for SMBs: A Benchmark Analysis of Qwen3-30B on Consumer-Grade Hardware


63. An Architecture-Led Hybrid Report on Body Language Detection Project


64. LENS: LLM-Enabled Narrative Synthesis for Mental Health by Aligning Multimodal Sensing with Language Models


65. OpenGround: Active Cognition-based Reasoning for Open-World 3D Visual Grounding


66. Agentic AI for Cyber Resilience: A New Security Paradigm and Its System-Theoretic Foundations


67. FasterPy: An LLM-based Code Execution Efficiency Optimization Framework


68. CNSight: Evaluation of Clinical Note Segmentation Tools


69. Understanding the Mechanisms of Fast Hyperparameter Transfer


70. Robust LLM-based Column Type Annotation via Prompt Augmentation with LoRA Tuning


71. Harnessing Large Language Models for Biomedical Named Entity Recognition


72. FoldAct: Efficient and Stable Context Folding for Long-Horizon Search Agents


73. Conformal Prediction Sets for Next-Token Prediction in Large Language Models: Balancing Coverage Guarantees with Set Efficiency


74. Scaling Unverifiable Rewards: A Case Study on Visual Insights


75. Learning When Not to Attend Globally


76. RollArt: Scaling Agentic RL Training via Disaggregated Infrastructure


77. Predicting LLM Correctness in Prosthodontics Using Metadata and Hallucination Signals


78. Hierarchical Pedagogical Oversight: A Multi-Agent Adversarial Framework for Reliable AI Tutoring


79. Role-Based Fault Tolerance System for LLM RL Post-Training


80. Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds


81. The Bayesian Geometry of Transformer Attention


82. Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving


83. Emergence of Human to Robot Transfer in Vision-Language-Action Models


84. Efficient Multi-Model Orchestration for Self-Hosted Large Language Models


85. AI-Generated Code Is Not Reproducible (Yet): An Empirical Study of Dependency Gaps in LLM-Based Coding Agents


86. LLM-Guided Exemplar Selection for Few-Shot Wearable-Sensor Human Activity Recognition


87. Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration


88. Towards Efficient Post-Training via Fourier-Driven Adapter Architectures


89. Cost-Aware Text-to-SQL: An Empirical Study of Cloud Compute Costs for LLM-Generated Queries


90. VULCAN: Tool-Augmented Multi Agents for Iterative 3D Object Arrangement


91. The Effectiveness of Approximate Regularized Replay for Efficient Supervised Fine-Tuning of Large Language Models


92. SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents


93. VideoZoomer: Reinforcement-Learned Temporal Focusing for Long Video Reasoning


94. LLMBoost: Make Large Language Models Stronger with Boosting


95. Beyond Single Bugs: Benchmarking Large Language Models for Multi-Vulnerability Detection


96. The Illusion of Clinical Reasoning: A Benchmark Reveals the Pervasive Gap in Vision-Language Models for Clinical Competency


97. LLMTM: Benchmarking and Optimizing LLMs for Temporal Motif Analysis in Dynamic Graphs


98. Agentic Software Issue Resolution with Large Language Models: A Survey


99. Calibrating LLM Judges: Linear Probes for Fast and Reliable Uncertainty Estimation


100. Masking Teacher and Reinforcing Student for Distilling Vision-Language Models


101. DiRL: An Efficient Post-Training Framework for Diffusion Language Models


102. VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs


103. Literature Mining System for Nutraceutical Biosynthesis: From AI Framework to Biological Insight


104. ReGAIN: Retrieval-Grounded AI Framework for Network Traffic Analysis


105. VLM-PAR: A Vision Language Model for Pedestrian Attribute Recognition


106. MatKV: Trading Compute for Flash Storage in LLM Inference


107. Unbiased Visual Reasoning with Controlled Visual Inputs


108. Wireless Traffic Prediction with Large Language Model


109. BitFlipScope: Scalable Fault Localization and Recovery for Bit-Flip Corruptions in LLMs


110. Adaptive GPU Resource Allocation for Multi-Agent Collaborative Reasoning in Serverless Environments


111. GPU Kernel Optimization Beyond Full Builds: An LLM Framework with Minimal Executable Programs


112. Pre-review to Peer review: Pitfalls of Automating Reviews using Large Language Models


113. ReCollab: Retrieval-Augmented LLMs for Cooperative Ad-hoc Teammate Modeling


114. GPU-Virt-Bench: A Comprehensive Benchmarking Framework for Software-Based GPU Virtualization Systems