LLM 관련 주요 논문 - 2025-12-31

1. Web World Models


2. Divergent-Convergent Thinking in Large Language Models for Creative Problem Generation


3. The Gaining Paths to Investment Success: Information-Driven LLM Graph Reasoning for Venture Capital Prediction


4. Replay Failures as Successes: Sample-Efficient Reinforcement Learning for Instruction Following


5. AKG kernel Agent: A Multi-Agent Framework for Cross-Platform Kernel Synthesis


6. CubeBench: Diagnosing Interactive, Long-Horizon Spatial Reasoning Under Partial Observations


7. Agentic Physical AI toward a Domain-Specific Foundation Model for Nuclear Reactor Control


8. TCEval: Using Thermal Comfort to Assess Cognitive and Perceptual Abilities of AI


9. From Model Choice to Model Belief: Establishing a New Measure for LLM-Based Research



11. InSPO: Unlocking Intrinsic Self-Reflection for LLM Preference Optimization


12. Benchmark Success, Clinical Failure: When Reinforcement Learning Optimizes for Benchmarks, Not Patients


13. Problems With Large Language Models for Learner Modelling: Why LLMs Alone Fall Short for Responsible Tutoring in K–12 Education


14. Multimodal Fact-Checking: An Agent-based Approach


15. HiSciBench: A Hierarchical Multi-disciplinary Benchmark for Scientific Intelligence from Reading to Discovery


16. Memento-II: Learning by Stateful Reflective Memory


17. TravelBench: A Real-World Benchmark for Multi-Turn and Tool-Augmented Travel Planning


18. DICE: Discrete Interpretable Comparative Evaluation with Probabilistic Scoring for Retrieval-Augmented Generation


19. The Wisdom of Deliberating AI Crowds: Does Deliberation Improve LLM-Based Forecasting?


20. LLM Agents as VC investors: Predicting Startup Success via RolePlay-Based Collective Simulation


21. Learning Multi-Modal Mobility Dynamics for Generalized Next Location Recommendation


22. Lessons from Neuroscience for AI: How integrating Actions, Compositional Structure and Episodic Memory could enable Safe, Interpretable and Human-Like AI


23. DarkPatterns-LLM: A Multi-Layer Benchmark for Detecting Manipulative and Harmful AI Behavior


24. Monadic Context Engineering


25. HalluMat: Detecting Hallucinations in LLM-Generated Materials Science Content Through Multi-Stage Verification


26. Logic Sketch Prompting (LSP): A Deterministic and Interpretable Prompting Method


27. Shape of Thought: When Distribution Matters More than Correctness in Reasoning Tasks


28. GamiBench: Evaluating Spatial Reasoning and 2D-to-3D Planning Capabilities of MLLMs with Origami Folding Tasks


29. Emergent Persuasion: Will LLMs Persuade Without Being Prompted?


30. Bidirectional RAG: Safe Self-Improving Retrieval-Augmented Generation Through Multi-Stage Validation


31. Multilingual Hidden Prompt Injection Attacks on LLM-Based Academic Reviewing


32. BOAD: Discovering Hierarchical Software Engineering Agents via Bandit Optimization


33. RxnBench: A Multimodal Benchmark for Evaluating Large Language Models on Chemical Reaction Understanding from Scientific Literature


34. VL-RouterBench: A Benchmark for Vision-Language Model Routing


35. Toward Trustworthy Agentic AI: A Multimodal Framework for Preventing Prompt Injection Attacks


36. Lie to Me: Knowledge Graphs for Robust Hallucination Self-Detection in LLMs


37. PathFound: An Agentic Multimodal Model Activating Evidence-seeking Pathological Diagnosis


38. Alpha-R1: Alpha Screening with LLM Reasoning via Reinforcement Learning


39. Agentic AI for Autonomous Defense in Software Supply Chain Security: Beyond Provenance to Vulnerability Mitigation


40. Semantic Tree Inference on Text Corpa using a Nested Density Approach together with Large Language Model Embeddings


41. Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance


42. CoFi-Dec: Hallucination-Resistant Decoding via Coarse-to-Fine Generative Feedback in Large Vision-Language Models


43. Theoretical Foundations of Scaling Law in Familial Models


44. Securing the AI Supply Chain: What Can We Learn From Developer-Reported Security Issues and Solutions of AI Projects?


45. Post-Training Quantization of OpenPangu Models for Efficient Deployment on Atlas A2


46. AI Meets Brain: Memory Systems from Cognitive Neuroscience to Autonomous Agents


47. The Law of Multi-Model Collaboration: Scaling Limits of Model Ensembling for Large Language Models


48. Splitwise: Collaborative Edge-Cloud Inference for LLMs via Lyapunov-Assisted DRL


49. MedGemma vs GPT-4: Open-Source and Proprietary Zero-shot Medical Disease Classification from Images


50. Interpretable Safety Alignment via SAE-Constructed Low-Rank Subspace Adaptation


51. ViLaCD-R1: A Vision-Language Framework for Semantic Change Detection in Remote Sensing


52. Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process


53. Not too long do read: Evaluating LLM-generated extreme scientific summaries


54. EquaCode: A Multi-Strategy Jailbreak Approach for Large Language Models via Equation Solving and Code Completion


55. Reservoir Computing inspired Matrix Multiplication-free Language Model


56. It’s a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents


57. How Much Data Is Enough? Uniform Convergence Bounds for Generative & Vision-Language Models under Low-Dimensional Structure


58. A Note on Hybrid Online Reinforcement and Imitation Learning for LLMs: Formulations and Algorithms


59. Taming the Tail: Stable LLM Reinforcement Learning via Dynamic Vocabulary Pruning


60. Trust Region Masking for Long-Horizon LLM Reinforcement Learning


61. Viability and Performance of a Private LLM Server for SMBs: A Benchmark Analysis of Qwen3-30B on Consumer-Grade Hardware


62. An Architecture-Led Hybrid Report on Body Language Detection Project


63. LENS: LLM-Enabled Narrative Synthesis for Mental Health by Aligning Multimodal Sensing with Language Models


64. OpenGround: Active Cognition-based Reasoning for Open-World 3D Visual Grounding


65. Agentic AI for Cyber Resilience: A New Security Paradigm and Its System-Theoretic Foundations


66. FasterPy: An LLM-based Code Execution Efficiency Optimization Framework


67. CNSight: Evaluation of Clinical Note Segmentation Tools


68. Understanding the Mechanisms of Fast Hyperparameter Transfer


69. Robust LLM-based Column Type Annotation via Prompt Augmentation with LoRA Tuning


70. Harnessing Large Language Models for Biomedical Named Entity Recognition


71. FoldAct: Efficient and Stable Context Folding for Long-Horizon Search Agents


72. Conformal Prediction Sets for Next-Token Prediction in Large Language Models: Balancing Coverage Guarantees with Set Efficiency


73. Scaling Unverifiable Rewards: A Case Study on Visual Insights


74. Learning When Not to Attend Globally


75. RollArt: Scaling Agentic RL Training via Disaggregated Infrastructure


76. Predicting LLM Correctness in Prosthodontics Using Metadata and Hallucination Signals


77. Hierarchical Pedagogical Oversight: A Multi-Agent Adversarial Framework for Reliable AI Tutoring


78. Role-Based Fault Tolerance System for LLM RL Post-Training


79. Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds


80. The Bayesian Geometry of Transformer Attention


81. Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving


82. Emergence of Human to Robot Transfer in Vision-Language-Action Models


83. Efficient Multi-Model Orchestration for Self-Hosted Large Language Models


84. AI-Generated Code Is Not Reproducible (Yet): An Empirical Study of Dependency Gaps in LLM-Based Coding Agents


85. LLM-Guided Exemplar Selection for Few-Shot Wearable-Sensor Human Activity Recognition


86. Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration


87. Towards Efficient Post-Training via Fourier-Driven Adapter Architectures


88. Cost-Aware Text-to-SQL: An Empirical Study of Cloud Compute Costs for LLM-Generated Queries


89. VULCAN: Tool-Augmented Multi Agents for Iterative 3D Object Arrangement


90. The Effectiveness of Approximate Regularized Replay for Efficient Supervised Fine-Tuning of Large Language Models


91. SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents


92. VideoZoomer: Reinforcement-Learned Temporal Focusing for Long Video Reasoning


93. LLMBoost: Make Large Language Models Stronger with Boosting


94. Beyond Single Bugs: Benchmarking Large Language Models for Multi-Vulnerability Detection


95. The Illusion of Clinical Reasoning: A Benchmark Reveals the Pervasive Gap in Vision-Language Models for Clinical Competency


96. LLMTM: Benchmarking and Optimizing LLMs for Temporal Motif Analysis in Dynamic Graphs


97. Agentic Software Issue Resolution with Large Language Models: A Survey


98. Calibrating LLM Judges: Linear Probes for Fast and Reliable Uncertainty Estimation


99. Masking Teacher and Reinforcing Student for Distilling Vision-Language Models


100. DiRL: An Efficient Post-Training Framework for Diffusion Language Models


101. VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs


102. Literature Mining System for Nutraceutical Biosynthesis: From AI Framework to Biological Insight


103. ReGAIN: Retrieval-Grounded AI Framework for Network Traffic Analysis


104. VLM-PAR: A Vision Language Model for Pedestrian Attribute Recognition


105. MatKV: Trading Compute for Flash Storage in LLM Inference


106. Unbiased Visual Reasoning with Controlled Visual Inputs


107. Wireless Traffic Prediction with Large Language Model


108. BitFlipScope: Scalable Fault Localization and Recovery for Bit-Flip Corruptions in LLMs


109. Adaptive GPU Resource Allocation for Multi-Agent Collaborative Reasoning in Serverless Environments


110. GPU Kernel Optimization Beyond Full Builds: An LLM Framework with Minimal Executable Programs


111. Pre-review to Peer review: Pitfalls of Automating Reviews using Large Language Models


112. ReCollab: Retrieval-Augmented LLMs for Cooperative Ad-hoc Teammate Modeling


113. GPU-Virt-Bench: A Comprehensive Benchmarking Framework for Software-Based GPU Virtualization Systems