LLM 관련 주요 논문 - 2026-05-26

1. From Model Scaling to System Scaling: Scaling the Harness in Agentic AI


2. Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User’s Digital World


3. VeriTrace: Evolving Mental Models for Deep Research Agents


4. L2IR: Revealing Latent Intent in Graph Fraud Detection


5. CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists


6. $D^2$-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing


7. MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning


8. A Deep Dive into Axiomatic Design – Part I: Problem Formulation


9. AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions


10. CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents


11. Detecting Unfaithful Chain-of-Thought via Circuit-Guided Internal-External Discrepancy


12. Uncertainty Reasoning with Large Language Models for Explainable Disease Diagnosis


13. Beyond Query Memorization: Large Language Model Routing with Query Decomposition and Historical Matching


14. Personalize-then-Store: Benchmarking and Learning Personalized Memory for Long-horizon Agents


15. StructBreak: Structural Cognitive Overload-Induced Safety Failures in MLLMs


16. Credit Assignment with Resets in Language Model Reasoning


17. ATWL: A Formal Language for Representing, Comparing, and Reusing Visual Analytics Workflows


18. Security of OpenClaw Agents: Fundamentals, Attacks, and Countermeasures


19. CODESKILL: Learning Self-Evolving Skills for Coding Agents


20. Towards end-to-end LLM-based censoring-aware survival analysis


21. Second Guess: Detecting Uncertainty Through Abstention and Answer Stability in Small Language Models


22. AI Cartography: Mapping the Latent Landscape of AI Benchmark Ecosystems


23. Whose Alignment? Comparing LLM Process Alignment Across Diverse Organizational Decision Contexts


24. LipoAgent: Coordinating Fine-Tuned LLM Agents for Safer Lipid Design


25. FrontierOR: Benchmarking LLMs’ Capacity for Efficient Algorithm Design in Large-Scale Optimization


26. DarkForest: Less Talk, Higher Accuracy for Multi-Agent LLMs


27. SpecAlign: A Semantic Alignment Framework for SystemVerilog Assertion Generation


28. SimuWoB: Simulating Real-World Mobile Apps for Fast and Faithful GUI Agent Benchmarking


29. Representation Without Control: Testing the Realization Effect in Language Models


30. Beyond the Frontier: Stochastic Backtracking for Efficient Test-Time Scaling


31. Trust but Verify: Prover-Verifier Deliberation for Selective LLM Prediction


32. Privacy-Preserving Local Language Models for Longitudinal Data Retrieval in Chronic Dermatologic Disease: Implementation in Pemphigus Patients


33. Mitigating Object Hallucinations in Vision-Language Models through Region-Aware Attention Recalibration


34. TaBIIC2: Interactive Building of Ontological Taxonomies using Weighted Self-Organizing Maps


35. Inverting the Shield: Systematically Generating Safety Tests from Policy Specifications


36. Clustering as Reasoning: A $k$-Means Interpretation of Chain-of-Thought Graph Learning


37. Geo-Expert: Towards Expert-Level Geological Reasoning via Parameter-Efficient Fine-Tuning


38. Test-Time Deep Thinking to Explore Implicit Rules


39. CoRe-Code: Collaborative Reinforcement Learning for Code Generation


40. GRAIL: AI translation for scientists application workflow on satellite data


41. PRIMA: Operational Patterns for Resilient Multi-Agent Research with Verifiable Identity and Convergent Feedback



43. MDIA: A Multi-Agent Diagnostic Intelligence Pipeline on HealthBench Professional


44. Emotional intelligence in large language models is fragmented across perception, cognition, and interaction


45. When Mean CE Fails: Median CE Can Better Track Language Model Quality


46. Beyond Inference-Only Deployment: Comparing Weight-Based Consolidation Against Cascading Compaction


47. GlobalDentBench: A Multinational Benchmark for Evaluating LLM Clinical Reasoning in Dentistry with Expert Calibration


48. Agent-as-Peer-Debriefer: A Multi-Agent Framework with Perspective-Based Refinement for Qualitative Analysis


49. Hera: Learning Long-Horizon Coordination for Device-Cloud Collaborative LLM Agents


50. Learning to Reason Efficiently with A* Post-Training


51. Summoning the Oracle to Slay It: Mitigating Look-Ahead Bias in Financial Backtesting with Large Language Models


52. Jailbreak to Protect: Buffering and Reinforcing via Temporary Jailbreaking for Safe Fine-Tuning in Large Language Models


53. PALoRA: Projection-Adaptive LoRA for Preserving Reasoning in Large Language Models


54. Beyond Control-Flow: Integrating the Resource Perspective into Multi-Collaborative Process Modeling from Text


55. Hypothesis Generation and Inductive Inference in Children and Language Models


56. Market Regime Council for Dynamic Credit Assignment in Multi-Agent LLM Decision Systems


57. SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent


58. JT-SAFE-V2: Safety-by-Design Foundation Model with World-Context Data


59. The Model Is Not the Product: A Dual-Pillar Architecture for Local-First Psychological Coaching


60. Advancing Graph Few-Shot Learning via In-Context Learning


61. Understanding and Mitigating Premature Confidence for Better LLM Reasoning


62. Distilling Game Code World Model Generation into Lightweight Large Language Models


63. When Does Synthetic Patent Data Help? Volume-Fidelity Trade-offs in Low-Resource Multi-Label Classification


64. Safety-Oriented Routing Analysis of Mixtral MoE Under Benign and Harmful Prompts


65. Toward Enactive Artificial Intelligence


66. Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows


67. Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks


68. When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs


69. EPPC-OASIS: Ontology-Aware Adaptation and Structured Inference Refinement for Electronic Patient-Provider Communication Mining in Secure Messages


70. Inference Time Context Sparsity: Illusion or Opportunity?


71. HyperGuide: Hyperbolic Guidance for Efficient Multi-Step Reasoning in Large Language Models


72. SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills


73. Breaking the Chains of Probability: Neutrosophic Logic as a New Framework for Epistemic Uncertainty in Large Language Models


74. EvoSci: A Bio-Inspired Multi-Agent Framework for the Evolution of Scientific Discovery


75. LC-ERD: Mining Latent Logic for Self-Evolving Reasoning via Consistency-Regulated Reward Decomposition


76. Reason–Imagine–Act: Closed-Loop LLM Decision Making with World Models for Autonomous Driving


77. Towards trustworthy agentic AI: a comprehensive survey of safety, robustness, privacy, and system security


78. Why We Need World Models for AGI: Where LLMs Fail and How World Models May Outperform


79. LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs


80. QUIVER: A Formal Framework for Quantifying Perturbation Propagation and Bifurcation in Compound AI Systems


81. From Accuracy to Auditability: A Survey of Determinism in Financial AI Systems


82. Methods for Formal Verification of Agent Skills: Three Layers Toward a Mechanically Checkable Capability-Containment Proof


83. Stop Comparing LLM Agents Without Disclosing the Harness


84. MEMOR-E: In-Context and Fine-Tuned LLM Personalization for Alzheimer’s Assistive Robotics


85. Authority Inversion in LLM-Mediated Ubiquitous Systems: When Models Trust Users Over Sensors


86. Practical Quantum CIM Empowerment via All-Domestic-Core Agentic Large Model


87. When Correct Beliefs Collapse: Epistemic Resilience of LLMs under Clinical Pressure


88. BODHI: Precise OS Kernel Specification Inference


89. Toward Reliable Design of LLM-Enabled Agentic Workflows: Optimizing Latency-Reliability-Cost Tradeoffs


90. Context: Proactive Goal-Directed Intelligence via Composable Sandboxed Programs, Declarative Wiring, and Structured Interaction


91. How Much Thinking is Enough? Quantifying and Understanding Redundancy in LLM Reasoning


92. Confidence Calibration in Large Language Models


93. In Search of the Ingredients of Open-Endedness: Replicating Picbreeder with Large Vision-Language Models


94. Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation


95. Beyond Summaries: Structure-Aware Labeling of Code Changes with Large Language Models


96. Language Models Need Sleep


97. OrpQuant: Geometric Orthogonal Residual Projection for Multiplier-Free Power-of-Two Transformer Quantization


98. When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges


99. Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals


100. DRScaffold: Boosting Dense-Scene Reasoning in Lightweight Vision Language Models


101. Retrieval-Augmented Detection of Potentially Abusive Clauses in Chilean Terms of Service


102. SafeCtrl-RL: Inference-Time Adaptive Behaviour Control for LLM Dialogue via RL-Driven Prompt Optimisation


103. Creative Quality Alignment: Expert Tacit Knowledge Transfer via Chain-of-Thought Fine-Tuning


104. QUIET: A Multi-Blank Cascaded Story Cloze Benchmark for LLM Creative Generation Capability


105. Step-TP: A Grounded, Step-Level Dataset with Chain-of-Thought Reasoning for LLM-Guided Tensor Program Optimization


106. EchoPilot: Training-Free Ultrasound Video Segmentation via Scale-Space Semantic Prompting and Reliability-Gated Memory



108. Causal Tongue-Tie: LLMs Can Encode Causal Direction, But Their Yes/No Outputs Fail to Express


109. Explaining Too Much? Understanding How Large Language Model Reasoning Traces Influence Performance and Metacognition


110. TIAR: Trajectory-Informed Advantage Reweighting for LLM Abstention Learning


111. TTPrint: Evidence-Grounded TTP Extraction via Diverge-then-Converge Verification


112. Context-Instrumental Data Distillation for Kubernetes Manifest Generation: Method and Experimental Evaluation


113. When Search Becomes Memory: Turning Robot Design Trials into Transferable Skills


114. Clarify, Abstain or Answer? Strategising in Conversation with Belief-Augmented Generation


115. Adaptive Graph Refinement and Label Propagation with LLMs for Cost-Effective Entity Resolution


116. Multi-Agent Coordination Adaptation via Structure-Guided Orchestration


117. How Should LLMs Consume High-Quality Data? Optimal Data Scheduling via Quality-Aware Functional Scaling Laws


118. Simulating Human Memory with Language Models


119. AutoSG: LLM-Driven Solver Generation Solely from Task Prompts for Expensive Optimization


120. Fine-Tuning and Serving Gemma 4 31B on Google Cloud TPU: A Technical Comparison with GPU Baselines


121. Toward a Benchmark for Controllable Simulation of Imperfect Students with Large Language Models


122. Extreme Region Policy Distillation


123. PennySynth: RAG-Driven Data Synthesis for Automated Quantum Code Generation


124. BC Protocol: Structured Dual-Expert Dialogue for Eliciting High-Quality Chain-of-Thought Post-Training Data



126. Generative AI impacts on intra-urban inequality and skill premium in Beijing


127. IndexMem: Learned KV-Cache Eviction with Latent Memory for Long-Context LLM Inference


128. From Simulation to Enaction: Post-trained language models recognize and react to their own generations


129. AI Content Moderation in Therapy Conversations


130. A Multi-Agent LLM Framework for Rating the Quality of Surgical Feedback


131. Binding Visual Features Point by Point


132. SeqRoute: Global Budget-Aware Sequential LLM Routing via Offline Reinforcement Learning


133. A Token/KV-Cache Communication Media Selection and Resource Allocation Strategy for Multi-Agent Collaboration


134. SomaliBench Eval: Measuring English-to-Somali Refusal Gaps in Open-Weight Language Models


135. Evo-Attacker: Memory-Augmented Reinforcement Learning for Long-Horizon Tool Attacks on LLM-MAS


136. Adversarial Orthogonal Disentanglement for LVLM Hallucination Mitigation


137. A general tensor-structured compression scheme for efficient large language models


138. CausalFlow: Causal Attribution and Counterfactual Repair for LLM Agent Failures


139. Eureka: Intelligent Feature Engineering for Enterprise AI Cloud Resource Demand Prediction


140. READER: Reasoning-Enhanced AI-Generated Text Detection


141. Mimir: Large-scale Multilingual Concept Modeling


142. First, do no harm: Breaking suicidogenic echo chambers in media recommendation


143. Guess the Unified Model: How Much Can We Recover from Generated Images?


144. Quantifying Empirical Compute-Supervision Tradeoffs in RLVR


145. JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment


146. Specification-Based Code-Text-Code Reengineering for LLM-Mediated Software Evolution


147. Influence-Inspired Spectral Rotations for Extreme Low-Bit LLM Quantization


148. Hide to Guide: Learning via Semantic Masking


149. By Their Fruits You Will Know Them: Comparing Formalizations of Law by the Decisions They Encode


150. Knowledge Graph-Driven Expert-Level Reasoning for Neuroscience


151. Grow-Prune-Freeze Networks: Adaptive & Continual Learning Technique for Olfactory Navigation


152. STREAM: A Data-Centric Framework for Mining High-Value Task-Oriented Dialogues from Streaming Media


153. LLM Agent Based Renewable Energy Forecasting Using Edge and IoT Data A Review of Solar Wind Weather and Grid Aware Decision Support


154. Multi-Agent Specification-based Metamorphic Testing of FMU-Based Simulations


155. Polynomial Context-Truncation Sensitivity in Autoregressive Language Models: Sequential Wyner-Ziv Bounds for KV Cache Compression


156. Security in the Fine-Tuning Lifecycle of Large Language Models: Threats, Defenses,Evaluation, and Future Directions


157. Language Bias in LVLMs: From In-Depth Analysis to Simple and Effective Mitigation


158. Selective Test-Time Compute Scaling for Click-Through Rate Prediction via Uncertainty-Triggered Feature Path Exploration


159. Investigating the Interplay between Contextual and Parametric Chain-of-Thought Faithfulness under Optimization


160. APT-Agent: Automated Penetration Testing using Large Language Models


161. Riemannian-Manifold Steering: Geometry-Aware Generative Autoencoders for Label-Free Steering


162. When Reasoning Hurts: Source-Aware Evaluation of Frontier LLMs for Clinical SOAP Note Generation


163. The Concept Allocation Zone: Tracking How Concepts Form Across Transformer Depth


164. Tiny Brains, Giant Impact: Uncovering the Keystone Neurons of LLM with Just a Few Prompts


165. Reflect-Guard: Enhancing LLM Safeguards against Adversarial Prompts via Logical Self-Reflection


166. Zero-Shot Parkinson’s Disease Detection from Speech: Comparing Large Audio and Language Models


167. Divide-and-Conquer Inference for Large-Scale Visual Recognition with Multimodal Large Language Models


168. Parameter-Efficient VLMs for Gastrointestinal Endoscopy: Medical Image Generation and Clinical Visual Question Answering


169. CONF-KV: Confidence-Aware KV Cache Eviction with Mixed-Precision Storage for Long-Horizon LLM


170. From Theory to Decision Rule: Calibrating the Noisy-Label Crossover for Vision-Language Model Weak Supervision Across Three Medical-Imaging Benchmarks


171. Spectral Retrieval: Multi-Scale Sinc Convolution over Token Embeddings for Localized Retrieval in LLM Multi-Agent Systems


172. Bilevel Optimization of Synthetic Trajectories for Multi-Turn LLM Fine-Tuning


173. Who judges the judges? Governance from metrics: a runtime framework for continuous LLM compliance monitoring


174. World-State Transformations for Neuro-symbolic Interactive Storytelling


175. TS-Skill: A Benchmark for Evaluating Analytical Skills in Time-Series Question Answering


176. The Path Matters: Learning a Token-Commitment Policy for Diffusion Language Models


177. Mix-MoE: Improving Multilingual Machine Translation of Large Language Models through Mixed MoEs


178. VaaWIT: Visual-Aware Adaptation of Large Language Models for Multilingual Web Image Translation


179. How Many Tools Should an LLM Agent See? A Chance-Corrected Answer


180. Demystifying the Mythos or Disrupting Bugonomics? From Zero-Day Asymmetry to Defender Remediation Throughput


181. Measuring the Depth of LLM Unlearning via Activation Patching


182. Guarded Repair for Harm-Aware Post-hoc Replacement of LLM Mathematical Reasoning


183. Correcting Visual Blur Induced by Attention Distraction to Reduce Hallucinations: Algorithm and Theory


184. PEDESTRIANQA: A Benchmark for Vision-Language Models on Pedestrian Intention and Trajectory Prediction


185. SemanticZip: A Pilot Framework for Lossy Text Compression with LLMs as Semantic Decompressors


186. Grammatically-Guided Sparse Attention for Efficient and Interpretable Transformers


187. FoodMonitor: Benchmarking MLLMs for Explainable Compliance Analysis


188. Code2UML: Agentic LLMs with context engineering for scalable software visualization


189. VectorArk: Learning Practical Image Vectorization with Rounded Polygon Representation


190. Side-by-side Comparison Amplifies Dialect Bias in Language Models


191. ScaleAcross Explorer: Exploring Communication Optimization for Scale-Across AI Model Training


192. ChaosBench-Logic v2: Evaluating LLM Logical Reasoning over Dynamical Systems at Scale


193. Enhancing Reliability in LLM-Based Secure Code Generation


194. An Empirical Evaluation of LLM-Generated Code Security Across Prompting Methods


195. An Interactive Paradigm for Deep Research


196. Attested Tool-Server Admission: A Security Extension to the Model Context Protocol


197. Improving Labeling Consistency with Detailed Constitutional Definitions and AI-Driven Evaluation


198. Agent-ToM: Learning to Monitor Autonomous LLM Agents via Theory-of-Mind Reasoning


199. Teaching Through Analogies: A Modular Pipeline for Educational Analogy Generation


200. Human-AI Collaboration in Science at Scale: A Global Large-scale Randomized Field Experiment


201. Extracting Training Data from Diffusion Language Models via Infilling


202. PromptAudit: Auditing Prompt Sensitivity in LLM-Based Vulnerability Detection


203. Understanding Conversational Patterns in Multi-agent Programming: A Case Study on Fibonacci Game Development


204. Empirical Analysis and Detection of Hallucinations in LLM-Generated Bug Report Summaries


205. The Time is Here for Just-in-Time Systems: Challenges and Opportunities


206. TRACER: A Semantic-Aware Framework for Fine-Grained Contamination Detection in Code LLMs


207. When the Manual Lies: A Realistic Benchmark to Evaluate MCP Poisoning Attacks for LLM Agents


208. Generative Representation Learning on Hyper-relational Knowledge Graphs via Masked Discrete Diffusion


209. Signs Beat Floats: Low-Rank Double-Binary Adaptation for On-Device Fine-Tuning


210. Feature Lottery? A Bifurcation Theory of Concept Emergence


211. Truthful Online Preference Aggregation for LLM Fine-Tuning in Mobile Crowdsourcing


212. More Skills, Worse Agents? Skill Shadowing Degrades Performance When Expanding Skill Libraries


213. Mixture of Complementary Agents for Robust LLM Ensemble


214. LLM-AutoSciLab: Closed-Loop Scientific Discovery via Active Experimentation with LLMs


215. Harnessing AtomisticSkills for Agentic Atomistic Research


216. IVR-R1: Refining Trajectories through Iterative Visual-Grounded Reasoning in Reinforcement Learning


217. MemForest: An Efficient Agent Memory System with Hierarchical Temporal Indexing


218. TriVAL: A Tri-Validation Framework for Faithful Automatic Optimization Modeling


219. EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs


220. SODE: Analyzing Social Dynamics in LLM Agents


221. KT4EQG: Personalized Exercise Question Generation via Knowledge Tracing


222. Catching The Correct Answer Trap: Characterising AI Tutor Blind Spots When Analysing Student Reasoning


223. Artificial Effort


224. Agent-Facing Information Design in LLM Tool Registries


225. VineLM: Trie-Based Fine-Grained Control for Agentic Workflows


226. Raon-Speech Technical Report


227. Check Your LLM’s Secret Dictionary! Five Lines of Code Reveal What Your LLM Learned (Including What It Shouldn’t Have)