LLM 관련 주요 논문 - 2026-05-19

1. Actionable World Representation


2. What Does the AI Doctor Value? Auditing Pluralism in the Clinical Ethics of Language Models


3. SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents


4. Democratizing Large-Scale Re-Optimization with LLM-Guided Model Patches


5. Position: A Three-Layer Probabilistic Assume-Guarantee Architecture Is Structurally Required for Safe LLM Agent Deployment


6. GIM: Evaluating models via tasks that integrate multiple cognitive domains


7. SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science


8. Latent Action Reparameterization for Efficient Agent Inference


9. VISAFF: Speaker-Centered Visual Affective Feature Learning for Emotion Recognition in Conversation


10. AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment


11. QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi


12. Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks


13. TRACE: Trajectory Correction from Cross-layer Evidence for Hallucination Reduction


14. Evidence-Grounded Frontier Mapping and Agentic Hypothesis Generation in Nanomedicine


15. Generative AI and the Productivity Divide: Human-AI Complementarities in Education


16. Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction


17. LLM-Guided Communication for Cooperative Multi-Agent Reinforcement Learning


18. TeleCom-Bench: How Far Are Large Language Models from Industrial Telecommunication Applications?


19. Unleashing LLMs in Bayesian Optimization: Preference-Guided Framework for Scientific Discovery


20. Reconciling Contradictory Views on the Effectiveness of SFT in LLMs: An Interaction Perspective


21. SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain


22. DuIVRS-2: An LLM-based Interactive Voice Response System for Large-scale POI Attribute Acquisition


23. Evaluating Cognitive Age Alignment in Interactive AI Agents


24. PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization


25. Remembering More, Risking More: Longitudinal Safety Risks in Memory-Equipped LLM Agents


26. Interactive Evaluation Requires a Design Science


27. Accelerating AI-Powered Research: The PuppyChatter Framework for Usable and Flexible Tooling


28. STRIDE: A Self-Reflective Agent Framework for Reliable Automatic Equation Discovery


29. Harnessing LLM Agents with Skill Programs


30. EXG: Self-Evolving Agents with Experience Graphs


31. Multimodal Cultural Heritage Knowledge Graph Extension with Language and Vision Models


32. Causal Intervention-Based Memory Selection for Long-Horizon LLM Agents


33. Episodic-Semantic Memory Architecture for Long-Horizon Scientific Agents


34. GraphMind: From Operational Traces to Self-Evolving Workflow Automation


35. AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment


36. NeuSymMS: A Hybrid Neuro-Symbolic Memory System for Persistent, Self-Curating LLM Agents


37. Generalization or Memorization? Brittleness Testing for Chess-Trained Language Models


38. Memory-Guided Tree Search with Cross-Branch Knowledge Transfer for LLM Solver Synthesis


39. RAG-based EEG-to-Text Translation Using Deep Learning and LLMs


40. The Capability Paradox: How Smarter Auditors Make Multi-Agent Systems Less Secure


41. Computational Challenges in Token Economics: Bridging Economic Theory and AI System Design


42. QQJ: Quantifying Qualitative Judgment for Scalable and Human-Aligned Evaluation of Generative AI


43. ADR: An Agentic Detection System for Enterprise Agentic AI Security


44. CBT-Audio: Evaluating Audio Language Models for Patient-Side Distress Intensity Estimation in CBT Session Recordings


45. Reasoning Before Diagnosis: Physician-Inspired Structured Thinking for ECG Classification


46. CyberCorrect: A Cybernetic Framework for Closed-Loop Self-Correction in Large Language Models


47. MetaCogAgent: A Metacognitive Multi-Agent LLM Framework with Self-Aware Task Delegation


48. CAM-Bench: A Benchmark for Computational and Applied Mathematics in Lean


49. CatalyticMLLM: A Graph-Text Multimodal Large Language Model for Catalytic Materials


50. ChemVA: Advancing Large Language Models on Chemical Reaction Diagrams Understanding


51. MADP: A Multi-Agent Pipeline for Sustainable Document Processing with Human-in-the-Loop


52. Latent Heuristic Search: Continuous Optimization for Automated Algorithm Design


53. Capturing LLM Capabilities via Evidence-Calibrated Query Clustering


54. Scientific Logicality Enriched Methodology for LLM Reasoning: A Practice in Physics


55. RAGA: Reading-And-Graph-building-Agent for Autonomous Knowledge Graph Construction and Retrieval-Augmented Generation


56. AnchorDiff: Topology-Aware Masked Diffusion with Confidence-based Rewriting for Radiology Report Generation


57. Towards Human-Level Book-Writing Capability


58. PersonaArena: Dynamic Simulation for Evaluating and Enhancing Persona-Level Role-Playing in Large Language Models


59. Reliability and Effectiveness of Autonomous AI Agents in Supply Chain Management


60. How do Humans Process AI-generated Hallucination Contents: a Neuroimaging Study


61. Reasoning Can Be Restored by Correcting a Few Decision Tokens


62. Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models


63. Multi-Paradigm Agent Interaction in Practice:A Systematic Analysis of Generator-Evaluator, ReAct Loop,and Adversarial Evaluation in the buddyMe Framework


64. NeuroMAS: Multi-Agent Systems as Neural Networks with Joint Reinforcement Learning


65. State Contamination in Memory-Augmented LLM Agents


66. PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play


67. GRID: Graph Representation of Intelligence Data for Security Text Knowledge Graph Construction


68. Enhancing Metacognitive AI: Knowledge-Graph Population with Graph-Theoretic LLM Enrichment


69. LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning


70. TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens


71. PRISMat: Policy-Driven, Permutation-Invariant Autoregressive Material Generation


72. Counterparty Modeling is Not Strategy: The Limits of LLM Negotiators


73. From Prompts to Protocols: An AI Agent for Laboratory Automation


74. ANNEAL: Adapting LLM Agents via Governed Symbolic Patch Learning


75. DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention


76. Code as Agent Harness


77. Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation


78. Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency


79. Reversa: A Reverse Documentation Engineering Framework for Converting Legacy Software into Operational Specifications for AI Agents


80. Post-Trained MoE Can Skip Half Experts via Self-Distillation


81. CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark


82. CATA: Continual Machine Unlearning via Conflict-Averse Task Arithmetic


83. Not What You Asked For: Typographic Attacks in Household Robot Manipulation


84. Estimating Item Difficulty with Large Language Models as Experts


85. STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics


86. Continuous Diffusion Scales Competitively with Discrete Diffusion for Language


87. AI4BayesCode: From Natural Language Descriptions to Validated Modular Stateful Bayesian Samplers


88. GAMMA: Global Bit Allocation for Mixed-Precision Models under Arbitrary Budgets


89. Prompt2Fingerprint: Plug-and-Play LLM Fingerprinting via Text-to-Weight Generation


90. What is Holding Back Latent Visual Reasoning?


91. EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective


92. Geometry-Aware Uncertainty Coresets for Robust Visual In-Context Learning in Histopathology


93. Prompts Don’t Protect: Architectural Enforcement via MCP Proxy for LLM Tool Access Control


94. Qumus: Realization of An Embodied AI Quantum Material Experimentalist


95. SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution


96. Diagnosing Korean-Language LLM Political Bias via Census-Grounded Agent Simulation


97. Beyond Inference-Time Search: Reinforcement Learning Synthesizes Reusable Solvers


98. The Hidden Cost of Contextual Sycophancy: an AI Literacy Intervention in Human-AI Collaboration


99. Same Signal, Different Semantics: A Cross-Framework Behavioral Analysis of Software Engineering Agents


100. Wasserstein Equilibrium Decoding for Reliable Medical Visual Question Answering


101. Alignment Dynamics in LLM Fine-Tuning


102. CommitDistill: A Lightweight Knowledge-Centric Memory Layer for Software Repositories


103. From Volume to Value: Preference-Aligned Memory Construction for On-Device RAG


104. CodeBind: Decoupled Representation Learning for Multimodal Alignment with Unified Compositional Codebook


105. Machine Unlearning for Masked Diffusion Language Models


106. Multilingual jailbreaking of LLMs using low-resource languages


107. Are Sparse Autoencoder Benchmarks Reliable?


108. Context Memorization for Efficient Long Context Generation


109. SPATIOROUTE: Dynamic Prompt Routing for Zero-Shot Spatial Reasoning


110. PIPER: Content-Based Table Search via profiling and LLM-Generated Pseudoqueries


111. Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency


112. Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models


113. An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments


114. Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers


115. A-ProS: Towards Reliable Autonomous Programming Through Multi-Model Feedback


116. PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows


117. FedSDR: Federated Self-Distillation with Rectification


118. MARR: Module-Adaptive Residual Reconstruction for Low-Bit Post-Training Quantization


119. Predictive Prefetching for Retrieval-Augmented Generation


120. Babel: Jailbreaking Safety Attention via Obfuscation Distribution Optimized Sampling


121. BLAgent: Agentic RAG for File-Level Bug Localization


122. A More Word-like Image Tokenization for MLLMs


123. BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting


124. Prompt Compression in Diffusion Large Language Models: Evaluating LLMLingua-2 on LLaDA


125. Multi-agent AI systems outperform human teams in creativity


126. HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents


127. $\boldsymbol{f}$-OPD: Stabilizing Long-Horizon On-Policy Distillation with Freshness-Aware Control


128. Generating Pretraining Tokens from Organic Data for Data-Bound Scaling


129. CounterCount: A Diagnostic Framework for Counting Bias in Vision Language Models


130. Why We Look Where We Look: Emergent Human-like Fixations of a Foveated Visual Language Model Maximizing Scene Understanding


131. TierCheck: Tiered Checkpointing for Fault Tolerance in Large Language Model Training


132. Systematic Evaluation of the Quality of Synthetic Clinical Notes Rephrased by LLMs at Million-Note Scale


133. OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization


134. Validate Your Authority: Benchmarking LLMs on Multi-Label Precedent Treatment Classification


135. PULSE: Agentic Investigation with Passive Sensing for Proactive Intervention in Cancer Survivorship


136. LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models


137. Automated Root-Cause Subclassification and No-Code Fix Generation for Invalid Bug Reports


138. Rethinking Code Review in the Age of AI: A Vision for Agentic Code Review


139. CasualSynth: Generating Structurally Sound Synthetic Data


140. ContraFix: Agentic Vulnerability Repair via Differential Runtime Evidence and Skill Reuse


141. MemRepair: Hierarchical Memory for Agentic Repository-Level Vulnerability Repair


142. Beyond Catalogue Counts: the Dataset Visibility Asymmetry in Low-Resource Multilingual NLP


143. DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents


144. Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications


145. Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization


146. \textsc{MasFACT}: Continual Multi-Agent Topology Learning via Geometry-Aware Posterior Transfer


147. Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment


148. Single-Sample Black-Box Membership Inference Attack against Vision-Language Models via Cross-modal Semantic Alignment


149. ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents


150. Attention Hijacking: Response Manipulation Across Queries in Vision-Language Models


151. StyleText: A Large-Scale Dataset and Benchmark for Stylized Scene Text Inpainting


152. ConflictRAG: Detecting and Resolving Knowledge Conflicts in Retrieval Augmented Generation


153. LEAP: Learnable End-to-End Adaptive Pruning of Large Language Models


154. When Efficiency Backfires: Cascading LLMs Trigger Cascade Failure under Adversarial Attack


155. ContractBench: Can LLM Agents Preserve Observation Contracts?


156. Rover: Context-aware Conflict Resolution with LLM


157. Fidelity Probes for Specification–Code Alignment


158. Event-Grounded Sparse Autoencoders for Vision-Language-Action Policies


159. PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media


160. Beyond Execution: Static-Analysis Rewards and Hint-Conditioned Diffusion RL for Code Generation


161. Why Do Safety Guardrails Degrade Across Languages?


162. OpenJarvis: Personal AI, On Personal Devices


163. Charon: A Unified and Fine-Grained Simulator for Large-Scale LLM Training and Inference


164. STRIDE-AI: A Threat Modeling Framework for Generative AI Security Assessment


165. Contrastive Conceptor Activation Steering (COAST): Unlocking Vision-Language-Action Models through Hidden States


166. UCSF-PDGM-VQA: Visual Question Answering dataset for brain tumor MRI interpretation


167. The Point of No Return: Counterfactual Localization of Deceptive Commitment in Language-Model Reasoning


168. DynMuon: A Dynamic Spectral Shaping View of Muon


169. SEMA-RAG: A Self-Evolving Multi-Agent Retrieval-Augmented Generation Framework for Medical Reasoning


170. S-Bus: Automatic Read-Set Reconstruction for Multi-Agent LLM State Coordination


171. D$^2$Evo: Dual Difficulty-Aware Self-Evolution for Data-Efficient Reinforcement Learning


172. Privacy Policy Enforcement Guardrails for Data-Sensitive Retrieval-Augmented Generation


173. Task Abstention for Large Language Models in Code Generation


174. PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts


175. Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training


176. BoLT: A Benchmark to Democratize Black-box Optimization Research for Expensive LLM Tasks


177. Skills on the Fly: Test-Time Adaptive Skill Synthesis for LLM Agents


178. WhiteTesseract: Reframing the Interpretation of Cultural Heritage through XR and Conversational AI


179. Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps


180. The Alpha Illusion: Reported Alpha from LLM Trading Agents Should Not Be Treated as Deployment Evidence


181. DriveSafe: A Framework for Risk Detection and Safety Suggestions in Driving Scenarios


182. Some[Body] Must Receive That Pain for Agent Accountability


183. Pedestrian-Aware LLM-Driven Behavioral Planning for Autonomous Vehicles


184. Thinking with Patterns: Breaking the Perceptual Bottleneck in Visual Planning via Pattern Induction


185. Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation


186. AgentKernelArena: Generalization-Aware Benchmarking of GPU Kernel Optimization Agents


187. TIER: Trajectory-Invariant Execution Rewards for Multi-Step Tool Composition


188. Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning


189. Exploring Lightweight Large Language Models for Court View Generation


190. EmoMind: Decoding Affective Captions from Human Brain fMRI


191. GeoWorld-VLM: Geometry from World Models for Vision-Language Models


192. A Scalable Tool for Measuring Manner and Result Verbs in Developmental Language Research


193. SKG-Eval: Stateful Evaluation of Multi-Turn Dialogue via Incremental Semantic Knowledge Graphs


194. \textsc{PrivScope}: Task-scoped Disclosure Control for Hybrid Agentic Systems


195. To Trust or Not to Trust: Authors’ Response to AI-based Reviews


196. PromptDecipher: Supporting AI Tutor Authoring Through Editable Simulated Interactions


197. GRASP: Graph Agentic Search over Propositions for Multi-hop Question Answering


198. RAPT: Retrieval-Augmented Post-hoc Thresholding for Multi-Label Classification



200. Alignment Drift in Long-Term Human-LLM Interaction: A Mechanism-Oriented Framework


201. The Scaling Laws of Skills in LLM Agent Systems


202. MoleCode unlocks structural intelligence in large language models



204. LERA: LLM-Enhanced RAG for Ad Auction in Generative Chatbots


205. Strategic Over-Parameterization for Generalizable Low-Rank Adaptation


206. Asking Back: Interaction-Layer Antidistillation Watermarks


207. Peak-Detector: Explainable Peak Detection via Instruction-Tuned Large Language Models in Physiological Sign


208. Membership Inference Attacks on Discrete Diffusion Language Models


209. KVCapsule: Efficient Sequential KV Cache Compression for Vision-Language Models with Asymmetric Redundancy


210. A Theory of Training Profit-Optimal LLMs


211. Agentic Pipeline for Self-Synchronized Multiview Joint Angle Monitoring in Uncalibrated Environments


212. CAVE: A Structured Credit Assignment Approach for Fragmented Visual Evidence Reasoning


213. Reducing Hallucination in Vision-Language Models via Stage-wise Preference Optimization under Distribution Shift