LLM 관련 주요 논문 - 2026-05-12

1. BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD


2. From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World


3. The First Drop of Ink: Nonlinear Impact of Misleading Information in Long-Context Reasoning


4. Probing Cross-modal Information Hubs in Audio-Visual LLMs


5. NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation


6. Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge


7. New AI-Driven Tools for Enhancing Campus Well-being: A Prevention and Intervention Approach


8. PathISE: Learning Informative Path Supervision for Knowledge Graph Question Answering


9. ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox


10. MATRA: Modeling the Attack Surface of Agentic AI Systems – OpenClaw Case Study


11. The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents


12. Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents


13. Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks


14. Teacher-Aware Evolution of Heuristic Programs from Learned Optimization Policies


15. PRISM: Generation-Time Detection and Mitigation of Secret Leakage in Multi-Agent LLM Pipelines


16. Budget-Efficient Automatic Algorithm Design via Code Graph


17. LLARS: Enabling Domain Expert & Developer Collaboration for LLM Prompting, Generation and Evaluation


18. LLM Jaggedness Unlocks Scientific Creativity


19. A Reflective Storytelling Agent for Older Adults: Integrating Argumentation Schemes and Argument Mining in LLM-Based Personalised Narratives


20. PrimeKG-CL: A Continual Graph Learning Benchmark on Evolving Biomedical Knowledge Graphs


21. SLASH the Sink: Sharpening Structural Attention Inside LLMs


22. ASIA: an Autonomous System Identification Agent


23. LLM4Branch: Large Language Model for Discovering Efficient Branching Policies of Integer Programs


24. GuardAD: Safeguarding Autonomous Driving MLLMs via Markovian Safety Logic


25. Agent-X: Full Pipeline Acceleration of On-device AI Agents


26. Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values


27. How Mobile World Model Guides GUI Agents?


28. TMAS: Scaling Test-Time Compute via Multi-Agent Synergy


29. Verifiable Process Rewards for Agentic Reasoning


30. Positive Alignment: Artificial Intelligence for Human Flourishing


31. AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks


32. IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs


33. Hypothesis-Driven Deep Research with Large Language Models: A Structured Methodology for Automated Knowledge Discovery


34. Beyond Autonomy: A Dynamic Tiered AgentRunner Framework for Governable and Resilient Enterprise AI Execution


35. Benchmarking Safety Risks of Knowledge-Intensive Reasoning under Malicious Knowledge Editing


36. Arcane: An Assertion Reduction Framework through Semantic Clustering and MCTS-Guided Rule Exploring


37. Active Testing of Large Language Models via Approximate Neyman Allocation


38. Strategic Exploitation in LLM Agent Markets: A Simulation Framework for E-Commerce Trust


39. Route by State, Recover from Trace: STAR with Failure-Aware Markov Routing for Multi-Agent Spatiotemporal Reasoning


40. Prospective Compression in Human Abstraction Learning


41. HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution


42. expo: Exploration-prioritized policy optimization via adaptive kl regulation and gaussian curriculum sampling


43. RADAR: Redundancy-Aware Diffusion for Multi-Agent Communication Structure Generation


44. Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought


45. The Gordian Knot for VLMs: Diagrammatic Knot Reasoning as a Hard Benchmark


46. M2A: Synergizing Mathematical and Agentic Reasoning in Large Language Models


47. Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations


48. When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning


49. The Metacognitive Probe: Five Behavioural Calibration Diagnostics for LLMs


50. Medical Model Synthesis Architectures: A Case Study


51. Ambig-DS: A Benchmark for Task-Framing Ambiguity in Data-Science Agents


52. Absurd World: A Simple Yet Powerful Method to Absurdify the Real-world for Probing LLM Reasoning Capabilities


53. CodeClinic: Evaluating Automation of Coding Skills for Clinical Reasoning Agents


54. Workspace Optimization: How to Train Your Agent


55. TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning


56. LLM-Guided Monte Carlo Tree Search over Knowledge Graphs: Composing Mechanistic Explanations for Drug-Disease Pairs


57. A Game Theoretic Free Energy Analysis of Higher Order Synergy in Attention Heads of Large Language Models


58. Don’t Click That: Teaching Web Agents to Resist Deceptive Interfaces


59. VulTriage: Triple-Path Context Augmentation for LLM-Based Vulnerability Detection


60. SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning


61. From Passive Reuse to Active Reasoning: Grounding Large Language Models for Neuro-Symbolic Experience Replay


62. Do Linear Probes Generalize Better in Persona Coordinates?


63. NEXUS: Continual Learning of Symbolic Constraints for Safe and Robust Embodied Planning


64. Position: Avoid Overstretching LLMs for every Enterprise Task


65. CHAINTRIX: A multi-pipeline LLM-augmented framework for automated smart-contract security auditing


66. Do Self-Evolving Agents Forget? Capability Degradation and Preservation in Lifelong LLM Agent Adaptation


67. How LLMs Are Persuaded: A Few Attention Heads, Rerouted


68. Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning


69. PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning


70. A Prompt-Aware Structuring Framework for Reliable Reuse of AI-Generated Content in the Agentic Web


71. EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium


72. Shaping Schema via Language Representation as the Next Frontier for LLM Intelligence Expanding


73. The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations


74. Emergent Semantic Role Understanding in Language Models


75. Agentic MIP Research: Accelerated Constraint Handler Generation


76. Open Ontologies: Tool-Augmented Ontology Engineering with Stable Matching Alignment


77. CIVeX: Causal Intervention Verification for Language Agents


78. FORTIS: Benchmarking Over-Privilege in Agent Skills


79. Do LLMs Experience an Internal Polylogue? Investigating Reasoning through the Lens of Personas


80. MCP-Cosmos: World Model-Augmented Agents for Complex Task Execution in MCP Environments


81. Data-driven Circuit Discovery for Interpretability of Language Models


82. Token Economics for LLM Agents: A Dual-View Study from Computing and Economics


83. CauSim: Scaling Causal Reasoning with Increasingly Complex Causal Simulators


84. Containment Verification: AI Safety Guarantees Independent of Alignment


85. SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks


86. Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics


87. Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization


88. Agentic AI Scientists Are Not Built For Autonomous Scientific Discovery


89. Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs


90. OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces


91. When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents


92. How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors


93. Mirror, Mirror on the Wall: Can VLM Agents Tell Who They Are at All?


94. Reasoning Compression with Mixed-Policy Distillation


95. EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems


96. AHD Agent: Agentic Reinforcement Learning for Automatic Heuristic Design


97. Bias by Necessity: Impossibility Theorems for Sequential Processing with Convergent AI and Human Validation


98. AgentPSO: Evolving Agent Reasoning Skill via Multi-agent Particle Swarm Optimization


99. SkillMaster: Toward Autonomous Skill Mastery in LLM Agents


100. Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLMs


101. MIND-Skill: Quality-Guaranteed Skill Generation via Multi-Agent Induction and Deduction


102. DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules


103. The Echo Amplifies the Knowledge: Somatic Marker Analogues in Language Models via Emotion Vector Re-Injection


104. Why Retrying Fails: Context Contamination in LLM Agent Pipelines


105. Evaluating Developmental Cognition Capabilities of LLMs


106. Human-Inspired Memory Architecture for LLM Agents


107. Human-LLM Dialogue Improves Diagnostic Accuracy in Emergency Care


108. OracleTSC: Oracle-Informed Reward Hurdle and Uncertainty Regularization for Traffic Signal Control


109. Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms


110. Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models


111. Behavioral Determinants of Deployed AI Agents in Social Networks: A Multi-Factor Study of Personality, Model, and Guardrail Specification


112. LLM-guided Semi-Supervised Approaches for Social Media Crisis Data Classification


113. Political Plasticity: An Analysis of Ideological Adaptability in Large Language Models


114. CoCoDA: Co-evolving Compositional DAG for Tool-Augmented Agents


115. SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents


116. MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs


117. On Distinguishing Capability Elicitation from Capability Creation in Post-Training: A Free-Energy Perspective


118. Spatial Priming Outperforms Semantic Prompting: A Grid-Based Approach to Improving LLM Accuracy on Chart Data Extraction


119. Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits


120. ELF: Embedded Language Flows


121. LoKA: Low-precision Kernel Applications for Recommendation Models At Scale


122. AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents


123. Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient?


124. Training-Free Cultural Alignment of Large Language Models via Persona Disagreement


125. SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing


126. Threat Modelling using Domain-Adapted Language Models: Empirical Evaluation and Insights


127. Can You Keep a Secret? Involuntary Information Leakage in Language Model Writing


128. Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning


129. Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization


130. Geospatial-Temporal Sensemaking of Remote Sensing Activity Detections with Multimodal Large Language Model


131. Why Low-Resource NLP Needs More Than Cross-Lingual Transfer: Lessons Learned from Luxembourgish


132. The Bystander Effect in Multi-Agent Reasoning: Quantifying Cognitive Loafing in Collaborative Interactions


133. Step Rejection Fine-Tuning: A Practical Distillation Recipe


134. Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions


135. When Can Digital Personas Reliably Approximate Human Survey Findings?


136. LLaVA-CKD: Bottom-Up Cascaded Knowledge Distillation for Vision-Language Models


137. Towards Understanding Continual Factual Knowledge Acquisition of Language Models: From Theory to Algorithm


138. Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs


139. Re-Triggering Safeguards within LLMs for Jailbreak Detection


140. Measuring Embedding Sensitivity to Authorial Style in French: Comparing Literary Texts with Language Model Rewritings


141. An agentic framework for gravitational-wave counterpart association in the multi-messenger era


142. Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing


143. SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models


144. ThreatCore: A Benchmark for Explicit and Implicit Threat Detection


145. DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning


146. StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs


147. Can Language Models Analyze Data? Evaluating Large Language Models for Question Answering over Datasets


148. AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation


149. RW-Post: Auditable Evidence-Grounded Multimodal Fact-Checking in the Wild


150. EvoStreaming: Your Offline Video Model Is a Natively Streaming Assistant


151. SCALAR: A Neurosymbolic Framework for Automated Conjecture and Reasoning in Quantum Circuit Analysis


152. DP-LAC: Lightweight Adaptive Clipping for Differentially Private Federated Fine-tuning of Language Models


153. To Redact, or not to Redact? A Local LLM Approach to Deliberative Process Privilege Classification


154. ProteinOPD: Towards Effective and Efficient Preference Alignment for Protein Design



156. When Prompts Become Payloads: A Framework for Mitigating SQL Injection Attacks in Large Language Model-Driven Applications


157. When Reviews Disagree: Fine-Grained Contradiction Analysis in Scientific Peer Reviews


158. MicroWorld: Empowering Multimodal Large Language Models to Bridge the Microscopic Domain Gap with Multimodal Attribute Graph


159. ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models


160. Metis: Learning to Jailbreak LLMs via Self-Evolving Metacognitive Policy Optimization


161. NCO: A Versatile Plug-in for Handling Negative Constraints in Decoding


162. Not-So-Strange Love: Language Models and Generative Linguistic Theories are More Compatible than They Appear


163. Personalizing LLMs with Binary Feedback: A Preference-Corrected Optimization Framework


164. Speech-based Psychological Crisis Assessment using LLMs


165. Medical Incident Causal Factors and Preventive Measures Generation Using Tag-based Example Selection in Few-shot Learning


166. Attention Drift: What Autoregressive Speculative Decoding Models Learn


167. G-Zero: Self-Play for Open-Ended Generation from Zero Data


168. PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning


169. Team-Based Self-Play With Dual Adaptive Weighting for Fine-Tuning LLMs


170. Verifier-Free RL for LLMs via Intrinsic Gradient-Norm Reward


171. NaiAD: Initiate Data-Driven Research for LLM Advertising


172. Pseudo-Deliberation in Language Models: When Reasoning Fails to Align Values and Actions


173. The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space


174. Continuous Latent Contexts Enable Efficient Online Learning in Transformers


175. Nautilus Compass: Black-box Persona Drift Detection for Production LLM Agents


176. Flag Varieties: A Geometric Framework for Deep Network Alignment


177. Fashion Florence: Fine-Tuning Florence-2 for Structured Fashion Attribute Extraction


178. Pretraining large language models with MXFP4


179. LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models


180. Insight: Enhancing Mobile Accessibility for Blind and Visually Impaired Users with LLMs


181. CrossVL: Complexity-Aware Feature Routing and Paired Curriculum for Cross-View Vision-Language Detection


182. Parameter-Efficient Neuroevolution for Diverse LLM Generation: Quality-Diversity Optimization via Prompt Embedding Evolution


183. EvoPref: Multi-Objective Evolutionary Optimization Discovers Diverse LLM Alignments Beyond Gradient Descent


184. Exploitation Without Deception: Dark Triad Feature Steering Reveals Separable Antisocial Circuits in Language Models



186. Entropy-informed Decoding: Adaptive Information-Driven Branching


187. The Silent Vote: Improving Zero-Shot LLM Reliability by Aggregating Semantic Neighborhoods


188. KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving


189. Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT


190. Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon


191. DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents


192. Learning Multi-Indicator Weights for Data Selection: A Joint Task-Model Adaptation Framework with Efficient Proxies


193. MedMeta: A Benchmark for LLMs in Synthesizing Meta-Analysis Conclusion from Medical Studies


194. SmartEval: A Benchmark for Evaluating LLM-Generated Smart Contracts from Natural Language Specifications


195. Efficient Ensemble Selection from Binary and Pairwise Feedback


196. CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics


197. TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM


198. Assessment of RAG and Fine-Tuning for Industrial Question-Answering-Applications


199. Position: AI Security Policy Should Target Systems, Not Models


200. LASSA Architecture-Based Autonomous Fault-Tolerant Control of Unmanned Underwater Vehicles


201. APCD: Adaptive Path-Contrastive Decoding for Reliable Large Language Model Generation


202. Evading Visual Aphasia: Contrastive Adaptive Semantic Token Pruning for Vision-Language Models


203. LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering


204. From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs


205. Your Simulation Runs but Solves the Wrong Physics: PDE-Grounded Intent Verification for LLM-Generated Multiphysics Simulation Code


206. Skill-R1: Agent Skill Evolution via Reinforcement Learning


207. HOME-KGQA: A Benchmark Dataset for Multimodal Knowledge Graph Question Answering on Household Daily Activities


208. RuPLaR : Efficient Latent Compression of LLM Reasoning Chains with Rule-Based Priors From Multi-Step to One-Step


209. Towards Effective Theory of LLMs: A Representation Learning Approach


210. Beyond Continuity: Challenges of Context Switching in Multi-Turn Dialogue with LLMs


211. ProactBench: Beyond What The User Asked For


212. The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring


213. Detect, Localize, and Explain: Interactive Hierarchical Log Anomaly Analytics with LLM Augmentation


214. Flame3D: Zero-shot Compositional Reasoning of 3D Scenes with Agentic Language Models


215. Select-then-differentiate: Solving Bilevel Optimization with Manifold Lower-level Solution Sets


216. DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation


217. Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers


218. From Traditional Taggers to LLMs: A Comparative Study of POS Tagging for Medieval Romance Languages


219. A Communication-Theoretic Framework for LLM Agents: Cost-Aware Adaptive Reliability


220. Personalized Alignment Revisited: The Necessity and Sufficiency of User Diversity