LLM 관련 주요 논문 - 2026-04-14

1. Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games


2. Agentic Driving Coach: Robustness and Determinism of Agentic AI-Powered Human-in-the-Loop Cyber-Physical Systems


3. DreamKG: A KG-Augmented Conversational System for People Experiencing Homelessness


4. Why Do Large Language Models Generate Harmful Content?


5. Intersectional Sycophancy: How Perceived User Demographics Shape False Validation in Large Language Models


6. UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents


7. A collaborative agent with two lightweight synergistic models for autonomous crystal materials research


8. Anthropogenic Regional Adaptation in Multimodal Vision-Language Model


9. OOM-RL: Out-of-Money Reinforcement Learning Market-Driven Alignment for LLM-Based Multi-Agent Systems


10. Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents


11. Escaping the Context Bottleneck: Active Context Curation for LLM Agents via Reinforcement Learning


12. Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agentic Retrieval


13. From Agent Loops to Structured Graphs:A Scheduler-Theoretic Framework for LLM Agent Execution


14. The Missing Knowledge Layer in Cognitive Architectures for AI Agents


15. Dynamic Summary Generation for Interpretable Multimodal Depression Detection


16. PaperScope: A Multi-Modal Multi-Document Benchmark for Agentic Deep Research Across Massive Scientific Papers


17. BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows


18. Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Using a Large Language Model


19. Inspectable AI for Science: A Research Object Approach to Generative AI Governance


20. Mobile GUI Agent Privacy Personalization with Trajectory Induced Preference Optimization


21. Environmental Footprint of GenAI Research: Insights from the Moshi Foundation Model


22. From Answers to Arguments: Toward Trustworthy Clinical Diagnostic Reasoning with Toulmin-Guided Curriculum Goal-Conditioned Learning


23. Persona Non Grata: Single-Method Safety Evaluation Is Incomplete for Persona-Imbued LLMs


24. Hodoscope: Unsupervised Monitoring for AI Misbehaviors


25. From Topology to Trajectory: LLM-Driven World Models For Supply Chain Resilience


26. Introspective Diffusion Language Models


27. Min-$k$ Sampling: Decoupling Truncation from Temperature Scaling via Relative Logit Dynamics


28. Diffusion-CAM: Faithful Visual Explanations for dMLLMs


29. MAFIG: Multi-agent Driven Formal Instruction Generation Framework


30. Back to the Barn with LLAMAs: Evolving Pretrained LLM Backbones in Finetuning Vision Language Models


31. ATANT v1.1: Positioning Continuity Evaluation Against Memory, Long-Context, and Agentic-Memory Benchmarks


32. CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning


33. RAG-KT: Cross-platform Explainable Knowledge Tracing with Multi-view Fusion Retrieval Generation


34. CSPO: Alleviating Reward Ambiguity for Structured Table-to-LaTeX Generation


35. CASK: Core-Aware Selective KV Compression for Reasoning Traces


36. ZoomR: Memory Efficient Reasoning through Multi-Granularity Key Value Retrieval


37. Beyond Statistical Co-occurrence: Unlocking Intrinsic Semantics for Tabular Data Clustering


38. A Benchmark for Gap and Overlap Analysis as a Test of KG Task Readiness


39. Your Model Diversity, Not Method, Determines Reasoning Strategy


40. CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms


41. Learning Preference-Based Objectives from Clinical Narratives for Sequential Treatment Decision-Making


42. When More Thinking Hurts: Overthinking in LLM Test-Time Compute Scaling


43. Teaching Language Models How to Code Like Learners: Conversational Serialization for Student Simulation


44. FACT-E: Causality-Inspired Evaluation for Trustworthy Chain-of-Thought Reasoning


45. Do LLMs Build Spatial World Models? Evidence from Grid-World Maze Tasks


46. Agent^2 RL-Bench: Can LLM Agents Engineer Agentic RL Post-Training?


47. From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning


48. Agent Mentor: Framing Agent Knowledge through Semantic Trajectory Analysis


49. Thinking Fast, Thinking Wrong: Intuitiveness Modulates LLM Counterfactual Reasoning in Policy Evaluation


50. A Progressive Training Strategy for Vision-Language Models to Counteract Spatio-Temporal Hallucinations in Embodied Reasoning


51. CARO: Chain-of-Analogy Reasoning Optimization for Robust Content Moderation


52. CHAIRO: Contextual Hierarchical Analogical Induction and Reasoning Optimization for LLMs


53. Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs


54. PEMANT: Persona-Enriched Multi-Agent Negotiation for Travel


55. VeriSim: A Configurable Framework for Evaluating Medical AI Under Realistic Patient Noise


56. CWCD: Category-Wise Contrastive Decoding for Structured Medical Report Generation


57. TrajOnco: a multi-agent framework for temporal reasoning over longitudinal EHR for multi-cancer early detection


58. ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents


59. VeriTrans: Fine-Tuned LLM-Assisted NL-to-PL Translation via a Deterministic Neuro-Symbolic Pipeline


60. From GPT-3 to GPT-5: Mapping their capabilities, scope, limitations, and consequences


61. Gypscie: A Cross-Platform AI Artifact Management System


62. TimeSeriesExamAgent: Creating Time Series Reasoning Benchmarks at Scale


63. The Amazing Agent Race: Strong Tool Users, Weak Navigators


64. SVSR: A Self-Verification and Self-Rectification Paradigm for Multimodal Reasoning


65. Edu-MMBias: A Three-Tier Multimodal Benchmark for Auditing Social Bias in Vision-Language Models under Educational Contexts


66. SpecMoE: A Fast and Efficient Mixture-of-Experts Inference via Self-Assisted Speculative Decoding


67. Trust Your Memory: Verifiable Control of Smart Homes through Reinforcement Learning with Multi-dimensional Rewards


68. Ontological Trajectory Forecasting via Finite Semigroup Iteration and Lie Algebra Approximation in Geopolitical Knowledge Graphs


69. AI Achieves a Perfect LSAT Score


70. FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks


71. New Hybrid Fine-Tuning Paradigm for LLMs: Algorithm Design and Convergence Analysis Framework


72. HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks


73. In-situ process monitoring for defect detection in wire-arc additive manufacturing: an agentic AI approach


74. What do your logits know? (The answer may surprise you!)


75. Instructing LLMs to Negotiate using Reinforcement Learning with Verifiable Rewards


76. Steered LLM Activations are Non-Surjective


77. COMPOSITE-Stem


78. EE-MCP: Self-Evolving MCP-GUI Agents via Automated Environment Generation and Experience Learning


79. Pioneer Agent: Continual Improvement of Small Language Models in Production


80. The Myth of Expert Specialization in MoEs: Why Routing Reflects Geometry, Not Necessarily Domain Expertise


81. Belief-Aware VLM Model for Human-like Reasoning


82. How LLMs Might Think


83. General-purpose LLMs as Models of Human Driver Behavior: The Case of Simplified Merging


84. Evaluating Reliability Gaps in Large Language Model Safety via Repeated Prompt Sampling


85. LLMs for Text-Based Exploration and Navigation Under Partial Observability


86. From Scalars to Tensors: Declared Losses Recover Epistemic Distinctions That Neutrosophic Scalars Cannot Express


87. Hubble: An LLM-Driven Agentic Framework for Safe and Automated Alpha Factor Discovery


88. DERM-3R: A Resource-Efficient Multimodal Agents Framework for Dermatologic Diagnosis and Treatment in Real-World Clinical Settings


89. Agentic Exploration of PDE Spaces using Latent Foundation Models for Parameterized Simulations


90. OOWM: Structuring Embodied Reasoning and Planning via Object-Oriented Programmatic World Modeling


91. Help Without Being Asked: A Deployed Proactive Agent System for On-Call Support with Continuous Self-Improvement


92. Solving Physics Olympiad via Reinforcement Learning on Physics Simulators


93. C-ReD: A Comprehensive Chinese Benchmark for AI-Generated Text Detection Derived from Real-World Prompts


94. A Mechanistic Analysis of Looped Reasoning Language Models


95. ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection


96. General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks


97. Discourse Diversity in Multi-Turn Empathic Dialogue


98. Evaluating Cooperation in LLM Social Groups through Elected Leadership



100. Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind


101. Towards Autonomous Mechanistic Reasoning in Virtual Cells


102. RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents


103. A Triadic Suffix Tokenization Scheme for Numerical Reasoning


104. Synthius-Mem: Brain-Inspired Hallucination-Resistant Persona Memory Achieving 94.4% Memory Accuracy and 99.6% Adversarial Robustness on LoCoMo


105. FM-Agent: Scaling Formal Methods to Large Systems via LLM-Based Hoare-Style Reasoning


106. Time is Not a Label: Continuous Phase Rotation for Temporal Knowledge Graphs and Agentic Memory


107. NovBench: Evaluating Large Language Models on Academic Paper Novelty Assessment


108. CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space


109. SVD-Prune: Training-Free Token Pruning For Efficient Vision-Language Models


110. From Translation to Superset: Benchmark-Driven Evolution of a Production AI Agent from Rust to Python


111. EdgeCIM: A Hardware-Software Co-Design for CIM-Based Acceleration of Small Language Models


112. Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization


113. METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models


114. SLALOM: Simulation Lifecycle Analysis via Longitudinal Observation Metrics for Social Simulation


115. Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration


116. METRO: Towards Strategy Induction from Expert Dialogue Transcripts for Non-collaborative Dialogues


117. Do LLMs Know Tool Irrelevance? Demystifying Structural Alignment Bias in Tool Invocations


118. Network Effects and Agreement Drift in LLM Debates


119. The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems


120. Enhancing Multimodal Large Language Models for Ancient Chinese Character Evolution Analysis via Glyph-Driven Fine-Tuning


121. The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping


122. RECIPER: A Dual-View Retrieval Pipeline for Procedure-Oriented Materials Question Answering


123. Exploring Knowledge Conflicts for Faithful LLM Reasoning: Benchmark and Method


124. Designing Adaptive Digital Nudging Systems with LLM-Driven Reasoning


125. CocoaBench: Evaluating Unified Digital Agents in the Wild


126. BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning


127. Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding



129. Efficient Training for Cross-lingual Speech Language Models


130. Bottleneck Tokens for Unified Multimodal Retrieval


131. E2E-REME: Towards End-to-End Microservices Auto-Remediation via Experience-Simulation Reinforcement Fine-Tuning


132. ReSpinQuant: Efficient Layer-Wise LLM Quantization via Subspace Residual Rotation Approximation


133. Rethinking Token-Level Credit Assignment in RLVR: A Polarity-Entropy Analysis


134. Shared Emotion Geometry Across Small Language Models: A Cross-Architecture Study of Representation, Behavior, and Methodological Confounds


135. A Systematic Analysis of the Impact of Persona Steering on LLM Capabilities


136. Panoptic Pairwise Distortion Graph


137. When Valid Signals Fail: Regime Boundaries Between LLM Features and RL Trading Policies


138. MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models


139. A molecular clock for writing systems reveals the quantitative impact of imperial power on cultural evolution


140. Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models


141. Mem$^2$Evolve: Towards Self-Evolving Agents via Co-Evolutionary Capability Expansion and Experience Distillation


142. ReXSonoVQA: A Video QA Benchmark for Procedure-Centric Ultrasound Understanding


143. Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music


144. Beyond A Fixed Seal: Adaptive Stealing Watermark in Large Language Models


145. Ambiguity Detection and Elimination in Automated Executable Process Modeling


146. AOP-Smart: A RAG-Enhanced Large Language Model Framework for Adverse Outcome Pathway Analysis


147. Resilient Write: A Six-Layer Durable Write Surface for LLM Coding Agents


148. LLMs for Qualitative Data Analysis Fail on Security-specificComments in Human Experiments


149. Verify Before You Fix: Agentic Execution Grounding for Trustworthy Cross-Language Code Analysis


150. Advancing Polish Language Modeling through Tokenizer Optimization in the Bielik v3 7B and 11B Series


151. TInR: Exploring Tool-Internalized Reasoning in Large Language Models


152. Do BERT Embeddings Encode Narrative Dimensions? A Token-Level Probing Analysis of Time, Space, Causality, and Character in Fiction


153. Generating Multiple-Choice Knowledge Questions with Interpretable Difficulty Estimation using Knowledge Graphs and Large Language Models


154. Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models


155. Turning Generators into Retrievers: Unlocking MLLMs for Natural Language-Guided Geo-Localization


156. Detecting RAG Extraction Attack via Dual-Path Runtime Integrity Game


157. Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing


158. Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning


159. SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting


160. Critical-CoT: A Robust Defense Framework against Reasoning-Level Backdoor Attacks in Large Language Models


161. Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents


162. Learning and Enforcing Context-Sensitive Control for LLMs


163. DynamicsLLM: a Dynamic Analysis-based Tool for Generating Intelligent Execution Traces Using LLMs to Detect Android Behavioural Code Smells


164. Efficient Process Reward Modeling via Contrastive Mutual Information


165. Vibe-driven model-based engineering


166. Computational Lesions in Multilingual Language Models Separate Shared and Language-specific Brain Alignment


167. MoEITS: A Green AI approach for simplifying MoE-LLMs


168. Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance


169. Calibration Collapse Under Sycophancy Fine-Tuning: How Reward Hacking Breaks Uncertainty Quantification in LLMs


170. Early Decisions Matter: Proximity Bias and Initial Trajectory Shaping in Non-Autoregressive Diffusion Language Models


171. LLMs Should Incorporate Explicit Mechanisms for Human Empathy


172. IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs


173. Machine Learning-Based Detection of MCP Attacks


174. Towards an Appropriate Level of Reliance on AI: A Preliminary Reliance-Control Framework for AI in Software Engineering


175. ReFEree: Reference-Free and Fine-Grained Method for Evaluating Factual Consistency in Real-World Code Summarization


176. How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks


177. CodaRAG: Connecting the Dots with Associativity Inspired by Complementary Learning


178. Intent-aligned Formal Specification Synthesis via Traceable Refinement


179. Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion


180. From Helpful to Trustworthy: LLM Agents for Pair Programming


181. FashionMV: Product-Level Composed Image Retrieval with Multi-View Fashion Data


182. Adapting 2D Multi-Modal Large Language Model for 3D CT Image Analysis


183. Think in Sentences: Explicit Sentence Boundaries Enhance Language Model’s Capabilities


184. MR-Coupler: Automated Metamorphic Test Generation via Functional Coupling Analysis


185. CircuitSynth: Reliable Synthetic Data Generation


186. ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models


187. CoSToM:Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment in Large Language Models


188. LVSum: A Benchmark for Timestamp-Aware Long Video Summarization


189. Demographic and Linguistic Bias Evaluation in Omnimodal Language Models


190. Like a Hammer, It Can Build, It Can Break: Large Language Model Uses, Perceptions, and Adoption in Cybersecurity Operations on Reddit


191. Agentic Application in Power Grid Static Analysis: Automatic Code Generation and Error Correction


192. Rebooting Microreboot: Architectural Support for Safe, Parallel Recovery in Microservice Systems


193. Cross-Cultural Value Awareness in Large Vision-Language Models


194. The Rise and Fall of $G$ in AGI


195. From UAV Imagery to Agronomic Reasoning: A Multimodal LLM Benchmark for Plant Phenotyping


196. Exploring Structural Complexity in Normative RAG with Graph-based approaches: A case study on the ETSI Standards


197. Automating Structural Analysis Across Multiple Software Platforms Using Large Language Models


198. RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies


199. Is There Knowledge Left to Extract? Evidence of Fragility in Medically Fine-Tuned Vision-Language Models


200. GIANTS: Generative Insight Anticipation from Scientific Literature


201. MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering


202. A-IO: Adaptive Inference Orchestration for Memory-Bound NPUs


203. Backdoors in RLVR: Jailbreak Backdoors in LLMs From Verifiable Reward


204. ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying


205. CONSCIENTIA: Can LLM Agents Learn to Strategize? Emergent Deception and Trust in a Multi-Agent NYC Simulation


206. ExecTune: Effective Steering of Black-Box LLMs with Guide Models


207. LOLGORITHM: Funny Comment Generation Agent For Short Videos


208. ConfigSpec: Profiling-Based Configuration Selection for Distributed Edge–Cloud Speculative LLM Serving


209. LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models


210. Evaluating Scene-based In-Situ Item Labeling for Immersive Conversational Recommendation


211. Assessing Privacy Preservation and Utility in Online Vision-Language Models


212. CAGE: Bridging the Accuracy-Aesthetics Gap in Educational Diagrams via Code-Anchored Generative Enhancement


213. Grid2Matrix: Revealing Digital Agnosia in Vision-Language Models


214. Decision-Theoretic Safety Assessment of Persona-Driven Multi-Agent Systems in O-RAN


215. NetAgentBench: A State-Centric Benchmark for Evaluating Agentic Network Configuration


216. A Comparative Theoretical Analysis of Entropy Control Methods in Reinforcement Learning


217. Human-like Working Memory Interference in Large Language Models


218. Do We Still Need GraphRAG? Benchmarking RAG and GraphRAG for Agentic Search Systems


219. Deliberative Alignment is Deep, but Uncertainty Remains: Inference time safety improvement in reasoning via attribution of unsafe behavior to base model


220. Generating High Quality Synthetic Data for Dutch Medical Conversations


221. Hardware Utilization and Inference Performance of Edge Object Detection Under Fault Injection


222. LLM Nepotism in Organizational Governance


223. Assessing the Pedagogical Readiness of Large Language Models as AI Tutors in Low-Resource Contexts: A Case Study of Nepal’s K-10 Curriculum


224. HearthNet: Edge Multi-Agent Orchestration for Smart Homes


225. Token-Budget-Aware Pool Routing for Cost-Efficient LLM Inference


226. Characterizing Performance-Energy Trade-offs of Large Language Models in Multi-Request Workflows


227. ECHO: Elastic Speculative Decoding with Sparse Gating for High-Concurrency Scenarios


228. Why Smaller Is Slower? Dimensional Misalignment in Compressed LLMs


229. Evaluating Visual Prompts with Eye-Tracking Data for MLLM-Based Human Activity Recognition


230. Generative UI: LLMs are Effective UI Generators


231. ACE-TA: An Agentic Teaching Assistant for Grounded Q&A, Quiz Generation, and Code Tutoring


232. Tuning Qwen2.5-VL to Improve Its Web Interaction Skills


233. LETGAMES: An LLM-Powered Gamified Approach to Cognitive Training for Patients with Cognitive Impairment


234. ACE-Bench: A Lightweight Benchmark for Evaluating Azure SDK Usage Correctness


235. StreamServe: Adaptive Speculative Flows for Low-Latency Disaggregated LLM Serving


236. SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding


237. SRBench: A Comprehensive Benchmark for Sequential Recommendation with Large Language Models


238. MCERF: Advancing Multimodal LLM Evaluation of Engineering Documentation with Enhanced Retrieval


239. SemaCDR: LLM-Powered Transferable Semantics for Cross-Domain Sequential Recommendation


240. Beyond Offline A/B Testing: Context-Aware Agent Simulation for Recommender System Evaluation


241. Retrieval-Augmented Large Language Models for Evidence-Informed Guidance on Cannabidiol Use in Older Adults