LLM 관련 주요 논문 - 2026-04-28

1. Case-Specific Rubrics for Clinical AI Evaluation: Methodology, Validation, and LLM-Clinician Agreement Across 823 Encounters


2. The Price of Agreement: Measuring LLM Sycophancy in Agentic Financial Applications


3. XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation


4. Evaluating whether AI models would sabotage AI safety research


5. A systematic evaluation of vision-language models for observational astronomical reasoning tasks


6. Towards Lawful Autonomous Driving: Deriving Scenario-Aware Driving Requirements from Traffic Laws and Regulations


7. STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator


8. Beyond the Attention Stability Boundary: Agentic Self-Synthesizing Reasoning Protocols


9. Agentic clinical reasoning over longitudinal myeloma records: a retrospective evaluation against expert consensus


10. PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model


11. Aligning with Your Own Voice: Self-Corrected Preference Learning for Hallucination Mitigation in LVLMs


12. Adaptive ToR: Complexity-Aware Tree-Based Retrieval for Pareto-Optimal Multi-Intent NLU


13. Multi-Dimensional Evaluation of Sustainable City Trips with LLM-as-a-Judge and Human-in-the-Loop


14. An Information-Geometric Framework for Stability Analysis of Large Language Models under Entropic Stress


15. Grounding Before Generalizing: How AI Differs from Humans in Causal Transfer


16. A2DEPT: Large Language Model-Driven Automated Algorithm Design via Evolutionary Program Trees


17. Representational Curvature Modulates Behavioral Uncertainty in Large Language Models


18. LLM-Guided Agentic Floor Plan Parsing for Accessible Indoor Navigation of Blind and Low-Vision People


19. Context-Aware Hospitalization Forecasting Evaluations for Decision Support using LLMs


20. LLM-Augmented Traffic Signal Control with LSTM-Based Traffic State Prediction and Safety-Constrained Decision Support


21. ZenBrain: A Neuroscience-Inspired 7-Layer Memory Architecture for Autonomous AI Systems


22. ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation


23. Domain-Filtered Knowledge Graphs from Sparse Autoencoder Features


24. FAIR_XAI: Improving Multimodal Foundation Model Fairness via Explainability for Wellbeing Assessment



26. Vibe Medicine: Redefining Biomedical Research Through Human-AI Co-Work


27. Tandem: Riding Together with Large and Small Language Models for Efficient Reasoning


28. Thinking Like a Clinician: A Cognitive AI Agent for Clinical Diagnosis via Panoramic Profiling and Adversarial Debate


29. When AI reviews science: Can we trust the referee?


30. MetaGAI: A Large-Scale and High-Quality Benchmark for Generative AI Model and Data Card Generation


31. Agentic Adversarial Rewriting Exposes Architectural Vulnerabilities in Black-Box NLP Pipelines


32. Ulterior Motives: Detecting Misaligned Reasoning in Continuous Thought Models


33. IndustryAssetEQA: A Neurosymbolic Operational Intelligence System for Embodied Question Answering in Industrial Asset Maintenance


34. When Corrective Hints Hurt: Prompt Design in Reasoner-Guided Repair of LLM Overcaution on Entailed Negations under OWL~2~DL


35. GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLMs


36. LEGO: An LLM Skill-Based Front-End Design Generation Platform


37. CAP-CoT: Cycle Adversarial Prompt for Improving Chain of Thoughts in LLM Reasoning


38. Discovering Agentic Safety Specifications from 1-Bit Danger Signals


39. From Coarse to Fine: Self-Adaptive Hierarchical Planning for LLM Agents


40. Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines


41. PhySE: A Psychological Framework for Real-Time AR-LLM Social Engineering Attacks


42. Towards Automated Ontology Generation from Unstructured Text: A Multi-Agent LLM Approach


43. Analytica: Soft Propositional Reasoning for Robust and Scalable LLM-Driven Analysis


44. Don’t Make the LLM Read the Graph: Make the Graph Think


45. A Systematic Approach for Large Language Models Debugging


46. FormalScience: Scalable Human-in-the-Loop Autoformalisation of Science with Agentic Code Generation in Lean


47. PExA: Parallel Exploration Agent for Complex Text-to-SQL


48. An Intelligent Fault Diagnosis Method for General Aviation Aircraft Based on Multi-Fidelity Digital Twin and FMEA Knowledge Enhancement


49. Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis


50. Green Shielding: A User-Centric Approach Towards Trustworthy AI


51. Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study


52. Benchmarking Source-Sensitive Reasoning in Turkish: Humans and LLMs under Evidential Trust Manipulation


53. AgentWard: A Lifecycle Security Architecture for Autonomous AI Agents


54. DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference


55. K-MetBench: A Multi-Dimensional Benchmark for Fine-Grained Evaluation of Expert Reasoning, Locality, and Multimodality in Meteorology


56. Less Is More: Engineering Challenges of On-Device Small Language Model Integration in a Mobile Application


57. Learning to Route Queries to Heads for Attention-based Re-ranking with Large Language Models


58. Skill Retrieval Augmentation for Agentic AI


59. Layerwise Convergence Fingerprints for Runtime Misbehavior Detection in Large Language Models


60. Understanding the Limits of Automated Evaluation for Code Review Bots in Practice


61. Why AI Harms Can’t Be Fixed One Identity at a Time: What 5300 Incident Reports Reveal About Intersectionality


62. GAMMAF: A Common Framework for Graph-Based Anomaly Monitoring Benchmarking in LLM Multi-Agent Systems


63. Measuring Successful Cooperation in Human-AI Teamwork: Development and Validation of the Perceived Cooperativity and Teaming Perception Scales


64. Kwai Summary Attention Technical Report


65. Scaling Properties of Continuous Diffusion Spoken Language Models


66. All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation


67. Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation


68. SeaEvo: Advancing Algorithm Discovery with Strategy Space Evolution


69. DPRM: A Plug-in Doob h transform-induced Token-Ordering Module for Diffusion Language Models


70. SycoPhantasy: Quantifying Sycophancy and Hallucination in Small Open Weight VLMs for Vision-Language Scoring of Fantasy Characters


71. See Further, Think Deeper: Advancing VLM’s Reasoning Ability with Low-level Visual Cues and Reflection


72. MEMCoder: Multi-dimensional Evolving Memory for Private-Library-Oriented Code Generation


73. RefEvo: Agentic Design with Co-Evolutionary Verification for Agile Reference Model Generation


74. Agentic Witnessing: Pragmatic and Scalable TEE-Enabled Privacy-Preserving Auditing


75. Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis


76. MultiDx: A Multi-Source Knowledge Integration Framework towards Diagnostic Reasoning


77. MemeScouts@LT-EDI 2026: Asking the Right Questions – Prompted Weak Supervision for Meme Hate Speech Detection


78. Meta-Aligner: Bidirectional Preference-Policy Optimization for Multi-Objective LLMs Alignment


79. AdapTime: Enabling Adaptive Temporal Reasoning in Large Language Models


80. Defusing the Trigger: Plug-and-Play Defense for Backdoored LLMs via Tail-Risk Intrinsic Geometric Smoothing


81. Strategic Bidding in 6G Spectrum Auctions with Large Language Models


82. Latency and Cost of Multi-Agent Intelligent Tutoring at Scale


83. TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training


84. Jailbreaking Frontier Foundation Models Through Intention Deception


85. The Pragmatic Persona: Discovering LLM Persona through Bridging Inference


86. QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering


87. AgenticCache: Cache-Driven Asynchronous Planning for Embodied AI Agents


88. From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills


89. IntentVLM: Open-Vocabulary Intention Recognition through Forward-Inverse Modeling with Video-Language Models


90. EPM-RL: Reinforcement Learning for On-Premise Product Mapping in E-Commerce


91. Fix Initial Codes and Iteratively Refine Textual Directions Toward Safe Multi-Turn Code Correction


92. Hindsight Preference Optimization for Financial Time Series Advisory


93. Quantum Knowledge Graph: Modeling Context-Dependent Triplet Validity


94. KOMBO: Korean Character Representations Based on the Combination Rules of Subcharacters


95. What Did They Mean? How LLMs Resolve Ambiguous Social Situations across Perspectives and Roles


96. Constraint-Guided Multi-Agent Decompilation for Executable Binary Recovery


97. SMSI: System Model Security Inference: Automated Threat Modeling for Cyber-Physical Systems


98. Generative Synthetic Data for Causal Inference: Pitfalls, Remedies, and Opportunities


99. Evaluation of Prompt Injection Defenses in Large Language Models


100. Inverting Foundation Models of Brain Function with Simulation-Based Inference


101. Graph Memory Transformer (GMT)


102. Exploring Audio Hallucination in Egocentric Video Understanding


103. S2G-RAG: Structured Sufficiency and Gap Judging for Iterative Retrieval-Augmented QA


104. The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation


105. SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning


106. Zoom In, Reason Out: Efficient Far-field Anomaly Detection in Expressway Surveillance Videos via Focused VLM Reasoning Guided by Bayesian Inference


107. AIPsy-Affect: A Keyword-Free Clinical Stimulus Battery for Mechanistic Interpretability of Emotion in Language Models


108. Agri-CPJ: A Training-Free Explainable Framework for Agricultural Pest Diagnosis Using Caption-Prompt-Judge and LLM-as-a-Judge


109. PhysCodeBench: Benchmarking Physics-Aware Symbolic Simulation of 3D Scenes via Self-Corrective Multi-Agent Refinement


110. LLMs Reading the Rhythms of Daily Life: Aligned Understanding for Behavior Prediction and Generation


111. CyberCane: Neuro-Symbolic RAG for Privacy-Preserving Phishing Detection with Formal Ontology Reasoning


112. DLM: Unified Decision Language Models for Offline Multi-Agent Sequential Decision Making


113. Pref-CTRL: Preference Driven LLM Alignment using Representation Editing


114. MTRouter: Cost-Aware Multi-Turn LLM Routing with History-Model Joint Embeddings


115. Grammar-Constrained Refinement of Safety Operational Rules Using Language in the Loop: What Could Go Wrong


116. Uncertainty Propagation in LLM-Based Systems


117. Hybrid JIT-CUDA Graph Optimization for Low-Latency Large Language Model Inference


118. Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs


119. AI Safety Training Can be Clinically Harmful


120. Automating Categorization of Scientific Texts with In-Context Learning and Prompt-Chaining in Large Language Models


121. PushupBench: Your VLM is not good at counting pushups


122. An Empirical Evaluation of Locally Deployed LLMs for Bug Detection in Python Code


123. EmoTrans: A Benchmark for Understanding, Reasoning, and Predicting Emotion Transitions in Multimodal LLMs


124. Evaluating Jailbreaking Vulnerabilities in LLMs Deployed as Assistants for Smart Grid Operations: A Benchmark Against NERC Standards


125. EAD-Net: Emotion-Aware Talking Head Generation with Spatial Refinement and Temporal Coherence


126. $\mathcal{S}^2$IT: Stepwise Syntax Integration Tuning for Large Language Models in Aspect Sentiment Quad Prediction


127. Au-M-ol: A Unified Model for Medical Audio and Language Understanding


128. From Similarity to Structure: Training-free LLM Context Compression with Hybrid Graph Priors


129. Lightweight and Production-Ready PDF Visual Element Parsing


130. Small Language Model Helps Resolve Semantic Ambiguity of LLM Prompt


131. Knowledge Lever Risk Management for Software Engineering: A Stochastic Framework for Mitigating Knowledge Loss


132. Scalable LLM-based Coding of Dialogue in Healthcare Simulation: Balancing Coding Performance, Processing Time, and Environmental Impact


133. AI-Assisted Code Review as a Scaffold for Code Quality and Self-Regulated Learning: An Experience Report


134. AnalogRetriever: Learning Cross-Modal Representations for Analog Circuit Retrieval


135. Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns


136. UNSEEN: A Cross-Stack LLM Unlearning Defense against AR-LLM Social Engineering Attacks


137. Mechanistic Steering of LLMs Reveals Layer-wise Feature Vulnerabilities in Adversarial Settings


138. MindTrellis: Co-Creating Knowledge Structures with AI through Interactive Visual Exploration


139. ArgRE: Formal Argumentation for Conflict Resolution in Multi-Agent Requirements Negotiation


140. Mixture of Heterogeneous Grouped Experts for Language Modeling


141. No Test Cases, No Problem: Distillation-Driven Code Generation for Scientific Workflows


142. Code Broker: A Multi-Agent System for Automated Code Quality Assessment


143. From Pixels to Explanations: Interpretable Diabetic Retinopathy Grading with CNN-Transformer Ensembles, Visual Explainability and Vision-Language Models


144. C-MORAL: Controllable Multi-Objective Molecular Optimization with Reinforcement Alignment for LLMs


145. DeepImagine: Learning Biomedical Reasoning via Successive Counterfactual Imagining


146. AmaraSpatial-10K: A Spatially and Semantically Aligned 3D Dataset for Spatial Computing and Embodied AI


147. CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging


148. Peer Identity Bias in Multi-Agent LLM Evaluation: An Empirical Study Using the TRUST Democratic Discourse Analysis Pipeline


149. Self Knowledge Re-expression: A Fully Local Method for Adapting LLMs to Tasks Using Intrinsic Knowledge


150. Utility-Aware Data Pricing: Token-Level Quality and Empirical Training Gain for LLMs


151. Quantifying and Mitigating Self-Preference Bias of LLM Judges


152. RouteGuard: Internal-Signal Detection of Skill Poisoning in LLM Agents


153. Can Multimodal Large Language Models Truly Understand Small Objects?


154. SketchVLM: Vision language models can annotate images to explain thoughts and guide users


155. AutoRISE: Agent-Driven Strategy Evolution for Red-Teaming Large Language Models


156. IntrAgent: An LLM Agent for Content-Grounded Information Retrieval through Literature Review


157. SwarmDrive: Semantic V2V Coordination for Latency-Constrained Cooperative Autonomous Driving


158. Structure Guided Retrieval-Augmented Generation for Factual Queries


159. PivotMerge: Bridging Heterogeneous Multimodal Pre-training via Post-Alignment Model Merging


160. DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models


161. Complete Cyclic Subtask Graphs for Tool-Using LLM Agents: Flexibility, Cost, and Bottlenecks in Multi-Agent Workflows


162. See No Evil: Semantic Context-Aware Privacy Risk Detection for AR


163. RCSB PDB AI Help Desk: retrieval-augmented generation for protein structure deposition support


164. Parameter Efficiency Is Not Memory Efficiency: Rethinking Fine-Tuning for On-Device LLM Adaptation


165. Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing


166. KARL: Mitigating Hallucinations in LLMs via Knowledge-Boundary-Aware Reinforcement Learning


167. Epicure: Multidimensional Flavor Structure in Food Ingredient Embeddings


168. When VLMs ‘Fix’ Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR


169. The Randomness Floor: Measuring Intrinsic Non-Randomness in Language Model Token Distributions


170. Learning in Blocks: A Multi Agent Debate Assisted Personalized Adaptive Learning Framework for Language Learning


171. Artificial General Intelligence Forecasting and Scenario Analysis: State of the Field, Methodological Gaps, and Strategic Implications


172. Implicit Humanization in Everyday LLM Moral Judgments


173. Behavioral Intelligence Platforms: From Event Streams to Autonomous Insight via Probabilistic Journey Graphs, Behavioral Knowledge Extraction, and Grounded Language Generation


174. Quantifying Divergence in Inter-LLM Communication Through API Retrieval and Ranking


175. RedParrot: Accelerating NL-to-DSL for Business Analytics via Query Semantic Caching


176. Your Reviews Replicate You: LLM-Based Agents as Customer Digital Twins for Conjoint Analysis


177. RADIANT-LLM: an Agentic Retrieval Augmented Generation Framework for Reliable Decision Support in Safety-Critical Nuclear Engineering


178. How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks