LLM 관련 주요 논문 - 2026-05-08

1. GlazyBench: A Benchmark for Ceramic Glaze Property Prediction and Image Generation


2. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key


3. MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems


4. SkillOS: Learning Skill Curation for Self-Evolving Agents


5. NeuroAgent: LLM Agents for Multimodal Neuroimaging Analysis and Research


6. Process Matters more than Output for Distinguishing Humans from Machines


7. Instrumental Choices: Measuring the Propensity of LLM Agents to Pursue Instrumental Behaviors


8. ReasonSTL: Bridging Natural Language and Signal Temporal Logic via Tool-Augmented Process-Rewarded Learning


9. Patch-Effect Graph Kernels for LLM Interpretability


10. Beyond Task Success: Measuring Workflow Fidelity in LLM-Based Agentic Payment Systems


11. PrefixGuard: From LLM-Agent Traces to Online Failure-Warning Monitors


12. SCRuB: Social Concept Reasoning under Rubric-Based Evaluation



14. Rethinking Vacuity for OOD Detection in Evidential Deep Learning


15. From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work


16. A Regime Theory of Controller Class Selection for LLM Action Decisions


17. Addressing Labelled Data Scarcity: Taxonomy-Agnostic Annotation of PII Values in HTTP Traffic using LLMs


18. Data Language Models: A New Foundation Model Class for Tabular Data


19. Joint Consistency: A Unified Test-Time Aggregation Framework via Energy Minimization


20. Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models


21. Towards Annotation-Free Validation of MLLMs: A Vision-Language Logical Consistency Metric


22. The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models


23. Systematic Evaluation of Large Language Models for Post-Discharge Clinical Action Extraction


24. Event-Causal RAG: A Retrieval-Augmented Generation Framework for Long Video Reasoning in Complex Scenarios


25. Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost


26. Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges


27. Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning


28. Back to the Beginning of Heuristic Design: Bridging Code and Knowledge with LLMs


29. Policy-Guided Stepwise Model Routing for Cost-Effective Reasoning


30. CrossCult-KIBench: A Benchmark for Cross-Cultural Knowledge Insertion in MLLMs


31. Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility


32. VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?


33. Visual Fingerprints for LLM Generation Comparison


34. Novelty-based Tree-of-Thought Search for LLM Reasoning and Planning


35. Strat-LLM: Stratified Strategy Alignment for LLM-based Stock Trading with Real-time Multi-Source Signals


36. TACT: Mitigating Overthinking and Overacting in Coding Agents via Activation Steering


37. TheraAgent: Self-Improving Therapeutic Agent for Precise and Comprehensive Treatment Planning


38. ICU-Bench:Benchmarking Continual Unlearning in Multimodal Large Language Models


39. Wisteria: A Unified Multi-Scale Feature Learning Framework for DNA Language Model


40. Null Space Constrained Contrastive Visual Forgetting for MLLM Unlearning


41. Taklif.AI: LLM-Powered Platform for Interest-Based Personalized College Assignments


42. On the Role of Language Representations in Auto-Bidding: Findings and Implications


43. AGPO: Asymmetric Group Policy Optimization for Verifiable Reasoning and Search Ads Relevance at JD


44. CircuitFormer: A Circuit Language Model for Analog Topology Design from Natural Language Prompt


45. HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory


46. ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning


47. Knee Osteoarthritis Severity Grading Using Optimized Deep Learning and LLM-Driven Intelligent AI on Computationally Limited Systems


48. SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents


49. Detecting Time Series Anomalies Like an Expert: A Multi-Agent LLM Framework with Specialized Analyzers


50. More Is Not Always Better: Cross-Component Interference in LLM Agent Scaffolding


51. Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes


52. Conceal, Reconstruct, Jailbreak: Exploiting the Reconstruction-Concealment Tradeoff in MLLMs


53. Resolving the bias-precision paradox with stochastic causal representation learning for personalized medicine


54. Knowledge-Graph Paths as Intermediate Supervision for Self-Evolving Search Agents


55. Inference-Time Budget Control for LLM Search Agents


56. Saliency-Aware Regularized Quantization Calibration for Large Language Models


57. DataDignity: Training Data Attribution for Large Language Models


58. Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination


59. Large Vision-Language Models Get Lost in Attention


60. Retrieval-Conditioned Topology Selection with Provable Budget Conservation for Multi-Agent Code Generation


61. Text-Graph Synergy: A Bidirectional Verification and Completion Framework for RAG


62. Prober.ai: Gated Inquiry-Based Feedback via LLM-Constrained Personas for Argumentative Writing Development


63. Causal Probing for Internal Visual Representations in Multimodal Large Language Models


64. Belief Memory: Agent Memory Under Partial Observability


65. AlphaCrafter: A Full-Stack Multi-Agent Framework for Cross-Sectional Quantitative Trading


66. Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration


67. SPARK: Self-Play with Asymmetric Reward from Knowledge Graphs


68. AgenticRAG: Agentic Retrieval for Enterprise Knowledge Bases


69. FoodCHA: Multi-Modal LLM Agent for Fine-Grained Food Analysis


70. FinRAG-12B: A Production-Validated Recipe for Grounded Question Answering in Banking


71. LANTERN: LLM-Augmented Neurosymbolic Transfer with Experience-Gated Reasoning Networks


72. Agentic Discovery of Exchange-Correlation Density Functionals


73. The Geopolitics of AI Safety: A Causal Analysis of Regional LLM Bias


74. From History to State: Constant-Context Skill Learning for LLM Agents


75. LaTA: A Drop-in, FERPA-Compliant Local-LLM Autograder for Upper-Division STEM Coursework


76. Agentic Retrieval-Augmented Generation for Financial Document Question Answering


77. PRISM: Perception Reasoning Interleaved for Sequential Decision Making


78. When Helpfulness Becomes Sycophancy: Sycophancy is a Boundary Failure Between Social Alignment and Epistemic Integrity in Large Language Models


79. BALAR : A Bayesian Agentic Loop for Active Reasoning


80. Understanding Annotator Safety Policy with Interpretability


81. Verifier-Backed Hard Problem Generation for Mathematical Reasoning


82. Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less


83. When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels


84. Superintelligent Retrieval Agent: The Next Frontier of Information Retrieval


85. StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction


86. The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity


87. AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents


88. UniSD: Towards a Unified Self-Distillation Framework for Large Language Models


89. Continuous Latent Diffusion Language Model


90. Is One Layer Enough? Understanding Inference Dynamics in Tabular Foundation Models


91. PACZero: PAC-Private Fine-Tuning of Language Models via Sign Quantization


92. Litespark Inference on Consumer CPUs: Custom SIMD Kernels for Ternary Neural Networks


93. Constraint Decay: The Fragility of LLM Agents in Backend Code Generation


94. Is Escalation Worth It? A Decision-Theoretic Characterization of LLM Cascades


95. Human-AI Co-Evolution and Epistemic Collapse: A Dynamical Systems Perspective


96. CoupleEvo: Evolving Heuristics for Coupled Optimization Problems Using Large Language Models


97. Fine-Tuning Small Language Models for Solution-Oriented Windows Event Log Analysis


98. Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity


99. Improving the Efficiency of Language Agent Teams with Adaptive Task Graphs


100. Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization


101. Attributions All the Way Down? The Metagame of Interpretability


102. Log-Likelihood, Simpson’s Paradox, and the Detection of Machine-Generated Text


103. Correct Code, Vulnerable Dependencies: A Large Scale Measurement Study of LLM-Specified Library Versions


104. Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs


105. TIDE: Every Layer Knows the Token Beneath the Context


106. Retina-RAG: Retrieval-Augmented Vision-Language Modeling for Joint Retinal Diagnosis and Clinical Report Generation


107. IRC-Bench: Recognizing Entities from Contextual Cues in First-Person Reminiscences


108. Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex


109. Normalized Architectures are Natively 4-Bit


110. Causal Reinforcement Learning for Complex Card Games: A Magic The Gathering Benchmark


111. Optimal Transport for LLM Reward Modeling from Noisy Preference


112. Does Synthetic Data Help? Empirical Evidence from Deep Learning Time Series Forecasters


113. Safety Anchor: Defending Harmful Fine-tuning via Geometric Bottlenecks


114. PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts


115. Towards Reliable LLM Evaluation: Correcting the Winner’s Curse in Adaptive Benchmarking


116. Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR


117. Hallucination as an Anomaly: Dynamic Intervention via Probabilistic Circuits


118. LLM-Driven Design Space Exploration of FPGA-based Accelerators


119. Quantum-enhanced Large Language Models on Quantum Hardware via Cayley Unitary Adapters


120. CITE: Anytime-Valid Statistical Inference in LLM Self-Consistency


121. VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding


122. LoopTrap: Termination Poisoning Attacks on LLM Agents


123. LeakDojo: Decoding the Leakage Threats of RAG Systems


124. LCC-LLM: Leveraging Code-Centric Large Language Models for Malware Attribution


125. Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio


126. Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using LLM Judges with Closed-Loop Reinforcement Learning Feedback


127. CRAFT: Forgetting-Aware Intervention-Based Adaptation for Continual Learning


128. SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety


129. Active Learning for Communication Structure Optimization in LLM-Based Multi-Agent Systems


130. An Empirical Study of Proactive Coding Assistants in Real-World Software Development


131. Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving


132. Decomposing the Basic Abilities of Large Language Models: Mitigating Cross-Task Interference in Multi-Task Instruct-Tuning


133. XL-SafetyBench: A Country-Grounded Cross-Cultural Benchmark for LLM Safety and Cultural Sensitivity


134. One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue


135. Leveraging Image Generators to Address Training Data Scarcity: The Gen4Regen Dataset for Forest Regeneration Mapping


136. When2Speak: A Dataset for Temporal Participation and Turn-Taking in Multi-Party Conversations for Large Language Models


137. AstroAlertBench: Evaluating the Accuracy, Reasoning, and Honesty of Multimodal LLMs in Astronomical Classification


138. ReaComp: Compiling LLM Reasoning into Symbolic Solvers for Efficient Program Synthesis


139. Robustness of Graph Self-Supervised Learning to Real-World Noise: A Case Study on Text-Driven Biomedical Graphs


140. SLAM: Structural Linguistic Activation Marking for Language Models


141. Information Theoretic Adversarial Training of Large Language Models


142. Open-SAT: LLM-Guided Query Embedding Refinement for Open-Vocabulary Object Retrieval in Satellite Imagery


143. Feature Starvation as Geometric Instability in Sparse Autoencoders


144. How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study


145. Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code


146. Decision-aware User Simulation Agent for Evaluating Conversational Recommender Systems


147. Towards Dependable Retrieval-Augmented Generation Using Factual Confidence Prediction


148. Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods


149. MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference


150. MidSteer: Optimal Affine Framework for Steering Generative Models


151. Sparse Prefix Caching for Hybrid and Recurrent LLM Serving


152. A Review of Large Language Models for Stock Price Forecasting from a Hedge-Fund Perspective