LLM 관련 주요 논문 - 2026-05-21

1. DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation


2. PALS: Power-Aware LLM Serving for Mixture-of-Experts Models


3. Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents


4. AutoRPA: Efficient GUI Automation through LLM-Driven Code Synthesis from Interactions


5. Governance by Construction for Generalist Agents


6. PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models


7. VBFDD-Agent for Electric Vehicle Battery Fault Detection and Diagnosis: Descriptive Text Modeling of Battery Digital Signals


8. Declarative Data Services: Structured Agentic Discovery for Composing Data Systems


9. Evaluating Temporal Semantic Caching and Workflow Optimization in Agentic Plan-Execute Pipelines


10. AgentAtlas: Beyond Outcome Leaderboards for LLM Agents


11. OSCToM: RL-Guided Adversarial Generation for High-Order Theory of Mind


12. Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration


13. SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation


14. Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate


15. WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata


16. Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling


17. Mem-$π$: Adaptive Memory through Learning When and What to Generate


18. TempGlitch: Evaluating Vision-Language Models for Temporal Glitch Detection in Gameplay Videos


19. Stdlib or Third-Party? Empirical Performance and Correctness of LLM-Assisted Zero-Dependency Python Libraries


20. Open-source LLMs administer maximum electric shocks in a Milgram-like obedience experiment


21. SymbolicLight V1: Spike-Gated Dual-Path Language Modeling with High Activation Sparsity and Sub-Billion-Scale Pre-Training Evidence


22. TextReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization


23. Frontier: Towards Comprehensive and Accurate LLM Inference Simulation


24. Tracing the ongoing emergence of human-like reasoning in Large Language Models


25. TimeSRL: Generalizable Time-Series Behavioral Modeling via Semantic RL-Tuned LLMs – A Case Study in Mental Health


26. MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset


27. How Much Online RL is Enough? Informative Rollouts for Offline Preference Optimization in RLVR


28. APEX: Autonomous Policy Exploration for Self-Evolving LLM Agents


29. PREFINE: Preference-Based Implicit Reward and Cost Fine-Tuning for Safety Alignment


30. Automated ICD Classification of Psychiatric Diagnoses: From Classical NLP to Large Language Models


31. ACL-Verbatim: hallucination-free question answering for research


32. Fine-grained Claim-level RAG Benchmark for Law


33. Grounding Driving VLA via Inverse Kinematics


34. Beyond Text-to-SQL: An Agentic LLM System for Governed Enterprise Analytics APIs


35. Towards Context-Invariant Safety Alignment for Large Language Models


36. Finding the Correct Visual Evidence Without Forgetting: Mitigating Hallucination in LVLMs via Inter-Layer Visual Attention Discrepancy


37. Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models


38. DASH: Fast Differentiable Architecture Search for Hybrid Attention in Minutes on a Single GPU


39. Strategy-Induct: Task-Level Strategy Induction for Instruction Generation


40. Causal Past Logic for Runtime Verification of Distributed LLM Agent Workflows


41. Sutra: Tensor-Op RNNs as a Compilation Target for Vector Symbolic Architectures


42. Calibration vs Decision Making: Revisiting the Reliability Paradox in Unlearned Language Models


43. GenAI-Driven Threat Detection with Microsoft Security Copilot


44. Terminal-World: Scaling Terminal-Agent Environments via Agent Skills


45. Runtime-Certified Bounded-Error Quantized Attention


46. Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards


47. ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models


48. GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval


49. Correcting Stochastic Update Bias in Preconditioned Language Model Optimizers


50. The Devil is in the Condition Numbers: Why is GLU Better than non-GLU Structure?


51. Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale


52. Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression


53. An Application-Layer Multi-Modal Covert-Channel Reference Monitor for LLM Agent Egress


54. Distributional Alignment as a Criterion for Designing Task Vectors in In-Context Learning


55. AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback


56. SCRIBE: Diagnostic Evaluation and Rich Transcription Models for Indic ASR


57. Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU


58. Heartbeat-Bound Hierarchical Credentials: Cryptographic Revocation for AI Agent Swarms


59. Interpretable Discriminative Text Representations via Agreement and Label Disentanglement


60. DIVE: Embedding Compression via Self-Limiting Gradient Updates


61. REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak


62. AVSD: Adaptive-View Self-Distillation by Balancing Consensus and Teacher-Specific Privileged Signals


63. Trusted Weights, Treacherous Optimizations? Optimization-Triggered Backdoor Attacks on LLMs


64. Self-Training Doesn’t Flatten Language – It Restructures It: Surface Markers Amplify While Deep Syntax Dies


65. NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding


66. Machine-Learning-Enhanced Non-Invasive Testing for MASLD Fibrosis: Shallow-Deep Neural Networks Versus FIB-4, Tabular Foundation Models, and Large Language Models


67. Codec-Robust Attacks on Audio LLMs


68. Code Generation by Differential Test Time Scaling


69. LLM Pretraining Shapes a Generalizable Manifold: Insights into Cross-Modal Transfer to Time Series


70. Mechanics of Bias and Reasoning: Interpreting the Impact of Chain-of-Thought Prompting on Gender Bias in LLMs


71. Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor


72. Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs


73. DEL: Digit Entropy Loss for Numerical Learning of Large Language Models


74. Security Document Classification with a Fine-Tuned Local Large Language Model: Benchmark Data and an Open-Source System


75. FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision–Language Generation


76. Spectral Unforgetting: Post-Hoc Recovery of Damaged Capabilities Without Retraining


77. Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization


78. Plug-and-Play Spiking Operators: Breaking the Nonlinearity Bottleneck in Spiking Transformers


79. Introspective X Training: Feedback Conditioning Improves Scaling Across all LLM Training Stages


80. Regulating Anatomy-Aware Rewards via Trajectory-Integral Feedback for Volumetric Computed Tomography Analysis


81. Modality-Decoupled Online Recursive Editing


82. Conformal Selective Acting: Anytime-Valid Risk Control for RLVR-Trained LLMs


83. Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding


84. It Takes Two: Complementary Self-Distillation for Contextual Integrity in LLMs


85. Efficient Table QA via TableGrid Navigation and Progressive Inference Prompting


86. ProcBench: Evaluating Process-Level Defects and Control Preservation in LLM Coding Agents


87. Automated Kernel Discovery Towards Understanding High-dimensional Bayesian Optimization


88. CP-MoE: Consistency-Preserving Mixture-of-Experts for Continual Learning


89. GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents



91. LEAP: A closed-loop framework for perovskite precursor additive discovery


92. Geometry-Lite: Interpretable Safety Probing via Layer-Wise Margin Geometry


93. AI-Assisted Competency Assessment from Egocentric Video in Simulation-Based Nursing Education


94. Leveraging Vision-Language Models to Detect Attention in Educational Videos


95. PrivacyAkinator: Articulating Key Privacy Design Decisions by Answering LLM-Generated Multiple-choice Questions


96. RealUserSim: Bridging the Reality Gap in Agent Benchmarking via Grounded User Simulation


97. GrandGuard: Taxonomy, Benchmark, and Safeguards for Elderly-Chatbot Interaction Safety


98. Under Pressure: Emotional Framing Induces Measurable Behavioral Shifts and Structured Internal Geometry in Small Language Models


99. Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning


100. FlowLM: Few-Step Language Modeling via Diffusion-to-Flow Adaptation


101. Pseudo-Siamese Network for Planning in Target-Oriented Proactive Dialogues


102. Parallel LLM Reasoning for Bias-Resilient, Robust Conceptual Abstraction


103. Improving Quantized Model Performance in Qualitative Analysis with Multi-Pass Prompt Verification


104. Diverge to Induce Prompting: Multi-Rationale Induction for Zero-Shot Reasoning