LLM 관련 주요 논문 - 2026-03-04

1. Inherited Goal Drift: Contextual Pressure Can Undermine Agentic Goals


2. Density-Guided Response Optimization: Community-Grounded Alignment via Implicit Acceptance Signals


3. AI-for-Science Low-code Platform with Bayesian Adversarial Multi-Agent Framework


4. No Memorization, No Detection: Output Distribution-Based Contamination Detection in Small Language Models


5. Agentic AI-based Coverage Closure for Formal Verification


6. Beyond Task Completion: Revealing Corrupt Success in LLM Agents through Procedure-Aware Evaluation


7. Beyond Factual Correctness: Mitigating Preference-Inconsistent Explanations in Explainable Recommendation


8. RAPO: Expanding Exploration for LLM Agents via Retrieval-Augmented Policy Optimization


9. REGAL: A Registry-Driven Architecture for Deterministic Grounding of Agentic AI in Enterprise Telemetry


10. OrchMAS: Orchestrated Reasoning with Multi Collaborative Heterogeneous Scientific Expert Structured Agents


11. SpatialText: A Pure-Text Cognitive Benchmark for Spatial Understanding in Large Language Models


12. Architecting Trust in Artificial Epistemic Agents


13. ShipTraj-R1: Reinforcing Ship Trajectory Prediction in Large Language Models via Group Relative Policy Optimization


14. SAE as a Crystal Ball: Interpretable Features Predict Cross-domain Transferability of LLMs without Training


15. LLM-based Argument Mining meets Argumentation and Description Logics: a Unified Framework for Reasoning about Debates


16. Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification


17. Rethinking Code Similarity for Automated Algorithm Design with LLMs


18. A Natural Language Agentic Approach to Study Affective Polarization


19. FinTexTS: Financial Text-Paired Time-Series Dataset via Semantic-Based and Multi-Level Pairing


20. LLMs for High-Frequency Decision-Making: Normalized Action Reward-Guided Consistency Policy Optimization


21. SorryDB: Can AI Provers Complete Real-World Lean Theorems?


22. See and Remember: A Multimodal Agent for Web Traversal


23. SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving


24. LiveAgentBench: Comprehensive Benchmarking of Agentic Systems Across 104 Real-World Challenges


25. AnchorDrive: LLM Scenario Rollout with Anchor-Guided Diffusion Regeneration for Safety-Critical Scenario Generation


26. A Neuropsychologically Grounded Evaluation of LLM Cognitive Abilities


27. LLM-MLFFN: Multi-Level Autonomous Driving Behavior Feature Fusion via Large Language Model


28. NeuroProlog: Multi-Task Fine-Tuning for Neurosymbolic Mathematical Reasoning via the Cocktail Effect


29. Diagnosing Retrieval vs. Utilization Bottlenecks in LLM Agent Memory


30. VL-KGE: Vision-Language Models Meet Knowledge Graph Embeddings


31. SuperLocalMemory: Privacy-Preserving Multi-Agent Memory with Bayesian Trust Defense Against Memory Poisoning


32. Engineering Reasoning and Instruction (ERI) Benchmark: A Large Taxonomy-driven Dataset for Foundation Models and Agents


33. Tether: Autonomous Functional Play with Correspondence-Driven Trajectory Warping


34. UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?


35. Understanding and Mitigating Dataset Corruption in LLM Steering


36. APRES: An Agentic Paper Revision and Evaluation System


37. Compact Prompting in Instruction-tuned LLMs for Joint Argumentative Component Detection


38. TrustMH-Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Large Language Models in Mental Health


39. Why Does RLAIF Work At All?


40. Contextualized Privacy Defense for LLM Agents


41. SEALing the Gap: A Reference Framework for LLM Inference Carbon Estimation via Multi-Benchmark Driven Embodiment


42. Beyond One-Size-Fits-All: Adaptive Subgraph Denoising for Zero-Shot Graph Learning with Large Language Models


43. Eliciting Numerical Predictive Distributions of LLMs Without Autoregression


44. CoFL: Continuous Flow Fields for Language-Conditioned Navigation


45. Faster, Cheaper, More Accurate: Specialised Knowledge Tracing Models Outperform LLMs


46. OCR or Not? Rethinking Document Information Extraction in the MLLMs Era with Real-World Large-Scale Datasets


47. Efficient Self-Evaluation for Diffusion Language Models via Sequence Regeneration


48. iGVLM: Dynamic Instruction-Guided Vision Encoding for Question-Aware Multimodal Understanding


49. Sensory-Aware Sequential Recommendation via Review-Distilled Representations


50. ITLC at SemEval-2026 Task 11: Normalization and Deterministic Parsing for Formal Reasoning in LLMs


51. Real-Time Generation of Game Video Commentary with Multimodal LLMs: Pause-Aware Decoding Approaches


52. AlphaFree: Recommendation Free from Users, IDs, and GNNs


53. Robust Heterogeneous Analog-Digital Computing for Mixture-of-Experts Models with Theoretical Generalization Guarantees


54. MASPOB: Bandit-Based Prompt Optimization for Multi-Agent Systems with Graph Neural Networks


55. GPUTOK: GPU Accelerated Byte Level BPE Tokenization


56. How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities


57. CAPT: Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment


58. Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs


59. CoDAR: Continuous Diffusion Language Models are More Powerful Than You Think


60. Human-Certified Module Repositories for the AI Age


61. Slurry-as-a-Service: A Modest Proposal on Scalable Pluralistic Alignment for Nutrient Optimization


62. PlayWrite: A Multimodal System for AI Supported Narrative Co-Authoring Through Play in XR


63. RIVA: Leveraging LLM Agents for Reliable Configuration Drift Detection


64. ZeroDayBench: Evaluating LLM Agents on Unseen Zero-Day Vulnerabilities for Cyberdefense


65. Quantifying Frontier LLM Capabilities for Container Sandbox Escape


66. Contextual Invertible World Models: A Neuro-Symbolic Agentic Framework for Colorectal Cancer Drug Response


67. When Scaling Fails: Mitigating Audio Perception Decay of LALMs via Multi-Step Perception-Aware Reasoning


68. Silent Sabotage During Fine-Tuning: Few-Shot Rationale Poisoning of Compact Medical LLMs


69. Concept Heterogeneity-aware Representation Steering


70. CUDABench: Benchmarking LLMs for Text-to-CUDA Generation


71. Beyond Binary Preferences: A Principled Framework for Reward Modeling with Ordinal Feedback


72. Neural Paging: Learning Context Management Policies for Turing-Complete Agents


73. MedCalc-Bench Doesn’t Measure What You Think: A Benchmark Audit and the Case for Open-Book Evaluation


74. MedFeat: Model-Aware and Explainability-Driven Feature Engineering with LLMs for Clinical Tabular Prediction


75. NExT-Guard: Training-Free Streaming Safeguard without Token-Level Labels


76. Self-Play Only Evolves When Self-Synthetic Pipeline Ensures Learnable Information Gain


77. ATPO: Adaptive Tree Policy Optimization for Multi-Turn Medical Dialogue


78. Param$Δ$ for Direct Weight Mixing: Post-Train Large Language Model at Zero Cost