LLM 관련 주요 논문 - 2026-03-25

1. Bilevel Autoresearch: Meta-Autoresearching Itself


2. Beyond Preset Identities: How Agents Form Stances and Boundaries in Generative Societies


3. RelayS2S: A Dual-Path Speculative Generation for Real-Time Dialogue


4. LLM Olympiad: Why Model Evaluation Needs a Sealed Exam


5. MemCollab: Cross-Agent Memory Collaboration via Contrastive Trajectory Distillation


6. PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments


7. Describe-Then-Act: Proactive Agent Steering via Distilled Language-Action World Models


8. Between Rules and Reality: On the Context Sensitivity of LLM Moral Judgment


9. MedCausalX: Adaptive Causal Reasoning with Self-Reflection for Trustworthy Medical Vision-Language Models


10. Can Large Language Models Reason and Optimize Under Constraints?


11. JFTA-Bench: Evaluate LLM’s Ability of Tracking and Analyzing Malfunctions Using Fault Trees


12. PersonalQ: Select, Quantize, and Serve Personalized Diffusion Models for Efficient Inference


13. Optimizing Small Language Models for NL2SQL via Chain-of-Thought Fine-Tuning


14. Ran Score: a LLM-based Evaluation Score for Radiology Report Generation


15. ProGRank: Probe-Gradient Reranking to Defend Dense-Retriever RAG from Corpus Poisoning


16. Separating Diagnosis from Control: Auditable Policy Adaptation in Agent-Based Simulations with LLM-Based Diagnostics


17. Dynamical Systems Theory Behind a Hierarchical Reasoning Model


18. Chain-of-Authorization: Internalizing Authorization into Large Language Models via Reasoning Trajectories


19. Improving Safety Alignment via Balanced Direct Preference Optimization


20. AgriPestDatabase-v1.0: A Structured Insect Dataset for Training Agricultural Large Language Model


21. Can LLM Agents Generate Real-World Evidence? Evaluating Observational Studies in Medical Databases


22. Beyond Binary Correctness: Scaling Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks


23. Benchmarking Multi-Agent LLM Architectures for Financial Document Processing: A Comparative Study of Orchestration Patterns, Cost-Accuracy Tradeoffs and Production Scaling Strategies


24. Understanding LLM Performance Degradation in Multi-Instance Processing: The Roles of Instance Count and Context Length


25. From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents


26. MedObvious: Exposing the Medical Moravec’s Paradox in VLMs via Clinical Triage


27. VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions


28. Failure of contextual invariance in gender inference with large language models


29. ReqFusion: A Multi-Provider Framework for Automated PEGS Analysis Across Software Domains


30. 3DCity-LLM: Empowering Multi-modality Large Language Models for 3D City-scale Perception and Understanding


31. Evaluating LLM-Based Test Generation Under Software Evolution


32. SortedRL: Accelerating RL Training for LLMs through Online Length-Aware Scheduling


33. Leveraging LLMs and Social Media to Understand User Perception of Smartphone-Based Earthquake Early Warnings


34. Curriculum-Driven 3D CT Report Generation via Language-Free Visual Grafting and Zone-Constrained Compression


35. Designing Agentic AI-Based Screening for Portfolio Investment


36. Emergence of Fragility in LLM-based Social Networks: the Case of Moltbook


37. A Multimodal Framework for Human-Multi-Agent Interaction


38. Not All Tokens Are Created Equal: Query-Efficient Jailbreak Fuzzing for LLMs


39. SafeSeek: Universal Attribution of Safety Circuits in Language Models


40. ImplicitRM: Unbiased Reward Modeling from Implicit Preference Data for LLM alignment


41. Reasoning over Semantic IDs Enhances Generative Recommendation


42. Robust Safety Monitoring of Language Models via Activation Watermarking


43. Why AI-Generated Text Detection Fails: Evidence from Explainable AI Beyond Benchmark Accuracy


44. Can an LLM Detect Instances of Microservice Infrastructure Patterns?


45. DBAutoDoc: Automated Discovery and Documentation of Undocumented Database Schemas via Statistical Analysis and Iterative LLM Refinement


46. Set-Valued Prediction for Large Language Models with Feasibility-Aware Coverage Guarantees


47. EVA: Efficient Reinforcement Learning for End-to-End Video Agent


48. ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling


49. Agent-Sentry: Bounding LLM Agents via Execution Provenance


50. Agent Audit: A Security Analysis System for LLM Agent Applications


51. When AI Shows Its Work, Is It Actually Working? Step-Level Evaluation Reveals Frontier Language Models Frequently Bypass Their Own Reasoning


52. Focus, Don’t Prune: Identifying Instruction-Relevant Regions for Information-Rich Image Understanding


53. KARMA: Knowledge-Action Regularized Multimodal Alignment for Personalized Search at Taobao



55. KALAVAI: Predicting When Independent Specialist Fusion Works – A Quantitative Model for Post-Hoc Cooperative LLM Training


56. PopResume: Causal Fairness Evaluation of LLM/VLM Resume Screeners with Population-Representative Dataset


57. WiFi2Cap: Semantic Action Captioning from Wi-Fi CSI via Limb-Level Semantic Alignment


58. Generalizing Dynamics Modeling More Easily from Representation Perspective


59. AwesomeLit: Towards Hypothesis Generation with Agent-Supported Literature Research


60. LGSE: Lexically Grounded Subword Embedding Initialization for Low-Resource Language Adaptation


61. To Agree or To Be Right? The Grounding-Sycophancy Tradeoff in Medical Vision-Language Models


62. Language Models Can Explain Visual Features via Steering


63. Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?


64. STRIATUM-CTF: A Protocol-Driven Agentic Framework for General-Purpose CTF Solving


65. Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos


66. GraphRAG for Engineering Diagrams: ChatP&ID Enables LLM Interaction with P&IDs


67. LLMON: An LLM-native Markup Language to Leverage Structure and Semantics at the LLM Interface


68. Do Large Language Models Reduce Research Novelty? Evidence from Information Systems Journals


69. Tiny Inference-Time Scaling with Latent Verifiers


70. Cognitive Training for Language Models: Towards General Capabilities via Cross-Entropy Games


71. Functional Component Ablation Reveals Specialization Patterns in Hybrid Language Model Architectures


72. LLM-guided headline rewriting for clickability enhancement without clickbait


73. Sparse but Critical: A Token-Level Analysis of Distributional Shifts in RLVR Fine-Tuning of LLMs


74. CaP-X: A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation


75. AI Co-Scientist for Ranking: Discovering Novel Search Ranking Models alongside LLM-based AI Agents with Cloud Computing Access


76. FAAR: Format-Aware Adaptive Rounding for NVFP4


77. When Visuals Aren’t the Problem: Evaluating Vision-Language Models on Misleading Data Visualizations


78. Reasoner-Executor-Synthesizer: Scalable Agentic Architecture with Static O(1) Context Window


79. Early Discoveries of Algorithmist I: Promise of Provable Algorithm Synthesis at Scale


80. WIST: Web-Grounded Iterative Self-Play Tree for Domain-Targeted Reasoning Improvement



82. Causal Direct Preference Optimization for Distributionally Robust Generative Recommendation


83. Graph Signal Processing Meets Mamba2: Adaptive Filter Bank via Delta Modulation


84. Large Language Models for Missing Data Imputation: Understanding Behavior, Hallucination Effects, and Control Mechanisms


85. Trained Persistent Memory for Frozen Decoder-Only LLMs


86. AgentSLR: Automating Systematic Literature Reviews in Epidemiology with Agentic AI


87. DAQ: Delta-Aware Quantization for Post-Training LLM Weight Compression


88. From Instructions to Assistance: a Dataset Aligning Instruction Manuals with Assembly Videos for Evaluating Multimodal LLMs



90. Sample Transform Cost-Based Training-Free Hallucination Detector for Large Language Models


91. Latent Semantic Manifolds in Large Language Models


92. Between the Layers Lies the Truth: Uncertainty Estimation in LLMs Using Intra-Layer Local Information Scores


93. Whether, Not Which: Mechanistic Interpretability Reveals Dissociable Affect Reception and Emotion Categorization in LLMs


94. Efficient Embedding-based Synthetic Data Generation for Complex Reasoning Tasks


95. TIPS: Turn-Level Information-Potential Reward Shaping for Search-Augmented LLMs


96. MERIT: Memory-Enhanced Retrieval for Interpretable Knowledge Tracing


97. Evaluating Prompting Strategies for Chart Question Answering with Large Language Models


98. Founder effects shape the evolutionary dynamics of multimodality in open LLM families


99. Automated Microservice Pattern Instance Detection Using Infrastructure-as-Code Artifacts and Large Language Models