LLM 관련 주요 논문 - 2026-03-23

1. Learning Dynamic Belief Graphs for Theory-of-mind Reasoning


2. Pitfalls in Evaluating Interpretability Agents


3. Experience is the Best Teacher: Motivating Effective Exploration in Reinforcement Learning for LLMs


4. Utility-Guided Agent Orchestration for Efficient LLM Tool Use


5. FormalEvolve: Neuro-Symbolic Evolutionary Search for Diverse and Prover-Effective Autoformalization


6. Stepwise: Neuro-Symbolic Proof Search for Automated Systems Verification


7. A Subgoal-driven Framework for Improving Long-Horizon LLM Agents


8. HyEvo: Self-Evolving Hybrid Agentic Workflows for Efficient Reasoning


9. PowerLens: Taming LLM Agents for Safe and Personalized Mobile Power Management


10. ItinBench: Benchmarking Planning Across Multiple Cognitive Dimensions with Large Language Models


11. Learning to Disprove: Formal Counterexample Generation with Large Language Models


12. Teaching an Agent to Sketch One Part at a Time


13. LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation


14. Improving Generalization on Cybersecurity Tasks with Multi-Modal Contrastive Learning


15. Adaptive Greedy Frame Selection for Long Video Understanding


16. AI Agents Can Already Autonomously Perform Experimental High Energy Physics


17. Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation


18. The Robot’s Inner Critic: Self-Refinement of Social Behaviors through VLM-based Replanning


19. Semantic Token Clustering for Efficient Uncertainty Quantification in Large Language Models


20. Evolving Jailbreaks: Automated Multi-Objective Long-Tail Attacks on Large Language Models


21. The $\mathbf{Y}$-Combinator for LLMs: Solving Long-Context Rot with $λ$-Calculus


22. An Empirical Study of SFT-DPO Interaction and Parameterization in Small Language Models


23. LLM-Enhanced Semantic Data Integration of Electronic Component Qualifications in the Aerospace Domain


24. Agentic Harness for Real-World Compilers


25. LoASR-Bench: Evaluating Large Speech Language Models on Low-Resource Automatic Speech Recognition Across Language Families



27. Breaking the Capability Ceiling of LLM Post-Training by Reintroducing Markov States


28. Trojan’s Whisper: Stealthy Manipulation of OpenClaw through Injected Bootstrapped Guidance


29. HiPath: Hierarchical Vision-Language Alignment for Structured Pathology Report Prediction


30. What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time


31. Semantic Delta: An Interpretable Signal Differentiating Human and LLMs Dialogue


32. Gesture2Speech: How Far Can Hand Movements Shape Expressive Speech?


33. GoAgent: Group-of-Agents Communication Topology Generation for LLM-based Multi-Agent Systems


34. PolicySim: An LLM-Based Agent Social Simulation Sandbox for Proactive Policy Optimization


35. CAF-Score: Calibrating CLAP with LALMs for Reference-free Audio Captioning Evaluation


36. FB-CLIP: Fine-Grained Zero-Shot Anomaly Detection with Foreground-Background Disentanglement


37. Skilled AI Agents for Embedded and IoT Systems Development


38. Optimal Scalar Quantization for Matrix Multiplication: Closed-Form Density and Phase Transition


39. FDARxBench: Benchmarking Regulatory and Clinical Reasoning on FDA Generic Drug Assessment


40. dinov3.seg: Open-Vocabulary Semantic Segmentation with DINOv3


41. Inducing Sustained Creativity and Diversity in Large Language Models


42. Gastric-X: A Multimodal Multi-Phase Benchmark Dataset for Advancing Vision-Language Models in Gastric Cancer Analysis


43. Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL


44. A Framework for Formalizing LLM Agent Security


45. LoFi: Location-Aware Fine-Grained Representation Learning for Chest X-ray


46. Vocabulary shapes cross-lingual variation of word-order learnability in language models


47. Is Evaluation Awareness Just Format Sensitivity? Limitations of Probe-Based Evidence under Controlled Prompt Structure


48. The Autonomy Tax: Defense Training Breaks LLM Agents


49. Scalable Prompt Routing via Fine-Grained Latent Task Discovery


50. POET: Power-Oriented Evolutionary Tuning for LLM-Based RTL PPA Optimization


51. Goedel-Code-Prover: Hierarchical Proof Search for Open State-of-the-Art Code Verification


52. Memory-Driven Role-Playing: Evaluation and Enhancement of Persona Knowledge Utilization in LLMs


53. MemReward: Graph-Based Experience Memory for LLM Reward Prediction with Limited Labels


54. Agreement Between Large Language Models, Human Reviewers, and Authors in Evaluating STROBE Checklists for Observational Studies in Rheumatology


55. Parameter-Efficient Token Embedding Editing for Clinical Class-Level Unlearning


56. Maximizing mutual information between user-contexts and responses improve LLM personalization with no additional data


57. LLM-MRD: LLM-Guided Multi-View Reasoning Distillation for Fake News Detection


58. Speculating Experts Accelerates Inference for Mixture-of-Experts


59. Generalized Stock Price Prediction for Multiple Stocks Combined with News Fusion


60. CDEoH: Category-Driven Automatic Algorithm Design With Large Language Models


61. Framing Effects in Independent-Agent Large Language Models: A Cross-Family Behavioral Analysis


62. URAG: A Benchmark for Uncertainty Quantification in Retrieval-Augmented Large Language Models


63. From Feature-Based Models to Generative AI: Validity Evidence for Constructed Response Scoring


64. HypeLoRA: Hyper-Network-Generated LoRA Adapters for Calibrated Language Model Fine-Tuning


65. From Flat to Structural: Enhancing Automated Short Answer Grading with GraphRAG


66. Improving Automatic Summarization of Radiology Reports through Mid-Training of Large Language Models


67. CURE: A Multimodal Benchmark for Clinical Understanding and Retrieval Evaluation


68. LSR: Linguistic Safety Robustness Benchmark for Low-Resource West African Languages


69. Transformers are Stateless Differentiable Neural Computers


70. A Human-Centered Workflow for Using Large Language Models in Content Analysis


71. Full-Stack Domain Enhancement for Combustion LLMs: Construction and Optimization


72. Probing to Refine: Reinforcement Distillation of LLMs via Explanatory Inversion


73. When the Pure Reasoner Meets the Impossible Object: Analytic vs. Synthetic Fine-Tuning and the Suppression of Genesis in Language Models


74. Generative Active Testing: Efficient LLM Evaluation via Proxy Task Adaptation


75. The α-Law of Observable Belief Revision in Large Language Model Inference


76. MAPLE: Metadata Augmented Private Language Evolution


77. LARFT: Closing the Cognition-Action Gap for Length Instruction Following in Large Language Models


78. A comprehensive study of LLM-based argument classification: from Llama through DeepSeek to GPT-5.2


79. GeoChallenge: A Multi-Answer Multiple-Choice Benchmark for Geometric Reasoning with Diagrams


80. When Prompt Optimization Becomes Jailbreaking: Adaptive Red-Teaming of Large Language Models


81. L-PRISMA: An Extension of PRISMA in the Era of Generative Artificial Intelligence (GenAI)