LLM 관련 주요 논문 - 2026-03-05

1. A Dual-Helix Governance Approach Towards Reliable Agentic AI for WebGIS Development


2. Agentics 2.0: Logical Transduction Algebra for Agentic Data Workflows


3. Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions


4. BeamPERL: Parameter-Efficient RL with Verifiable Rewards Specializes Compact LLMs for Structured Beam Mechanics Reasoning


5. Generative AI in Managerial Decision-Making: Redefining Boundaries through Ambiguity Resolution and Sycophancy Analysis


6. In-Context Environments Induce Evaluation-Awareness in Language Models


7. Specification-Driven Generation and Evaluation of Discrete-Event World Models via the DEVS Formalism


8. AgentSelect: Benchmark for Narrative Query-to-Agent Recommendation


9. AI4S-SDS: A Neuro-Symbolic Solvent Design System via Sparse MCTS and Differentiable Physics Alignment


10. MAGE: Meta-Reinforcement Learning for Language Agents toward Strategic Exploration and Exploitation


11. Mozi: Governed Autonomy for Drug Discovery LLM Agents


12. Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants


13. Robustness of Agentic AI Systems via Adversarially-Aligned Jacobian Regularization


14. Efficient Refusal Ablation in LLM through Optimal Transport


15. SpotIt+: Verification-based Text-to-SQL Evaluation with Database Constraints


16. World Properties without World Models: Recovering Spatial and Temporal Structure from Co-occurrence Statistics in Static Word Embeddings


17. LabelBuddy: An Open Source Music and Audio Language Annotation Tagging Tool Using AI Assistance


18. VANGUARD: Vehicle-Anchored Ground Sample Distance Estimation for UAVs in GPS-Denied Environments


19. Causality Elicitation from Large Language Models


20. When AI Fails, What Works? A Data-Driven Taxonomy of Real-World AI Risk Mitigation Strategies


21. FeedAIde: Guiding App Users to Submit Rich Feedback Reports by Asking Context-Aware Follow-Up Questions


22. PRAM-R: A Perception-Reasoning-Action-Memory Framework with LLM-Guided Modality Routing for Adaptive Autonomous Driving


23. CAM-LDS: Cyber Attack Manifestations for Automatic Interpretation of System Logs and Security Alerts


24. CodeTaste: Can LLMs Generate Human-Level Code Refactorings?


25. Bielik-Q2-Sharp: A Comparative Study of Extreme 2-bit Quantization Methods for a Polish 11B Language Model


26. Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization


27. Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation


28. Monitoring Emergent Reward Hacking During Generation via Internal Activations


29. Inference-Time Toxicity Mitigation in Protein Language Models


30. A Multi-Dimensional Quality Scoring Framework for Decentralized LLM Inference with Proof of Quality


31. Discriminative Perception via Anchored Description for Reasoning Segmentation


32. When Visual Evidence is Ambiguous: Pareidolia as a Diagnostic Probe for Vision Models


33. Rethinking Role-Playing Evaluation: Anonymous Benchmarking and a Systematic Study of Personality Effects


34. IROSA: Interactive Robot Skill Adaptation using Natural Language


35. CzechTopic: A Benchmark for Zero-Shot Topic Localization in Historical Czech Documents


36. On the Suitability of LLM-Driven Agents for Dark Pattern Audits


37. SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration


38. T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning


39. MACC: Multi-Agent Collaborative Competition for Scientific Exploration


40. Confidence-Calibrated Small-Large Language Model Collaboration for Cost-Efficient Reasoning


41. PROSPECT: Unified Streaming Vision-Language Navigation via Semantic–Spatial Fusion and Latent Predictive Representation


42. Understanding Parents’ Desires in Moderating Children’s Interactions with GenAI Chatbots through LLM-Generated Probes


43. Large-Language-Model-Guided State Estimation for Partially Observable Task and Motion Planning


44. EvoPrune: Early-Stage Visual Token Pruning for Efficient MLLMs


45. MIND: Unified Inquiry and Diagnosis RL with Criteria Grounded Clinical Supports for Psychiatric Consultation


46. Bridging Pedagogy and Play: Introducing a Language Mapping Interface for Human-AI Co-Creation in Educational Game Design


47. Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions


48. Goal-Driven Risk Assessment for LLM-Powered Systems: A Healthcare Case Study


49. Social Norm Reasoning in Multimodal Language Models: An Evaluation


50. Belief-Sim: Towards Belief-Driven Simulation of Demographic Misinformation Susceptibility


51. Molt Dynamics: Emergent Social Phenomena in Autonomous AI Agent Populations


52. Tucano 2 Cool: Better Open Source LLMs for Portuguese


53. RAG-X: Systematic Diagnosis of Retrieval-Augmented Generation for Medical Question Answering


54. SafeCRS: Personalized Safety Alignment for LLM-Based Conversational Recommender Systems


55. Test-Time Meta-Adaptation with Self-Synthesis


56. MMAI Gym for Science: Training Liquid Foundation Models for Drug Discovery


57. Raising Bars, Not Parameters: LilMoo Compact Language Model for Hindi


58. PhyPrompt: RL-based Prompt Refinement for Physically Plausible Text-to-Video Generation


59. Parallel Test-Time Scaling with Multi-Sequence Verifiers


60. Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs


61. On Google’s SynthID-Text LLM Watermarking System: Theoretical Analysis and Empirical Validation


62. MemSifter: Offloading LLM Memory Retrieval via Outcome-Driven Proxy Reasoning


63. AOI: Turning Failed Trajectories into Training Signals for Autonomous Cloud Diagnosis


64. Sleeper Cell: Injecting Latent Malice Temporal Backdoors into Tool-Using LLMs


65. Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations


66. PulseLM: A Foundation Dataset and Benchmark for PPG-Text Learning


67. Certainty robustness: Evaluating LLM stability under self-challenging prompts


68. AutoHarness: improving LLM agents by automatically synthesizing a code harness


69. StructLens: A Structural Lens for Language Models via Maximum Spanning Trees


70. Controllable and explainable personality sliders for LLMs at inference time


71. IntPro: A Proxy Agent for Context-Aware Intent Understanding via Retrieval-conditioned Inference


72. Controlling Chat Style in Language Models via Single-Direction Editing


73. Discern Truth from Falsehood: Reducing Over-Refusal via Contrastive Refinement


74. Can Large Language Models Derive New Knowledge? A Dynamic Benchmark for Biological Knowledge Discovery


75. DIALEVAL: Automated Type-Theoretic Evaluation of LLM Instruction Following


76. From We to Me: Theory Informed Narrative Shift with Abductive Reasoning


77. Automated Concept Discovery for LLM-as-a-Judge Preference Analysis


78. Quantum-Inspired Self-Attention in a Large Language Model


79. M-QUEST – Meme Question-Understanding Evaluation on Semantics and Toxicity


80. Towards Self-Robust LLMs: Intrinsic Prompt Noise Resistance via CoIPO


81. How does fine-tuning improve sensorimotor representations in large language models?


82. Escaping the BLEU Trap: A Signal-Grounded Framework with Decoupled Semantic Guidance for EEG-to-Text Decoding


83. Old Habits Die Hard: How Conversational History Geometrically Traps LLMs


84. Token-Oriented Object Notation vs JSON: A Benchmark of Plain and Constrained Decoding Generation


85. Draft-Conditioned Constrained Decoding for Structured Generation in LLMs


86. HumanLM: Simulating Users with State Alignment Beats Response Imitation


87. Developing an AI Assistant for Knowledge Management and Workforce Training in State DOTs


88. From Exact Hits to Close Enough: Semantic Caching for LLM Embeddings


89. TATRA: Training-Free Instance-Adaptive Prompting Through Rephrasing and Aggregation


90. TTSR: Test-Time Self-Reflection for Continual Reasoning Improvement


91. PlugMem: A Task-Agnostic Plugin Memory Module for LLM Agents


92. Language Model Goal Selection Differs from Humans’ in an Open-Ended Task


93. Fine-Tuning and Evaluating Conversational AI for Agricultural Advisory


94. From Conflict to Consensus: Boosting Medical Reasoning via Multi-Round Agentic RAG


95. One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models


96. AriadneMem: Threading the Maze of Lifelong Memory for LLM Agents