LLM 관련 주요 논문 - 2026-04-21

1. Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs


2. Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion


3. OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning


4. LLM Safety From Within: Detecting Harmful Content with Internal Representations


5. Using large language models for embodied planning introduces systematic safety risks


6. Six Llamas: Comparative Religious Ethics Through LoRA-Adapted Language Models


7. Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes


8. Training and Agentic Inference Strategies for LLM-based Manim Animation Generation


9. PARM: Pipeline-Adapted Reward Model


10. Toward Zero-Egress Psychiatric AI: On-Device LLM Deployment for Privacy-Preserving Mental Health Decision Support


11. Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence


12. LeGo-Code: Can Modular Curriculum Learning Advance Complex Code Generation? Insights from Text-to-SQL


13. AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation


14. QuantumQA: Enhancing Scientific Reasoning via Physics-Consistent Dataset and Verification-Aware Reinforcement Learning


15. Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration


16. Stability Implies Redundancy: Delta Attention Selective Halting for Efficient Long-Context Prefilling


17. Architectural Design Decisions in AI Agent Harnesses


18. SELF-EMO: Emotional Self-Evolution from Recognition to Consistent Expression


19. From Fallback to Frontline: When Can LLMs be Superior Annotators of Human Perspectives?


20. TPS-CalcBench: A Benchmark and Diagnostic Evaluation Framework for LLM Analytical Calculation Competence in Hypersonic Thermal Protection System Engineering


21. LiteResearcher: A Scalable Agentic RL Training Framework for Deep Research Agent


22. SPREG: Structured Plan Repair with Entropy-Guided Test-Time Intervention for Large Language Model Reasoning


23. Polysemantic Experts, Monosemantic Paths: Routing as Control in MoEs


24. WebUncertainty: Dual-Level Uncertainty Driven Planning and Reasoning For Autonomous Web Agent


25. Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition


26. Prompt Optimization Enables Stable Algorithmic Collusion in LLM Agents


27. When Vision-Language Models Judge Without Seeing: Exposing Informativeness Bias


28. Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks


29. Co-evolving Agent Architectures and Interpretable Reasoning for Automated Optimization


30. Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play


31. Poly-EPO: Training Exploratory Reasoning Models


32. KnowledgeBerg: Evaluating Systematic Knowledge Coverage and Compositional Reasoning in Large Language Models


33. Characterizing Model-Native Skills


34. Beyond Static Snapshots: A Grounded Evaluation Framework for Language Models at the Agentic Frontier


35. SafeAgent: A Runtime Protection Architecture for Agentic Systems


36. SkillGraph: Self-Evolving Multi-Agent Collaboration with Multimodal Graph Topology


37. Towards Shutdownable Agents: Generalizing Stochastic Choice in RL Agents and LLMs


38. Waking Up Blind: Cold-Start Optimization of Supervision-Free Agentic Trajectories for Grounded Visual Perception


39. Language models recognize dropout and Gaussian noise applied to their activations


40. TrafficClaw: Generalizable Urban Traffic Control via Unified Physical Environment Modeling


41. Compiling Deterministic Structure into SLM Harnesses


42. EvoMaster: A Foundational Agent Framework for Building Evolving Autonomous Scientific Agents at Scale


43. Phase-Scheduled Multi-Agent Systems for Token-Efficient Coordination


44. Beyond Meta-Reasoning: Metacognitive Consolidation for Self-Improving LLM Reasoning


45. LLM-Guided Strategy Synthesis for Scalable Equality Saturation


46. Hive: A Multi-Agent Infrastructure for Algorithm- and Task-Level Scaling


47. SOCIA-EVO: Automated Simulator Construction via Dual-Anchored Bi-Level Optimization


48. AutoSearch: Adaptive Search Depth for Efficient Agentic RAG via Reinforcement Learning


49. Knows: Agent-Native Structured Research Representations


50. Efficient Test-Time Scaling via Temporal Reasoning Aggregation


51. LLaTiSA: Towards Difficulty-Stratified Time Series Reasoning from Visual Perception to Semantics


52. Rectification Difficulty and Optimal Sample Allocation in LLM-Augmented Surveys


53. Beyond the Basics: Leveraging Large Language Model for Fine-Grained Medical Entity Recognition


54. Graph-of-Agents: A Graph-based Framework for Multi-Agent LLM Collaboration


55. If Only My CGM Could Speak: A Privacy-Preserving Agent for Question Answering over Continuous Glucose Data


56. Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification


57. Harness as an Asset: Enforcing Determinism via the Convergent AI Agent Framework (CAAF)


58. A phenotype-driven and evidence-governed framework for knowledge graph enrichment and hypotheses discovery in population data


59. MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models


60. AutoPKG: An Automated Framework for Dynamic E-commerce Product-Attribute Knowledge Graph Construction


61. LLMs can persuade only psychologically susceptible humans on societal issues, via trust in AI and emotional appeals, amid logical fallacies


62. Playing Psychic: Using Thought Trees to Predict Reasoning Models Accuracy on Coding Tasks


63. Alignment Imprint: Zero-Shot AI-Generated Text Detection via Provable Preference Discrepancy


64. ClimAgent: LLM as Agents for Autonomous Open-ended Climate Science Analysis


65. The Cognitive Penalty: Ablating System 1 and System 2 Reasoning in Edge-Native SLMs for Decentralized Consensus


66. Skilldex: A Package Manager and Registry for Agent Skill Packages with Hierarchical Scope-Based Distribution


67. Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models


68. GRAIL: Autonomous Concept Grounding for Neuro-Symbolic Reinforcement Learning


69. Introspection Adapters: Training LLMs to Report Their Learned Behaviors


70. Machine individuality: Separating genuine idiosyncrasy from response bias in large language models


71. Know When to Trust the Skill: Delayed Appraisal and Epistemic Vigilance for Single-Agent LLMs


72. Don’t Start What You Can’t Finish: A Counterfactual Audit of Support-State Triage in LLM Agents


73. CT Open: An Open-Access, Uncontaminated, Live Platform for the Open Challenge of Clinical Trial Outcome Prediction


74. When Agents Go Quiet: Output Generation Capacity and Format-Cost Separation for LLM Document Synthesis


75. Debate as Reward: A Multi-Agent Reward System for Scientific Ideation via RL Post-Training


76. Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench


77. The Query Channel: Information-Theoretic Limits of Masking-Based Explanations


78. Agentic Risk-Aware Set-Based Engineering Design


79. From Subsumption to Satisfiability: LLM-Assisted Active Learning for OWL Ontologies


80. Healthcare AI for Automation or Allocation? A Transaction Cost Economics Framework


81. Semantic Consensus: Process-Aware Conflict Detection and Resolution for Enterprise Multi-Agent LLM Systems


82. Bounded Ratio Reinforcement Learning


83. When Can LLMs Learn to Reason with Weak Supervision?


84. Back into Plato’s Cave: Examining Cross-modal Representational Convergence at Scale


85. Latent Phase-Shift Rollback: Inference-Time Error Correction via Residual Stream Monitoring and KV-Cache Steering


86. Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks


87. ProtoCLIP: Prototype-Aligned Latent Refinement for Robust Zero-Shot Chest X-Ray Classification


88. Revisiting Change VQA in Remote Sensing with Structured and Native Multimodal Qwen Models


89. AlphaContext: An Evolutionary Tree-based Psychometric Context Generator for Creativity Assessment


90. Dissecting AI Trading: Behavioral Finance and Market Bubbles


91. Multilingual Training and Evaluation Resources for Vision-Language Models


92. EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations


93. DocQAC: Adaptive Trie-Guided Decoding for Effective In-Document Query Auto-Completion


94. Towards Disentangled Preference Optimization Dynamics Beyond Likelihood Displacement


95. Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies


96. WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models


97. Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs


98. STaD: Scaffolded Task Design for Identifying Compositional Skill Gaps in LLMs


99. Copy-as-Decode: Grammar-Constrained Parallel Prefill for LLM Editing


100. Beyond Reproduction: A Paired-Task Framework for Assessing LLM Comprehension and Creativity in Literary Translation


101. MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge


102. Modular Representation Compression: Adapting LLMs for Efficient and Effective Recommendations


103. Region-Grounded Report Generation for 3D Medical Imaging: A Fine-Grained Dataset and Graph-Enhanced Framework


104. AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization


105. Depth Registers Unlock W4A4 on SwiGLU: A Reader/Generator Decomposition


106. TLoRA: Task-aware Low Rank Adaptation of Large Language Models


107. The Collaboration Gap in Human-AI Work


108. Mix and Match: Context Pairing for Scalable Topic-Controlled Educational Summarisation


109. ExAI5G: A Logic-Based Explainable AI Framework for Intrusion Detection in 5G Networks


110. First, Do No Harm (With LLMs): Mitigating Racial Bias via Agentic Workflows


111. Diversity Collapse in Multi-Agent LLM Systems: Structural Coupling and Collective Failure in Open-Ended Idea Generation


112. RAVEN: Retrieval-Augmented Vulnerability Exploration Network for Memory Corruption Analysis in User Code and Binary Programs


113. Heterogeneity in Formal Linguistic Competence of Language Models: Is Data the Real Bottleneck?


114. HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment


115. Bayesian Active Learning with Gaussian Processes Guided by LLM Relevance Scoring for Dense Passage Retrieval


116. LoReC: Rethinking Large Language Models for Graph Data Analysis


117. LEPO: \underline{L}atent R\underline{e}asoning \underline{P}olicy \underline{O}ptimization for Large Language~Models


118. Latent Preference Modeling for Cross-Session Personalized Tool Calling


119. Latent Abstraction for Retrieval-Augmented Generation


120. PDDL-Mind: Large Language Models are Capable on Belief Reasoning with Reliable State Tracking


121. Do LLMs Need to See Everything? A Benchmark and Study of Failures in LLM-driven Smartphone Automation using Screentext vs. Screenshots


122. Understanding Secret Leakage Risks in Code LLMs: A Tokenization Perspective


123. Bridging the Reasoning Gap in Vietnamese with Small Language Models via Test-Time Scaling


124. DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization


125. Forget What Matters, Keep the Rest: Selective Unlearning of Informative Tokens


126. SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks


127. Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF


128. MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models


129. RePrompT: Recurrent Prompt Tuning for Integrating Structured EHR Encoders with Large Language Models


130. Concurrent Criterion Validation of a Validity Screen for LLM Confidence Signals via Selective Prediction


131. Screen Before You Interpret: A Portable Validity Protocol for Benchmark-Based LLM Confidence Signals


132. Before You Interpret the Profile: Validity Scaling for LLM Metacognitive Self-Report


133. WISV: Wireless-Informed Semantic Verification for Distributed Speculative Decoding in Device-Edge LLM Inference


134. CAPO: Counterfactual Credit Assignment in Sequential Cooperative Teams


135. SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models


136. ATLAS: Constitution-Conditioned Latent Geometry and Redistribution Across Language Models and Neural Perturbation Data


137. Semantic Density Effect (SDE): Maximizing Information Per Token Improves LLM Accuracy


138. Provable Coordination for LLM Agents via Message Sequence Charts


139. Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories


140. PBSBench: A Multi-Level Vision-Language Framework and Benchmark for Hematopathology Whole Slide Image Interpretation


141. OPSDL: On-Policy Self-Distillation for Long-Context Language Models


142. RS-HyRe-R1: A Hybrid Reward Mechanism to Overcome Perceptual Inertia for Remote Sensing Images Understanding


143. Generative AI Technologies, Techniques & Tensions: A Primer


144. Dual-Anchoring: Addressing State Drift in Vision-Language Navigation


145. Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning


146. Jupiter-N Technical Report


147. DuConTE: Dual-Granularity Text Encoder with Topology-Constrained Attention for Text-attributed Graphs


148. Speculative Decoding for Autoregressive Video Generation


149. When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models


150. ArgBench: Benchmarking LLMs on Computational Argumentation Tasks


151. PsychBench: Auditing Epidemiological Fidelity in Large Language Model Mental Health Simulations


152. Still Between Us? Evaluating and Improving Voice Assistant Robustness to Third-Party Interruptions


153. Signal or Noise in Multi-Agent LLM-based Stock Recommendations?


154. SigGate-GT: Taming Over-Smoothing in Graph Transformers via Sigmoid-Gated Attention


155. Calibrated? Not for Everyone: How Sexual Orientation and Religious Markers Distort LLM Accuracy and Confidence in Medical QA


156. A Survey of Reinforcement Learning for Large Language Models under Data Scarcity: Challenges and Solutions


157. RoTRAG: Rule of Thumb Reasoning for Conversation Harm Detection with Retrieval-Augmented Generation


158. Cat-DPO: Category-Adaptive Safety Alignment


159. Probabilistic Programs of Thought


160. Clover: A Neural-Symbolic Agentic Harness with Stochastic Tree-of-Thoughts for Verified RTL Repair


161. Instinct vs. Reflection: Unifying Token and Verbalized Confidence in Multimodal Large Models


162. HORIZON: A Benchmark for In-the-wild User Behaviour Modeling


163. Seeing Isn’t Believing: Mitigating Belief Inertia via Active Intervention in Embodied Agents


164. DORA Explorer: Improving the Exploration Ability of LLMs Without Training


165. HeadRank: Decoding-Free Passage Reranking via Preference-Aligned Attention Heads


166. Enhancing Zero-shot Personalized Image Aesthetics Assessment with Profile-aware Multimodal LLM


167. Dynamics of Cognitive Heterogeneity: Investigating Behavioral Biases in Multi-Stage Supply Chains with LLM-Based Simulation


168. Cross-Modal Attention Analysis and Optimization in Vision-Language Models: A Study on Visual Reliability


169. DREAM: Dynamic Retinal Enhancement with Adaptive Multi-modal Fusion for Expert Precision Medical Report Generation


170. Layer-wise MoE Routing Locality under Shared-Prefix Code Generation: Token-Identity Decomposition and Compile-Equivalent Fork Redundancy


171. RosettaSearch: Multi-Objective Inference-Time Search for Protein Sequence Design


172. CCCL: In-GPU Compression-Coupled Collective Communication


173. Systematic Capability Benchmarking of Frontier Large Language Models for Offensive Cyber Tasks



175. The Consensus Trap: Rescuing Multi-Agent LLMs from Adversarial Majorities via Token-Level Collaboration


176. CASCADE: A Cascaded Hybrid Defense Architecture for Prompt Injection Detection in MCP-Based Systems


177. HiveMind: OS-Inspired Scheduling for Concurrent LLM Agent Workloads


178. Configuration Over Selection: Hyperparameter Sensitivity Exceeds Model Differences in Open-Source LLMs for RTL Generation


179. Comparing Human and Large Language Model Interpretation of Implicit Information


180. Abstain-R1: Calibrated Abstention and Post-Refusal Clarification via Verifiable RL


181. RLM-on-KG: Heuristics First, LLMs When Needed: Adaptive Retrieval Control over Mention Graphs for Scattered Evidence


182. mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval


183. Efficient Task Adaptation in Large Language Models via Selective Parameter Optimization


184. Where is the Mind? Persona Vectors and LLM Individuation


185. Beyond Static Benchmarks: Synthesizing Harmful Content via Persona-based Simulation for Robust Evaluation


186. Improving LLM Code Reasoning via Semantic Equivalence Self-Play with Formal Verification


187. Bolzano: Case Studies in LLM-Assisted Mathematical Research


188. Evaluating Multimodal LLMs for Inpatient Diagnosis: Real-World Performance, Safety, and Cost Across Ten Frontier Models


189. MEMRES: A Memory-Augmented Resolver with Confidence Cascade for Agentic Python Dependency Resolution


190. D-QRELO: Training- and Data-Free Delta Compression for Large Language Models via Quantization and Residual Low-Rank Approximation


191. PRISM: Probing Reasoning, Instruction, and Source Memory in LLM Hallucinations


192. ProtoCycle: Reflective Tool-Augmented Planning for Text-Guided Protein Design


193. Incentivizing Parametric Knowledge via Reinforcement Learning with Verifiable Rewards for Cross-Cultural Entity Translation


194. Applications of deep generative models to DNA reaction kinetics and to cryogenic electron microscopy


195. The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation


196. SafeDream: Safety World Model for Proactive Early Jailbreak Detection


197. Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering


198. Bridging Coarse and Fine Recognition: A Hybrid Approach for Open-Ended Multi-Granularity Object Recognition in Interactive Educational Games


199. Federation over Text: Insight Sharing for Multi-Agent Reasoning


200. StageMem: Lifecycle-Managed Memory for Language Models


201. The Reliance Negotiation Framework: A Dynamic Process Model of Student LLM Engagement in Academic Writing


202. Reducing Peak Memory Usage for Modern Multimodal Large Language Model Pipelines


203. Agentic Large Language Models for Training-Free Neuro-Radiological Image Analysis


204. No-Worse Context-Aware Decoding: Preventing Neutral Regression in Context-Conditioned Generation


205. Rewind-IL: Online Failure Detection and State Respawning for Imitation Learning


206. KAIROS: Stateful, Context-Aware Power-Efficient Agentic Inference Serving


207. Cross-Modal Bayesian Low-Rank Adaptation for Uncertainty-Aware Multimodal Learning


208. AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation


209. Aligning Backchannel and Dialogue Context Representations via Contrastive LLM Fine-Tuning


210. Spotlights and Blindspots: Evaluation Machine-Generated Text Detection


211. Randomized Antipodal Search Done Right for Data Pareto Improvement of LLM Unlearning


212. Certified Program Synthesis with a Multi-Modal Verifier


213. POLAR: Online Learning for LoRA Adapter Caching and Routing in Edge LLM Serving


214. Reasoning on the Manifold: Bidirectional Consistency for Self-Verification in Diffusion Language Models


215. SpecPylot: Python Specification Generation using Large Language Models



217. Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion


218. A Survey on the Security of Long-Term Memory in LLM Agents: Toward Mnemonic Sovereignty


219. Conjunctive Prompt Attacks in Multi-Agent LLM Systems


220. SCATR: Simple Calibrated Test-Time Ranking


221. Scaling Test-Time Compute for Agentic Coding


222. Expert-Annotated Embryo Image Dataset with Natural Language Descriptions for Evidence-Based Patient Communication in IVF


223. CAMP: Cumulative Agentic Masking and Pruning for Privacy Protection in Multi-Turn LLM Conversations


224. NL2SQLBench: A Modular Benchmarking Framework for LLM-Enabled NL2SQL Solutions


225. Spike-driven Large Language Model


226. Training Language Models for Bilateral Trade with Private Information


227. B-PASTE: Beam-Aware Pattern-Guided Speculative Execution for Resource-Constrained LLM Agents


228. From Inheritance to Saturation: Disentangling the Evolution of Visual Redundancy for Architecture-Aware MLLM Inference Acceleration


229. Sampling for Quality: Training-Free Reward-Guided LLM Decoding via Sequential Monte Carlo


230. iPhoneme: Brain-to-Text Communication for ALS Using ConformerXL Decoding


231. HalluSAE: Detecting Hallucinations in Large Language Models via Sparse Auto-Encoders


232. Shifting the Gradient: Understanding How Defensive Training Methods Protect Language Model Integrity


233. Injecting Structured Biomedical Knowledge into Language Models: Continual Pretraining vs. GraphRAG


234. Measuring Representation Robustness in Large Language Models for Geometry


235. Breaking Validity-Induced Boundaries to Expand Algorithm Search Space: A Two-Stage AST-Based Operator for LLM-Driven Automated Heuristic Evolution


236. What Is Actually Being Annotated? Inter-Prompt Reliability as a Measurement Problem in LLM-Based Social Science Labeling


237. GraphRAG-Router: Learning Cost-Efficient Routing over GraphRAGs and LLMs with Reinforcement Learning


238. CoLLM: A Unified Framework for Co-execution of LLMs Federated Fine-tuning and Inference


239. IACDM: Interactive Adversarial Convergence Development Methodology – A Structured Framework for AI-Assisted Software Development


240. A Framework for Human-AI Q-Matrix Refinement: A NeuralCDM Evaluation


241. Instructor-Created Custom GPTs as Pedagogical Partners Fostering Immersion in Online Higher Education: Two Case Studies


242. Stream2LLM: Overlap Context Streaming and Prefill for Reduced TTFT



244. RoMathExam: A Longitudinal Dataset of Romanian Math Exams (1895-2025) with a Seven-Decade Core (1957-2025)


245. Large language models for post-publication research evaluation: Evidence from expert recommendations and citation indicators


246. StressWeb: A Diagnostic Benchmark for Web Agent Robustness under Realistic Interaction Variability


247. Same Verdict, Different Reasons: LLM-as-a-Judge and Clinician Disagreement on Medical Chatbot Completeness


248. Brain-CLIPLM: Decoding Compressed Semantic Representations in EEG for Language Reconstruction


249. Talk, Walk, and Market Response: Multimodal Measurement of AI Washing and Its Capital Market Consequences in China


250. Clinical Note Bloat Reduction for Efficient LLM Use