LLM 관련 주요 논문 - 2026-02-25

1. A Benchmark for Deep Information Synthesis


2. Tool Building as a Path to “Superintelligence”


3. LogicGraph : Benchmarking Multi-Path Logical Reasoning via Neuro-Symbolic Generation and Verification


4. Architecting AgentOS: From Token-Level Context to Emergent System-Level Intelligence


5. HELP: HyperNode Expansion and Logical Path-Guided Evidence Localization for Accurate and Efficient GraphRAG


6. Predicting Sentence Acceptability Judgments in Multimodal Contexts


7. Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs


8. Pressure Reveals Character: Behavioural Alignment Evaluation at Depth


9. Qwen-BIM: developing large language model for BIM-based design with domain-specific benchmark and dataset


10. Pipeline for Verifying LLM-Generated Mathematical Solutions


11. CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference


12. Balancing Multiple Objectives in Urban Traffic Control with Reinforcement Learning from AI Feedback


13. Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning


14. Counterfactual Simulation Training for Chain-of-Thought Faithfulness


15. ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction


16. PromptCD: Test-Time Behavior Enhancement via Polarity-Prompt Contrastive Decoding


17. How Foundational Skills Influence VLM-based Embodied Agents:A Native Perspective


18. Recursive Belief Vision Language Model


19. Grounding LLMs in Scientific Discovery via Embodied Actions


20. Physics-based phenomenological characterization of cross-modal bias in multimodal models


21. CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation


22. From Logs to Language: Learning Optimal Verbalization for LLM-Based Recommendation in Production


23. Inner Speech as Behavior Guides: Steerable Imitation of Diverse Behaviors for Human-AI coordination


24. ActionEngine: From Reactive to Programmatic GUI Agents via State Machine Memory


25. PreScience: A Benchmark for Forecasting Scientific Contributions


26. Learning to Rewrite Tool Descriptions for Reliable LLM-Agent Tool Use


27. Implicit Intelligence – Evaluating Agents on What Users Don’t Say


28. DMCD: Semantic-Statistical Framework for Causal Discovery


29. An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models


30. Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training


31. XMorph: Explainable Brain Tumor Analysis Via LLM-Assisted Hybrid Deep Intelligence


32. SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery


33. “Are You Sure?”: An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems


34. VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation


35. CrystaL: Spontaneous Emergence of Visual Latents in MLLMs


36. The Art of Efficient Reasoning: Data, Reward, and Optimization


37. SoK: Agentic Skills – Beyond Tool Use in LLM Agents


38. AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs


39. PRECTR-V2:Unified Relevance-CTR Framework with Cross-User Preference Mining, Exposure Bias Correction, and LLM-Distilled Encoder Optimization


40. CAMEL: Confidence-Gated Reflection for Reward Modeling


41. Vision-Language Models for Ergonomic Assessment of Manual Lifting Tasks: Estimating Horizontal and Vertical Hand Distances from RGB Video


42. OptiLeak: Efficient Prompt Reconstruction via Reinforcement Learning in Multi-tenant LLM Services


43. Personal Information Parroting in Language Models


44. Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training


45. How Do Inpainting Artifacts Propagate to Language?


46. Wireless Federated Multi-Task LLM Fine-Tuning via Sparse-and-Orthogonal LoRA


47. Hybrid LLM-Embedded Dialogue Agents for Learner Reflection: Designing Responsive and Theory-Driven Interactions


48. Protein Language Models Diverge from Natural Language: Comparative Analysis and Improved Inference


49. Examining and Addressing Barriers to Diversity in LLM-Generated Ideas


50. Three Concrete Challenges and Two Hopes for the Safety of Unsupervised Elicitation


51. Case-Aware LLM-as-a-Judge Evaluation for Enterprise-Scale RAG Systems


52. No One Size Fits All: QueryBandits for Hallucination Mitigation


53. Circuit Tracing in Vision-Language Models: Understanding the Internal Mechanisms of Multimodal Thinking


54. Learning Physical Principles from Interaction: Self-Evolving Planning via Test-Time Memory


55. What Makes a Good Query? Measuring the Impact of Human-Confusing Linguistic Features on LLM Performance


56. InterviewSim: A Scalable Framework for Interview-Grounded Personality Simulation


57. Exploring Anti-Aging Literature via ConvexTopics and Large Language Models


58. An Approach to Combining Video and Speech with Large Language Models in Human-Robot Interaction


59. KnapSpec: Self-Speculative Decoding via Adaptive Layer Selection as a Knapsack Problem


60. CodeHacker: Automated Test Case Generation for Detecting Vulnerabilities in Competitive Programming Solutions


61. Golden Layers and Where to Find Them: Improved Knowledge Editing for Large Language Models Via Layer Gradient Analysis


62. Mitigating “Epistemic Debt” in Generative AI-Scaffolded Novice Programming using Metacognitive Scripts


63. Evaluating the Reliability of Digital Forensic Evidence Discovered by Large Language Model: A Case Study


64. Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning


65. MoBiQuant: Mixture-of-Bits Quantization for Token-Adaptive Elastic LLMs


66. Closing the Expertise Gap in Residential Building Energy Retrofits: A Domain-Specific LLM for Informed Decision-Making


67. CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation


68. ConceptRM: The Quest to Mitigate Alert Fatigue through Consensus-Based Purity-Driven Data Cleaning for Reflection Modelling


69. Talking to Yourself: Defying Forgetting in Large Language Models