LLM 관련 주요 논문 - 2025-12-01

1. On the Limits of Innate Planning in Large Language Models


2. Self-Transparency Failures in Expert-Persona LLMs: A Large-Scale Behavioral Audit


3. Pessimistic Verification for Open Ended Math Questions


4. SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition


5. MADRA: Multi-Agent Debate for Risk-Aware Embodied Planning


6. Prune4Web: DOM Tree Pruning Programming for Web Agent


7. OVOD-Agent: A Markov-Bandit Framework for Proactive Visual Reasoning and Self-Evolving Detection



9. ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning


10. Improving Procedural Skill Explanations via Constrained Generation: A Symbolic-LLM Hybrid Architecture


11. ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction


12. Representation Interventions Enable Lifelong Unstructured Knowledge Control


13. Learning Multi-Access Point Coordination in Agentic AI Wi-Fi with Large Language Models


14. Cross Domain Evaluation of Multimodal Chain-of-Thought Reasoning of different datasets into the Amazon CoT Framework


15. Reasoning With a Star: A Heliophysics Dataset and Benchmark for Agentic Scientific Reasoning


16. $A^2Flow:$ Automating Agentic Workflow Generation via Self-Adaptive Abstraction Operators


17. Minimizing Hyperbolic Embedding Distortion with LLM-Guided Hierarchy Restructuring


18. Revisiting Generalization Across Difficulty Levels: It’s Not So Easy


19. ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration


20. G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning


21. Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework


22. Escaping the Verifier: Learning to Reason via Demonstrations


23. Qwen3-VL Technical Report


24. Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining


25. BAMAS: Structuring Budget-Aware Multi-Agent Systems


26. Voice, Bias, and Coreference: An Interpretability Study of Gender in Speech Translation


27. Tool-RoCo: An Agent-as-Tool Self-organization Large Language Model Benchmark in Multi-robot Cooperation


28. Constructing and Benchmarking: a Labeled Email Dataset for Text-Based Phishing and Spam Detection Framework


29. From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings


30. Training Introspective Behavior: Fine-Tuning Induces Reliable Internal State Detection in a 7B Model


31. Do Reasoning Vision-Language Models Inversely Scale in Test-Time Compute? A Distractor-centric Empirical Analysis


32. Monet: Reasoning in Latent Visual Space Beyond Images and Language


33. FITRep: Attention-Guided Item Representation via MLLMs


34. SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding


35. TALES: A Taxonomy and Analysis of Cultural Representations in LLM-generated Stories


36. Progress by Pieces: Test-Time Scaling for Autoregressive Image Generation


37. LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs


38. Beyond Patch Aggregation: 3-Pass Pyramid Indexing for Vision-Enhanced Document Retrieval


39. From Bits to Rounds: Parallel Decoding with Exploration for Diffusion Language Models


40. MLPMoE: Zero-Shot Architectural Metamorphosis of Dense LLM MLPs into Static Mixture-of-Experts


41. Aligning LLMs with Biomedical Knowledge using Balanced Fine-Tuning


42. Context-Aware Pragmatic Metacognitive Prompting for Sarcasm Detection


43. Breaking the Safety-Capability Tradeoff: Reinforcement Learning with Verifiable Rewards Maintains Safety Guardrails in LLMs


44. Subgoal Graph-Augmented Planning for LLM-Guided Open-World Reinforcement Learning


45. Even with AI, Bijection Discovery is Still Hard: The Opportunities and Challenges of OpenEvolve for Novel Bijection Construction


46. Towards Audio Token Compression in Large Audio Language Models


47. BUSTR: Breast Ultrasound Text Reporting with a Descriptor-Aware Vision-Language Model


48. Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory


49. Unsupervised Memorability Modeling from Tip-of-the-Tongue Retrieval Queries


50. Length-MAX Tokenizer for Language Models


51. Structured Prompting Enables More Robust, Holistic Evaluation of Language Models


52. SPHINX: A Synthetic Environment for Visual Perception and Reasoning


53. Memories Retrieved from Many Paths: A Multi-Prefix Framework for Robust Detection of Training Data Leakage in Large Language Models


54. Physics Steering: Causal Control of Cross-Domain Concepts in a Physics Foundation Model


55. Revisiting KRISP: A Lightweight Reproduction and Analysis of Knowledge-Enhanced Vision-Language Models


56. CANVAS: A Benchmark for Vision-Language Models on Tool-Based User Interface Design



58. Spatio-Temporal Trajectory Foundation Model - Recent Advances and Future Directions


59. Learning from Risk: LLM-Guided Generation of Safety-Critical Scenarios with Prior Knowledge


60. ST-PPO: Stabilized Off-Policy Proximal Policy Optimization for Multi-Turn Agents Training


61. Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation


62. Active Slice Discovery in Large Language Models


63. Are Neuro-Inspired Multi-Modal Vision-Language Models Resilient to Membership Inference Privacy Leakage?


64. DUALGUAGE: Automated Joint Security-Functionality Benchmarking for Secure Code Generation


65. PropensityBench: Evaluating Latent Safety Risks in Large Language Models via an Agentic Approach


66. Musical Score Understanding Benchmark: Evaluating Large Language Models’ Comprehension of Complete Musical Scores


67. Morality in AI. A plea to embed morality in LLM architectures and frameworks


68. Cognitive bias in LLM reasoning compromises interpretation of clinical oncology notes


69. MindSET: Advancing Mental Health Benchmarking through Large-Scale Social Media Data



71. Context-Aware Visual Prompting: Automating Geospatial Web Dashboards with Large Language Models and Agent Self-Validation for Decision Support


72. Domain-Grounded Evaluation of LLMs in International Student Knowledge


73. When LLMs Can’t Help: Real-World Evaluation of LLMs in Nutrition