LLM 관련 주요 논문 - 2025-10-16

1. Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math


2. From Refusal to Recovery: A Control-Theoretic Approach to Generative AI Guardrails


3. Training LLM Agents to Empower Humans


4. Tandem Training for Language Models


5. Confidence as a Reward: Transforming LLMs into Reward Models


6. Assessing LLM Reasoning Through Implicit Causal Chain Discovery in Climate Discourse


7. Personalized Learning Path Planning with Goal-Driven Learner State Modeling


8. Adaptive Reasoning Executor: A Collaborative Agent System for Efficient Reasoning


9. Emotional Cognitive Modeling Framework with Desire-Driven Objective Optimization for LLM-empowered Agent in Social Simulation


10. Toward Reasoning-Centric Time-Series Analysis


11. From Narratives to Probabilistic Reasoning: Predicting and Interpreting Drivers’ Hazardous Actions in Crashes Using Large Language Model


12. SENTINEL: A Multi-Level Formal Framework for Safety Evaluation of LLM-based Embodied Agents


13. DeepPlanner: Scaling Planning Capability for Deep Research Agents via Advantage Shaping


14. From Literal to Liberal: A Meta-Prompting Framework for Eliciting Human-Aligned Exception Handling in Large Language Models


15. Generative Universal Verifier as Multimodal Meta-Reasoner


16. Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs


17. The Art of Scaling Reinforcement Learning Compute for LLMs


18. RECODE: Reasoning Through Code Generation for Visual Question Answering


19. FIRST: Federated Inference Resource Scheduling Toolkit for Scientific AI Model Access


20. Time Series Foundation Models: Benchmarking Challenges and Requirements


21. Closing the Gap Between Text and Speech Understanding in LLMs


22. Deflanderization for Game Dialogue: Balancing Character Authenticity with Task Execution in LLM-based NPCs


23. In-Browser LLM-Guided Fuzzing for Real-Time Prompt Injection Testing in Agentic AI Browsers


24. K-Merge: Online Continual Merging of Adapters for On-device Large Language Models


25. Offline and Online KL-Regularized RLHF under Differential Privacy


26. ConsintBench: Evaluating Language Models on Real-World Consumer Intent Understanding


27. LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA


28. MADREC: A Multi-Aspect Driven LLM Agent for Explainable and Adaptive Recommendation


29. Document Intelligence in the Era of Large Language Models: A Survey


30. Language as a Label: Zero-Shot Multimodal Classification of Everyday Postures under Data Scarcity


31. Protect: Towards Robust Guardrailing Stack for Trustworthy Enterprise LLM Systems


32. Thompson Sampling via Fine-Tuning of LLMs


33. Self-Augmented Visual Contrastive Decoding


34. LLM one-shot style transfer for Authorship Attribution and Verification


35. Higher Satisfaction, Lower Cost: A Technical Report on How LLMs Revolutionize Meituan’s Intelligent Interaction Systems


36. To Steer or Not to Steer? Mechanistic Error Reduction with Abstention for Language Models


37. What “Not” to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging


38. LLM-Guided Synthetic Augmentation (LGSA) for Mitigating Bias in AI Systems


39. StressTransfer: Stress-Aware Speech-to-Speech Translation with Emphasis Preservation


40. Program of Thoughts for Financial Reasoning: Leveraging Dynamic In-Context Examples and Generative Retrieval


41. Stable LLM Ensemble: Interaction between Example Representativeness and Diversity


42. On the Reasoning Abilities of Masked Diffusion Language Models


43. Multi-Label Clinical Text Eligibility Classification and Summarization System


44. DriveCritic: Towards Context-Aware, Human-Aligned Evaluation for Autonomous Driving with Vision-Language Models


45. TRUSTVIS: A Multi-Dimensional Trustworthiness Evaluation Framework for Large Language Models


46. ESI: Epistemic Uncertainty Quantification via Semantic-preserving Intervention for Large Language Models


47. VLA-0: Building State-of-the-Art VLAs with Zero Modification


48. Deliberate Lab: A Platform for Real-Time Human-AI Social Experiments


49. Developing and Validating the Arabic Version of the Attitudes Toward Large Language Models Scale


50. CurLL: A Developmental Framework to Evaluate Continual Learning in Language Models


51. Max It or Miss It: Benchmarking LLM On Solving Extremal Problems


52. Epistemic-aware Vision-Language Foundation Model for Fetal Ultrasound Interpretation


53. SpareCodeSearch: Searching for Code Context When You Have No Spare GPU


54. InferA: A Smart Assistant for Cosmological Ensemble Data


55. KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems


56. Adaptive Generation of Bias-Eliciting Questions for LLMs


57. VLURes: Benchmarking VLM Visual and Linguistic Understanding in Low-Resource Languages


58. FaStFACT: Faster, Stronger Long-Form Factuality Evaluations in LLMs


59. A\textsuperscript{2}FM: An Adaptive Agent Foundation Model for Tool-Aware Hybrid Reasoning


60. Repurposing Annotation Guidelines to Instruct LLM Annotators: A Case Study


61. Gobernanza y trazabilidad “a prueba de AI Act” para casos de uso legales: un marco técnico-jurídico, métricas forenses y evidencias auditables


62. Mathematics with large language models as provers and verifiers


63. Scheming Ability in LLM-to-LLM Strategic Interactions


64. MEDEQUALQA: Evaluating Biases in LLMs with Counterfactual Reasoning


65. From Noise to Signal to Selbstzweck: Reframing Human Label Variation in the Era of Post-training in NLP


66. Cancer Diagnosis Categorization in Electronic Health Records Using Large Language Models and BioBERT: Model Performance Evaluation Study


67. Benchmarking Open-Source Large Language Models for Persian in Zero-Shot and Few-Shot Learning


68. AutoCode: LLMs as Problem Setters for Competitive Programming