LLM 관련 주요 논문 - 2025-10-09

1. NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents


2. Integrating Domain Knowledge into Process Discovery Using Large Language Models


3. VRPAgent: LLM-Driven Discovery of Heuristic Operators for Vehicle Routing Problems


4. Prompt Optimization Across Multiple Agents for Representing Diverse Human Populations


5. Tool-Augmented Policy Optimization: Synergizing Reasoning and Adaptive Tool Use with Reinforcement Learning


6. Revisiting the Uniform Information Density Hypothesis in LLM Reasoning Traces


7. LLM-Assisted Modeling of Semantic Web-Enabled Multi-Agents Systems with AJAN


8. TGPR: Tree-Guided Policy Refinement for Robust Self-Debugging of LLMs


9. Autoformalizer with Tool Feedback


10. Verifying Memoryless Sequential Decision-making of Large Language Models


11. MultiCNKG: Integrating Cognitive Neuroscience, Gene, and Disease Knowledge Graphs Using Large Language Models


12. Agent-in-the-Loop: A Data Flywheel for Continuous Improvement in LLM-based Customer Support


13. WebDART: Dynamic Decomposition and Re-planning for Complex Web Tasks


14. Auto-Prompt Ensemble for LLM Judge


15. Beneficial Reasoning Behaviors in Agentic Search and Effective Post-training to Obtain Them


16. Bridging Reasoning to Learning: Unmasking Illusions using Complexity Out of Distribution Generalization


17. Vibe Checker: Aligning Code Evaluation with Human Preference


18. h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning


19. MLE-Smith: Scaling MLE Tasks with Automated Multi-Agent Pipeline


20. AudioMarathon: A Comprehensive Benchmark for Long-Context Audio Understanding and Efficiency in Audio LLMs


21. Evolutionary Profiles for Protein Fitness Prediction


22. Online Rubrics Elicitation from Pairwise Comparisons



24. Benchmarking LLM Causal Reasoning with Scientifically Validated Relationships


25. Where to Begin: Efficient Pretraining via Subnetwork Selection and Distillation


26. Language Lives in Sparse Dimensions: Toward Interpretable and Efficient Multilingual Control for Large Language Models


27. TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics



29. Comparing human and language models sentence processing difficulties on complex structures


30. Opt-ICL at LeWiDi-2025: Maximizing In-Context Signal from Rater Examples via Meta-Learning


31. Vision-Language-Action Models for Robotics: A Review Towards Real-World Applications


32. LuxInstruct: A Cross-Lingual Instruction Tuning Dataset For Luxembourgish


33. Search-R3: Unifying Reasoning and Embedding Generation in Large Language Models


34. Unified Molecule Pre-training with Flexible 2D and 3D Modalities: Single and Paired Modality Integration


35. Mining the Mind: What 100M Beliefs Reveal About Frontier LLM Knowledge


36. Pragyaan: Designing and Curating High-Quality Cultural Post-Training Datasets for Indian Languages


37. The Limits of Goal-Setting Theory in LLM-Driven Assessment


38. VelLMes: A high-interaction AI-based deception framework


39. EDUMATH: Generating Standards-aligned Educational Math Word Problems


40. Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation


41. LongRM: Revealing and Unlocking the Context Boundary of Reward Modeling


42. OpenJAI-v1.0: An Open Thai Large Language Model


43. SID: Multi-LLM Debate Driven by Self Signals


44. Recurrence-Complete Frame-based Action Models


45. FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline


46. Foundations of LLM Knowledge Materialization: Termination, Reproducibility, Robustness


47. Evaluating LLMs for Historical Document OCR: A Methodological Framework for Digital Humanities


48. Are LLMs Reliable Rankers? Rank Manipulation via Two-Stage Token Optimization


49. Scaling LLM Multi-turn RL with End-to-end Summarization-based Context Management


50. LLM Company Policies and Policy Implications in Software Organizations


51. AISysRev - LLM-based Tool for Title-abstract Screening


52. Learning to Rewrite Prompts for Bootstrapping LLMs on Downstream Tasks


53. Distilling Lightweight Language Models for C/C++ Vulnerabilities


54. StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering


55. Reading Between the Lines: Towards Reliable Black-box LLM Fingerprinting via Zeroth-order Gradient Estimation


56. The Algebra of Meaning: Why Machines Need Montague More Than Moore’s Law


57. LogSTOP: Temporal Scores over Prediction Sequences for Matching and Retrieval


58. Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels


59. Valid Stopping for LLM Generation via Empirical Dynamic Formal Lift


60. Attention Sinks and Compression Valleys in LLMs are Two Sides of the Same Coin


61. Evaluating Node-tree Interfaces for AI Explainability


62. A Survey on Agentic Security: Applications, Threats and Defenses


63. Reward Model Perspectives: Whose Opinions Do Reward Models Reward?


64. Protecting De-identified Documents from Search-based Linkage Attacks


65. Relational Transformer: Toward Zero-Shot Foundation Models for Relational Data


66. Constrained Natural Language Action Planning for Resilient Embodied Systems


67. Leveraging Large Language Models for Cybersecurity Risk Assessment – A Case from Forestry Cyber-Physical Systems


68. VeriEquivBench: An Equivalence Score for Ground-Truth-Free Evaluation of Formally Verifiable Code


69. ChainMPQ: Interleaved Text-Image Reasoning Chains for Mitigating Relation Hallucinations


70. Surgeons Are Indian Males and Speech Therapists Are White Females: Auditing Biases in Vision-Language Models for Healthcare Professionals


71. Reproducibility Study of “XRec: Large Language Models for Explainable Recommendation”


72. MCCE: A Framework for Multi-LLM Collaborative Co-Evolution


73. Language models for longitudinal analysis of abusive content in Billboard Music Charts


74. Dual-stage and Lightweight Patient Chart Summarization for Emergency Physicians


75. Ensemble Deep Learning and LLM-Assisted Reporting for Automated Skin Lesion Diagnosis


76. LLM-Driven Rubric-Based Assessment of Algebraic Competence in Multi-Stage Block Coding Tasks with Design and Field Evaluation


77. Scalable multilingual PII annotation for responsible AI in LLMs


78. TRepLiNa: Layer-wise CKA+REPINA Alignment Improves Low-Resource Machine Translation in Aya-23 8B


79. CoT Referring: Improving Referring Expression Tasks with Grounded Reasoning


80. Transparent Reference-free Automated Evaluation of Open-Ended User Survey Responses


81. Knowledge Graph-Guided Multi-Agent Distillation for Reliable Industrial Question Answering with Datasets


82. Stacked Regression using Off-the-shelf, Stimulus-tuned and Fine-tuned Neural Networks for Predicting fMRI Brain Responses to Movies (Algonauts 2025 Report)


83. Generalized Multi-agent Social Simulation Framework


84. A Multimodal GUI Architecture for Interfacing with LLM-Based Conversational Assistants


85. WeatherArchive-Bench: Benchmarking Retrieval-Augmented Reasoning for Historical Weather Archives