LLM 관련 주요 논문 - 2025-11-04

1. Interaction as Intelligence Part II: Asynchronous Human-Agent Rollout for Long-Horizon Task Training


2. Validity Is What You Need


3. Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning


4. VeriMoA: A Mixture-of-Agents Framework for Spec-to-HDL Generation


5. InnovatorBench: Evaluating Agents’ Ability to Conduct Innovative LLM Research


6. Mechanics of Learned Reasoning 1: TempoBench, A Benchmark for Interpretable Deconstruction of Reasoning System Performance


7. GeoFM: Enhancing Geometric Reasoning of MLLMs via Synthetic Data Generation through Formal Language


8. ToolScope: An Agentic Framework for Vision-Guided and Long-Horizon Tool Use


9. An In-depth Study of LLM Contributions to the Bin Packing Problem


10. GUI-Rise: Structured Reasoning and History Summarization for GUI Navigation


11. Fints: Efficient Inference-Time Personalization for LLMs with Fine-Grained Instance-Tailored Steering


12. Glia: A Human-Inspired AI for Automated Systems Design and Optimization


13. Causal Masking on Spatial Data: An Information-Theoretic Case for Learning Spatial Datasets with Unimodal Language Models


14. Cognition Envelopes for Bounded AI Reasoning in Autonomous UAS Operations


15. Inverse Knowledge Search over Verifiable Reasoning: Synthesizing a Scientific Encyclopedia from a Long Chains-of-Thought Knowledge Base


16. CATArena: Evaluation of LLM Agents through Iterative Tournament Competitions


17. Continuous Autoregressive Language Models


18. PETAR: Localized Findings Generation with Mask-Aware Vision-Language Modeling for PET Automated Reporting


19. Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning


20. CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments


21. DialectalArabicMMLU: Benchmarking Dialectal Capabilities in Arabic and Multilingual Language Models


22. TetraJet-v2: Accurate NVFP4 Training for Large Language Models with Oscillation Suppression and Outlier Control


23. Thought Branches: Interpreting LLM Reasoning Requires Resampling


24. VCORE: Variance-Controlled Optimization-based Reweighting for Chain-of-Thought Supervision


25. Atlas-Alignment: Making Interpretability Transferable Across Language Models


26. Balancing Knowledge Updates: Toward Unified Modular Editing in LLMs


27. Can LLMs Help You at Work? A Sandbox for Evaluating LLM Agents in Enterprise Environments


28. FOCUS: Efficient Keyframe Selection for Long Video Understanding


29. Why Do Multilingual Reasoning Gaps Emerge in Reasoning Language Models?


30. MedCalc-Eval and MedCalc-Env: Advancing Medical Calculation Capabilities of Large Language Models


31. Higher-order Linear Attention


32. Languages are Modalities: Cross-Lingual Alignment via Encoder Injection


33. Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs


34. Vintage Code, Modern Judges: Meta-Validation in Low Data Regimes


35. MemeArena: Automating Context-Aware Unbiased Evaluation of Harmfulness Understanding for Multimodal Large Language Models


36. Unvalidated Trust: Cross-Stage Vulnerabilities in Large Language Model Architectures


37. Adaptive Defense against Harmful Fine-Tuning for Large Language Models via Bayesian Data Scheduler


38. Generating Accurate and Detailed Captions for High-Resolution Images


39. Adapting Large Language Models to Emerging Cybersecurity using Retrieval Augmented Generation


40. Towards a Measure of Algorithm Similarity


41. Consistency Training Helps Stop Sycophancy and Jailbreaks


42. Detecting Data Contamination in LLMs via In-Context Learning


43. Dataset Creation and Baseline Models for Sexism Detection in Hausa


44. Elastic Architecture Search for Efficient Language Models


45. LLMs are Overconfident: Evaluating Confidence Interval Calibration with FermiEval


46. Overview of the MEDIQA-OE 2025 Shared Task on Medical Order Extraction from Doctor-Patient Consultations


47. LLM-based Multi-class Attack Analysis and Mitigation Framework in IoT/IIoT Networks


48. RepV: Safety-Separable Latent Spaces for Scalable Neurosymbolic Plan Verification


49. Heterogeneous Robot Collaboration in Unstructured Environments with Grounded Generative Intelligence


50. How Similar Are Grokipedia and Wikipedia? A Multi-Dimensional Textual and Structural Comparison


51. Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench


52. Broken-Token: Filtering Obfuscated Prompts by Counting Characters-Per-Token


53. CAS-Spec: Cascade Adaptive Self-Speculative Decoding for On-the-Fly Lossless Inference Acceleration of LLMs


54. Category-Aware Semantic Caching for Heterogeneous LLM Workloads


55. LeMat-Synth: a multi-modal toolbox to curate broad synthesis procedure databases from scientific literature


56. Detecting Prefix Bias in LLM-based Reward Models