전체 AI 논문 - 2025-10-10

1. How to Teach Large Multimodal Models New Skills

Authors: Zhen Zhu , Yiming Gong , Yao Xiao , Yaoyao Liu , Derek Hoiem
URL: https://arxiv.org/abs/2510.08564
Abstract:

How can we teach large multimodal models (LMMs) new skills without erasing prior abilities? We study sequential fine-tuning on five target skills while monitoring general ability on eight held-out benchmarks across three model families. We observe that apparent “forgetting” on held-out tasks after narrow fine-tuning can partly recover at later stages. We trace this behavior to a measurable shift in the output token distribution, manifested through a simple counting-bias probe that co-varies with forgetting. Guided by this picture, we identify two simple, robust tuning recipes that learn strongly while limiting drift: (i) updating only the self-attention projection layers, and (ii) updating only the MLP Gate&Up while freezing the Down projection. Across models and tasks, these choices deliver strong target gains while largely preserving held-out performance. Code is available at this https URL

2. Agent Learning via Early Experience

Authors: Kai Zhang , Xiangchao Chen , Bo Liu , Tianci Xue , Zeyi Liao , Zhihan Liu , Xiyao Wang , Yuting Ning , Zhaorun Chen , Xiaohan Fu , Jian Xie , Yuxuan Sun , Boyu Gou , Qi Qi , Zihang Meng , Jianwei Yang , Ning Zhang , Xian Li , Ashish Shah , Dat Huynh , Hengduo Li , Zi Yang , Sara Cao , Lawrence Jang , Shuyan Zhou , Jiacheng Zhu , Huan Sun , Jason Weston , Yu Su , Yifan Wu
URL: https://arxiv.org/abs/2510.08558
Abstract:

A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent’s own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm we study two strategies of using such data: (1) Implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) Self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. We evaluate across eight diverse environments and multiple model families. Our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, positioning it as a practical bridge between imitation learning and fully experience-driven agents.

3. FlowSearch: Advancing deep research with dynamic structured knowledge flow

Authors: Yusong Hu , Runmin Ma , Yue Fan , Jinxin Shi , Zongsheng Cao , Yuhao Zhou , Jiakang Yuan , Xiangchao Yan , Wenlong Zhang , Lei Bai , Bo Zhang
URL: https://arxiv.org/abs/2510.08521
Abstract:

Deep research is an inherently challenging task that demands both breadth and depth of thinking. It involves navigating diverse knowledge spaces and reasoning over complex, multi-step dependencies, which presents substantial challenges for agentic systems. To address this, we propose FlowSearch, a multi-agent framework that actively constructs and evolves a dynamic structured knowledge flow to drive subtask execution and reasoning. FlowSearch is capable of strategically planning and expanding the knowledge flow to enable parallel exploration and hierarchical task decomposition, while also adjusting the knowledge flow in real time based on feedback from intermediate reasoning outcomes and insights. FlowSearch achieves state-of-the-art performance on both general and scientific benchmarks, including GAIA, HLE, GPQA and TRQA, demonstrating its effectiveness in multi-disciplinary research scenarios and its potential to advance scientific discovery. The code is available at this https URL .

4. CaRT: Teaching LLM Agents to Know When They Know Enough

Authors: Grace Liu , Yuxiao Qu , Jeff Schneider , Aarti Singh , Aviral Kumar
URL: https://arxiv.org/abs/2510.08517
Abstract:

Many tasks require learned models to strategically gather relevant information over multiple rounds of interaction before actually acting on a task. Strategic information gathering requires models to know not only how to effectively acquire information, but also when to stop gathering information and make a decision, in order to avoid overthinking or getting derailed when acting. In this paper, we formalize this problem and introduce Counterfactuals and Reasoning for Termination (CaRT), an approach for teaching LLMs when to stop seeking information. To appropriately learn when to terminate, CaRT fine-tunes LLMs using counterfactual pairs of trajectories, one where termination is appropriate and a minimally modified version of the same trajectory where it is not. It trains the LLM to explain the rationale for the termination decision in either case via verbal reasoning, and imbues this capability into the base LLM via fine-tuning. We instantiate CaRT in two domains: interactive medical diagnosis and math problem solving. In both domains, we find that CaRT improves the efficiency of information gathering and task success rate compared to other fine-tuning methods.

5. AutoMLGen: Navigating Fine-Grained Optimization for Coding Agents

Authors: Shangheng Du , Xiangchao Yan , Dengyang Jiang , Jiakang Yuan , Yusong Hu , Xin Li , Liang He , Bo Zhang , Lei Bai
URL: https://arxiv.org/abs/2510.08511
Abstract:

Large language models (LLMs) have shown impressive performance in general programming tasks. However, in Machine Learning Engineering (MLE) scenarios such as AutoML and Kaggle competitions, achieving high performance depends heavily on expert intervention and repeated adjustments rather than simply generating correct code. When applied directly to these tasks, LLMs often lack fine-grained domain priors, and existing MLE approaches that use linear or tree-structured searches limit knowledge transfer to adjacent hierarchical links. As a result, they cannot leverage past full trajectories or share information across branches, limiting self-evolving ability and search space diversity. To address these limitations, we introduce AutoMLGen, an LLM-based coding agent that integrates a domain knowledge base for high-quality prior guidance and Monte Carlo Graph Search (MCGS) for efficient exploration. MCGS retains the tree-guided exploration of MCTS while embedding a graph structure into the expansion stage to enable dynamic path reorganization, historical trajectory reuse, and multi-solution fusion to support both self-evolution and collaborative learning. Combined with fine-grained operator sets, this design improves stability and accelerates convergence. Evaluation on the MLE-Bench shows that AutoMLGen achieves state-of-the-art performance in numerous dimensions, such as the average medal rate and the valid submission rate, under a 12-hour budget (half the standard runtime). The code is available at this https URL .

6. Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling

Authors: Bianca-Mihaela Ganescu , Suchir Salhan , Andrew Caines , Paula Buttery
URL: https://arxiv.org/abs/2510.08470
Abstract:

Training vision-language models on cognitively-plausible amounts of data requires rethinking how models integrate multimodal information. Within the constraints of the Vision track for the BabyLM Challenge 2025, we propose a lightweight decoder-based architecture with (1) token-wise dynamic gating for adaptive fusion of linguistic and visual cues, (2) feature modulation and channel attention to maximise the utility of limited visual information and (3) auxiliary contrastive objectives for visual grounding. Evaluation on five benchmarks (BLiMP, BLiMP Supplement, EWoK, Winoground and VQA) shows competitive or superior performance to multimodal baselines. More notably, our dynamic gate discovers interpretable patterns without explicit supervision, favouring visual cues for content words and linguistic cues for function words. While we identify limitations in the Challenge constraints, such as the information bottleneck created by global image embeddings and training instability from the dataset split, our findings establish dynamic gating as a powerful tool for efficient multimodal learning, offering both interpretability and performance even under severe constraints.

7. Revisiting Hallucination Detection with Effective Rank-based Uncertainty

Authors: Rui Wang , Zeming Wei , Guanzhang Yue , Meng Sun
URL: https://arxiv.org/abs/2510.08389
Abstract:

Detecting hallucinations in large language models (LLMs) remains a fundamental challenge for their trustworthy deployment. Going beyond basic uncertainty-driven hallucination detection frameworks, we propose a simple yet powerful method that quantifies uncertainty by measuring the effective rank of hidden states derived from multiple model outputs and different layers. Grounded in the spectral analysis of representations, our approach provides interpretable insights into the model’s internal reasoning process through semantic variations, while requiring no extra knowledge or additional modules, thus offering a combination of theoretical elegance and practical efficiency. Meanwhile, we theoretically demonstrate the necessity of quantifying uncertainty both internally (representations of a single response) and externally (different responses), providing a justification for using representations among different layers and responses from LLMs to detect hallucinations. Extensive experiments demonstrate that our method effectively detects hallucinations and generalizes robustly across various scenarios, contributing to a new paradigm of hallucination detection for LLM truthfulness.

8. QAgent: A modular Search Agent with Interactive Query Understanding

Authors: Yi Jiang , Lei Shen , Lujie Niu , Sendong Zhao , Wenbo Su , Bo Zheng
URL: https://arxiv.org/abs/2510.08383
Abstract:

Large language models (LLMs) excel at natural language tasks but are limited by their static parametric knowledge, especially in knowledge-intensive task. Retrieval-augmented generation (RAG) mitigates this by integrating external information. However, (1) traditional RAG struggles with complex query understanding, and (2) even search agents trained with reinforcement learning (RL), despite their promise, still face generalization and deployment challenges. To address these limitations, we propose QAgent, a unified agentic RAG framework that employs a search agent for adaptive retrieval. This agent optimizes its understanding of the query through interactive reasoning and retrieval. To facilitate real-world application, we focus on modular search agent for query understanding that are plug-and-play in complex systems. Secifically, the agent follows a multi-step decision process trained with RL to maximize retrieval quality and support accurate downstream answers. We further analyze the strengths and weaknesses of end-to-end RL and propose a strategy that focuses on effective retrieval, thereby enhancing generalization in LLM applications. Experiments show QAgent excels at QA and serves as a plug-and-play module for real-world deployment.

9. LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings

Authors: Benjamin F. Maier , Ulf Aslak , Luca Fiaschi , Nina Rismal , Kemble Fletcher , Christian C. Luhmann , Robbie Dow , Kli Pappas , Thomas V. Wiecki
URL: https://arxiv.org/abs/2510.08338
Abstract:

Consumer research costs companies billions annually yet suffers from panel biases and limited scale. Large language models (LLMs) offer an alternative by simulating synthetic consumers, but produce unrealistic response distributions when asked directly for numerical ratings. We present semantic similarity rating (SSR), a method that elicits textual responses from LLMs and maps these to Likert distributions using embedding similarity to reference statements. Testing on an extensive dataset comprising 57 personal care product surveys conducted by a leading corporation in that market (9,300 human responses), SSR achieves 90% of human test-retest reliability while maintaining realistic response distributions (KS similarity > 0.85). Additionally, these synthetic respondents provide rich qualitative feedback explaining their ratings. This framework enables scalable consumer research simulations while preserving traditional survey metrics and interpretability.

10. Beyond Pass@k: Breadth-Depth Metrics for Reasoning Boundaries

Authors: Marius Dragoi , Ioana Pintilie , Florin Gogianu , Florin Brad
URL: https://arxiv.org/abs/2510.08325
Abstract:

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm to improve Large Language Models on reasoning tasks such as coding, math or logic. To assess the reasoning boundary (the fraction of problems a model can solve) researchers often report Pass@k at large sampling budgets. Recent results reveal a crossover phenomenon: while RLVR models outperform the base model at small k values, the base model usually outperforms them when sampling a very large number of completions. This has been interpreted as evidence that base models have a larger reasoning boundary. We argue that on tasks with discrete answer spaces, such as math with numeric outputs, Pass@k at large k reflects the increasingly higher chance of success in the limit of the number of trials rather than genuine reasoning, and can therefore be misleading. We propose Cover@tau, which measures the fraction of problems that a model can solve for which at least a tau proportion of completions are correct. Unlike Pass@k, Cover@tau captures reasoning under an explicit reliability threshold: models that rely on random guessing degrade rapidly as tau increases. We evaluate several RLVR models using Cover@tau-based metrics and illustrate how the relative rankings of popular algorithms change compared to Pass@1, offering a different perspective on reasoning boundaries.

11. First Try Matters: Revisiting the Role of Reflection in Reasoning Models

Authors: Liwei Kang , Yue Deng , Yao Xiao , Zhanfeng Mo , Wee Sun Lee , Lidong Bing
URL: https://arxiv.org/abs/2510.08308
Abstract:

Large language models have recently demonstrated significant gains in reasoning ability, often attributed to their capacity to generate longer chains of thought and engage in reflective reasoning. However, the contribution of reflections to performance improvement remains unclear. In this paper, we systematically analyze the rollouts of eight reasoning models on five mathematical datasets. We focus on reflective behaviours where the model has already produced an answer but continues reflecting before finalizing its output. Our analysis reveals that reflections are predominantly confirmatory and rarely alter the model’s initial answer, a pattern consistent across models and datasets. To understand the role of reflections in training, we construct supervised fine-tuning (SFT) datasets with varying amounts of reflection steps. We observe that training models on rollouts with more reflection steps primarily enhances first-answer correctness rather than the ability to correct initially wrong answers through reflections. This motivates us to propose a question-aware early-stopping method that enhances inference-time token efficiency by stopping the reasoning process once a few plausible candidate answers are generated, thereby reducing unnecessary reflection steps. Motivated by this, we further propose to dynamically truncate the reflections after a candidate answer has appeared during generation, which reduces reasoning tokens by 24.5% across five mathematical datasets, within a 2.9% drop in accuracy.

12. Symmetry-Aware Fully-Amortized Optimization with Scale Equivariant Graph Metanetworks

Authors: Bart Kuipers , Freek Byrman , Daniel Uyterlinde , Alejandro García-Castellanos
URL: https://arxiv.org/abs/2510.08300
Abstract:

Amortized optimization accelerates the solution of related optimization problems by learning mappings that exploit shared structure across problem instances. We explore the use of Scale Equivariant Graph Metanetworks (ScaleGMNs) for this purpose. By operating directly in weight space, ScaleGMNs enable single-shot fine-tuning of existing models, reducing the need for iterative optimization. We demonstrate the effectiveness of this approach empirically and provide a theoretical result: the gauge freedom induced by scaling symmetries is strictly smaller in convolutional neural networks than in multi-layer perceptrons. This insight helps explain the performance differences observed between architectures in both our work and that of Kalogeropoulos et al. (2024). Overall, our findings underscore the potential of symmetry-aware metanetworks as a powerful approach for efficient and generalizable neural network optimization. Open-source code: this https URL

13. Co-TAP: Three-Layer Agent Interaction Protocol Technical Report

Authors: Shunyu An , Miao Wang , Yongchao Li , Dong Wan , Lina Wang , Ling Qin , Liqin Gao , Congyao Fan , Zhiyong Mao , Jiange Pu , Wenji Xia , Dong Zhao , Rui Hu , Ji Lu , Guiyue Zhou , Baoyu Tang , Yanqin Gao , Yongsheng Du , Daigang Xu , Lingjun Huang , Baoli Wang , Xiwen Zhang , Luyao Wang , Shilong Liu
URL: https://arxiv.org/abs/2510.08263
Abstract:

This paper proposes Co-TAP (T: Triple, A: Agent, P: Protocol), a three-layer agent interaction protocol designed to address the challenges faced by multi-agent systems across the three core dimensions of Interoperability, Interaction and Collaboration, and Knowledge Sharing. We have designed and proposed a layered solution composed of three core protocols: the Human-Agent Interaction Protocol (HAI), the Unified Agent Protocol (UAP), and the Memory-Extraction-Knowledge Protocol (MEK). HAI focuses on the interaction layer, standardizing the flow of information between users, interfaces, and agents by defining a standardized, event-driven communication paradigm. This ensures the real-time performance, reliability, and synergy of interactions. As the core of the infrastructure layer, UAP is designed to break down communication barriers among heterogeneous agents through unified service discovery and protocol conversion mechanisms, thereby enabling seamless interconnection and interoperability of the underlying network. MEK, in turn, operates at the cognitive layer. By establishing a standardized ‘‘Memory (M) - Extraction (E) - Knowledge (K)’’ cognitive chain, it empowers agents with the ability to learn from individual experiences and form shareable knowledge, thereby laying the foundation for the realization of true collective intelligence. We believe this protocol framework will provide a solid engineering foundation and theoretical guidance for building the next generation of efficient, scalable, and intelligent multi-agent applications.

14. Chain-of-Trigger: An Agentic Backdoor that Paradoxically Enhances Agentic Robustness

Authors: Jiyang Qiu , Xinbei Ma , Yunqing Xu , Zhuosheng Zhang , Hai Zhao
URL: https://arxiv.org/abs/2510.08238
Abstract:

The rapid deployment of large language model (LLM)-based agents in real-world applications has raised serious concerns about their trustworthiness. In this work, we reveal the security and robustness vulnerabilities of these agents through backdoor attacks. Distinct from traditional backdoors limited to single-step control, we propose the Chain-of-Trigger Backdoor (CoTri), a multi-step backdoor attack designed for long-horizon agentic control. CoTri relies on an ordered sequence. It starts with an initial trigger, and subsequent ones are drawn from the environment, allowing multi-step manipulation that diverts the agent from its intended task. Experimental results show that CoTri achieves a near-perfect attack success rate (ASR) while maintaining a near-zero false trigger rate (FTR). Due to training data modeling the stochastic nature of the environment, the implantation of CoTri paradoxically enhances the agent’s performance on benign tasks and even improves its robustness against environmental distractions. We further validate CoTri on vision-language models (VLMs), confirming its scalability to multimodal agents. Our work highlights that CoTri achieves stable, multi-step control within agents, improving their inherent robustness and task capabilities, which ultimately makes the attack more stealthy and raises potential safty risks.

Authors: Yunlong Deng , Boyang Sun , Yan Li , Lingjing Kong , Zeyu Tang , Kun Zhang , Guangyi Chen
URL: https://arxiv.org/abs/2510.08222
Abstract:

Due to their inherent complexity, reasoning tasks have long been regarded as rigorous benchmarks for assessing the capabilities of machine learning models, especially large language models (LLMs). Although humans can solve these tasks with ease, existing models, even after extensive pre-training and post-training at scale, still fail to perform reasoning reliably. In this paper, we revisit reasoning tasks from a causal perspective, seeking to understand their behavior in latent space and to offer insights for addressing their challenges. Specifically, we cast reasoning tasks as a selection mechanism, in which high-level logical concepts function as selection operators on the given observations, such as, identifying the correct answer in a math problem or filling the appropriate entry in Sudoku. We emphasize two key properties of this formulation that shed light on the difficulty of reasoning tasks. First, the latent space exceeds the observation space in complexity, even when the correct answer is fully determined by the observed input. Second, the latent variables, corresponding to logical thought, are densely structured and exhibit strong dependencies. Building on this formulation, we introduce a framework, called SR$^2$, that incorporates the estimated latent variables as feedback into the selection mechanism, thereby facilitating the learning of dense dependencies among latent representations. The framework consists of three key modules: reflective representation learning, dependency self-refinement, and periodic intermediate alignment. Experimentally, we show that our approach yields significant gains in reasoning accuracy, for example, attaining over 10$\%$ improvement in performance with 8$\times$ fewer parameters on the Sudoku and Maze tasks over the recent advances.

16. DODO: Causal Structure Learning with Budgeted Interventions

Authors: Matteo Gregorini , Chiara Boldrini , Lorenzo Valerio
URL: https://arxiv.org/abs/2510.08207
Abstract:

Artificial Intelligence has achieved remarkable advancements in recent years, yet much of its progress relies on identifying increasingly complex correlations. Enabling causality awareness in AI has the potential to enhance its performance by enabling a deeper understanding of the underlying mechanisms of the environment. In this paper, we introduce DODO, an algorithm defining how an Agent can autonomously learn the causal structure of its environment through repeated interventions. We assume a scenario where an Agent interacts with a world governed by a causal Directed Acyclic Graph (DAG), which dictates the system’s dynamics but remains hidden from the Agent. The Agent’s task is to accurately infer the causal DAG, even in the presence of noise. To achieve this, the Agent performs interventions, leveraging causal inference techniques to analyze the statistical significance of observed changes. Results show better performance for DODO, compared to observational approaches, in all but the most limited resource conditions. DODO is often able to reconstruct with as low as zero errors the structure of the causal graph. In the most challenging configuration, DODO outperforms the best baseline by +0.25 F1 points.

17. The Tournament Tree Method for preference elicitation in Multi-criteria decision-making

Authors: Diego García-Zamora , Álvaro Labella , José Rui Figueira
URL: https://arxiv.org/abs/2510.08197
Abstract:

Pairwise comparison methods, such as Fuzzy Preference Relations and Saaty’s Multiplicative Preference Relations, are widely used to model expert judgments in multi-criteria decision-making. However, their application is limited by the high cognitive load required to complete $m(m-1)/2$ comparisons, the risk of inconsistency, and the computational complexity of deriving consistent value scales. This paper proposes the Tournament Tree Method (TTM), a novel elicitation and evaluation framework that overcomes these limitations. The TTM requires only $m-1$ pairwise comparisons to obtain a complete, reciprocal, and consistent comparison matrix. The method consists of three phases: (i) elicitation of expert judgments using a reduced set of targeted comparisons, (ii) construction of the consistent pairwise comparison matrix, and (iii) derivation of a global value scale from the resulting matrix. The proposed approach ensures consistency by design, minimizes cognitive effort, and reduces the dimensionality of preference modeling from $m(m-1)/2$ to $m$ parameters. Furthermore, it is compatible with the classical Deck of Cards method, and thus it can handle interval and ratio scales. We have also developed a web-based tool that demonstrates its practical applicability in real decision-making scenarios.

18. Measuring What Matters: The AI Pluralism Index

Authors: Rashid Mushkani
URL: https://arxiv.org/abs/2510.08193
Abstract:

Artificial intelligence systems increasingly mediate knowledge, communication, and decision making. Development and governance remain concentrated within a small set of firms and states, raising concerns that technologies may encode narrow interests and limit public agency. Capability benchmarks for language, vision, and coding are common, yet public, auditable measures of pluralistic governance are rare. We define AI pluralism as the degree to which affected stakeholders can shape objectives, data practices, safeguards, and deployment. We present the AI Pluralism Index (AIPI), a transparent, evidence-based instrument that evaluates producers and system families across four pillars: participatory governance, inclusivity and diversity, transparency, and accountability. AIPI codes verifiable practices from public artifacts and independent evaluations, explicitly handling “Unknown” evidence to report both lower-bound (“evidence”) and known-only scores with coverage. We formalize the measurement model; implement a reproducible pipeline that integrates structured web and repository analysis, external assessments, and expert interviews; and assess reliability with inter-rater agreement, coverage reporting, cross-index correlations, and sensitivity analysis. The protocol, codebook, scoring scripts, and evidence graph are maintained openly with versioned releases and a public adjudication process. We report pilot provider results and situate AIPI relative to adjacent transparency, safety, and governance frameworks. The index aims to steer incentives toward pluralistic practice and to equip policymakers, procurers, and the public with comparable evidence.

19. R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?

Authors: Yi Lu , Jianing Wang , Linsen Guo , Wei He , Hongyin Tang , Tao Gui , Xuanjing Huang , Xuezhi Cao , Wei Wang , Xunliang Cai
URL: https://arxiv.org/abs/2510.08189
Abstract:

Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek-R1) have led to remarkable improvements through long Chain-of-Thought (CoT). However, existing benchmarks mainly focus on immediate, single-horizon tasks, failing to adequately evaluate models’ ability to understand and respond to complex, long-horizon scenarios. To address this incomplete evaluation of Large Reasoning Models (LRMs), we propose R-HORIZON, a method designed to stimulate long-horizon reasoning behaviors in LRMs through query composition. Based on R-HORIZON, we construct a long-horizon reasoning benchmark, comprising complex multi-step reasoning tasks with interdependent problems that span long reasoning horizons. Through comprehensive evaluation of LRMs using the R-HORIZON benchmark, we find that even the most advanced LRMs suffer significant performance degradation. Our analysis reveals that LRMs exhibit limited effective reasoning length and struggle to allocate thinking budget across multiple problems appropriately. Recognizing these limitations, we use R-HORIZON to construct long-horizon reasoning data for reinforcement learning with verified rewards (RLVR). Compared to training with single-horizon data, RLVR with R-HORIZON not only substantially improves performance on the multi-horizon reasoning tasks, but also promotes accuracy on standard reasoning tasks, with an increase of 7.5 on AIME2024. These results position R-HORIZON as a scalable, controllable, and low-cost paradigm for enhancing and evaluating the long-horizon reasoning capabilities of LRMs.

20. Prepared mind, fast response: A temporal decoupling framework for adaptive knowledge orchestration in open-domain dialogue

Authors: Jinling Gan , Churong Liang , Runnan Li
URL: https://arxiv.org/abs/2510.08175
Abstract:

The latency-quality tradeoff is a fundamental constraint in open-domain dialogue AI systems, since comprehensive knowledge access necessitates prohibitive response delays. Contemporary approaches offer two inadequate solutions: lightweight instruct models achieve sub-second latency but lack reasoning depth, while tool-augmented ReAct agents enhance factuality through external knowledge at the cost of synchronous execution that blocks interaction during re- trieval processes. PMFR is thus proposed, with a tempo- ral decoupling framework that fundamentally resolves the contradiction through asynchronous knowledge orchestra- tion. PMFR employs three coordinated components: (1) a Knowledge Adequacy Evaluator for real-time sufficiency assessment, (2) a Lightweight Response Generator for imme- diate user interaction, and (3) an Asynchronous Knowledge Refinement Agent for background knowledge enhancement. This architecture maintains continuous conversational flow while progressively enriching knowledge coverage through intelligent triggering mechanisms. Evaluation results on Top- iOCQA demonstrate PMFR outperforms brute-force scaling: PMFR achieves 95.3% latency reduction (23.38s -> 1.09s) while preserving response quality comparable to heavyweight synchronous baselines (GEval-C: 0.613 vs. 0.620).

21. Can Risk-taking AI-Assistants suitably represent entities

Authors: Ali Mazyaki , Mohammad Naghizadeh , Samaneh Ranjkhah Zonouzaghi , Amirhossein Farshi Sotoudeh
URL: https://arxiv.org/abs/2510.08114
Abstract:

Abstract not available

22. From Ethical Declarations to Provable Independence: An Ontology-Driven Optimal-Transport Framework for Certifiably Fair AI Systems

Authors: Sukriti Bhattacharya , Chitro Majumdar
URL: https://arxiv.org/abs/2510.08086
Abstract:

This paper presents a framework for provably fair AI that overcomes the limits of current bias mitigation methods by systematically removing all sensitive information and its proxies. Using ontology engineering in OWL 2 QL, it formally defines sensitive attributes and infers their proxies through logical reasoning, constructing a sigma algebra G that captures the full structure of biased patterns. Fair representations are then obtained via Delbaen Majumdar optimal transport, which generates variables independent of G while minimizing L2 distance to preserve accuracy. This guarantees true independence rather than mere decorrelation. By modeling bias as dependence between sigma algebras, compiling ontological knowledge into measurable structures, and using optimal transport as the unique fair transformation, the approach ensures complete fairness in tasks like loan approval, where proxies such as ZIP code reveal race. The result is a certifiable and mathematically grounded method for trustworthy AI.

23. AutoQual: An LLM Agent for Automated Discovery of Interpretable Features for Review Quality Assessment

Authors: Xiaochong Lan , Jie Feng , Yinxing Liu , Xinlei Shi , Yong Li
URL: https://arxiv.org/abs/2510.08081
Abstract:

Ranking online reviews by their intrinsic quality is a critical task for e-commerce platforms and information services, impacting user experience and business outcomes. However, quality is a domain-dependent and dynamic concept, making its assessment a formidable challenge. Traditional methods relying on hand-crafted features are unscalable across domains and fail to adapt to evolving content patterns, while modern deep learning approaches often produce black-box models that lack interpretability and may prioritize semantics over quality. To address these challenges, we propose AutoQual, an LLM-based agent framework that automates the discovery of interpretable features. While demonstrated on review quality assessment, AutoQual is designed as a general framework for transforming tacit knowledge embedded in data into explicit, computable features. It mimics a human research process, iteratively generating feature hypotheses through reflection, operationalizing them via autonomous tool implementation, and accumulating experience in a persistent memory. We deploy our method on a large-scale online platform with a billion-level user base. Large-scale A/B testing confirms its effectiveness, increasing average reviews viewed per user by 0.79% and the conversion rate of review readers by 0.27%.

24. Multi-Condition Conformal Selection

Authors: Qingyang Hao , Wenbo Liao , Bingyi Jing , Hongxin Wei
URL: https://arxiv.org/abs/2510.08075
Abstract:

Selecting high-quality candidates from large-scale datasets is critically important in resource-constrained applications such as drug discovery, precision medicine, and the alignment of large language models. While conformal selection methods offer a rigorous solution with False Discovery Rate (FDR) control, their applicability is confined to single-threshold scenarios (i.e., y > c) and overlooks practical needs for multi-condition selection, such as conjunctive or disjunctive conditions. In this work, we propose the Multi-Condition Conformal Selection (MCCS) algorithm, which extends conformal selection to scenarios with multiple conditions. In particular, we introduce a novel nonconformity score with regional monotonicity for conjunctive conditions and a global Benjamini-Hochberg (BH) procedure for disjunctive conditions, thereby establishing finite-sample FDR control with theoretical guarantees. The integration of these components enables the proposed method to achieve rigorous FDR-controlled selection in various multi-condition environments. Extensive experiments validate the superiority of MCCS over baselines, its generalizability across diverse condition combinations, different real-world modalities, and multi-task scalability.

25. LinguaSim: Interactive Multi-Vehicle Testing Scenario Generation via Natural Language Instruction Based on Large Language Models

Authors: Qingyuan Shi , Qingwen Meng , Hao Cheng , Qing Xu , Jianqiang Wang
URL: https://arxiv.org/abs/2510.08046
Abstract:

The generation of testing and training scenarios for autonomous vehicles has drawn significant attention. While Large Language Models (LLMs) have enabled new scenario generation methods, current methods struggle to balance command adherence accuracy with the realism of real-world driving environments. To reduce scenario description complexity, these methods often compromise realism by limiting scenarios to 2D, or open-loop simulations where background vehicles follow predefined, non-interactive behaviors. We propose LinguaSim, an LLM-based framework that converts natural language into realistic, interactive 3D scenarios, ensuring both dynamic vehicle interactions and faithful alignment between the input descriptions and the generated scenarios. A feedback calibration module further refines the generation precision, improving fidelity to user intent. By bridging the gap between natural language and closed-loop, interactive simulations, LinguaSim constrains adversarial vehicle behaviors using both the scenario description and the autonomous driving model guiding them. This framework facilitates the creation of high-fidelity scenarios that enhance safety testing and training. Experiments show LinguaSim can generate scenarios with varying criticality aligned with different natural language descriptions (ACT: 0.072 s for dangerous vs. 3.532 s for safe descriptions; comfortability: 0.654 vs. 0.764), and its refinement module effectively reduces excessive aggressiveness in LinguaSim’s initial outputs, lowering the crash rate from 46.9% to 6.3% to better match user intentions.

26. AILoRA: Function-Aware Asymmetric Initialization for Low-Rank Adaptation of Large Language Models

Authors: Xiaoshuang Ji , Zhendong Zhao , Xiaoyan Gu , Xiaojun Chen , Xin Zhao , Zeyao Liu
URL: https://arxiv.org/abs/2510.08034
Abstract:

Parameter-efficient finetuning (PEFT) aims to mitigate the substantial computational and memory overhead involved in adapting large-scale pretrained models to diverse downstream tasks. Among numerous PEFT strategies, Low-Rank Adaptation (LoRA) has emerged as one of the most widely adopted approaches due to its robust empirical performance and low implementation complexity. In practical deployment, LoRA is typically applied to the $W^Q$ and $W^V$ projection matrices of self-attention modules, enabling an effective trade-off between model performance and parameter efficiency. While LoRA has achieved considerable empirical success, it still encounters challenges such as suboptimal performance and slow convergence. To address these limitations, we introduce \textbf{AILoRA}, a novel parameter-efficient method that incorporates function-aware asymmetric low-rank priors. Our empirical analysis reveals that the projection matrices $W^Q$ and $W^V$ in the self-attention mechanism exhibit distinct parameter characteristics, stemming from their functional differences. Specifically, $W^Q$ captures task-specific semantic space knowledge essential for attention distributions computation, making its parameters highly sensitive to downstream task variations. In contrast, $W^V$ encodes token-level feature representations that tend to remain stable across tasks and layers. Leveraging these insights, AILoRA performs a function-aware initialization by injecting the principal components of $W^Q$ to retain task-adaptive capacity, and the minor components of $W^V$ to preserve generalizable feature representations. This asymmetric initialization strategy enables LoRA modules to better capture the specialized roles of attention parameters, thereby enhancing both finetuning performance and convergence efficiency.

27. PEAR: Phase Entropy Aware Reward for Efficient Reasoning

Authors: Chen Huang , Wei Lu , Wenxuan Zhang
URL: https://arxiv.org/abs/2510.08026
Abstract:

Large Reasoning Models (LRMs) have achieved impressive performance on complex reasoning tasks by generating detailed chain-of-thought (CoT) explanations. However, these responses are often excessively long, containing redundant reasoning steps that inflate inference cost and reduce usability. Controlling the length of generated reasoning without sacrificing accuracy remains an open challenge. Through a systematic empirical analysis, we reveal a consistent positive correlation between model entropy and response length at different reasoning stages across diverse LRMs: the thinking phase exhibits higher entropy, reflecting exploratory behavior of longer responses, while the final answer phase shows lower entropy, indicating a more deterministic this http URL observation suggests that entropy at different reasoning stages can serve as a control knob for balancing conciseness and performance. Based on this insight, this paper introduces Phase Entropy Aware Reward (PEAR), a reward mechanism that incorporating phase-dependent entropy into the reward design. Instead of treating all tokens uniformly, PEAR penalize excessive entropy during the thinking phase and allowing moderate exploration at the final answer phase, which encourages models to generate concise reasoning traces that retain sufficient flexibility to solve the task correctly. This enables adaptive control of response length without relying on explicit length targets or rigid truncation rules. Extensive experiments across four benchmarks demonstrate that PEAR consistently reduces response length while sustaining competitive accuracy across model scales. In addition, PEAR demonstrates strong out-of-distribution (OOD) robustness beyond the training distribution. Our code is available at: this https URL .

28. Language Models Do Not Embed Numbers Continuously

Authors: Alex O. Davies , Roussel Nzoyem , Nirav Ajmeri , Telmo M. Silva Filho
URL: https://arxiv.org/abs/2510.08009
Abstract:

Recent research has extensively studied how large language models manipulate integers in specific arithmetic tasks, and on a more fundamental level, how they represent numeric values. These previous works have found that language model embeddings can be used to reconstruct the original values, however, they do not evaluate whether language models actually model continuous values as continuous. Using expected properties of the embedding space, including linear reconstruction and principal component analysis, we show that language models not only represent numeric spaces as non-continuous but also introduce significant noise. Using models from three major providers (OpenAI, Google Gemini and Voyage AI), we show that while reconstruction is possible with high fidelity ($R^2 \geq 0.95$), principal components only explain a minor share of variation within the embedding space. This indicates that many components within the embedding space are orthogonal to the simple numeric input space. Further, both linear reconstruction and explained variance suffer with increasing decimal precision, despite the ordinal nature of the input space being fundamentally unchanged. The findings of this work therefore have implications for the many areas where embedding models are used, in-particular where high numerical precision, large magnitudes or mixed-sign values are common.

Authors: Haitao Jia , Ming He , Zimo Yin , Likang Wu , Jianping Fan , Jitao Sang
URL: https://arxiv.org/abs/2510.07988
Abstract:

Mobile GUI agents exhibit substantial potential to facilitate and automate the execution of user tasks on mobile phones. However, exist mobile GUI agents predominantly privilege autonomous operation and neglect the necessity of active user engagement during task execution. This omission undermines their adaptability to information dilemmas including ambiguous, dynamically evolving, and conflicting task scenarios, leading to execution outcomes that deviate from genuine user requirements and preferences. To address these shortcomings, we propose ReInAgent, a context-aware multi-agent framework that leverages dynamic information management to enable human-in-the-loop mobile task navigation. ReInAgent integrates three specialized agents around a shared memory module: an information-managing agent for slot-based information management and proactive interaction with the user, a decision-making agent for conflict-aware planning, and a reflecting agent for task reflection and information consistency validation. Through continuous contextual information analysis and sustained user-agent collaboration, ReInAgent overcomes the limitation of existing approaches that rely on clear and static task assumptions. Consequently, it enables more adaptive and reliable mobile task navigation in complex, real-world scenarios. Experimental results demonstrate that ReInAgent effectively resolves information dilemmas and produces outcomes that are more closely aligned with genuine user preferences. Notably, on complex tasks involving information dilemmas, ReInAgent achieves a 25% higher success rate than Mobile-Agent-v2.

30. VoiceAgentBench: Are Voice Assistants ready for agentic tasks?

Authors: Dhruv Jain , Harshit Shukla , Gautam Rajeev , Ashish Kulkarni , Chandra Khatri , Shubham Agarwal
URL: https://arxiv.org/abs/2510.07978
Abstract:

Large-scale Speech Language Models (SpeechLMs) have enabled voice assistants capable of understanding natural spoken queries and performing complex tasks. However, existing speech benchmarks primarily focus on isolated capabilities such as transcription, or question-answering, and do not systematically evaluate agentic scenarios encompassing multilingual and cultural understanding, as well as adversarial robustness. To address this, we introduce VoiceAgentBench, a comprehensive benchmark designed to evaluate SpeechLMs in realistic spoken agentic settings. It comprises over 5,500 synthetic spoken queries, including dialogues grounded in Indian context, covering single-tool invocations, multi-tool workflows, multi-turn interactions, and safety evaluations. The benchmark supports English, Hindi, and 5 other Indian languages, reflecting real-world linguistic and cultural diversity. We simulate speaker variability using a novel sampling algorithm that selects audios for TTS voice conversion based on its speaker embeddings, maximizing acoustic and speaker diversity. Our evaluation measures tool selection accuracy, structural consistency, and the correctness of tool invocations, including adversarial robustness. Our experiments reveal significant gaps in contextual tool orchestration tasks, Indic generalization, and adversarial robustness, exposing critical limitations of current SpeechLMs.

31. TaoSR-SHE: Stepwise Hybrid Examination Reinforcement Learning Framework for E-commerce Search Relevance

Authors: Pengkun Jiao , Yiming Jin , Jianhui Yang , Chenhe Dong , Zerui Huang , Shaowei Yao , Xiaojiang Zhou , Dan Ou , Haihong Tang
URL: https://arxiv.org/abs/2510.07972
Abstract:

Query-product relevance analysis is a foundational technology in e-commerce search engines and has become increasingly important in AI-driven e-commerce. The recent emergence of large language models (LLMs), particularly their chain-of-thought (CoT) reasoning capabilities, offers promising opportunities for developing relevance systems that are both more interpretable and more robust. However, existing training paradigms have notable limitations: SFT and DPO suffer from poor generalization on long-tail queries and from a lack of fine-grained, stepwise supervision to enforce rule-aligned reasoning. In contrast, reinforcement learning with verification rewards (RLVR) suffers from sparse feedback, which provides insufficient signal to correct erroneous intermediate steps, thereby undermining logical consistency and limiting performance in complex inference scenarios. To address these challenges, we introduce the Stepwise Hybrid Examination Reinforcement Learning framework for Taobao Search Relevance (TaoSR-SHE). At its core is Stepwise Reward Policy Optimization (SRPO), a reinforcement learning algorithm that leverages step-level rewards generated by a hybrid of a high-quality generative stepwise reward model and a human-annotated offline verifier, prioritizing learning from critical correct and incorrect reasoning steps. TaoSR-SHE further incorporates two key techniques: diversified data filtering to encourage exploration across varied reasoning paths and mitigate policy entropy collapse, and multi-stage curriculum learning to foster progressive capability growth. Extensive experiments on real-world search benchmarks show that TaoSR-SHE improves both reasoning quality and relevance-prediction accuracy in large-scale e-commerce settings, outperforming SFT, DPO, GRPO, and other baselines, while also enhancing interpretability and robustness.

32. Agent-Based Genetic Algorithm for Crypto Trading Strategy Optimization

Authors: Qiushi Tian , Churong Liang , Kairan Hong , Runnan Li
URL: https://arxiv.org/abs/2510.07943
Abstract:

Cryptocurrency markets present formidable challenges for trading strategy optimization due to extreme volatility, non-stationary dynamics, and complex microstructure patterns that render conventional parameter optimization methods fundamentally inadequate. We introduce Cypto Genetic Algorithm Agent (CGA-Agent), a pioneering hybrid framework that synergistically integrates genetic algorithms with intelligent multi-agent coordination mechanisms for adaptive trading strategy parameter optimization in dynamic financial environments. The framework uniquely incorporates real-time market microstructure intelligence and adaptive strategy performance feedback through intelligent mechanisms that dynamically guide evolutionary processes, transcending the limitations of static optimization approaches. Comprehensive empirical evaluation across three cryptocurrencies demonstrates systematic and statistically significant performance improvements on both total returns and risk-adjusted metrics.

33. Enabling Personalized Long-term Interactions in LLM-based Agents through Persistent Memory and User Profiles

Authors: Rebecca Westhäußer , Wolfgang Minker , Sebatian Zepf
URL: https://arxiv.org/abs/2510.07925
Abstract:

Large language models (LLMs) increasingly serve as the central control unit of AI agents, yet current approaches remain limited in their ability to deliver personalized interactions. While Retrieval Augmented Generation enhances LLM capabilities by improving context-awareness, it lacks mechanisms to combine contextual information with user-specific data. Although personalization has been studied in fields such as human-computer interaction or cognitive science, existing perspectives largely remain conceptual, with limited focus on technical implementation. To address these gaps, we build on a unified definition of personalization as a conceptual foundation to derive technical requirements for adaptive, user-centered LLM-based agents. Combined with established agentic AI patterns such as multi-agent collaboration or multi-source retrieval, we present a framework that integrates persistent memory, dynamic coordination, self-validation, and evolving user profiles to enable personalized long-term interactions. We evaluate our approach on three public datasets using metrics such as retrieval accuracy, response correctness, or BertScore. We complement these results with a five-day pilot user study providing initial insights into user feedback on perceived personalization. The study provides early indications that guide future work and highlights the potential of integrating persistent memory and user profiles to improve the adaptivity and perceived personalization of LLM-based agents.

34. Profit Mirage: Revisiting Information Leakage in LLM-based Financial Agents

Authors: Xiangyu Li , Yawen Zeng , Xiaofen Xing , Jin Xu , Xiangmin Xu
URL: https://arxiv.org/abs/2510.07920
Abstract:

LLM-based financial agents have attracted widespread excitement for their ability to trade like human experts. However, most systems exhibit a “profit mirage”: dazzling back-tested returns evaporate once the model’s knowledge window ends, because of the inherent information leakage in LLMs. In this paper, we systematically quantify this leakage issue across four dimensions and release FinLake-Bench, a leakage-robust evaluation benchmark. Furthermore, to mitigate this issue, we introduce FactFin, a framework that applies counterfactual perturbations to compel LLM-based agents to learn causal drivers instead of memorized outcomes. FactFin integrates four core components: Strategy Code Generator, Retrieval-Augmented Generation, Monte Carlo Tree Search, and Counterfactual Simulator. Extensive experiments show that our method surpasses all baselines in out-of-sample generalization, delivering superior risk-adjusted performance.

35. Towards Meaningful Transparency in Civic AI Systems

Authors: Dave Murray-Rust , Kars Alfrink , Cristina Zaga
URL: https://arxiv.org/abs/2510.07889
Abstract:

Artificial intelligence has become a part of the provision of governmental services, from making decisions about benefits to issuing fines for parking violations. However, AI systems rarely live up to the promise of neutral optimisation, creating biased or incorrect outputs and reducing the agency of both citizens and civic workers to shape the way decisions are made. Transparency is a principle that can both help subjects understand decisions made about them and shape the processes behind those decisions. However, transparency as practiced around AI systems tends to focus on the production of technical objects that represent algorithmic aspects of decision making. These are often difficult for publics to understand, do not connect to potential for action, and do not give insight into the wider socio-material context of decision making. In this paper, we build on existing approaches that take a human-centric view on AI transparency, combined with a socio-technical systems view, to develop the concept of meaningful transparency for civic AI systems: transparencies that allow publics to engage with AI systems that affect their lives, connecting understanding with potential for action.

36. Understanding DeepResearch via Reports

Authors: Tianyu Fan , Xinyao Niu , Yuxiang Zheng , Fengji Zhang , Chengen Huang , Bei Chen , Junyang Lin , Chao Huang
URL: https://arxiv.org/abs/2510.07861
Abstract:

DeepResearch agents represent a transformative AI paradigm, conducting expert-level research through sophisticated reasoning and multi-tool integration. However, evaluating these systems remains critically challenging due to open-ended research scenarios and existing benchmarks that focus on isolated capabilities rather than holistic performance. Unlike traditional LLM tasks, DeepResearch systems must synthesize diverse sources, generate insights, and present coherent findings, which are capabilities that resist simple verification. To address this gap, we introduce DeepResearch-ReportEval, a comprehensive framework designed to assess DeepResearch systems through their most representative outputs: research reports. Our approach systematically measures three dimensions: quality, redundancy, and factuality, using an innovative LLM-as-a-Judge methodology achieving strong expert concordance. We contribute a standardized benchmark of 100 curated queries spanning 12 real-world categories, enabling systematic capability comparison. Our evaluation of four leading commercial systems reveals distinct design philosophies and performance trade-offs, establishing foundational insights as DeepResearch evolves from information assistants toward intelligent research partners. Source code and data are available at: this https URL .

37. Augur: Modeling Covariate Causal Associations in Time Series via Large Language Models

Authors: Zhiqing Cui , Binwu Wang , Qingxiang Liu , Yeqiang Wang , Zhengyang Zhou , Yuxuan Liang , Yang Wang
URL: https://arxiv.org/abs/2510.07858
Abstract:

Large language models (LLM) have emerged as a promising avenue for time series forecasting, offering the potential to integrate multimodal data. However, existing LLM-based approaches face notable limitations-such as marginalized role in model architectures, reliance on coarse statistical text prompts, and lack of interpretability. In this work, we introduce Augur, a fully LLM driven time series forecasting framework that exploits LLM causal reasoning to discover and use directed causal associations among covariates. Augur uses a two stage teacher student architecture where a powerful teacher LLM infers a directed causal graph from time series using heuristic search together with pairwise causality testing. A lightweight student agent then refines the graph and fine tune on high confidence causal associations that are encoded as rich textual prompts to perform forecasting. This design improves predictive accuracy while yielding transparent, traceable reasoning about variable interactions. Extensive experiments on real-world datasets with 25 baselines demonstrate that Augur achieves competitive performance and robust zero-shot generalization.

38. FinMR: A Knowledge-Intensive Multimodal Benchmark for Advanced Financial Reasoning

Authors: Shuangyan Deng , Haizhou Peng , Jiachen Xu , Rui Mao , Ciprian Doru Giurcăneanu , Jiamou Liu
URL: https://arxiv.org/abs/2510.07852
Abstract:

Multimodal Large Language Models (MLLMs) have made substantial progress in recent years. However, their rigorous evaluation within specialized domains like finance is hindered by the absence of datasets characterized by professional-level knowledge intensity, detailed annotations, and advanced reasoning complexity. To address this critical gap, we introduce FinMR, a high-quality, knowledge-intensive multimodal dataset explicitly designed to evaluate expert-level financial reasoning capabilities at a professional analyst’s standard. FinMR comprises over 3,200 meticulously curated and expertly annotated question-answer pairs across 15 diverse financial topics, ensuring broad domain diversity and integrating sophisticated mathematical reasoning, advanced financial knowledge, and nuanced visual interpretation tasks across multiple image types. Through comprehensive benchmarking with leading closed-source and open-source MLLMs, we highlight significant performance disparities between these models and professional financial analysts, uncovering key areas for model advancement, such as precise image analysis, accurate application of complex financial formulas, and deeper contextual financial understanding. By providing richly varied visual content and thorough explanatory annotations, FinMR establishes itself as an essential benchmark tool for assessing and advancing multimodal financial reasoning toward professional analyst-level competence.

Authors: Yuping Zhou , Siqi Lai , Jindong Han , Hao Liu
URL: https://arxiv.org/abs/2510.07825
Abstract:

The rise of Internet of Vehicles (IoV) technologies is transforming traffic management from isolated control to a collective, multi-vehicle process. At the heart of this shift is multi-vehicle dynamic navigation, which requires simultaneously routing large fleets under evolving traffic conditions. Existing path search algorithms and reinforcement learning methods struggle to scale to city-wide networks, often failing to capture the nonlinear, stochastic, and coupled dynamics of urban traffic. To address these challenges, we propose CityNav, a hierarchical, LLM-powered framework for large-scale multi-vehicle navigation. CityNav integrates a global traffic allocation agent, which coordinates strategic traffic flow distribution across regions, with local navigation agents that generate locally adaptive routes aligned with global directives. To enable effective cooperation, we introduce a cooperative reasoning optimization mechanism, in which agents are jointly trained with a dual-reward structure: individual rewards promote per-vehicle efficiency, while shared rewards encourage network-wide coordination and congestion reduction. Extensive experiments on four real-world road networks of varying scales (up to 1.6 million roads and 430,000 intersections) and traffic datasets demonstrate that CityNav consistently outperforms nine classical path search and RL-based baselines in city-scale travel efficiency and congestion mitigation. Our results highlight the potential of LLMs to enable scalable, adaptive, and cooperative city-wide traffic navigation, providing a foundation for intelligent, large-scale vehicle routing in complex urban environments. Our project is available at this https URL .

40. Strategic Communication under Threat: Learning Information Trade-offs in Pursuit-Evasion Games

Authors: Valerio La Gatta , Dolev Mutzari , Sarit Kraus , VS Subrahmanian
URL: https://arxiv.org/abs/2510.07813
Abstract:

Adversarial environments require agents to navigate a key strategic trade-off: acquiring information enhances situational awareness, but may simultaneously expose them to threats. To investigate this tension, we formulate a PursuitEvasion-Exposure-Concealment Game (PEEC) in which a pursuer agent must decide when to communicate in order to obtain the evader’s position. Each communication reveals the pursuer’s location, increasing the risk of being targeted. Both agents learn their movement policies via reinforcement learning, while the pursuer additionally learns a communication policy that balances observability and risk. We propose SHADOW (Strategic-communication Hybrid Action Decision-making under partial Observation for Warfare), a multi-headed sequential reinforcement learning framework that integrates continuous navigation control, discrete communication actions, and opponent modeling for behavior prediction. Empirical evaluations show that SHADOW pursuers achieve higher success rates than six competitive baselines. Our ablation study confirms that temporal sequence modeling and opponent modeling are critical for effective decision-making. Finally, our sensitivity analysis reveals that the learned policies generalize well across varying communication risks and physical asymmetries between agents.

41. GCPO: When Contrast Fails, Go Gold

Authors: Hao Wu , Wei Liu
URL: https://arxiv.org/abs/2510.07790
Abstract:

Reinforcement learning has been widely applied to enhance the reasoning capabilities of large language models. Extending the inference limits of smaller models has become a prominent research focus. However, algorithms such as Group Relative Policy Optimization (GRPO) suffer from a clear drawback: the upper bound of a model’s rollout responses is entirely determined by the model itself, preventing the acquisition of knowledge from samples that are either all incorrect or all correct. In this paper, we introduce Group Contrastive Policy Optimization (GCPO), a method that incorporates external standard reference answers. When the model cannot solve a problem, the reference answer supplies the correct response, steering the model toward an unequivocally accurate update direction. This approach offers two main advantages: (1) it improves training efficiency by fully utilizing every sample; (2) it enables the model to emulate the problem solving strategy of the reference answer during training, thereby enhancing generalization in reasoning. GCPO achieves outstanding results across multiple benchmark datasets, yielding substantial improvements over the baseline model. Our code is available at: this https URL .

42. An approach for systematic decomposition of complex llm tasks

Authors: Tianle Zhou , Jiakai Xu , Guanhong Liu , Jiaxiang Liu , Haonan Wang , Eugene Wu
URL: https://arxiv.org/abs/2510.07772
Abstract:

Large Language Models (LLMs) suffer from reliability issues on complex tasks, as existing decomposition methods are heuristic and rely on agent or manual decomposition. This work introduces a novel, systematic decomposition framework that we call Analysis of CONstraint-Induced Complexity (ACONIC), which models the task as a constraint problem and leveraging formal complexity measures to guide decomposition. On combinatorial (SATBench) and LLM database querying tasks (Spider), we find that by decomposing the tasks following the measure of complexity, agent can perform considerably better (10-40 percentage point).

43. From Noisy to Native: LLM-driven Graph Restoration for Test-Time Graph Domain Adaptation

Authors: Xiangwei Lv , JinLuan Yang , Wang Lin , Jingyuan Chen , Beishui Liao
URL: https://arxiv.org/abs/2510.07762
Abstract:

Graph domain adaptation (GDA) has achieved great attention due to its effectiveness in addressing the domain shift between train and test data. A significant bottleneck in existing graph domain adaptation methods is their reliance on source-domain data, which is often unavailable due to privacy or security concerns. This limitation has driven the development of Test-Time Graph Domain Adaptation (TT-GDA), which aims to transfer knowledge without accessing the source examples. Inspired by the generative power of large language models (LLMs), we introduce a novel framework that reframes TT-GDA as a generative graph restoration problem, “restoring the target graph to its pristine, source-domain-like state”. There are two key challenges: (1) We need to construct a reasonable graph restoration process and design an effective encoding scheme that an LLM can understand, bridging the modality gap. (2) We need to devise a mechanism to ensure the restored graph acquires the intrinsic features of the source domain, even without access to the source data. To ensure the effectiveness of graph restoration, we propose GRAIL, that restores the target graph into a state that is well-aligned with the source domain. Specifically, we first compress the node representations into compact latent features and then use a graph diffusion process to model the graph restoration process. Then a quantization module encodes the restored features into discrete tokens. Building on this, an LLM is fine-tuned as a generative restorer to transform a “noisy” target graph into a “native” one. To further improve restoration quality, we introduce a reinforcement learning process guided by specialized alignment and confidence rewards. Extensive experiments demonstrate the effectiveness of our approach across various datasets.

44. Haibu Mathematical-Medical Intelligent Agent:Enhancing Large Language Model Reliability in Medical Tasks via Verifiable Reasoning Chains

Authors: Yilun Zhang , Dexing Kong
URL: https://arxiv.org/abs/2510.07748
Abstract:

Large Language Models (LLMs) show promise in medicine but are prone to factual and logical errors, which is unacceptable in this high-stakes field. To address this, we introduce the “Haibu Mathematical-Medical Intelligent Agent” (MMIA), an LLM-driven architecture that ensures reliability through a formally verifiable reasoning process. MMIA recursively breaks down complex medical tasks into atomic, evidence-based steps. This entire reasoning chain is then automatically audited for logical coherence and evidence traceability, similar to theorem proving. A key innovation is MMIA’s “bootstrapping” mode, which stores validated reasoning chains as “theorems.” Subsequent tasks can then be efficiently solved using Retrieval-Augmented Generation (RAG), shifting from costly first-principles reasoning to a low-cost verification model. We validated MMIA across four healthcare administration domains, including DRG/DIP audits and medical insurance adjudication, using expert-validated benchmarks. Results showed MMIA achieved an error detection rate exceeding 98% with a false positive rate below 1%, significantly outperforming baseline LLMs. Furthermore, the RAG matching mode is projected to reduce average processing costs by approximately 85% as the knowledge base matures. In conclusion, MMIA’s verifiable reasoning framework is a significant step toward creating trustworthy, transparent, and cost-effective AI systems, making LLM technology viable for critical applications in medicine.

45. SurveyG: A Multi-Agent LLM Framework with Hierarchical Citation Graph for Automated Survey Generation

Authors: Minh-Anh Nguye , Minh-Duc Nguyen , Nguyen Thi Ha Lan , Kieu Hai Dang , Nguyen Tien Dong , Le Duy Dung
URL: https://arxiv.org/abs/2510.07733
Abstract:

Large language models (LLMs) are increasingly adopted for automating survey paper generation \cite{wang2406autosurvey, liang2025surveyx, yan2025surveyforge,su2025benchmarking,wen2025interactivesurvey}. Existing approaches typically extract content from a large collection of related papers and prompt LLMs to summarize them directly. However, such methods often overlook the structural relationships among papers, resulting in generated surveys that lack a coherent taxonomy and a deeper contextual understanding of research progress. To address these shortcomings, we propose \textbf{SurveyG}, an LLM-based agent framework that integrates \textit{hierarchical citation graph}, where nodes denote research papers and edges capture both citation dependencies and semantic relatedness between their contents, thereby embedding structural and contextual knowledge into the survey generation process. The graph is organized into three layers: \textbf{Foundation}, \textbf{Development}, and \textbf{Frontier}, to capture the evolution of research from seminal works to incremental advances and emerging directions. By combining horizontal search within layers and vertical depth traversal across layers, the agent produces multi-level summaries, which are consolidated into a structured survey outline. A multi-agent validation stage then ensures consistency, coverage, and factual accuracy in generating the final survey. Experiments, including evaluations by human experts and LLM-as-a-judge, demonstrate that SurveyG outperforms state-of-the-art frameworks, producing surveys that are more comprehensive and better structured to the underlying knowledge taxonomy of a field.

46. oMeBench: Towards Robust Benchmarking of LLMs in Organic Mechanism Elucidation and Reasoning

Authors: Ruiling Xu , Yifan Zhang , Qingyun Wang , Carl Edwards , Heng Ji
URL: https://arxiv.org/abs/2510.07731
Abstract:

Organic reaction mechanisms are the stepwise elementary reactions by which reactants form intermediates and products, and are fundamental to understanding chemical reactivity and designing new molecules and reactions. Although large language models (LLMs) have shown promise in understanding chemical tasks such as synthesis design, it is unclear to what extent this reflects genuine chemical reasoning capabilities, i.e., the ability to generate valid intermediates, maintain chemical consistency, and follow logically coherent multi-step pathways. We address this by introducing oMeBench, the first large-scale, expert-curated benchmark for organic mechanism reasoning in organic chemistry. It comprises over 10,000 annotated mechanistic steps with intermediates, type labels, and difficulty ratings. Furthermore, to evaluate LLM capability more precisely and enable fine-grained scoring, we propose oMeS, a dynamic evaluation framework that combines step-level logic and chemical similarity. We analyze the performance of state-of-the-art LLMs, and our results show that although current models display promising chemical intuition, they struggle with correct and consistent multi-step reasoning. Notably, we find that using prompting strategy and fine-tuning a specialist model on our proposed dataset increases performance by 50% over the leading closed-source model. We hope that oMeBench will serve as a rigorous foundation for advancing AI systems toward genuine chemical reasoning.

47. Control Synthesis of Cyber-Physical Systems for Real-Time Specifications through Causation-Guided Reinforcement Learning

Authors: Xiaochen Tang , Zhenya Zhang , Miaomiao Zhang , Jie An
URL: https://arxiv.org/abs/2510.07715
Abstract:

In real-time and safety-critical cyber-physical systems (CPSs), control synthesis must guarantee that generated policies meet stringent timing and correctness requirements under uncertain and dynamic conditions. Signal temporal logic (STL) has emerged as a powerful formalism of expressing real-time constraints, with its semantics enabling quantitative assessment of system behavior. Meanwhile, reinforcement learning (RL) has become an important method for solving control synthesis problems in unknown environments. Recent studies incorporate STL-based reward functions into RL to automatically synthesize control policies. However, the automatically inferred rewards obtained by these methods represent the global assessment of a whole or partial path but do not accumulate the rewards of local changes accurately, so the sparse global rewards may lead to non-convergence and unstable training performances. In this paper, we propose an online reward generation method guided by the online causation monitoring of STL. Our approach continuously monitors system behavior against an STL specification at each control step, computing the quantitative distance toward satisfaction or violation and thereby producing rewards that reflect instantaneous state dynamics. Additionally, we provide a smooth approximation of the causation semantics to overcome the discontinuity of the causation semantics and make it differentiable for using deep-RL methods. We have implemented a prototype tool and evaluated it in the Gym environment on a variety of continuously controlled benchmarks. Experimental results show that our proposed STL-guided RL method with online causation semantics outperforms existing relevant STL-guided RL methods, providing a more robust and efficient reward generation framework for deep-RL.

Authors: Alhim Vera , Karen Sanchez , Carlos Hinojosa , Haidar Bin Hamid , Donghoon Kim , Bernard Ghanem
URL: https://arxiv.org/abs/2510.07709
Abstract:

Can generative agents be trusted in multimodal environments? Despite advances in large language and vision-language models that enable agents to act autonomously and pursue goals in rich settings, their ability to reason about safety, coherence, and trust across modalities remains limited. We introduce a reproducible simulation framework for evaluating agents along three dimensions: (1) safety improvement over time, including iterative plan revisions in text-visual scenarios; (2) detection of unsafe activities across multiple categories of social situations; and (3) social dynamics, measured as interaction counts and acceptance ratios of social exchanges. Agents are equipped with layered memory, dynamic planning, multimodal perception, and are instrumented with SocialMetrics, a suite of behavioral and structural metrics that quantifies plan revisions, unsafe-to-safe conversions, and information diffusion across networks. Experiments show that while agents can detect direct multimodal contradictions, they often fail to align local revisions with global safety, reaching only a 55 percent success rate in correcting unsafe plans. Across eight simulation runs with three models - Claude, GPT-4o mini, and Qwen-VL - five agents achieved average unsafe-to-safe conversion rates of 75, 55, and 58 percent, respectively. Overall performance ranged from 20 percent in multi-risk scenarios with GPT-4o mini to 98 percent in localized contexts such as fire/heat with Claude. Notably, 45 percent of unsafe actions were accepted when paired with misleading visuals, showing a strong tendency to overtrust images. These findings expose critical limitations in current architectures and provide a reproducible platform for studying multimodal safety, coherence, and social dynamics.

49. Safely Exploring Novel Actions in Recommender Systems via Deployment-Efficient Policy Learning

Authors: Haruka Kiyohara , Yusuke Narita , Yuta Saito , Kei Tateno , Takuma Udagawa
URL: https://arxiv.org/abs/2510.07635
Abstract:

In many real recommender systems, novel items are added frequently over time. The importance of sufficiently presenting novel actions has widely been acknowledged for improving long-term user engagement. A recent work builds on Off-Policy Learning (OPL), which trains a policy from only logged data, however, the existing methods can be unsafe in the presence of novel actions. Our goal is to develop a framework to enforce exploration of novel actions with a guarantee for safety. To this end, we first develop Safe Off-Policy Policy Gradient (Safe OPG), which is a model-free safe OPL method based on a high confidence off-policy evaluation. In our first experiment, we observe that Safe OPG almost always satisfies a safety requirement, even when existing methods violate it greatly. However, the result also reveals that Safe OPG tends to be too conservative, suggesting a difficult tradeoff between guaranteeing safety and exploring novel actions. To overcome this tradeoff, we also propose a novel framework called Deployment-Efficient Policy Learning for Safe User Exploration, which leverages safety margin and gradually relaxes safety regularization during multiple (not many) deployments. Our framework thus enables exploration of novel actions while guaranteeing safe implementation of recommender systems.

50. Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models

Authors: Yinglun Zhu , Jiancheng Zhang , Fuzhi Tang
URL: https://arxiv.org/abs/2510.07632
Abstract:

Frontier AI models have achieved remarkable progress, yet recent studies suggest they struggle with compositional reasoning, often performing at or below random chance on established benchmarks. We revisit this problem and show that widely used evaluation metrics systematically underestimate model capability. To address this, we introduce a group matching score that better exploits group structure and reveals substantial hidden capability in both contrastive vision-language models (VLMs) and multimodal large language models (MLLMs). Moreover, simply overfitting to the induced group matchings at test time transfers this hidden capability into higher scores under standard evaluation metrics, closing much of the reported gap. This adjustment enables SigLIP-B16 to surpass all previous results and GPT-4.1 to yield the first result surpassing estimated human performance on Winoground. Building on this insight, we propose Test-Time Matching (TTM), an iterative, self-improving algorithm that further bootstraps model performance without any external supervision. TTM delivers additional, non-trivial improvements: for example, TTM enables SigLIP-B16 to surpass GPT-4.1 on MMVP-VLM, establishing a new state of the art. Importantly, TTM remains broadly effective even on benchmarks without metric-induced effects or group structures, achieving relative gains up to 85.7% on challenging datasets such as WhatsUp. Across 16 dataset variants spanning diverse setups, our experiments demonstrate that TTM consistently improves model performance and advances the frontier of compositional reasoning.

51. A Case for Leveraging Generative AI to Expand and Enhance Training in the Provision of Mental Health Services

Authors: Hannah R. Lawrence , Shannon Wiltsey Stirman , Samuel Dorison , Taedong Yun , Megan Jones Bell
URL: https://arxiv.org/abs/2510.07623
Abstract:

Generative artificial intelligence (Generative AI) is transforming healthcare. With this evolution comes optimism regarding the impact it will have on mental health, as well as concern regarding the risks that come with generative AI operating in the mental health domain. Much of the investment in, and academic and public discourse about, AI-powered solutions for mental health has focused on therapist chatbots. Despite the common assumption that chatbots will be the most impactful application of GenAI to mental health, we make the case here for a lower-risk, high impact use case: leveraging generative AI to enhance and scale training in mental health service provision. We highlight key benefits of using generative AI to help train people to provide mental health services and present a real-world case study in which generative AI improved the training of veterans to support one another’s mental health. With numerous potential applications of generative AI in mental health, we illustrate why we should invest in using generative AI to support training people in mental health service provision.

52. Traceability and Accountability in Role-Specialized Multi-Agent LLM Pipelines

Authors: Amine Barrak
URL: https://arxiv.org/abs/2510.07614
Abstract:

Sequential multi-agent systems built with large language models (LLMs) can automate complex software tasks, but they are hard to trust because errors quietly pass from one stage to the next. We study a traceable and accountable pipeline, meaning a system with clear roles, structured handoffs, and saved records that let us trace who did what at each step and assign blame when things go wrong. Our setting is a Planner -> Executor -> Critic pipeline. We evaluate eight configurations of three state-of-the-art LLMs on three benchmarks and analyze where errors start, how they spread, and how they can be fixed. Our results show: (1) adding a structured, accountable handoff between agents markedly improves accuracy and prevents the failures common in simple pipelines; (2) models have clear role-specific strengths and risks (e.g., steady planning vs. high-variance critiquing), which we quantify with repair and harm rates; and (3) accuracy-cost-latency trade-offs are task-dependent, with heterogeneous pipelines often the most efficient. Overall, we provide a practical, data-driven method for designing, tracing, and debugging reliable, predictable, and accountable multi-agent systems.

53. AgentAsk: Multi-Agent Systems Need to Ask

Authors: Bohan Lin , Kuo Yang , Yingchuan Lai , Yudong Zhang , Chen Zhang , Guibin Zhang , Xinlei Yu , Miao Yu , Xu Wang , Yang Wang
URL: https://arxiv.org/abs/2510.07593
Abstract:

Multi-agent systems built on large language models (LLMs) promise enhanced problem-solving capabilities through collaborative division of labor. However, they frequently underperform single-agent baselines due to edge-level error cascades: minor inaccuracies at one message handoff propagate across the entire chain. We propose AgentAsk, a lightweight and plug-and-play clarification module that treats every inter-agent message as a potential failure point and inserts minimally necessary questions to arrest error propagation. AgentAsk follows a three-stage pipeline: (i) distilling edge-level judgments from curated failure traces into a compact policy, (ii) supervising the policy to determine when/what/whom/how to ask, and (iii) optimizing online with E-GRPO, a reinforcement learning objective that balances accuracy, latency, and cost. The module is architecture-agnostic and easy to integrate into existing orchestration. Across math, reasoning, and coding benchmarks, AgentAsk consistently improves accuracy and robustness over public multi-agent implementations while keeping overhead minimal, with latency and extra cost all less than 5%, approaching the performance of a strong evaluator. Beyond empirical improvements, we contribute a principled taxonomy of edge-level errors and a practical recipe for link-local intervention, offering a scalable pathway toward more reliable LLM-based multi-agent systems.

54. Benchmarking is Broken - Don’t Let AI be its Own Judge

Authors: Zerui Cheng , Stella Wohnig , Ruchika Gupta , Samiul Alam , Tassallah Abdullahi , João Alves Ribeiro , Christian Nielsen-Garcia , Saif Mir , Siran Li , Jason Orender , Seyed Ali Bahrainian , Daniel Kirste , Aaron Gokaslan , Mikołaj Glinka , Carsten Eickhoff , Ruben Wolff
URL: https://arxiv.org/abs/2510.07575
Abstract:

The meteoric rise of Artificial Intelligence (AI), with its rapidly expanding market capitalization, presents both transformative opportunities and critical challenges. Chief among these is the urgent need for a new, unified paradigm for trustworthy evaluation, as current benchmarks increasingly reveal critical vulnerabilities. Issues like data contamination and selective reporting by model developers fuel hype, while inadequate data quality control can lead to biased evaluations that, even if unintentionally, may favor specific approaches. As a flood of participants enters the AI space, this “Wild West” of assessment makes distinguishing genuine progress from exaggerated claims exceptionally difficult. Such ambiguity blurs scientific signals and erodes public confidence, much as unchecked claims would destabilize financial markets reliant on credible oversight from agencies like Moody’s. In high-stakes human examinations (e.g., SAT, GRE), substantial effort is devoted to ensuring fairness and credibility; why settle for less in evaluating AI, especially given its profound societal impact? This position paper argues that the current laissez-faire approach is unsustainable. We contend that true, sustainable AI advancement demands a paradigm shift: a unified, live, and quality-controlled benchmarking framework robust by construction, not by mere courtesy and goodwill. To this end, we dissect the systemic flaws undermining today’s AI evaluation, distill the essential requirements for a new generation of assessments, and introduce PeerBench, a community-governed, proctored evaluation blueprint that embodies this paradigm through sealed execution, item banking with rolling renewal, and delayed transparency. Our goal is to pave the way for evaluations that can restore integrity and deliver genuinely trustworthy measures of AI progress.

55. An Evaluation Study of Hybrid Methods for Multilingual PII Detection

Authors: Harshit Rajgarhia , Suryam Gupta , Asif Shaik , Gulipalli Praveen Kumar , Y Santhoshraj , Sanka Nithya Tanvy Nishitha , Abhishek Mukherji
URL: https://arxiv.org/abs/2510.07551
Abstract:

The detection of Personally Identifiable Information (PII) is critical for privacy compliance but remains challenging in low-resource languages due to linguistic diversity and limited annotated data. We present RECAP, a hybrid framework that combines deterministic regular expressions with context-aware large language models (LLMs) for scalable PII detection across 13 low-resource locales. RECAP’s modular design supports over 300 entity types without retraining, using a three-phase refinement pipeline for disambiguation and filtering. Benchmarked with nervaluate, our system outperforms fine-tuned NER models by 82% and zero-shot LLMs by 17% in weighted F1-score. This work offers a scalable and adaptable solution for efficient PII detection in compliance-focused applications.

56. Measuring and Mitigating Identity Bias in Multi-Agent Debate via Anonymization

Authors: Hyeong Kyu Choi , Xiaojin Zhu , Yixuan Li
URL: https://arxiv.org/abs/2510.07517
Abstract:

Multi-agent debate (MAD) aims to improve large language model (LLM) reasoning by letting multiple agents exchange answers and then aggregate their opinions. Yet recent studies reveal that agents are not neutral: they are prone to identity-driven sycophancy and self-bias, uncritically adopting a peer’s view or stubbornly adhering to their own prior output, undermining the reliability of debate. In this work, we present the first principled framework that joins sycophancy and self-bias to mitigate and quantify identity bias in MAD. First, we formalize the debate dynamics as an identity-weighted Bayesian update process. Second, we propose response anonymization: by removing identity markers from prompts, agents cannot distinguish “self” from “peer”, which forces equal weights on agent identity, thereby reducing bias. Third, we define the Identity Bias Coefficient (IBC), a principled metric that measures how often an agent follows a peer versus itself. Empirical studies across multiple models, datasets and debate rounds confirm that identity bias is widespread, with sycophancy far more common than self-bias. Our findings highlight the need to “mask” identity to ensure that MAD systems reason based on content rather than source identity. Code is released in this https URL .

57. CompassLLM: A Multi-Agent Approach toward Geo-Spatial Reasoning for Popular Path Query

Authors: Md. Nazmul Islam Ananto , Shamit Fatin , Mohammed Eunus Ali , Md Rizwan Parvez
URL: https://arxiv.org/abs/2510.07516
Abstract:

The popular path query - identifying the most frequented routes between locations from historical trajectory data - has important applications in urban planning, navigation optimization, and travel recommendations. While traditional algorithms and machine learning approaches have achieved success in this domain, they typically require model training, parameter tuning, and retraining when accommodating data updates. As Large Language Models (LLMs) demonstrate increasing capabilities in spatial and graph-based reasoning, there is growing interest in exploring how these models can be applied to geo-spatial problems. We introduce CompassLLM, a novel multi-agent framework that intelligently leverages the reasoning capabilities of LLMs into the geo-spatial domain to solve the popular path query. CompassLLM employs its agents in a two-stage pipeline: the SEARCH stage that identifies popular paths, and a GENERATE stage that synthesizes novel paths in the absence of an existing one in the historical trajectory data. Experiments on real and synthetic datasets show that CompassLLM demonstrates superior accuracy in SEARCH and competitive performance in GENERATE while being cost-effective.

58. Optimizing Ethical Risk Reduction for Medical Intelligent Systems with Constraint Programming

Authors: Clotilde Brayé , Aurélien Bricout , Arnaud Gotlieb , Nadjib Lazaar , Quentin Vallet
URL: https://arxiv.org/abs/2510.07491
Abstract:

Medical Intelligent Systems (MIS) are increasingly integrated into healthcare workflows, offering significant benefits but also raising critical safety and ethical concerns. According to the European Union AI Act, most MIS will be classified as high-risk systems, requiring a formal risk management process to ensure compliance with the ethical requirements of trust- worthy AI. In this context, we focus on risk reduction optimization problems, which aim to reduce risks with ethical considerations by finding the best balanced assignment of risk assessment values according to their coverage of trustworthy AI ethical requirements. We formalize this problem as a constrained optimization task and investigate three resolution paradigms: Mixed Integer Programming (MIP), Satisfiability (SAT), and Constraint Pro- gramming(CP).Our contributions include the mathematical formulation of this optimization problem, its modeling with the Minizinc constraint modeling language, and a comparative experimental study that analyzes the performance, expressiveness, and scalability of each ap- proach to solving. From the identified limits of the methodology, we draw some perspectives of this work regarding the integration of the Minizinc model into a complete trustworthy AI ethical risk management process for MIS.

59. Evaluation of LLMs for Process Model Analysis and Optimization

Authors: Akhil Kumar , Jianliang Leon Zhao , Om Dobariya
URL: https://arxiv.org/abs/2510.07489
Abstract:

In this paper, we report our experience with several LLMs for their ability to understand a process model in an interactive, conversational style, find syntactical and logical errors in it, and reason with it in depth through a natural language (NL) interface. Our findings show that a vanilla, untrained LLM like ChatGPT (model o3) in a zero-shot setting is effective in understanding BPMN process models from images and answering queries about them intelligently at syntactic, logic, and semantic levels of depth. Further, different LLMs vary in performance in terms of their accuracy and effectiveness. Nevertheless, our empirical analysis shows that LLMs can play a valuable role as assistants for business process designers and users. We also study the LLM’s “thought process” and ability to perform deeper reasoning in the context of process analysis and optimization. We find that the LLMs seem to exhibit anthropomorphic properties.

60. ExpertAgent: Enhancing Personalized Education through Dynamic Planning and Retrieval-Augmented Long-Chain Reasoning

Authors: Binrong Zhu , Guiran Liu , Nina Jiang
URL: https://arxiv.org/abs/2510.07456
Abstract:

The application of advanced generative artificial intelligence in education is often constrained by the lack of real-time adaptability, personalization, and reliability of the content. To address these challenges, we propose ExpertAgent - an intelligent agent framework designed for personalized education that provides reliable knowledge and enables highly adaptive learning experiences. Therefore, we developed ExpertAgent, an innovative learning agent that provides users with a proactive and personalized learning experience. ExpertAgent dynamic planning of the learning content and strategy based on a continuously updated student model. Therefore, overcoming the limitations of traditional static learning content to provide optimized teaching strategies and learning experience in real time. All instructional content is grounded in a validated curriculum repository, effectively reducing hallucination risks in large language models and improving reliability and trustworthiness.

61. TS-Agent: A Time Series Reasoning Agent with Iterative Statistical Insight Gathering

Authors: Penghang Liu , Elizabeth Fons , Svitlana Vyetrenko , Daniel Borrajo , Vamsi Potluru , Manuela Veloso
URL: https://arxiv.org/abs/2510.07432
Abstract:

Large language models (LLMs) have shown strong abilities in reasoning and problem solving, but recent studies reveal that they still struggle with time series reasoning tasks, where outputs are often affected by hallucination or knowledge leakage. In this work we propose TS-Agent, a time series reasoning agent that leverages LLMs strictly for what they excel at, i.e., gathering evidence and synthesizing it into conclusions through step-by-step reasoning, while delegating the extraction of statistical and structural information to time series analytical tools. Instead of mapping time series into text tokens, images, or embeddings, our agent interacts with raw numeric sequences through atomic operators, records outputs in an explicit evidence log, and iteratively refines its reasoning under the guidance of a self-critic and a final quality gate. This design avoids multi-modal alignment training, preserves the native form of time series, ensures interpretability and verifiability, and mitigates knowledge leakage or hallucination. Empirically, we evaluate the agent on established benchmarks. Our experiments show that TS-Agent achieves performance comparable to state-of-the-art LLMs on understanding benchmarks, and delivers significant improvements on reasoning tasks, where existing models often rely on memorization and fail in zero-shot settings.

62. Less is More: Strategic Expert Selection Outperforms Ensemble Complexity in Traffic Forecasting

Authors: Walid Guettala , Yufan Zhao , László Gulyás
URL: https://arxiv.org/abs/2510.07426
Abstract:

Traffic forecasting is fundamental to intelligent transportation systems, enabling congestion mitigation and emission reduction in increasingly complex urban environments. While recent graph neural network approaches have advanced spatial temporal modeling, existing mixture of experts frameworks like Time Enhanced Spatio Temporal Attention Model (TESTAM) lack explicit incorporation of physical road network topology, limiting their spatial capabilities. We present TESTAM+, an enhanced spatio temporal forecasting framework that introduces a novel SpatioSemantic Expert integrating physical road topology with data driven feature similarity through hybrid graph construction. TESTAM+ achieves significant improvements over TESTAM: 1.3% MAE reduction on METR LA (3.10 vs. 3.14) and 4.1% improvement on PEMS BAY (1.65 vs. 1.72). Through comprehensive ablation studies, we discover that strategic expert selection fundamentally outperforms naive ensemble aggregation. Individual experts demonstrate remarkable effectiveness: the Adaptive Expert achieves 1.63 MAE on PEMS BAY, outperforming the original three expert TESTAM (1.72 MAE), while the SpatioSemantic Expert matches this performance with identical 1.63 MAE. The optimal Identity + Adaptive configuration achieves an 11.5% MAE reduction compared to state of the art MegaCRN on METR LA (2.99 vs. 3.38), while reducing inference latency by 53.1% compared to the full four expert TESTAM+. Our findings reveal that fewer, strategically designed experts outperform complex multi expert ensembles, establishing new state of the art performance with superior computational efficiency for real time deployment.

63. ProSEA: Problem Solving via Exploration Agents

Authors: William Nguyen , Vinh Luong , Christopher Nguyen
URL: https://arxiv.org/abs/2510.07423
Abstract:

Large language models (LLMs) have empowered AI agents to tackle increasingly complex tasks. However, most existing agents remain limited to static planning and brittle interactions, falling short of true collaboration or adaptive reasoning. We introduce ProSEA, a modular, general-purpose multi-agent framework designed for iterative problem solving through exploration and plan evolution. ProSEA features a hierarchical architecture in which a Manager Agent orchestrates domain-specialized Expert Agents, decomposes tasks, and adaptively replans based on structured feedback from failed attempts. Unlike prior systems, ProSEA agents report not only success or failure but also detailed reasons for failure and newly discovered constraints, enabling dynamic plan refinement informed by exploratory traces. The framework operates autonomously but supports seamless integration with human collaborators when needed. Experiments on the challenging FinanceBench benchmark demonstrate that ProSEA, even without human feedback, outperforms state-of-the-art baselines and achieves robust performance across reasoning-heavy tasks. These results underscore ProSEA’s potential as a foundation for more transparent, adaptive, and human-aligned AI agents.

64. Position: AI Will Transform Neuropsychology Through Mental Health Digital Twins for Dynamic Mental Health Care, Especially for ADHD

Authors: Neil Natarajan , Sruthi Viswanathan , Xavier Roberts-Gaal , Michelle Marie Martel
URL: https://arxiv.org/abs/2510.07409
Abstract:

Static solutions don’t serve a dynamic mind. Thus, we advocate a shift from static mental health diagnostic assessments to continuous, artificial intelligence (AI)-driven assessment. Focusing on Attention-Deficit/Hyperactivity Disorder (ADHD) as a case study, we explore how generative AI has the potential to address current capacity constraints in neuropsychology, potentially enabling more personalized and longitudinal care pathways. In particular, AI can efficiently conduct frequent, low-level experience sampling from patients and facilitate diagnostic reconciliation across care pathways. We envision a future where mental health care benefits from continuous, rich, and patient-centered data sampling to dynamically adapt to individual patient needs and evolving conditions, thereby improving both accessibility and efficacy of treatment. We further propose the use of mental health digital twins (MHDTs) - continuously updated computational models that capture individual symptom dynamics and trajectories - as a transformative framework for personalized mental health care. We ground this framework in empirical evidence and map out the research agenda required to refine and operationalize it.

65. Base Models Know How to Reason, Thinking Models Learn When

Authors: Constantin Venhoff , Iván Arcuschin , Philip Torr , Arthur Conmy , Neel Nanda
URL: https://arxiv.org/abs/2510.07364
Abstract:

Why do thinking language models like DeepSeek R1 outperform their base counterparts? Despite consistent performance gains, it remains unclear to what extent thinking models learn entirely new reasoning capabilities or repurpose pre-existing base model ones. In this work, we propose a hybrid model where we activate reasoning mechanisms in base models at the right time to elicit thinking-model-level reasoning chains, implying that thinking models exploit already existing capabilities. To ground our analysis, we introduce an unsupervised, bottom-up approach for uncovering human-interpretable reasoning behaviors in thinking models. This approach provides an unbiased method to discover reasoning behaviors without imposing manual or LLM-derived assumptions. Across three base and four thinking models, using GSM8K and MATH500, our hybrid model recovers up to 91% of the performance gap to thinking models without any weight updates while steering only 12% of tokens. Concretely, our empirical setup provides a simple, causal way to test the effectiveness of existing reasoning mechanisms in base models by invoking them directly and measuring the resulting task performance. More broadly, these results reframe our understanding of how thinking models are trained: pre-training is when models acquire most of their reasoning mechanisms, and post-training teaches efficient deployment of these mechanisms at the right time, enabling efficient use of their inference-time compute.

66. L2M-AID: Autonomous Cyber-Physical Defense by Fusing Semantic Reasoning of Large Language Models with Multi-Agent Reinforcement Learning (Preprint)

Authors: Tianxiang Xu , Zhichao Wen , Xinyu Zhao , Jun Wang , Yan Li , Chang Liu
URL: https://arxiv.org/abs/2510.07363
Abstract:

The increasing integration of Industrial IoT (IIoT) exposes critical cyber-physical systems to sophisticated, multi-stage attacks that elude traditional defenses lacking contextual awareness. This paper introduces L2M-AID, a novel framework for Autonomous Industrial Defense using LLM-empowered, Multi-agent reinforcement learning. L2M-AID orchestrates a team of collaborative agents, each driven by a Large Language Model (LLM), to achieve adaptive and resilient security. The core innovation lies in the deep fusion of two AI paradigms: we leverage an LLM as a semantic bridge to translate vast, unstructured telemetry into a rich, contextual state representation, enabling agents to reason about adversary intent rather than merely matching patterns. This semantically-aware state empowers a Multi-Agent Reinforcement Learning (MARL) algorithm, MAPPO, to learn complex cooperative strategies. The MARL reward function is uniquely engineered to balance security objectives (threat neutralization) with operational imperatives, explicitly penalizing actions that disrupt physical process stability. To validate our approach, we conduct extensive experiments on the benchmark SWaT dataset and a novel synthetic dataset generated based on the MITRE ATT&CK for ICS framework. Results demonstrate that L2M-AID significantly outperforms traditional IDS, deep learning anomaly detectors, and single-agent RL baselines across key metrics, achieving a 97.2% detection rate while reducing false positives by over 80% and improving response times by a factor of four. Crucially, it demonstrates superior performance in maintaining physical process stability, presenting a robust new paradigm for securing critical national infrastructure.

67. Truth-Aware Decoding: A Program-Logic Approach to Factual Language Generation

Authors: Faruk Alpay , Hamdi Alakkad
URL: https://arxiv.org/abs/2510.07331
Abstract:

This paper introduces Truth-Aware Decoding (TAD), a verification-oriented decoding scheme that aligns neural language generation with knowledge bases. Situated in the tradition of probabilistic program semantics for sequence models, TAD augments modern instruction-tuned systems with a lattice of semantic guards that operate at decode time. Our contributions are fourfold: (i) a constraint-based semantics that renders oracle filtering as a program-logic judgment, (ii) a proof that greedy selection enjoys local likelihood dominance under sound and complete guards (Theorem 2.7), (iii) an entropy-style invariant that quantifies factual risk via knowledge-aware safe mass, and (iv) a multi-agent operational calculus with verified Lean artefacts to certify implementation behaviour. Numerical and algorithmic case studies confirm that the resulting guardrails reduce hallucinations without sacrificing throughput, yielding a pragmatic bridge between large-scale empirical models and formal verification.

68. BLAZER: Bootstrapping LLM-based Manipulation Agents with Zero-Shot Data Generation

Authors: Rocktim Jyoti Das , Harsh Singh , Diana Turmakhan , Muhammad Abdullah Sohail , Mingfei Han , Preslav Nakov , Fabio Pizzati , Ivan Laptev
URL: https://arxiv.org/abs/2510.08572
Abstract:

Scaling data and models has played a pivotal role in the remarkable progress of computer vision and language. Inspired by these domains, recent efforts in robotics have similarly focused on scaling both data and model size to develop more generalizable and robust policies. However, unlike vision and language, robotics lacks access to internet-scale demonstrations across diverse robotic tasks and environments. As a result, the scale of existing datasets typically suffers from the need for manual data collection and curation. To address this problem, here we propose BLAZER, a framework that learns manipulation policies from automatically generated training data. We build on the zero-shot capabilities of LLM planners and automatically generate demonstrations for diverse manipulation tasks in simulation. Successful examples are then used to finetune an LLM and to improve its planning capabilities without human supervision. Notably, while BLAZER training requires access to the simulator’s state, we demonstrate direct transfer of acquired skills to sensor-based manipulation. Through extensive experiments, we show BLAZER to significantly improve zero-shot manipulation in both simulated and real environments. Moreover, BLAZER improves on tasks outside of its training pool and enables downscaling of LLM models. Our code and data will be made publicly available on the project page.

69. ArenaBencher: Automatic Benchmark Evolution via Multi-Model Competitive Evaluation

Authors: Qin Liu , Jacob Dineen , Yuxi Huang , Sheng Zhang , Hoifung Poon , Ben Zhou , Muhao Chen
URL: https://arxiv.org/abs/2510.08569
Abstract:

Benchmarks are central to measuring the capabilities of large language models and guiding model development, yet widespread data leakage from pretraining corpora undermines their validity. Models can match memorized content rather than demonstrate true generalization, which inflates scores, distorts cross-model comparisons, and misrepresents progress. We introduce ArenaBencher, a model-agnostic framework for automatic benchmark evolution that updates test cases while preserving comparability. Given an existing benchmark and a diverse pool of models to be evaluated, ArenaBencher infers the core ability of each test case, generates candidate question-answer pairs that preserve the original objective, verifies correctness and intent with an LLM as a judge, and aggregates feedback from multiple models to select candidates that expose shared weaknesses. The process runs iteratively with in-context demonstrations that steer generation toward more challenging and diagnostic cases. We apply ArenaBencher to math problem solving, commonsense reasoning, and safety domains and show that it produces verified, diverse, and fair updates that uncover new failure modes, increase difficulty while preserving test objective alignment, and improve model separability. The framework provides a scalable path to continuously evolve benchmarks in step with the rapid progress of foundation models.

70. NovaFlow: Zero-Shot Manipulation via Actionable Flow from Generated Videos

Authors: Hongyu Li , Lingfeng Sun , Yafei Hu , Duy Ta , Jennifer Barry , George Konidaris , Jiahui Fu
URL: https://arxiv.org/abs/2510.08568
Abstract:

Enabling robots to execute novel manipulation tasks zero-shot is a central goal in robotics. Most existing methods assume in-distribution tasks or rely on fine-tuning with embodiment-matched data, limiting transfer across platforms. We present NovaFlow, an autonomous manipulation framework that converts a task description into an actionable plan for a target robot without any demonstrations. Given a task description, NovaFlow synthesizes a video using a video generation model and distills it into 3D actionable object flow using off-the-shelf perception modules. From the object flow, it computes relative poses for rigid objects and realizes them as robot actions via grasp proposals and trajectory optimization. For deformable objects, this flow serves as a tracking objective for model-based planning with a particle-based dynamics model. By decoupling task understanding from low-level control, NovaFlow naturally transfers across embodiments. We validate on rigid, articulated, and deformable object manipulation tasks using a table-top Franka arm and a Spot quadrupedal mobile robot, and achieve effective zero-shot execution without demonstrations or embodiment-specific training. Project website: this https URL .

71. MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning

Authors: Tajamul Ashraf , Umair Nawaz , Abdelrahman M. Shaker , Rao Anwer , Philip Torr , Fahad Shahbaz Khan , Salman Khan
URL: https://arxiv.org/abs/2510.08567
Abstract:

Vision language models (VLMs) are increasingly deployed as controllers with access to external tools for complex reasoning and decision-making, yet their effectiveness remains limited by the scarcity of high-quality multimodal trajectories and the cost of manual annotation. We address this challenge with a vision-centric agent tuning framework that automatically synthesizes multimodal trajectories, generates step-wise preference pairs, and trains a VLM controller for robust tool-use reasoning. Our pipeline first constructs M-TRACE, a large-scale dataset of 28.5K multimodal tasks with 177K verified trajectories, enabling imitation-based trajectory tuning. Building on this, we develop MATRIX Agent, a controller finetuned on M-TRACE for step-wise tool reasoning. To achieve finer alignment, we further introduce Pref-X, a set of 11K automatically generated preference pairs, and optimize MATRIX on it via step-wise preference learning. Across three benchmarks, Agent-X, GTA, and GAIA, MATRIX consistently surpasses both open- and closed-source VLMs, demonstrating scalable and effective multimodal tool use. Our data and code is avaliable at this https URL .

72. SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models

Authors: Andong Deng , Taojiannan Yang , Shoubin Yu , Lincoln Spencer , Mohit Bansal , Chen Chen , Serena Yeung-Levy , Xiaohan Wang
URL: https://arxiv.org/abs/2510.08559
Abstract:

Large Multimodal Models (LMMs) have achieved remarkable progress across various capabilities; however, complex video reasoning in the scientific domain remains a significant and challenging frontier. Current video benchmarks predominantly target general scenarios where perception/recognition is heavily relied on, while with relatively simple reasoning tasks, leading to saturation and thus failing to effectively evaluate advanced multimodal cognitive skills. To address this critical gap, we introduce SciVideoBench, a rigorous benchmark specifically designed to assess advanced video reasoning in scientific contexts. SciVideoBench consists of 1,000 carefully crafted multiple-choice questions derived from cutting-edge scientific experimental videos spanning over 25 specialized academic subjects and verified by a semi-automatic system. Each question demands sophisticated domain-specific knowledge, precise spatiotemporal perception, and intricate logical reasoning, effectively challenging models’ higher-order cognitive abilities. Our evaluation highlights significant performance deficits in state-of-the-art proprietary and open-source LMMs, including Gemini 2.5 Pro and Qwen2.5-VL, indicating substantial room for advancement in video reasoning capabilities. Detailed analyses of critical factors such as reasoning complexity and visual grounding provide valuable insights and clear direction for future developments in LMMs, driving the evolution of truly capable multimodal AI co-scientists. We hope SciVideoBench could fit the interests of the community and help to push the boundary of cutting-edge AI for border science.

Authors: Yunzhe Xu , Yiyuan Pan , Zhe Liu
URL: https://arxiv.org/abs/2510.08553
Abstract:

Vision-and-Language Navigation (VLN) requires agents to follow natural language instructions through environments, with memory-persistent variants demanding progressive improvement through accumulated experience. Existing approaches for memory-persistent VLN face critical limitations: they lack effective memory access mechanisms, instead relying on entire memory incorporation or fixed-horizon lookup, and predominantly store only environmental observations while neglecting navigation behavioral patterns that encode valuable decision-making strategies. We present Memoir, which employs imagination as a retrieval mechanism grounded by explicit memory: a world model imagines future navigation states as queries to selectively retrieve relevant environmental observations and behavioral histories. The approach comprises: 1) a language-conditioned world model that imagines future states serving dual purposes: encoding experiences for storage and generating retrieval queries; 2) Hybrid Viewpoint-Level Memory that anchors both observations and behavioral patterns to viewpoints, enabling hybrid retrieval; and 3) an experience-augmented navigation model that integrates retrieved knowledge through specialized encoders. Extensive evaluation across diverse memory-persistent VLN benchmarks with 10 distinctive testing scenarios demonstrates Memoir’s effectiveness: significant improvements across all scenarios, with 5.4% SPL gains on IR2R over the best memory-persistent baseline, accompanied by 8.3x training speedup and 74% inference memory reduction. The results validate that predictive retrieval of both environmental and behavioral memories enables more effective navigation, with analysis indicating substantial headroom (73.3% vs 93.4% upper bound) for this imagination-guided paradigm. Code at this https URL .

74. VideoNorms: Benchmarking Cultural Awareness of Video Language Models

Authors: Nikhil Reddy Varimalla , Yunfei Xu , Arkadiy Saakyan , Meng Fan Wang , Smaranda Muresan
URL: https://arxiv.org/abs/2510.08543
Abstract:

As Video Large Language Models (VideoLLMs) are deployed globally, they require understanding of and grounding in the relevant cultural background. To properly assess these models’ cultural awareness, adequate benchmarks are needed. We introduce VideoNorms, a benchmark of over 1000 (video clip, norm) pairs from US and Chinese cultures annotated with socio-cultural norms grounded in speech act theory, norm adherence and violations labels, and verbal and non-verbal evidence. To build VideoNorms, we use a human-AI collaboration framework, where a teacher model using theoretically-grounded prompting provides candidate annotations and a set of trained human experts validate and correct the annotations. We benchmark a variety of open-weight VideoLLMs on the new dataset which highlight several common trends: 1) models performs worse on norm violation than adherence; 2) models perform worse w.r.t Chinese culture compared to the US culture; 3) models have more difficulty in providing non-verbal evidence compared to verbal for the norm adhere/violation label and struggle to identify the exact norm corresponding to a speech-act; and 4) unlike humans, models perform worse in formal, non-humorous contexts. Our findings emphasize the need for culturally-grounded video language model training - a gap our benchmark and framework begin to address.

75. On the optimization dynamics of RLVR: Gradient gap and step size thresholds

Authors: Joe Suk , Yaqi Duan
URL: https://arxiv.org/abs/2510.08539
Abstract:

Reinforcement Learning with Verifiable Rewards (RLVR), which uses simple binary feedback to post-train large language models, has shown significant empirical success. However, a principled understanding of why it works has been lacking. This paper builds a theoretical foundation for RLVR by analyzing its training process at both the full-response (trajectory) and token levels. Central to our analysis is a quantity called the Gradient Gap, which formalizes the direction of improvement from low-reward to high-reward regions of the response space. We prove that convergence critically depends on aligning the update direction with this Gradient Gap. Moreover, we derive a sharp step-size threshold based on the magnitude of the Gradient Gap: below it, learning converges, whereas above it, performance collapses. Our theory further predicts how the critical step size must scale with response length and the success rate, thereby explaining why practical heuristics such as length normalization improve stability and showing that, with a fixed learning rate, the success rate can stagnate strictly below $100\%$. We validate these predictions through controlled bandit simulations and LLM experiments, including training Qwen2.5-7B with GRPO.

76. Kontinuous Kontext: Continuous Strength Control for Instruction-based Image Editing

Authors: Rishubh Parihar , Or Patashnik , Daniil Ostashev , R. Venkatesh Babu , Daniel Cohen-Or , Kuan-Chieh Wang
URL: https://arxiv.org/abs/2510.08532
Abstract:

Instruction-based image editing offers a powerful and intuitive way to manipulate images through natural language. Yet, relying solely on text instructions limits fine-grained control over the extent of edits. We introduce Kontinuous Kontext, an instruction-driven editing model that provides a new dimension of control over edit strength, enabling users to adjust edits gradually from no change to a fully realized result in a smooth and continuous manner. Kontinuous Kontext extends a state-of-the-art image editing model to accept an additional input, a scalar edit strength which is then paired with the edit instruction, enabling explicit control over the extent of the edit. To inject this scalar information, we train a lightweight projector network that maps the input scalar and the edit instruction to coefficients in the model’s modulation space. For training our model, we synthesize a diverse dataset of image-edit-instruction-strength quadruplets using existing generative models, followed by a filtering stage to ensure quality and consistency. Kontinuous Kontext provides a unified approach for fine-grained control over edit strength for instruction driven editing from subtle to strong across diverse operations such as stylization, attribute, material, background, and shape changes, without requiring attribute-specific training.

77. SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models

Authors: Hongxing Li , Dingming Li , Zixuan Wang , Yuchen Yan , Hang Wu , Wenqi Zhang , Yongliang Shen , Weiming Lu , Jun Xiao , Yueting Zhuang
URL: https://arxiv.org/abs/2510.08531
Abstract:

Spatial reasoning remains a fundamental challenge for Vision-Language Models (VLMs), with current approaches struggling to achieve robust performance despite recent advances. We identify that this limitation stems from a critical gap: existing methods attempt to learn spatial reasoning directly without establishing the hierarchical foundations of perception and understanding. To address this challenge, we present a comprehensive methodology for building spatial intelligence progressively. We introduce SpatialLadder-26k, a multimodal dataset containing 26,610 samples spanning object localization, single image, multi-view, and video spatial reasoning tasks, constructed through a standardized pipeline that ensures systematic coverage across modalities. Building on this dataset, we design a three-stage progressive training framework that (1) establishes spatial perception through object localization, (2) develops spatial understanding through multi-dimensional spatial tasks, and (3) strengthens complex reasoning via reinforcement learning with verifiable rewards. This approach yields SpatialLadder, a 3B-parameter model that achieves state-of-the-art performance on spatial reasoning benchmarks, with 23.4% average improvement over the base model, surpassing GPT-4o by 20.8% and Gemini-2.0-Flash by 10.1%. Notably, SpatialLadder maintains strong generalization with 7.2% improvement on out-of-domain benchmarks, demonstrating that progressive training from perception to reasoning is essential for robust spatial intelligence.

78. CoMAS: Co-Evolving Multi-Agent Systems via Interaction Rewards

Authors: Xiangyuan Xue , Yifan Zhou , Guibin Zhang , Zaibin Zhang , Yijiang Li , Chen Zhang , Zhenfei Yin , Philip Torr , Wanli Ouyang , Lei Bai
URL: https://arxiv.org/abs/2510.08529
Abstract:

Self-evolution is a central research topic in enabling large language model (LLM)-based agents to continually improve their capabilities after pretraining. Recent research has witnessed a transition from reinforcement learning (RL)-free to RL-based methods. Current RL-based methods either rely on dense external reward signals or extract intrinsic reward signals from LLMs themselves. However, these approaches diverge from the self-evolution mechanisms observed in human intelligence, where individuals learn and improve through mutual discussion and collaboration. In this work, we introduce Co-Evolving Multi-Agent Systems (CoMAS), a novel framework that enables agents to improve autonomously by learning from inter-agent interactions without external supervision. CoMAS generates intrinsic rewards from rich discussion dynamics, employs an LLM-as-a-judge mechanism to formulate these rewards, and optimizes each agent’s policy through RL, thereby enabling decentralized and scalable co-evolution. Experimental results demonstrate that CoMAS consistently outperforms untrained agents and achieves state-of-the-art performance across most evaluation settings. Ablation studies confirm the necessity of interaction-based reward signals and reveal promising scalability as the number and diversity of agents increase. These findings establish CoMAS as a novel and effective paradigm for self-evolution in LLM-based agents.

79. To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models

Authors: Jiayun Luo , Wan-Cyuan Fan , Lyuyang Wang , Xiangteng He , Tanzila Rahman , Purang Abolmaesumi , Leonid Sigal
URL: https://arxiv.org/abs/2510.08510
Abstract:

Large Vision Language Models (LVLMs) have recently emerged as powerful architectures capable of understanding and reasoning over both visual and textual information. These models typically rely on two key components: a Vision Transformer (ViT) and a Large Language Model (LLM). ViT encodes visual content into a sequence of image tokens and serves as the perceptual front-end – the eyes of the model. In contrast, the LLM interprets these tokens to perform high-level reasoning, generates responses, and functions as the cognitive core – the brain of the model. However, it remains unclear which visual tokens contribute most significantly to understanding and reasoning, and how effectively these signals are propagated from ViT to the LLM. While most existing works have focused on identifying attention sinks, low-semantic tokens receiving disproportionately high attention, within the LLM, we shift the focus to the vision encoder by identifying a class of high-norm visual tokens from ViT, referred to as ViT attention sinks – a problem that has been rarely studied but is indeed very important for LVLMs. Our findings show that these ViT sinks encapsulate high-level semantic concepts from images, allowing the LLM to perform more effective understanding and reasoning. Despite their importance, these sink tokens are often overlooked in existing LVLM architectures. To explore their contribution, we present both qualitative and quantitative analyses of the information embedded in these sink tokens. We also propose both training-free and training-based approaches to better leverage how this information is interpreted by the LLM, and to what extent. By explicitly utilizing these tokens, we demonstrate substantial improvements across a range of LVLMs and visual reasoning tasks, highlighting the untapped potential of ViT attention sinks in enhancing visual reasoning.

80. AI-Driven Radiology Report Generation for Traumatic Brain Injuries

Authors: Riadh Bouslimi , Houda Trabelsi , Wahiba Ben Abdssalem Karaa , Hana Hedhli
URL: https://arxiv.org/abs/2510.08498
Abstract:

Traumatic brain injuries present significant diagnostic challenges in emergency medicine, where the timely interpretation of medical images is crucial for patient outcomes. In this paper, we propose a novel AI-based approach for automatic radiology report generation tailored to cranial trauma cases. Our model integrates an AC-BiFPN with a Transformer architecture to capture and process complex medical imaging data such as CT and MRI scans. The AC-BiFPN extracts multi-scale features, enabling the detection of intricate anomalies like intracranial hemorrhages, while the Transformer generates coherent, contextually relevant diagnostic reports by modeling long-range dependencies. We evaluate the performance of our model on the RSNA Intracranial Hemorrhage Detection dataset, where it outperforms traditional CNN-based models in both diagnostic accuracy and report generation. This solution not only supports radiologists in high-pressure environments but also provides a powerful educational tool for trainee physicians, offering real-time feedback and enhancing their learning experience. Our findings demonstrate the potential of combining advanced feature extraction with transformer-based text generation to improve clinical decision-making in the diagnosis of traumatic brain injuries.

81. DeepPrune: Parallel Scaling without Inter-trace Redundancy

Authors: Shangqing Tu , Yaxuan Li , Yushi Bai , Lei Hou , Juanzi Li
URL: https://arxiv.org/abs/2510.08483
Abstract:

Parallel scaling has emerged as a powerful paradigm to enhance reasoning capabilities in large language models (LLMs) by generating multiple Chain-of-Thought (CoT) traces simultaneously. However, this approach introduces significant computational inefficiency due to inter-trace redundancy – our analysis reveals that over 80% of parallel reasoning traces yield identical final answers, representing substantial wasted computation. To address this critical efficiency bottleneck, we propose DeepPrune, a novel framework that enables efficient parallel scaling through dynamic pruning. Our method features a specialized judge model trained with focal loss and oversampling techniques to accurately predict answer equivalence from partial reasoning traces which realizes 0.87 AUROC on equivalence prediction, combined with an online greedy clustering algorithm that dynamically prunes redundant paths while preserving answer diversity. Comprehensive evaluations across three challenging benchmarks (AIME 2024, AIME 2025, and GPQA) and multiple reasoning models demonstrate that DeepPrune achieves remarkable token reduction by over 80% compared to conventional consensus sampling on most cases, while maintaining competitive accuracy within 3 percentage points. Our work establishes a new standard for efficient parallel reasoning, making high-performance reasoning more efficient. Our code and data are here: this https URL

82. Platform-Agnostic Modular Architecture for Quantum Benchmarking

Authors: Neer Patel , Anish Giri , Hrushikesh Pramod Patil , Noah Siekierski , Avimita Chatterjee , Sonika Johri , Timothy Proctor , Thomas Lubinski , Siyuan Niu
URL: https://arxiv.org/abs/2510.08469
Abstract:

We present a platform-agnostic modular architecture that addresses the increasingly fragmented landscape of quantum computing benchmarking by decoupling problem generation, circuit execution, and results analysis into independent, interoperable components. Supporting over 20 benchmark variants ranging from simple algorithmic tests like Bernstein-Vazirani to complex Hamiltonian simulation with observable calculations, the system integrates with multiple circuit generation APIs (Qiskit, CUDA-Q, Cirq) and enables diverse workflows. We validate the architecture through successful integration with Sandia’s $\textit{pyGSTi}$ for advanced circuit analysis and CUDA-Q for multi-GPU HPC simulations. Extensibility of the system is demonstrated by implementing dynamic circuit variants of existing benchmarks and a new quantum reinforcement learning benchmark, which become readily available across multiple execution and analysis modes. Our primary contribution is identifying and formalizing modular interfaces that enable interoperability between incompatible benchmarking frameworks, demonstrating that standardized interfaces reduce ecosystem fragmentation while preserving optimization flexibility. This architecture has been developed as a key enhancement to the continually evolving QED-C Application-Oriented Performance Benchmarks for Quantum Computing suite.

83. Integral Signatures of Activation Functions: A 9-Dimensional Taxonomy and Stability Theory for Deep Learning

Authors: Ankur Mali , Lawrence Hall , Jake Williams , Gordon Richards
URL: https://arxiv.org/abs/2510.08456
Abstract:

Activation functions govern the expressivity and stability of neural networks, yet existing comparisons remain largely heuristic. We propose a rigorous framework for their classification via a nine-dimensional integral signature S_sigma(phi), combining Gaussian propagation statistics (m1, g1, g2, m2, eta), asymptotic slopes (alpha_plus, alpha_minus), and regularity measures (TV(phi’), C(phi)). This taxonomy establishes well-posedness, affine reparameterization laws with bias, and closure under bounded slope variation. Dynamical analysis yields Lyapunov theorems with explicit descent constants and identifies variance stability regions through (m2’, g2). From a kernel perspective, we derive dimension-free Hessian bounds and connect smoothness to bounded variation of phi’. Applying the framework, we classify eight standard activations (ReLU, leaky-ReLU, tanh, sigmoid, Swish, GELU, Mish, TeLU), proving sharp distinctions between saturating, linear-growth, and smooth families. Numerical Gauss-Hermite and Monte Carlo validation confirms theoretical predictions. Our framework provides principled design guidance, moving activation choice from trial-and-error to provable stability and kernel conditioning.

84. gLSTM: Mitigating Over-Squashing by Increasing Storage Capacity

Authors: Hugh Blayney , Álvaro Arroyo , Xiaowen Dong , Michael M. Bronstein
URL: https://arxiv.org/abs/2510.08450
Abstract:

Graph Neural Networks (GNNs) leverage the graph structure to transmit information between nodes, typically through the message-passing mechanism. While these models have found a wide variety of applications, they are known to suffer from over-squashing, where information from a large receptive field of node representations is collapsed into a single fixed sized vector, resulting in an information bottleneck. In this paper, we re-examine the over-squashing phenomenon through the lens of model storage and retrieval capacity, which we define as the amount of information that can be stored in a node’s representation for later use. We study some of the limitations of existing tasks used to measure over-squashing and introduce a new synthetic task to demonstrate that an information bottleneck can saturate this capacity. Furthermore, we adapt ideas from the sequence modeling literature on associative memories, fast weight programmers, and the xLSTM model to develop a novel GNN architecture with improved capacity. We demonstrate strong performance of this architecture both on our capacity synthetic task, as well as a range of real-world graph benchmarks.

85. Synthetic Series-Symbol Data Generation for Time Series Foundation Models

Authors: Wenxuan Wang , Kai Wu , Yujian Betterest Li , Dan Wang , Xiaoyu Zhang
URL: https://arxiv.org/abs/2510.08445
Abstract:

Foundation models for time series analysis (TSA) have attracted significant attention. However, challenges such as training data scarcity and imbalance continue to hinder their development. Inspired by complex dynamic system theories, we design a series-symbol data generation mechanism, enabling the unrestricted creation of high-quality time series data paired with corresponding symbolic expressions. To leverage series-symbol data pairs with strong correlations, we develop \texttt{SymTime}, a pre-trained foundation model for enhancing time series representation using symbolic information. \texttt{SymTime} demonstrates competitive performance across five major TSA tasks when fine-tunes with downstream tasks, rivaling foundation models pre-trained on real-world datasets. This approach underscores the potential of series-symbol data generation and pretraining mechanisms in overcoming data scarcity and enhancing task performance. The code is available at this https URL .

86. Gaze on the Prize: Shaping Visual Attention with Return-Guided Contrastive Learning

Authors: Andrew Lee , Ian Chuang , Dechen Gao , Kai Fukazawa , Iman Soltani
URL: https://arxiv.org/abs/2510.08442
Abstract:

Visual Reinforcement Learning (RL) agents must learn to act based on high-dimensional image data where only a small fraction of the pixels is task-relevant. This forces agents to waste exploration and computational resources on irrelevant features, leading to sample-inefficient and unstable learning. To address this, inspired by human visual foveation, we introduce Gaze on the Prize. This framework augments visual RL with a learnable foveal attention mechanism (Gaze), guided by a self-supervised signal derived from the agent’s experience pursuing higher returns (the Prize). Our key insight is that return differences reveal what matters most: If two similar representations produce different outcomes, their distinguishing features are likely task-relevant, and the gaze should focus on them accordingly. This is realized through return-guided contrastive learning that trains the attention to distinguish between the features relevant to success and failure. We group similar visual representations into positives and negatives based on their return differences and use the resulting labels to construct contrastive triplets. These triplets provide the training signal that teaches the attention mechanism to produce distinguishable representations for states associated with different outcomes. Our method achieves up to 2.4x improvement in sample efficiency and can solve tasks that the baseline fails to learn, demonstrated across a suite of manipulation tasks from the ManiSkill3 benchmark, all without modifying the underlying algorithm or hyperparameters.

87. xRouter: Training Cost-Aware LLMs Orchestration System via Reinforcement Learning

Authors: Cheng Qian , Zuxin Liu , Shirley Kokane , Akshara Prabhakar , Jielin Qiu , Haolin Chen , Zhiwei Liu , Heng Ji , Weiran Yao , Shelby Heinecke , Silvio Savarese , Caiming Xiong , Huan Wang
URL: https://arxiv.org/abs/2510.08439
Abstract:

Modern LLM deployments confront a widening cost-performance spectrum: premium models deliver strong reasoning but are expensive, while lightweight models are economical yet brittle on complex tasks. Static escalation rules and keyword heuristics under-utilize this spectrum and fail to adapt across task types. We present xRouter, a tool-calling-based routing system in which a learned router can either answer directly or invoke one or more external models. The router is trained end-to-end with reinforcement learning using an explicit, cost-aware reward that encodes cost-performance trade-offs, eliminating the need for hand-engineered routing rules. Our implementation encompasses the full reinforcement learning framework, including reward and cost accounting, as well as the deployment and evaluation pipelines. Across diverse benchmarks, xRouter achieves strong cost-performance trade-offs (e.g., substantial cost reductions at comparable task completion rates), and provides empirical insights into what reliably helps learned routing and what does not, ranging from model trainability to the difficulty of eliciting sophisticated orchestration behaviors in small open models. We hope these findings and our open implementation will serve as a practical substrate for advancing learned, cost-aware LLM orchestration.

88. ClauseLens: Clause-Grounded, CVaR-Constrained Reinforcement Learning for Trustworthy Reinsurance Pricing

Authors: Stella C. Dong , James R. Finlay
URL: https://arxiv.org/abs/2510.08429
Abstract:

Reinsurance treaty pricing must satisfy stringent regulatory standards, yet current quoting practices remain opaque and difficult to audit. We introduce ClauseLens, a clause-grounded reinforcement learning framework that produces transparent, regulation-compliant, and risk-aware treaty quotes. ClauseLens models the quoting task as a Risk-Aware Constrained Markov Decision Process (RA-CMDP). Statutory and policy clauses are retrieved from legal and underwriting corpora, embedded into the agent’s observations, and used both to constrain feasible actions and to generate clause-grounded natural language justifications. Evaluated in a multi-agent treaty simulator calibrated to industry data, ClauseLens reduces solvency violations by 51%, improves tail-risk performance by 27.9% (CVaR_0.10), and achieves 88.2% accuracy in clause-grounded explanations with retrieval precision of 87.4% and recall of 91.1%. These findings demonstrate that embedding legal context into both decision and explanation pathways yields interpretable, auditable, and regulation-aligned quoting behavior consistent with Solvency II, NAIC RBC, and the EU AI Act.

89. Prompts Generalize with Low Data: Non-vacuous Generalization Bounds for Optimizing Prompts with More Informative Priors

Authors: David Madras , Joshua Safyan , Qiuyi (Richard) Zhang
URL: https://arxiv.org/abs/2510.08413
Abstract:

Many prompt engineering techniques have been successful in practice, even when optimizing over a large prompt space with with a small amount of task-specific data. Recent work has partially explained this success by showing generalization bounds which apply PAC-Bayes theory to the discrete prompt space, but they are non-vacuous only in data-rich scenarios. We argue that such widespread success can be more fully explained through more carefully considering data- or distribution-dependent perplexity, which acts as an effective prior and steers the optimization towards prompts that are more ``natural’’ for the task at hand. We derive novel generalization bounds that are non-vacuous for data-scarce prompt optimization via more useful priors, formally analyzing how perplexity regularization tightens these bounds by limiting exploration. Empirically, we explore both the bounds’ effectiveness and the practical benefits of perplexity regularization in improving prompt generalization.

90. Single layer tiny Co$^4$ outpaces GPT-2 and GPT-BERT

Authors: Noor Ul Zain , Mohsin Raza , Ahsan Adeel
URL: https://arxiv.org/abs/2510.08404
Abstract:

We show that a tiny Co$^4$ machine(Adeel,2025) with a single layer, two heads, and 8M parameters, operating at an approximate cost of $O(N)$ (where $N$ is the number of input tokens), outpaces the BabyLM Challenge baselines GPT-2 (124M, 12 layers, $O(N^2))$ and GPT-BERT (30M, 12 layers, $O(N^2))$ in just two epochs, while both are trained for ten. Co$^4$ achieves orders-of-magnitude greater training efficiency on 10M tokens, demonstrating highly sample efficient pretraining. Using the BabyLM challenge evaluation pipeline across complex benchmarks, Co$^4$ exhibits strong zero-shot and fine-tuning performance on SuperGLUE tasks. Specifically, Co$^4$ outperforms GPT-2 on 5 out of 7 zero-shot metrics and 6 out of 7 fine-tuning tasks, and GPT-BERT on 4 out of 7 metrics in both cases. These results suggest the need to rethink prevailing deep learning paradigms and associated scaling laws.

91. FlyLoRA: Boosting Task Decoupling and Parameter Efficiency via Implicit Rank-Wise Mixture-of-Experts

Authors: Heming Zou , Yunliang Zang , Wutong Xu , Yao Zhu , Xiangyang Ji
URL: https://arxiv.org/abs/2510.08396
Abstract:

Low-Rank Adaptation (LoRA) is a widely used parameter-efficient fine-tuning method for foundation models, but it suffers from parameter interference, resulting in suboptimal performance. Although Mixture-of-Experts (MoE)-based LoRA variants show promise in mitigating intra-task correlations in single-task instruction tuning, they introduce additional router parameters and remain ineffective in multi-task model merging where inter-task interference arises. Inspired by the fly olfactory circuit, we propose FlyLoRA, an implicit MoE-based LoRA variant that introduces: (1) rank-wise expert activation in the up-projection matrix, and (2) an implicit router that unifies expert routing and down-projection, where a frozen sparse random projection matrix replaces the traditional dense trainable version. This design resolves the trade-off between intra-task decorrelation and computational efficiency by eliminating the need for an explicit router, while inherently mitigating inter-task interference due to the orthogonality property of random matrices. Extensive experiments across four domains – general knowledge understanding, scientific question answering, mathematical reasoning, and code generation – demonstrate consistent performance improvements over existing methods. Beyond empirical gains, FlyLoRA highlights how biological structures can inspire innovations in AI technologies. Code is available at this https URL .

92. Detecting Legend Items on Historical Maps Using GPT-4o with In-Context Learning

Authors: Sofia Kirsanova , Yao-Yi Chiang , Weiwei Duan
URL: https://arxiv.org/abs/2510.08385
Abstract:

Historical map legends are critical for interpreting cartographic symbols. However, their inconsistent layouts and unstructured formats make automatic extraction challenging. Prior work focuses primarily on segmentation or general optical character recognition (OCR), with few methods effectively matching legend symbols to their corresponding descriptions in a structured manner. We present a method that combines LayoutLMv3 for layout detection with GPT-4o using in-context learning to detect and link legend items and their descriptions via bounding box predictions. Our experiments show that GPT-4 with structured JSON prompts outperforms the baseline, achieving 88% F-1 and 85% IoU, and reveal how prompt design, example counts, and layout alignment affect performance. This approach supports scalable, layout-aware legend parsing and improves the indexing and searchability of historical maps across various visual styles.

93. Airy: Reading Robot Intent through Height and Sky

Authors: Baoyang Chen , Xian Xu , Huamin Qu
URL: https://arxiv.org/abs/2510.08381
Abstract:

As industrial robots move into shared human spaces, their opaque decision making threatens safety, trust, and public oversight. This artwork, Airy, asks whether complex multi agent AI can become intuitively understandable by staging a competition between two reinforcement trained robot arms that snap a bedsheet skyward. Building on three design principles, competition as a clear metric (who lifts higher), embodied familiarity (audiences recognize fabric snapping), and sensor to sense mapping (robot cooperation or rivalry shown through forest and weather projections), the installation gives viewers a visceral way to read machine intent. Observations from five international exhibitions indicate that audiences consistently read the robots’ strategies, conflict, and cooperation in real time, with emotional reactions that mirror the system’s internal state. The project shows how sensory metaphors can turn a black box into a public interface.

94. Evaluating Small Vision-Language Models on Distance-Dependent Traffic Perception

Authors: Nikos Theodoridis , Tim Brophy , Reenu Mohandas , Ganesh Sistu , Fiachra Collins , Anthony Scanlan , Ciaran Eising
URL: https://arxiv.org/abs/2510.08352
Abstract:

Vision-Language Models (VLMs) are becoming increasingly powerful, demonstrating strong performance on a variety of tasks that require both visual and textual understanding. Their strong generalisation abilities make them a promising component for automated driving systems, which must handle unexpected corner cases. However, to be trusted in such safety-critical applications, a model must first possess a reliable perception system. Moreover, since critical objects and agents in traffic scenes are often at a distance, we require systems that are not “shortsighted”, i.e., systems with strong perception capabilities at both close (up to 20 meters) and long (30+ meters) range. With this in mind, we introduce Distance-Annotated Traffic Perception Question Answering (DTPQA), the first Visual Question Answering (VQA) benchmark focused solely on perception-based questions in traffic scenes, enriched with distance annotations. By excluding questions that require reasoning, we ensure that model performance reflects perception capabilities alone. Since automated driving hardware has limited processing power and cannot support large VLMs, our study centers on smaller VLMs. More specifically, we evaluate several state-of-the-art (SOTA) small VLMs on DTPQA and show that, despite the simplicity of the questions, these models significantly underperform compared to humans (~60% average accuracy for the best-performing small VLM versus ~85% human performance). However, it is important to note that the human sample size was relatively small, which imposes statistical limitations. We also identify specific perception tasks, such as distinguishing left from right, that remain particularly challenging for these models.

95. DeepEN: Personalized Enteral Nutrition for Critically Ill Patients using Deep Reinforcement Learning

Authors: Daniel Jason Tan , Jiayang Chen , Dilruk Perera , Kay Choong See , Mengling Feng
URL: https://arxiv.org/abs/2510.08350
Abstract:

We introduce DeepEN, a deep reinforcement learning (RL) framework for personalized enteral nutrition (EN) in critically ill patients. Trained offline on over 11,000 ICU patients from the MIMIC-IV database, DeepEN generates 4-hourly recommendations for caloric, protein, and fluid intake tailored to each patient’s evolving physiology. The model integrates a curated, clinically informed state space with a custom reward function that balances short-term physiological and nutrition-related goals with long-term survival outcomes. Using a dueling double deep Q-network with conservative Q-learning regularization, DeepEN learns clinically realistic policies that align with high-value clinician actions while discouraging unsafe deviations. Across various qualitative and quantitative metrics, DeepEN outperforms clinician-derived and guideline-based policies, achieving a 3.7 $\pm$ 0.17 percentage-point reduction in estimated mortality (18.8% vs 22.5%) and improvements in key nutritional biomarkers. These findings highlight the potential of safe, data-driven personalization of EN therapy to improve outcomes beyond traditional guideline- or heuristic-based approaches.

96. Learning What’s Missing: Attention Dispersion and EMA Stabilization in Length Generalization

Authors: Pál Zsámboki , Benjamin Levi , David Ansel Josef Smith , Mitansh Kagalwala , Arlington Kell , Samuel Liechty , Cong Wang
URL: https://arxiv.org/abs/2510.08341
Abstract:

We study length generalization in transformers through the set complement task, where a model must predict a uniform distribution over tokens absent from an input sequence – an ability central to board-game style reasoning. Our main theoretical result establishes two statements. First, we prove tight bounds on embedding and value dimensions for single-layer attention-only transformers. Second, we show that if such a model achieves balanced logit displacement at lengths 1 and 2, then it must generalize to longer sequences, though with reduced precision. A mechanistic reading of the proof explains this limitation: as more tokens are attended to, softmax compresses logit displacements, eroding separation between valid and invalid outputs. Training dynamics also suggest a second obstacle: when many next tokens are possible, updates become noisy. We hypothesize that dropout can counteract the first effect and Exponential Moving Average (EMA) the second. We validate these hypotheses through random hyperparameter search on the set complement task, which confirms both mechanisms. We then test OthelloGPT, a GPT-1 style model trained on random Othello moves, and find that EMA again improves length generalization in this more complex setting.

97. Iterated Agent for Symbolic Regression

Authors: Zhuo-Yang Song , Zeyu Cai , Shutao Zhang , Jiashen Wei , Jichen Pan , Shi Qiu , Qing-Hong Cao , Tie-Jiun Hou , Xiaohui Liu , Ming-xing Luo , Hua Xing Zhu
URL: https://arxiv.org/abs/2510.08317
Abstract:

Symbolic regression (SR), the automated discovery of mathematical expressions from data, is a cornerstone of scientific inquiry. However, it is often hindered by the combinatorial explosion of the search space and a tendency to overfit. Popular methods, rooted in genetic programming, explore this space syntactically, often yielding overly complex, uninterpretable models. This paper introduces IdeaSearchFitter, a framework that employs Large Language Models (LLMs) as semantic operators within an evolutionary search. By generating candidate expressions guided by natural-language rationales, our method biases discovery towards models that are not only accurate but also conceptually coherent and interpretable. We demonstrate IdeaSearchFitter’s efficacy across diverse challenges: it achieves competitive, noise-robust performance on the Feynman Symbolic Regression Database (FSReD), outperforming several strong baselines; discovers mechanistically aligned models with good accuracy-complexity trade-offs on real-world data; and derives compact, physically-motivated parametrizations for Parton Distribution Functions in a frontier high-energy physics application. IdeaSearchFitter is a specialized module within our broader iterated agent framework, IdeaSearch, which is publicly available at this https URL .

98. Counterfactual Identifiability via Dynamic Optimal Transport

Authors: Fabio De Sousa Ribeiro , Ainkaran Santhirasekaram , Ben Glocker
URL: https://arxiv.org/abs/2510.08294
Abstract:

We address the open question of counterfactual identification for high-dimensional multivariate outcomes from observational data. Pearl (2000) argues that counterfactuals must be identifiable (i.e., recoverable from the observed data distribution) to justify causal claims. A recent line of work on counterfactual inference shows promising results but lacks identification, undermining the causal validity of its estimates. To address this, we establish a foundation for multivariate counterfactual identification using continuous-time flows, including non-Markovian settings under standard criteria. We characterise the conditions under which flow matching yields a unique, monotone and rank-preserving counterfactual transport map with tools from dynamic optimal transport, ensuring consistent inference. Building on this, we validate the theory in controlled scenarios with counterfactual ground-truth and demonstrate improvements in axiomatic counterfactual soundness on real images.

99. Learning Neural Exposure Fields for View Synthesis

Authors: Michael Niemeyer , Fabian Manhardt , Marie-Julie Rakotosaona , Michael Oechsle , Christina Tsalicoglou , Keisuke Tateno , Jonathan T. Barron , Federico Tombari
URL: https://arxiv.org/abs/2510.08279
Abstract:

Recent advances in neural scene representations have led to unprecedented quality in 3D reconstruction and view synthesis. Despite achieving high-quality results for common benchmarks with curated data, outputs often degrade for data that contain per image variations such as strong exposure changes, present, e.g., in most scenes with indoor and outdoor areas or rooms with windows. In this paper, we introduce Neural Exposure Fields (NExF), a novel technique for robustly reconstructing 3D scenes with high quality and 3D-consistent appearance from challenging real-world captures. In the core, we propose to learn a neural field predicting an optimal exposure value per 3D point, enabling us to optimize exposure along with the neural scene representation. While capture devices such as cameras select optimal exposure per image/pixel, we generalize this concept and perform optimization in 3D instead. This enables accurate view synthesis in high dynamic range scenarios, bypassing the need of post-processing steps or multi-exposure captures. Our contributions include a novel neural representation for exposure prediction, a system for joint optimization of the scene representation and the exposure field via a novel neural conditioning mechanism, and demonstrated superior performance on challenging real-world data. We find that our approach trains faster than prior works and produces state-of-the-art results on several benchmarks improving by over 55% over best-performing baselines.

100. A Distributed Emulation Environment for In-Memory Computing Systems

Authors: Eleni Bougioukou , Anastasios Petropoulos , Nikolaos Toulgaridis , Theodoros Chatzimichail , Theodore Antonakopoulos
URL: https://arxiv.org/abs/2510.08257
Abstract:

In-memory computing technology is used extensively in artificial intelligence devices due to lower power consumption and fast calculation of matrix-based functions. The development of such a device and its integration in a system takes a significant amount of time and requires the use of a real-time emulation environment, where various system aspects are analyzed, microcode is tested, and applications are deployed, even before the real chip is available. In this work, we present the architecture, the software development tools, and experimental results of a distributed and expandable emulation system for rapid prototyping of integrated circuits based on in-memory computing technologies. Presented experimental results demonstrate the usefulness of the proposed emulator.

101. Mix- and MoE-DPO: A Variational Inference Approach to Direct Preference Optimization

Authors: Jason Bohne , Pawel Polak , David Rosenberg , Brian Bloniarz , Gary Kazantsev
URL: https://arxiv.org/abs/2510.08256
Abstract:

Direct Preference Optimization (DPO) has recently emerged as a simple and effective alternative to reinforcement learning from human feedback (RLHF) for aligning large language models (LLMs) with user preferences. However, existing DPO formulations rely on a single monolithic model, which limits their expressivity in multi-task settings and their adaptability to heterogeneous or diverse preference distributions. In this work, we propose Mix- and MoE-DPO, a framework that extends DPO with both soft mixture models and mixture-of-experts (MoE) architectures, using a stochastic variational inference approach. Our method introduces a latent-variable model over expert assignments and optimizes a variational evidence lower bound (ELBO), enabling stable and efficient learning of specialized expert policies from preference data. Mix- and MoE-DPO provides three key advantages over standard DPO: (i) generalization via universal function approximation through mixtures; (ii) reward and policy specialization through expert components tailored to distinct preference modes; and (iii) contextual alignment through input-dependent soft gating that enables user-specific mixture policies. Our framework supports both shared base architectures with expert-specific policy heads and fully independent expert models, allowing flexible trade-offs between parameter efficiency and specialization. We validate our approach on a variety of model sizes and multi-preference datasets, demonstrating that Mix- and MoE-DPO offers a powerful and scalable method for preference-based LLM alignment.

102. Opponent Shaping in LLM Agents

Authors: Marta Emili Garcia Segura , Stephen Hailes , Mirco Musolesi
URL: https://arxiv.org/abs/2510.08255
Abstract:

Large Language Models (LLMs) are increasingly being deployed as autonomous agents in real-world environments. As these deployments scale, multi-agent interactions become inevitable, making it essential to understand strategic behavior in such systems. A central open question is whether LLM agents, like reinforcement learning agents, can shape the learning dynamics and influence the behavior of others through interaction alone. In this paper, we present the first investigation of opponent shaping (OS) with LLM-based agents. Existing OS algorithms cannot be directly applied to LLMs, as they require higher-order derivatives, face scalability constraints, or depend on architectural components that are absent in transformers. To address this gap, we introduce ShapeLLM, an adaptation of model-free OS methods tailored for transformer-based agents. Using ShapeLLM, we examine whether LLM agents can influence co-players’ learning dynamics across diverse game-theoretic environments. We demonstrate that LLM agents can successfully guide opponents toward exploitable equilibria in competitive games (Iterated Prisoner’s Dilemma, Matching Pennies, and Chicken) and promote coordination and improve collective welfare in cooperative games (Iterated Stag Hunt and a cooperative version of the Prisoner’s Dilemma). Our findings show that LLM agents can both shape and be shaped through interaction, establishing opponent shaping as a key dimension of multi-agent LLM research.

103. Contrastive Decoding for Synthetic Data Generation in Low-Resource Language Modeling

Authors: Jannek Ulm , Kevin Du , Vésteinn Snæbjarnarson
URL: https://arxiv.org/abs/2510.08245
Abstract:

Large language models (LLMs) are trained on huge amounts of textual data, and concerns have been raised that the limits of such data may soon be reached. A potential solution is to train on synthetic data sampled from LLMs. In this work, we build on this idea and investigate the benefits of contrastive decoding for generating synthetic corpora. In a controlled setting, we experiment with sampling corpora using the relative difference between a good and bad model trained on the same original corpus of 100 million words. By amplifying the signal from a model that has better performance, we create a synthetic corpus and mix it with the original training data. Our findings show that training on a mixture of synthesized and real data improves performance on the language modeling objective and a range of downstream tasks. In particular, we see that training with a mix of synthetic data from contrastive decoding benefits tasks that require more reasoning skills, while synthetic data from traditional sampling helps more on tasks dependent on surface level linguistic capabilities.

104. The Hidden Bias: A Study on Explicit and Implicit Political Stereotypes in Large Language Models

Authors: Konrad Löhr , Shuzhou Yuan , Michael Färber
URL: https://arxiv.org/abs/2510.08236
Abstract:

Large Language Models (LLMs) are increas- ingly integral to information dissemination and decision-making processes. Given their grow- ing societal influence, understanding potential biases, particularly within the political domain, is crucial to prevent undue influence on public opinion and democratic processes. This work investigates political bias and stereotype propa- gation across eight prominent LLMs using the two-dimensional Political Compass Test (PCT). Initially, the PCT is employed to assess the in- herent political leanings of these models. Sub- sequently, persona prompting with the PCT is used to explore explicit stereotypes across vari- ous social dimensions. In a final step, implicit stereotypes are uncovered by evaluating mod- els with multilingual versions of the PCT. Key findings reveal a consistent left-leaning polit- ical alignment across all investigated models. Furthermore, while the nature and extent of stereotypes vary considerably between models, implicit stereotypes elicited through language variation are more pronounced than those iden- tified via explicit persona prompting. Interest- ingly, for most models, implicit and explicit stereotypes show a notable alignment, suggest- ing a degree of transparency or “awareness” regarding their inherent biases. This study un- derscores the complex interplay of political bias and stereotypes in LLMs.

105. Expressive Value Learning for Scalable Offline Reinforcement Learning

Authors: Nicolas Espinosa-Dice , Kiante Brantley , Wen Sun
URL: https://arxiv.org/abs/2510.08218
Abstract:

Reinforcement learning (RL) is a powerful paradigm for learning to make sequences of decisions. However, RL has yet to be fully leveraged in robotics, principally due to its lack of scalability. Offline RL offers a promising avenue by training agents on large, diverse datasets, avoiding the costly real-world interactions of online RL. Scaling offline RL to increasingly complex datasets requires expressive generative models such as diffusion and flow matching. However, existing methods typically depend on either backpropagation through time (BPTT), which is computationally prohibitive, or policy distillation, which introduces compounding errors and limits scalability to larger base policies. In this paper, we consider the question of how to develop a scalable offline RL approach without relying on distillation or backpropagation through time. We introduce Expressive Value Learning for Offline Reinforcement Learning (EVOR): a scalable offline RL approach that integrates both expressive policies and expressive value functions. EVOR learns an optimal, regularized Q-function via flow matching during training. At inference-time, EVOR performs inference-time policy extraction via rejection sampling against the expressive value function, enabling efficient optimization, regularization, and compute-scalable search without retraining. Empirically, we show that EVOR outperforms baselines on a diverse set of offline RL tasks, demonstrating the benefit of integrating expressive value learning into offline RL.

106. FuelCast: Benchmarking Tabular and Temporal Models for Ship Fuel Consumption

Authors: Justus Viga , Penelope Mueck , Alexander Löser , Torben Weis
URL: https://arxiv.org/abs/2510.08217
Abstract:

In the shipping industry, fuel consumption and emissions are critical factors due to their significant impact on economic efficiency and environmental sustainability. Accurate prediction of ship fuel consumption is essential for further optimization of maritime operations. However, heterogeneous methodologies and limited high-quality datasets hinder direct comparison of modeling approaches. This paper makes three key contributions: (1) we introduce and release a new dataset ( this https URL ) comprising operational and environmental data from three ships; (2) we define a standardized benchmark covering tabular regression and time-series regression (3) we investigate the application of in-context learning for ship consumption modeling using the TabPFN foundation model - a first in this domain to our knowledge. Our results demonstrate strong performance across all evaluated models, supporting the feasibility of onboard, data-driven fuel prediction. Models incorporating environmental conditions consistently outperform simple polynomial baselines relying solely on vessel speed. TabPFN slightly outperforms other techniques, highlighting the potential of foundation models with in-context learning capabilities for tabular prediction. Furthermore, including temporal context improves accuracy.

107. LLMs Learn to Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions

Authors: XuHao Hu , Peng Wang , Xiaoya Lu , Dongrui Liu , Xuanjing Huang , Jing Shao
URL: https://arxiv.org/abs/2510.08211
Abstract:

Previous research has shown that LLMs finetuned on malicious or incorrect completions within narrow domains (e.g., insecure code or incorrect medical advice) can become broadly misaligned to exhibit harmful behaviors, which is called emergent misalignment. In this work, we investigate whether this phenomenon can extend beyond safety behaviors to a broader spectrum of dishonesty and deception under high-stakes scenarios (e.g., lying under pressure and deceptive behavior). To explore this, we finetune open-sourced LLMs on misaligned completions across diverse domains. Experimental results demonstrate that LLMs show broadly misaligned behavior in dishonesty. Additionally, we further explore this phenomenon in a downstream combined finetuning setting, and find that introducing as little as 1% of misalignment data into a standard downstream task is sufficient to decrease honest behavior over 20%. Furthermore, we consider a more practical human-AI interaction environment where we simulate both benign and biased users to interact with the assistant LLM. Notably, we find that the assistant can be misaligned unintentionally to exacerbate its dishonesty with only 10% biased user population. In summary, we extend the study of emergent misalignment to the domain of dishonesty and deception under high-stakes scenarios, and demonstrate that this risk arises not only through direct finetuning, but also in downstream mixture tasks and practical human-AI interactions.

108. Memory Retrieval and Consolidation in Large Language Models through Function Tokens

Authors: Shaohua Zhang , Yuan Lin , Hang Li
URL: https://arxiv.org/abs/2510.08203
Abstract:

The remarkable success of large language models (LLMs) stems from their ability to consolidate vast amounts of knowledge into the memory during pre-training and to retrieve it from the memory during inference, enabling advanced capabilities such as knowledge memorization, instruction-following and reasoning. However, the mechanisms of memory retrieval and consolidation in LLMs remain poorly understood. In this paper, we propose the function token hypothesis to explain the workings of LLMs: During inference, function tokens activate the most predictive features from context and govern next token prediction (memory retrieval). During pre-training, predicting the next tokens (usually content tokens) that follow function tokens increases the number of learned features of LLMs and updates the model parameters (memory consolidation). Function tokens here roughly correspond to function words in linguistics, including punctuation marks, articles, prepositions, and conjunctions, in contrast to content tokens. We provide extensive experimental evidence supporting this hypothesis. Using bipartite graph analysis, we show that a small number of function tokens activate the majority of features. Case studies further reveal how function tokens activate the most predictive features from context to direct next token prediction. We also find that during pre-training, the training loss is dominated by predicting the next content tokens following function tokens, which forces the function tokens to select the most predictive features from context.

109. Sentiment Matters: An Analysis of 200 Human-SAV Interactions

Authors: Lirui Guo , Michael G. Burke , Wynita M. Griggs
URL: https://arxiv.org/abs/2510.08202
Abstract:

Shared Autonomous Vehicles (SAVs) are likely to become an important part of the transportation system, making effective human-SAV interactions an important area of research. This paper introduces a dataset of 200 human-SAV interactions to further this area of study. We present an open-source human-SAV conversational dataset, comprising both textual data (e.g., 2,136 human-SAV exchanges) and empirical data (e.g., post-interaction survey results on a range of psychological factors). The dataset’s utility is demonstrated through two benchmark case studies: First, using random forest modeling and chord diagrams, we identify key predictors of SAV acceptance and perceived service quality, highlighting the critical influence of response sentiment polarity (i.e., perceived positivity). Second, we benchmark the performance of an LLM-based sentiment analysis tool against the traditional lexicon-based TextBlob method. Results indicate that even simple zero-shot LLM prompts more closely align with user-reported sentiment, though limitations remain. This study provides novel insights for designing conversational SAV interfaces and establishes a foundation for further exploration into advanced sentiment modeling, adaptive user interactions, and multimodal conversational systems.

110. Robust Canonicalization through Bootstrapped Data Re-Alignment

Authors: Johann Schmidt , Sebastian Stober
URL: https://arxiv.org/abs/2510.08178
Abstract:

Fine-grained visual classification (FGVC) tasks, such as insect and bird identification, demand sensitivity to subtle visual cues while remaining robust to spatial transformations. A key challenge is handling geometric biases and noise, such as different orientations and scales of objects. Existing remedies rely on heavy data augmentation, which demands powerful models, or on equivariant architectures, which constrain expressivity and add cost. Canonicalization offers an alternative by shielding such biases from the downstream model. In practice, such functions are often obtained using canonicalization priors, which assume aligned training data. Unfortunately, real-world datasets never fulfill this assumption, causing the obtained canonicalizer to be brittle. We propose a bootstrapping algorithm that iteratively re-aligns training samples by progressively reducing variance and recovering the alignment assumption. We establish convergence guarantees under mild conditions for arbitrary compact groups, and show on four FGVC benchmarks that our method consistently outperforms equivariant, and canonicalization baselines while performing on par with augmentation.

111. Leveraging Whisper Embeddings for Audio-based Lyrics Matching

Authors: Eleonora Mancini , Joan Serrà , Paolo Torroni , Yuki Mitsufuji
URL: https://arxiv.org/abs/2510.08176
Abstract:

Audio-based lyrics matching can be an appealing alternative to other content-based retrieval approaches, but existing methods often suffer from limited reproducibility and inconsistent baselines. In this work, we introduce WEALY, a fully reproducible pipeline that leverages Whisper decoder embeddings for lyrics matching tasks. WEALY establishes robust and transparent baselines, while also exploring multimodal extensions that integrate textual and acoustic features. Through extensive experiments on standard datasets, we demonstrate that WEALY achieves a performance comparable to state-of-the-art methods that lack reproducibility. In addition, we provide ablation studies and analyses on language robustness, loss functions, and embedding strategies. This work contributes a reliable benchmark for future research, and underscores the potential of speech technologies for music information retrieval tasks.

Authors: Haolin Yang , Yuxing Long , Zhuoyuan Yu , Zihan Yang , Minghan Wang , Jiapeng Xu , Yihan Wang , Ziyan Yu , Wenzhe Cai , Lei Kang , Hao Dong
URL: https://arxiv.org/abs/2510.08173
Abstract:

Instruction-following navigation is a key step toward embodied intelligence. Prior benchmarks mainly focus on semantic understanding but overlook systematically evaluating navigation agents’ spatial perception and reasoning capabilities. In this work, we introduce the NavSpace benchmark, which contains six task categories and 1,228 trajectory-instruction pairs designed to probe the spatial intelligence of navigation agents. On this benchmark, we comprehensively evaluate 22 navigation agents, including state-of-the-art navigation models and multimodal large language models. The evaluation results lift the veil on spatial intelligence in embodied navigation. Furthermore, we propose SNav, a new spatially intelligent navigation model. SNav outperforms existing navigation agents on NavSpace and real robot tests, establishing a strong baseline for future work.

113. Quantum Agents for Algorithmic Discovery

Authors: Iordanis Kerenidis , El-Amine Cherrat
URL: https://arxiv.org/abs/2510.08159
Abstract:

We introduce quantum agents trained by episodic, reward-based reinforcement learning to autonomously rediscover several seminal quantum algorithms and protocols. In particular, our agents learn: efficient logarithmic-depth quantum circuits for the Quantum Fourier Transform; Grover’s search algorithm; optimal cheating strategies for strong coin flipping; and optimal winning strategies for the CHSH and other nonlocal games. The agents achieve these results directly through interaction, without prior access to known optimal solutions. This demonstrates the potential of quantum intelligence as a tool for algorithmic discovery, opening the way for the automated design of novel quantum algorithms and protocols.

114. DACIP-RC: Domain Adaptive Continual Instruction Pre-Training via Reading Comprehension on Business Conversations

Authors: Elena Khasanova , Harsh Saini , Md Tahmid Rahman Laskar , Xue-Yong Fu , Cheng Chen , Shashi Bhushan TN
URL: https://arxiv.org/abs/2510.08152
Abstract:

The rapid advancements in Large Language Models (LLMs) have enabled their adoption in real-world industrial scenarios for various natural language processing tasks. However, the high inference cost of large-scale LLMs makes their deployment impractical, necessitating the use of smaller models. Despite their efficiency, smaller LLMs lack robust zero-shot instruction-following capabilities across diverse domains, limiting their adaptability to dynamic user requirements. Traditional fine-tuning approaches exacerbate this issue by inducing catastrophic forgetting, reducing the model’s generalization ability for unseen tasks. In this paper, we propose Domain Adaptive Continual Instruction Pre-Training via Reading Comprehension (DACIP-RC), a continual pre-training technique that enhances smaller LLMs’ domain adaptability for business conversational tasks. Unlike conventional pre-training approaches that rely on next-token prediction, DACIP-RC generates diverse task instructions and responses via reading comprehension on conversation transcripts, enabling better instruction generalization. Our empirical evaluations demonstrate that DACIP-RC significantly improves zero-shot generalization across a wide range of business conversational tasks, including meeting summarization, action item generation, and call purpose identification. To the best of our knowledge, this is the first work to apply instruction pre-training on business conversational data, providing insights into how industries can leverage proprietary datasets for domain adaptation.

115. AI Knowledge Assist: An Automated Approach for the Creation of Knowledge Bases for Conversational AI Agents

Authors: Md Tahmid Rahman Laskar , Julien Bouvier Tremblay , Xue-Yong Fu , Cheng Chen , Shashi Bhushan TN
URL: https://arxiv.org/abs/2510.08149
Abstract:

The utilization of conversational AI systems by leveraging Retrieval Augmented Generation (RAG) techniques to solve customer problems has been on the rise with the rapid progress of Large Language Models (LLMs). However, the absence of a company-specific dedicated knowledge base is a major barrier to the integration of conversational AI systems in contact centers. To this end, we introduce AI Knowledge Assist, a system that extracts knowledge in the form of question-answer (QA) pairs from historical customer-agent conversations to automatically build a knowledge base. Fine-tuning a lightweight LLM on internal data demonstrates state-of-the-art performance, outperforming larger closed-source LLMs. More specifically, empirical evaluation on 20 companies demonstrates that the proposed AI Knowledge Assist system that leverages the LLaMA-3.1-8B model eliminates the cold-start gap in contact centers by achieving above 90% accuracy in answering information-seeking questions. This enables immediate deployment of RAG-powered chatbots.

116. Think Just Enough: Sequence-Level Entropy as a Confidence Signal for LLM Reasoning

Authors: Aman Sharma , Paras Chopra
URL: https://arxiv.org/abs/2510.08146
Abstract:

We introduce a simple, yet novel entropy-based framework to drive token efficiency in large language models during reasoning tasks. Our approach uses Shannon entropy from token-level logprobs as a confidence signal to enable early stopping, achieving 25-50% computational savings while maintaining task accuracy. Crucially, we demonstrate that entropy-based confidence calibration represents an emergent property of advanced post-training optimization present in modern reasoning models but notably absent in standard instruction-tuned and pre-trained models (Llama 3.3 70B). We show that the entropy threshold to stop reasoning varies from model to model but can be calculated easily in one shot using only a few examples from existing reasoning datasets. Our results indicate that advanced reasoning models often know that they’ve gotten a correct answer early on, and that this emergent confidence awareness can be exploited to save tokens and reduce latency. The framework demonstrates consistent performance across reasoning-optimized model families with 25-50% computational cost reduction while preserving accuracy, revealing that confidence mechanisms represent a distinguishing characteristic of modern post-trained reasoning systems versus their predecessors.

117. Improving Temporal Understanding Logic Consistency in Video-Language Models via Attention Enhancement

Authors: Chengzhi Li , Heyan Huang , Ping Jian , Zhen Yang , Yaning Tian
URL: https://arxiv.org/abs/2510.08138
Abstract:

Large language models (LLMs) often generate self-contradictory outputs, which severely impacts their reliability and hinders their adoption in practical applications. In video-language models (Video-LLMs), this phenomenon recently draws the attention of researchers. Specifically, these models fail to provide logically consistent responses to rephrased questions based on their grounding outputs. However, the underlying causes of this phenomenon remain underexplored. In this work, we adopt an interpretability-driven approach to analyze, statistically summarize, and intervention the potential factors of the phenomenon. We find that one of the primary reasons for the inconsistency in responses lies in the inability of cross-modal attention heads to effectively distinguish video tokens across different timestamps. To address this, we propose an attention enhancement method called Temporally Conditioned Attention Sharpening (TCAS), which constructs an enhancement objective based on attention distinctions to enhance the model’s temporal resolution capability, thereby improving its temporal understanding logic consistency. Experimental results demonstrate that our method significantly enhances the temporal logic consistency of Video-LLMs. Further interpretability analyses reveal that our method indeed improves the temporal discriminability of attention heads, validating our conclusions. Additionally, our method achieves performance improvements in general video temporal grounding tasks, highlighting that temporal logic consistency is a bottleneck in temporal understanding. By enhancing consistency, our method drives significant progress in video temporal understanding.

118. Approximate Domain Unlearning for Vision-Language Models

Authors: Kodai Kawamura , Yuta Goto , Rintaro Yanagi , Hirokatsu Kataoka , Go Irie
URL: https://arxiv.org/abs/2510.08132
Abstract:

Pre-trained Vision-Language Models (VLMs) exhibit strong generalization capabilities, enabling them to recognize a wide range of objects across diverse domains without additional training. However, they often retain irrelevant information beyond the requirements of specific downstream tasks, raising concerns about computational efficiency and potential information leakage. This has motivated growing interest in approximate unlearning, which aims to selectively remove unnecessary knowledge while preserving overall model performance. Existing approaches to approximate unlearning have primarily focused on class unlearning, where a VLM is retrained to fail to recognize specified object classes while maintaining accuracy for others. However, merely forgetting object classes is often insufficient in practical applications. For instance, an autonomous driving system should accurately recognize real cars while avoiding misrecognition of illustrated cars depicted in roadside advertisements as real cars, which could be hazardous. In this paper, we introduce Approximate Domain Unlearning (ADU), a novel problem setting that requires reducing recognition accuracy for images from specified domains (e.g., illustration) while preserving accuracy for other domains (e.g., real). ADU presents new technical challenges: due to the strong domain generalization capability of pre-trained VLMs, domain distributions are highly entangled in the feature space, making naive approaches based on penalizing target domains ineffective. To tackle this limitation, we propose a novel approach that explicitly disentangles domain distributions and adaptively captures instance-specific domain information. Extensive experiments show that our approach outperforms baselines built upon VLM tuning techniques, paving the way for practical and fine-grained unlearning in VLMs. Code: this https URL .

119. Interpreting LLM-as-a-Judge Policies via Verifiable Global Explanations

Authors: Jasmina Gajcin , Erik Miehling , Rahul Nair , Elizabeth Daly , Radu Marinescu , Seshu Tirupathi
URL: https://arxiv.org/abs/2510.08120
Abstract:

Using LLMs to evaluate text, that is, LLM-as-a-judge, is increasingly being used at scale to augment or even replace human annotations. As such, it is imperative that we understand the potential biases and risks of doing so. In this work, we propose an approach for extracting high-level concept-based global policies from LLM-as-a-Judge. Our approach consists of two algorithms: 1) CLoVE (Contrastive Local Verifiable Explanations), which generates verifiable, concept-based, contrastive local explanations and 2) GloVE (Global Verifiable Explanations), which uses iterative clustering, summarization and verification to condense local rules into a global policy. We evaluate GloVE on seven standard benchmarking datasets for content harm detection. We find that the extracted global policies are highly faithful to decisions of the LLM-as-a-Judge. Additionally, we evaluated the robustness of global policies to text perturbations and adversarial attacks. Finally, we conducted a user study to evaluate user understanding and satisfaction with global policies.

120. Bayesian Decision Making around Experts

Authors: Daniel Jarne Ornia , Joel Dyer , Nicholas Bishop , Anisoara Calinescu , Michael Wooldridge
URL: https://arxiv.org/abs/2510.08113
Abstract:

Complex learning agents are increasingly deployed alongside existing experts, such as human operators or previously trained agents. However, it remains unclear how should learners optimally incorporate certain forms of expert data, which may differ in structure from the learner’s own action-outcome experiences. We study this problem in the context of Bayesian multi-armed bandits, considering: (i) offline settings, where the learner receives a dataset of outcomes from the expert’s optimal policy before interaction, and (ii) simultaneous settings, where the learner must choose at each step whether to update its beliefs based on its own experience, or based on the outcome simultaneously achieved by an expert. We formalize how expert data influences the learner’s posterior, and prove that pretraining on expert outcomes tightens information-theoretic regret bounds by the mutual information between the expert data and the optimal action. For the simultaneous setting, we propose an information-directed rule where the learner processes the data source that maximizes their one-step information gain about the optimal action. Finally, we propose strategies for how the learner can infer when to trust the expert and when not to, safeguarding the learner for the cases where the expert is ineffective or compromised. By quantifying the value of expert data, our framework provides practical, information-theoretic algorithms for agents to intelligently decide when to learn from others.

121. VersionRAG: Version-Aware Retrieval-Augmented Generation for Evolving Documents

Authors: Daniel Huwiler , Kurt Stockinger , Jonathan Fürst
URL: https://arxiv.org/abs/2510.08109
Abstract:

Retrieval-Augmented Generation (RAG) systems fail when documents evolve through versioning-a ubiquitous characteristic of technical documentation. Existing approaches achieve only 58-64% accuracy on version-sensitive questions, retrieving semantically similar content without temporal validity checks. We present VersionRAG, a version-aware RAG framework that explicitly models document evolution through a hierarchical graph structure capturing version sequences, content boundaries, and changes between document states. During retrieval, VersionRAG routes queries through specialized paths based on intent classification, enabling precise version-aware filtering and change tracking. On our VersionQA benchmark-100 manually curated questions across 34 versioned technical documents-VersionRAG achieves 90% accuracy, outperforming naive RAG (58%) and GraphRAG (64%). VersionRAG reaches 60% accuracy on implicit change detection where baselines fail (0-10%), demonstrating its ability to track undocumented modifications. Additionally, VersionRAG requires 97% fewer tokens during indexing than GraphRAG, making it practical for large-scale deployment. Our work establishes versioned document QA as a distinct task and provides both a solution and benchmark for future research.

122. Development of Mental Models in Human-AI Collaboration: A Conceptual Framework

Authors: Joshua Holstein , Gerhard Satzger
URL: https://arxiv.org/abs/2510.08104
Abstract:

Artificial intelligence has become integral to organizational decision-making and while research has explored many facets of this human-AI collaboration, the focus has mainly been on designing the AI agent(s) and the way the collaboration is set up - generally assuming a human decision-maker to be “fixed”. However, it has largely been neglected that decision-makers’ mental models evolve through their continuous interaction with AI systems. This paper addresses this gap by conceptualizing how the design of human-AI collaboration influences the development of three complementary and interdependent mental models necessary for this collaboration. We develop an integrated socio-technical framework that identifies the mechanisms driving the mental model evolution: data contextualization, reasoning transparency, and performance feedback. Our work advances human-AI collaboration literature through three key contributions: introducing three distinct mental models (domain, information processing, complementarity-awareness); recognizing the dynamic nature of mental models; and establishing mechanisms that guide the purposeful design of effective human-AI collaboration.

123. Lossless Vocabulary Reduction for Auto-Regressive Language Models

Authors: Daiki Chijiwa , Taku Hasegawa , Kyosuke Nishida , Shin’ya Yamaguchi , Tomoya Ohba , Tamao Sakao , Susumu Takeuchi
URL: https://arxiv.org/abs/2510.08102
Abstract:

Tokenization – the process of decomposing a given text into a sequence of subwords called tokens – is one of the key components in the development of language models. Particularly, auto-regressive language models generate texts token by token, i.e., by predicting the next-token distribution given the previous ones, and thus tokenization directly affects their efficiency in text generation. Since each language model has their own vocabulary as a set of possible tokens, they struggle to cooperate with each other at the level of next-token distributions such as model ensemble. In this paper, we establish a theoretical framework of lossless vocabulary reduction, which efficiently converts a given auto-regressive language model into the one with an arbitrarily small vocabulary without any loss in accuracy. As an application, we demonstrate that language models with different tokenization can cooperate with each other efficiently through their maximal common vocabulary.

124. The Price of Thought: A Multilingual Analysis of Reasoning, Performance, and Cost of Negotiation in Large Language Models

Authors: Sherzod Hakimov , Roland Bernard , Tim Leiber , Karl Osswald , Kristina Richert , Ruilin Yang , Raffaella Bernardi , David Schlangen
URL: https://arxiv.org/abs/2510.08098
Abstract:

Negotiation is a fundamental challenge for AI agents, as it requires an ability to reason strategically, model opponents, and balance cooperation with competition. We conduct the first comprehensive study systematically evaluating the effect of (LLM-)reasoning on the negotiation abilities of both commercial and open-weight LLMs, and do this across three languages. Using a self-play setup across three diverse dialogue games, we analyse trade-offs between performance and cost, the language consistency of reasoning processes, and the nature of strategic adaptation exhibited by models. Our findings show that enabling reasoning-that is, scaling test time compute-significantly improves negotiation outcomes by enhancing collaboration and helping models overcome task complexities, but comes at a substantial computational cost: reasoning improves GPT-5’s performance by 31.4 % while increasing its cost by nearly 400 %. Most critically, we uncover a significant multilingual reasoning distinction: open-weight models consistently switch to English for their internal reasoning steps, even when negotiating in German or Italian (and thus possibly impacting potential explainability gains through the disclosure of reasoning traces), while leading commercial models maintain language consistency between their reasoning and final output.

125. Everything is Plausible: Investigating the Impact of LLM Rationales on Human Notions of Plausibility

Authors: Shramay Palta , Peter Rankel , Sarah Wiegreffe , Rachel Rudinger
URL: https://arxiv.org/abs/2510.08091
Abstract:

We investigate the degree to which human plausibility judgments of multiple-choice commonsense benchmark answers are subject to influence by (im)plausibility arguments for or against an answer, in particular, using rationales generated by LLMs. We collect 3,000 plausibility judgments from humans and another 13,600 judgments from LLMs. Overall, we observe increases and decreases in mean human plausibility ratings in the presence of LLM-generated PRO and CON rationales, respectively, suggesting that, on the whole, human judges find these rationales convincing. Experiments with LLMs reveal similar patterns of influence. Our findings demonstrate a novel use of LLMs for studying aspects of human cognition, while also raising practical concerns that, even in domains where humans are ``experts’’ (i.e., common sense), LLMs have the potential to exert considerable influence on people’s beliefs.

126. A Novel Ensemble Learning Approach for Enhanced IoT Attack Detection: Redefining Security Paradigms in Connected Systems

Authors: Hikmat A. M. Abdeljaber , Md. Alamgir Hossain , Sultan Ahmad , Ahmed Alsanad , Md Alimul Haque , Sudan Jha , Jabeen Nazeer
URL: https://arxiv.org/abs/2510.08084
Abstract:

The rapid expansion of Internet of Things (IoT) devices has transformed industries and daily life by enabling widespread connectivity and data exchange. However, this increased interconnection has introduced serious security vulnerabilities, making IoT systems more exposed to sophisticated cyber attacks. This study presents a novel ensemble learning architecture designed to improve IoT attack detection. The proposed approach applies advanced machine learning techniques, specifically the Extra Trees Classifier, along with thorough preprocessing and hyperparameter optimization. It is evaluated on several benchmark datasets including CICIoT2023, IoTID20, BotNeTIoT L01, ToN IoT, N BaIoT, and BoT IoT. The results show excellent performance, achieving high recall, accuracy, and precision with very low error rates. These outcomes demonstrate the model efficiency and superiority compared to existing approaches, providing an effective and scalable method for securing IoT environments. This research establishes a solid foundation for future progress in protecting connected devices from evolving cyber threats.

127. An Adaptive Multi Agent Bitcoin Trading System

Authors: Aadi Singhi
URL: https://arxiv.org/abs/2510.08068
Abstract:

This paper presents a Multi Agent Bitcoin Trading system that utilizes Large Lan- guage Models (LLMs) for alpha generation and portfolio management in the cryptocur- rencies market. Unlike equities, cryptocurrencies exhibit extreme volatility and are heavily influenced by rapidly shifting market sentiments and regulatory announcements, making them difficult to model using static regression models or neural networks trained solely on historical data [53]. The proposed framework overcomes this by structuring LLMs into specialised agents for technical analysis, sentiment evaluation, decision-making, and performance reflection. The system improves over time through a novel verbal feedback mechanism where a Reflect agent provides daily and weekly natural-language critiques of trading decisions. These textual evaluations are then injected into future prompts, al- lowing the system to adjust indicator priorities, sentiment weights, and allocation logic without parameter updates or finetuning. Back-testing on Bitcoin price data from July 2024 to April 2025 shows consistent outperformance across market regimes: the Quantita- tive agent delivered over 30% higher returns in bullish phases and 15% overall gains versus buy-and-hold, while the sentiment-driven agent turned sideways markets from a small loss into a gain of over 100%. Adding weekly feedback further improved total performance by 31% and reduced bearish losses by 10%. The results demonstrate that verbal feedback represents a new, scalable, and low-cost method of tuning LLMs for financial goals.

128. Attribution-by-design: Ensuring Inference-Time Provenance in Generative Music Systems

Authors: Fabio Morreale , Wiebke Hutiri , Joan Serrà , Alice Xiang , Yuki Mitsufuji
URL: https://arxiv.org/abs/2510.08062
Abstract:

The rise of AI-generated music is diluting royalty pools and revealing structural flaws in existing remuneration frameworks, challenging the well-established artist compensation systems in the music industry. Existing compensation solutions, such as piecemeal licensing agreements, lack scalability and technical rigour, while current data attribution mechanisms provide only uncertain estimates and are rarely implemented in practice. This paper introduces a framework for a generative music infrastructure centred on direct attribution, transparent royalty distribution, and granular control for artists and rights’ holders. We distinguish ontologically between the training set and the inference set, which allows us to propose two complementary forms of attribution: training-time attribution and inference-time attribution. We here favour inference-time attribution, as it enables direct, verifiable compensation whenever an artist’s catalogue is used to condition a generated output. Besides, users benefit from the ability to condition generations on specific songs and receive transparent information about attribution and permitted usage. Our approach offers an ethical and practical solution to the pressing need for robust compensation mechanisms in the era of AI-generated music, ensuring that provenance and fairness are embedded at the core of generative systems.

129. FedDTRE: Federated Dialogue Generation Models Powered by Trustworthiness Evaluation

Authors: Shule Lu , Lingxiang Wang , Sijia Wen , Ziwei Wang , Hainan Zhang
URL: https://arxiv.org/abs/2510.08058
Abstract:

With the rapid development of artificial intelligence, dialogue systems have become a prominent form of human-computer interaction. However, traditional centralized or fully local training approaches face challenges in balancing privacy preservation and personalization due to data privacy concerns and heterogeneous device capabilities. Federated learning, as a representative distributed paradigm, offers a promising solution. However, existing methods often suffer from overfitting under limited client data and tend to forget global information after multiple training rounds, leading to poor generalization. To address these issues, we propose FedDTRE, a Federated adaptive aggregation strategy for Dialogue generation based on Trustworthiness Evaluation. Instead of directly replacing local models with the global model, FedDTRE leverages trustworthiness scores of both global and local models on a fairness-oriented evaluation dataset to dynamically regulate the global model’s contribution during local updates. Experimental results demonstrate that FedDTRE can improve dialogue model performance and enhance the quality of dialogue generation.

130. A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models

Authors: Congming Zheng , Jiachen Zhu , Zhuoying Ou , Yuxiang Chen , Kangning Zhang , Rong Shan , Zeyu Zheng , Mengyue Yang , Jianghao Lin , Yong Yu , Weinan Zhang
URL: https://arxiv.org/abs/2510.08049
Abstract:

Although Large Language Models (LLMs) exhibit advanced reasoning ability, conventional alignment remains largely dominated by outcome reward models (ORMs) that judge only final answers. Process Reward Models(PRMs) address this gap by evaluating and guiding reasoning at the step or trajectory level. This survey provides a systematic overview of PRMs through the full loop: how to generate process data, build PRMs, and use PRMs for test-time scaling and reinforcement learning. We summarize applications across math, code, text, multimodal reasoning, robotics, and agents, and review emerging benchmarks. Our goal is to clarify design spaces, reveal open challenges, and guide future research toward fine-grained, robust reasoning alignment.

131. TaoSR-AGRL: Adaptive Guided Reinforcement Learning Framework for E-commerce Search Relevance

Authors: Jianhui Yang , Yiming Jin , Pengkun Jiao , Chenhe Dong , Zerui Huang , Shaowei Yao , Xiaojiang Zhou , Dan Ou , Haihong Tang
URL: https://arxiv.org/abs/2510.08048
Abstract:

Query-product relevance prediction is fundamental to e-commerce search and has become even more critical in the era of AI-powered shopping, where semantic understanding and complex reasoning directly shape the user experience and business conversion. Large Language Models (LLMs) enable generative, reasoning-based approaches, typically aligned via supervised fine-tuning (SFT) or preference optimization methods like Direct Preference Optimization (DPO). However, the increasing complexity of business rules and user queries exposes the inability of existing methods to endow models with robust reasoning capacity for long-tail and challenging cases. Efforts to address this via reinforcement learning strategies like Group Relative Policy Optimization (GRPO) often suffer from sparse terminal rewards, offering insufficient guidance for multi-step reasoning and slowing convergence. To address these challenges, we propose TaoSR-AGRL, an Adaptive Guided Reinforcement Learning framework for LLM-based relevance prediction in Taobao Search Relevance. TaoSR-AGRL introduces two key innovations: (1) Rule-aware Reward Shaping, which decomposes the final relevance judgment into dense, structured rewards aligned with domain-specific relevance criteria; and (2) Adaptive Guided Replay, which identifies low-accuracy rollouts during training and injects targeted ground-truth guidance to steer the policy away from stagnant, rule-violating reasoning patterns toward compliant trajectories. TaoSR-AGRL was evaluated on large-scale real-world datasets and through online side-by-side human evaluations on Taobao Search. It consistently outperforms DPO and standard GRPO baselines in offline experiments, improving relevance accuracy, rule adherence, and training stability. The model trained with TaoSR-AGRL has been successfully deployed in the main search scenario on Taobao, serving hundreds of millions of users.

132. Verifying Graph Neural Networks with Readout is Intractable

Authors: Artem Chernobrovkin , Marco Sälzer , François Schwarzentruber , Nicolas Troquard
URL: https://arxiv.org/abs/2510.08045
Abstract:

We introduce a logical language for reasoning about quantized aggregate-combine graph neural networks with global readout (ACR-GNNs). We provide a logical characterization and use it to prove that verification tasks for quantized GNNs with readout are (co)NEXPTIME-complete. This result implies that the verification of quantized GNNs is computationally intractable, prompting substantial research efforts toward ensuring the safety of GNN-based systems. We also experimentally demonstrate that quantized ACR-GNN models are lightweight while maintaining good accuracy and generalization capabilities with respect to non-quantized models.

133. Towards Reliable LLM-based Robot Planning via Combined Uncertainty Estimation

Authors: Shiyuan Yin , Chenjia Bai , Zihao Zhang , Junwei Jin , Xinxin Zhang , Chi Zhang , Xuelong Li
URL: https://arxiv.org/abs/2510.08044
Abstract:

Large language models (LLMs) demonstrate advanced reasoning abilities, enabling robots to understand natural language instructions and generate high-level plans with appropriate grounding. However, LLM hallucinations present a significant challenge, often leading to overconfident yet potentially misaligned or unsafe plans. While researchers have explored uncertainty estimation to improve the reliability of LLM-based planning, existing studies have not sufficiently differentiated between epistemic and intrinsic uncertainty, limiting the effectiveness of uncertainty esti- mation. In this paper, we present Combined Uncertainty estimation for Reliable Embodied planning (CURE), which decomposes the uncertainty into epistemic and intrinsic uncertainty, each estimated separately. Furthermore, epistemic uncertainty is subdivided into task clarity and task familiarity for more accurate evaluation. The overall uncertainty assessments are obtained using random network distillation and multi-layer perceptron regression heads driven by LLM features. We validated our approach in two distinct experimental settings: kitchen manipulation and tabletop rearrangement experiments. The results show that, compared to existing methods, our approach yields uncertainty estimates that are more closely aligned with the actual execution outcomes.

134. MRI-derived quantification of hepatic vessel-to-volume ratios in chronic liver disease using a deep learning approach

Authors: Alexander Herold , Daniel Sobotka , Lucian Beer , Nina Bastati , Sarah Poetter-Lang , Michael Weber , Thomas Reiberger , Mattias Mandorfer , Georg Semmler , Benedikt Simbrunner , Barbara D. Wichtmann , Sami A. Ba-Ssalamah , Michael Trauner , Ahmed Ba-Ssalamah , Georg Langs
URL: https://arxiv.org/abs/2510.08039
Abstract:

Background: We aimed to quantify hepatic vessel volumes across chronic liver disease stages and healthy controls using deep learning-based magnetic resonance imaging (MRI) analysis, and assess correlations with biomarkers for liver (dys)function and fibrosis/portal hypertension. Methods: We assessed retrospectively healthy controls, non-advanced and advanced chronic liver disease (ACLD) patients using a 3D U-Net model for hepatic vessel segmentation on portal venous phase gadoxetic acid-enhanced 3-T MRI. Total (TVVR), hepatic (HVVR), and intrahepatic portal vein-to-volume ratios (PVVR) were compared between groups and correlated with: albumin-bilirubin (ALBI) and model for end-stage liver disease-sodium (MELD-Na) score, and fibrosis/portal hypertension (Fibrosis-4 [FIB-4] score, liver stiffness measurement [LSM], hepatic venous pressure gradient [HVPG], platelet count [PLT], and spleen volume). Results: We included 197 subjects, aged 54.9 $\pm$ 13.8 years (mean $\pm$ standard deviation), 111 males (56.3\%): 35 healthy controls, 44 non-ACLD, and 118 ACLD patients. TVVR and HVVR were highest in controls (3.9; 2.1), intermediate in non-ACLD (2.8; 1.7), and lowest in ACLD patients (2.3; 1.0) ($p \leq 0.001$). PVVR was reduced in both non-ACLD and ACLD patients (both 1.2) compared to controls (1.7) ($p \leq 0.001$), but showed no difference between CLD groups ($p = 0.999$). HVVR significantly correlated indirectly with FIB-4, ALBI, MELD-Na, LSM, and spleen volume ($\rho$ ranging from -0.27 to -0.40), and directly with PLT ($\rho = 0.36$). TVVR and PVVR showed similar but weaker correlations. Conclusions: Deep learning-based hepatic vessel volumetry demonstrated differences between healthy liver and chronic liver disease stages and shows correlations with established markers of disease severity.

135. FastUMI-100K: Advancing Data-driven Robotic Manipulation with a Large-scale UMI-style Dataset

Authors: Kehui Liu , Zhongjie Jia , Yang Li , Zhaxizhuoma , Pengan Chen , Song Liu , Xin Liu , Pingrui Zhang , Haoming Song , Xinyi Ye , Nieqing Cao , Zhigang Wang , Jia Zeng , Dong Wang , Yan Ding , Bin Zhao , Xuelong Li
URL: https://arxiv.org/abs/2510.08022
Abstract:

Data-driven robotic manipulation learning depends on large-scale, high-quality expert demonstration datasets. However, existing datasets, which primarily rely on human teleoperated robot collection, are limited in terms of scalability, trajectory smoothness, and applicability across different robotic embodiments in real-world environments. In this paper, we present FastUMI-100K, a large-scale UMI-style multimodal demonstration dataset, designed to overcome these limitations and meet the growing complexity of real-world manipulation tasks. Collected by FastUMI, a novel robotic system featuring a modular, hardware-decoupled mechanical design and an integrated lightweight tracking system, FastUMI-100K offers a more scalable, flexible, and adaptable solution to fulfill the diverse requirements of real-world robot demonstration data. Specifically, FastUMI-100K contains over 100K+ demonstration trajectories collected across representative household environments, covering 54 tasks and hundreds of object types. Our dataset integrates multimodal streams, including end-effector states, multi-view wrist-mounted fisheye images and textual annotations. Each trajectory has a length ranging from 120 to 500 frames. Experimental results demonstrate that FastUMI-100K enables high policy success rates across various baseline algorithms, confirming its robustness, adaptability, and real-world applicability for solving complex, dynamic manipulation challenges. The source code and dataset will be released in this link this https URL .

136. Backdoor Vectors: a Task Arithmetic View on Backdoor Attacks and Defenses

Authors: Stanisław Pawlak , Jan Dubiński , Daniel Marczak , Bartłomiej Twardowski
URL: https://arxiv.org/abs/2510.08016
Abstract:

Model merging (MM) recently emerged as an effective method for combining large deep learning models. However, it poses significant security risks. Recent research shows that it is highly susceptible to backdoor attacks, which introduce a hidden trigger into a single fine-tuned model instance that allows the adversary to control the output of the final merged model at inference time. In this work, we propose a simple framework for understanding backdoor attacks by treating the attack itself as a task vector. $Backdoor\ Vector\ (BV)$ is calculated as the difference between the weights of a fine-tuned backdoored model and fine-tuned clean model. BVs reveal new insights into attacks understanding and a more effective framework to measure their similarity and transferability. Furthermore, we propose a novel method that enhances backdoor resilience through merging dubbed $Sparse\ Backdoor\ Vector\ (SBV)$ that combines multiple attacks into a single one. We identify the core vulnerability behind backdoor threats in MM: $inherent\ triggers$ that exploit adversarial weaknesses in the base model. To counter this, we propose $Injection\ BV\ Subtraction\ (IBVS)$ - an assumption-free defense against backdoors in MM. Our results show that SBVs surpass prior attacks and is the first method to leverage merging to improve backdoor effectiveness. At the same time, IBVS provides a lightweight, general defense that remains effective even when the backdoor threat is entirely unknown.

137. Past, Present, and Future of Bug Tracking in the Generative AI Era

Authors: Utku Boran Torun , Mehmet Taha Demircan , Mahmut Furkan Gön , Eray Tüzün
URL: https://arxiv.org/abs/2510.08005
Abstract:

Traditional bug tracking systems rely heavily on manual reporting, reproduction, triaging, and resolution, each carried out by different stakeholders such as end users, customer support, developers, and testers. This division of responsibilities requires significant coordination and widens the communication gap between non-technical users and technical teams, slowing the process from bug discovery to resolution. Moreover, current systems are highly asynchronous; users often wait hours or days for a first response, delaying fixes and contributing to frustration. This paper examines the evolution of bug tracking, from early paper-based reporting to today’s web-based and SaaS platforms. Building on this trajectory, we propose an AI-powered bug tracking framework that augments existing tools with intelligent, large language model (LLM)-driven automation. Our framework addresses two main challenges: reducing time-to-fix and minimizing human overhead. Users report issues in natural language, while AI agents refine reports, attempt reproduction, and request missing details. Reports are then classified, invalid ones resolved through no-code fixes, and valid ones localized and assigned to developers. LLMs also generate candidate patches, with human oversight ensuring correctness. By integrating automation into each phase, our framework accelerates response times, improves collaboration, and strengthens software maintenance practices for a more efficient, user-centric future.

138. Learning on the Job: An Experience-Driven Self-Evolving Agent for Long-Horizon Tasks

Authors: Cheng Yang , Xuemeng Yang , Licheng Wen , Daocheng Fu , Jianbiao Mei , Rong Wu , Pinlong Cai , Yufan Shen , Nianchen Deng , Botian Shi , Yu Qiao , Haifeng Li
URL: https://arxiv.org/abs/2510.08002
Abstract:

Large Language Models have demonstrated remarkable capabilities across diverse domains, yet significant challenges persist when deploying them as AI agents for real-world long-horizon tasks. Existing LLM agents suffer from a critical limitation: they are test-time static and cannot learn from experience, lacking the ability to accumulate knowledge and continuously improve on the job. To address this challenge, we propose MUSE, a novel agent framework that introduces an experience-driven, self-evolving system centered around a hierarchical Memory Module. MUSE organizes diverse levels of experience and leverages them to plan and execute long-horizon tasks across multiple applications. After each sub-task execution, the agent autonomously reflects on its trajectory, converting the raw trajectory into structured experience and integrating it back into the Memory Module. This mechanism enables the agent to evolve beyond its static pretrained parameters, fostering continuous learning and self-evolution. We evaluate MUSE on the long-horizon productivity benchmark TAC. It achieves new SOTA performance by a significant margin using only a lightweight Gemini-2.5 Flash model. Sufficient Experiments demonstrate that as the agent autonomously accumulates experience, it exhibits increasingly superior task completion capabilities, as well as robust continuous learning and self-evolution capabilities. Moreover, the accumulated experience from MUSE exhibits strong generalization properties, enabling zero-shot improvement on new tasks. MUSE establishes a new paradigm for AI agents capable of real-world productivity task automation.

139. Leveraging Author-Specific Context for Scientific Figure Caption Generation: 3rd SciCap Challenge

Authors: Watcharapong Timklaypachara , Monrada Chiewhawan , Nopporn Lekuthai , Titipat Achakulvisut
URL: https://arxiv.org/abs/2510.07993
Abstract:

Scientific figure captions require both accuracy and stylistic consistency to convey visual information. Here, we present a domain-specific caption generation system for the 3rd SciCap Challenge that integrates figure-related textual context with author-specific writing styles using the LaMP-Cap dataset. Our approach uses a two-stage pipeline: Stage 1 combines context filtering, category-specific prompt optimization via DSPy’s MIPROv2 and SIMBA, and caption candidate selection; Stage 2 applies few-shot prompting with profile figures for stylistic refinement. Our experiments demonstrate that category-specific prompts outperform both zero-shot and general optimized approaches, improving ROUGE-1 recall by +8.3\% while limiting precision loss to -2.8\% and BLEU-4 reduction to -10.9\%. Profile-informed stylistic refinement yields 40–48\% gains in BLEU scores and 25–27\% in ROUGE. Overall, our system demonstrates that combining contextual understanding with author-specific stylistic adaptation can generate captions that are both scientifically accurate and stylistically faithful to the source paper.

140. Fewer Weights, More Problems: A Practical Attack on LLM Pruning

Authors: Kazuki Egashira , Robin Staab , Thibaud Gloaguen , Mark Vero , Martin Vechev
URL: https://arxiv.org/abs/2510.07985
Abstract:

Model pruning, i.e., removing a subset of model weights, has become a prominent approach to reducing the memory footprint of large language models (LLMs) during inference. Notably, popular inference engines, such as vLLM, enable users to conveniently prune downloaded models before they are deployed. While the utility and efficiency of pruning methods have improved significantly, the security implications of pruning remain underexplored. In this work, for the first time, we show that modern LLM pruning methods can be maliciously exploited. In particular, an adversary can construct a model that appears benign yet, once pruned, exhibits malicious behaviors. Our method is based on the idea that the adversary can compute a proxy metric that estimates how likely each parameter is to be pruned. With this information, the adversary can first inject a malicious behavior into those parameters that are unlikely to be pruned. Then, they can repair the model by using parameters that are likely to be pruned, effectively canceling out the injected behavior in the unpruned model. We demonstrate the severity of our attack through extensive evaluation on five models; after any of the pruning in vLLM are applied (Magnitude, Wanda, and SparseGPT), it consistently exhibits strong malicious behaviors in a diverse set of attack scenarios (success rates of up to $95.7\%$ for jailbreak, $98.7\%$ for benign instruction refusal, and $99.5\%$ for targeted content injection). Our results reveal a critical deployment-time security gap and underscore the urgent need for stronger security awareness in model compression.

141. Is Architectural Complexity Always the Answer? A Case Study on SwinIR vs. an Efficient CNN

Authors: Chandresh Sutariya , Nitin Singh
URL: https://arxiv.org/abs/2510.07984
Abstract:

The simultaneous restoration of high-frequency details and suppression of severe noise in low-light imagery presents a significant and persistent challenge in computer vision. While large-scale Transformer models like SwinIR have set the state of the art in performance, their high computational cost can be a barrier for practical applications. This paper investigates the critical trade-off between performance and efficiency by comparing the state-of-the-art SwinIR model against a standard, lightweight Convolutional Neural Network (CNN) on this challenging task. Our experimental results reveal a nuanced but important finding. While the Transformer-based SwinIR model achieves a higher peak performance, with a Peak Signal-to-Noise Ratio (PSNR) of 39.03 dB, the lightweight CNN delivers a surprisingly competitive PSNR of 37.4 dB. Crucially, the CNN reached this performance after converging in only 10 epochs of training, whereas the more complex SwinIR model required 132 epochs. This efficiency is further underscored by the model’s size; the CNN is over 55 times smaller than SwinIR. This work demonstrates that a standard CNN can provide a near state-of-the-art result with significantly lower computational overhead, presenting a compelling case for its use in real-world scenarios where resource constraints are a primary concern.

142. ZeroCard: Cardinality Estimation with Zero Dependence on Target Databases – No Data, No Query, No Retraining

Authors: Xianghong Xu , Rong Kang , Xiao He , Lei Zhang , Jianjun Chen , Tieying Zhang
URL: https://arxiv.org/abs/2510.07983
Abstract:

Cardinality estimation is a fundamental task in database systems and plays a critical role in query optimization. Despite significant advances in learning-based cardinality estimation methods, most existing approaches remain difficult to generalize to new datasets due to their strong dependence on raw data or queries, thus limiting their practicality in real scenarios. To overcome these challenges, we argue that semantics in the schema may benefit cardinality estimation, and leveraging such semantics may alleviate these dependencies. To this end, we introduce ZeroCard, the first semantics-driven cardinality estimation method that can be applied without any dependence on raw data access, query logs, or retraining on the target database. Specifically, we propose to predict data distributions using schema semantics, thereby avoiding raw data dependence. Then, we introduce a query template-agnostic representation method to alleviate query dependence. Finally, we construct a large-scale query dataset derived from real-world tables and pretrain ZeroCard on it, enabling it to learn cardinality from schema semantics and predicate representations. After pretraining, ZeroCard’s parameters can be frozen and applied in an off-the-shelf manner. We conduct extensive experiments to demonstrate the distinct advantages of ZeroCard and show its practical applications in query optimization. Its zero-dependence property significantly facilitates deployment in real-world scenarios.

143. Unveiling the Power of Multiple Gossip Steps: A Stability-Based Generalization Analysis in Decentralized Training

Authors: Qinglun Li , Yingqi Liu , Miao Zhang , Xiaochun Cao , Quanjun Yin , Li Shen
URL: https://arxiv.org/abs/2510.07980
Abstract:

Decentralized training removes the centralized server, making it a communication-efficient approach that can significantly improve training efficiency, but it often suffers from degraded performance compared to centralized training. Multi-Gossip Steps (MGS) serve as a simple yet effective bridge between decentralized and centralized training, significantly reducing experiment performance gaps. However, the theoretical reasons for its effectiveness and whether this gap can be fully eliminated by MGS remain open questions. In this paper, we derive upper bounds on the generalization error and excess error of MGS using stability analysis, systematically answering these two key questions. 1). Optimization Error Reduction: MGS reduces the optimization error bound at an exponential rate, thereby exponentially tightening the generalization error bound and enabling convergence to better solutions. 2). Gap to Centralization: Even as MGS approaches infinity, a non-negligible gap in generalization error remains compared to centralized mini-batch SGD ($\mathcal{O}(T^{\frac{c\beta}{c\beta +1}}/{n m})$ in centralized and $\mathcal{O}(T^{\frac{2c\beta}{2c\beta +2}}/{n m^{\frac{1}{2c\beta +2}}})$ in decentralized). Furthermore, we provide the first unified analysis of how factors like learning rate, data heterogeneity, node count, per-node sample size, and communication topology impact the generalization of MGS under non-convex settings without the bounded gradients assumption, filling a critical theoretical gap in decentralized training. Finally, promising experiments on CIFAR datasets support our theoretical findings.

144. Executable Analytic Concepts as the Missing Link Between VLM Insight and Precise Manipulation

Authors: Mingyang Sun , Jiude Wei , Qichen He , Donglin Wang , Cewu Lu , Jianhua Sun
URL: https://arxiv.org/abs/2510.07975
Abstract:

Enabling robots to perform precise and generalized manipulation in unstructured environments remains a fundamental challenge in embodied AI. While Vision-Language Models (VLMs) have demonstrated remarkable capabilities in semantic reasoning and task planning, a significant gap persists between their high-level understanding and the precise physical execution required for real-world manipulation. To bridge this “semantic-to-physical” gap, we introduce GRACE, a novel framework that grounds VLM-based reasoning through executable analytic concepts (EAC)-mathematically defined blueprints that encode object affordances, geometric constraints, and semantics of manipulation. Our approach integrates a structured policy scaffolding pipeline that turn natural language instructions and visual information into an instantiated EAC, from which we derive grasp poses, force directions and plan physically feasible motion trajectory for robot execution. GRACE thus provides a unified and interpretable interface between high-level instruction understanding and low-level robot control, effectively enabling precise and generalizable manipulation through semantic-physical grounding. Extensive experiments demonstrate that GRACE achieves strong zero-shot generalization across a variety of articulated objects in both simulated and real-world environments, without requiring task-specific training.

Authors: Jialu Du , Guiyang Hou , Yihui Fu , Chen Wu , Wenqi Zhang , Yongliang Shen , Weiming Lu
URL: https://arxiv.org/abs/2510.07974
Abstract:

While large language models (LLMs) excel in mathematical and code reasoning, we observe they struggle with social reasoning tasks, exhibiting cognitive confusion, logical inconsistencies, and conflation between objective world states and subjective belief states. Through deteiled analysis of DeepSeek-R1’s reasoning trajectories, we find that LLMs frequently encounter reasoning impasses and tend to output contradictory terms like “tricky” and “confused” when processing scenarios with multiple participants and timelines, leading to erroneous reasoning or infinite loops. The core issue is their inability to disentangle objective reality from agents’ subjective beliefs. To address this, we propose an adaptive world model-enhanced reasoning mechanism that constructs a dynamic textual world model to track entity states and temporal sequences. It dynamically monitors reasoning trajectories for confusion indicators and promptly intervenes by providing clear world state descriptions, helping models navigate through cognitive dilemmas. The mechanism mimics how humans use implicit world models to distinguish between external events and internal beliefs. Evaluations on three social benchmarks demonstrate significant improvements in accuracy (e.g., +10% in Hi-ToM) while reducing computational costs (up to 33.8% token reduction), offering a simple yet effective solution for deploying LLMs in social contexts.

146. LightReasoner: Can Small Language Models Teach Large Language Models Reasoning?

Authors: Jingyuan Wang , Yankai Chen , Zhonghang Li , Chao Huang
URL: https://arxiv.org/abs/2510.07962
Abstract:

Large language models (LLMs) have demonstrated remarkable progress in reasoning, often through supervised fine-tuning (SFT). However, SFT is resource-intensive, relying on large curated datasets, rejection-sampled demonstrations, and uniform optimization across all tokens, even though only a fraction carry meaningful learning value. In this work, we explore a counterintuitive idea: can smaller language models (SLMs) teach larger language models (LLMs) by revealing high-value reasoning moments that reflect the latter’s unique strength? We propose LightReasoner, a novel framework that leverages the behavioral divergence between a stronger expert model (LLM) and a weaker amateur model (SLM). LightReasoner operates in two stages: (1) a sampling stage that pinpoints critical reasoning moments and constructs supervision examples capturing the expert’s advantage through expert-amateur contrast, and (2) a fine-tuning stage that aligns the expert model with these distilled examples, amplifying its reasoning strengths. Across seven mathematical benchmarks, LightReasoner improves accuracy by up to 28.1%, while reducing time consumption by 90%, sampled problems by 80%, and tuned token usage by 99%, all without relying on ground-truth labels. By turning weaker SLMs into effective teaching signals, LightReasoner offers a scalable and resource-efficient approach for advancing LLM reasoning. Code is available at: this https URL

147. A Systematic Evaluation of Self-Supervised Learning for Label-Efficient Sleep Staging with Wearable EEG

Authors: Emilio Estevan , María Sierra-Torralba , Eduardo López-Larraz , Luis Montesano
URL: https://arxiv.org/abs/2510.07960
Abstract:

Wearable EEG devices have emerged as a promising alternative to polysomnography (PSG). As affordable and scalable solutions, their widespread adoption results in the collection of massive volumes of unlabeled data that cannot be analyzed by clinicians at scale. Meanwhile, the recent success of deep learning for sleep scoring has relied on large annotated datasets. Self-supervised learning (SSL) offers an opportunity to bridge this gap, leveraging unlabeled signals to address label scarcity and reduce annotation effort. In this paper, we present the first systematic evaluation of SSL for sleep staging using wearable EEG. We investigate a range of well-established SSL methods and evaluate them on two sleep databases acquired with the Ikon Sleep wearable EEG headband: BOAS, a high-quality benchmark containing PSG and wearable EEG recordings with consensus labels, and HOGAR, a large collection of home-based, self-recorded, and unlabeled recordings. Three evaluation scenarios are defined to study label efficiency, representation quality, and cross-dataset generalization. Results show that SSL consistently improves classification performance by up to 10% over supervised baselines, with gains particularly evident when labeled data is scarce. SSL achieves clinical-grade accuracy above 80% leveraging only 5% to 10% of labeled data, while the supervised approach requires twice the labels. Additionally, SSL representations prove robust to variations in population characteristics, recording environments, and signal quality. Our findings demonstrate the potential of SSL to enable label-efficient sleep staging with wearable EEG, reducing reliance on manual annotations and advancing the development of affordable sleep monitoring systems.

148. DISCO: Diversifying Sample Condensation for Efficient Model Evaluation

Authors: Alexander Rubinstein , Benjamin Raible , Martin Gubri , Seong Joon Oh
URL: https://arxiv.org/abs/2510.07959
Abstract:

Evaluating modern machine learning models has become prohibitively expensive. Benchmarks such as LMMs-Eval and HELM demand thousands of GPU hours per model. Costly evaluation reduces inclusivity, slows the cycle of innovation, and worsens environmental impact. The typical approach follows two steps. First, select an anchor subset of data. Second, train a mapping from the accuracy on this subset to the final test result. The drawback is that anchor selection depends on clustering, which can be complex and sensitive to design choices. We argue that promoting diversity among samples is not essential; what matters is to select samples that $\textit{maximise diversity in model responses}$. Our method, $\textbf{Diversifying Sample Condensation (DISCO)}$, selects the top-k samples with the greatest model disagreements. This uses greedy, sample-wise statistics rather than global clustering. The approach is conceptually simpler. From a theoretical view, inter-model disagreement provides an information-theoretically optimal rule for such greedy selection. $\textbf{DISCO}$ shows empirical gains over prior methods, achieving state-of-the-art results in performance prediction across MMLU, Hellaswag, Winogrande, and ARC. Code is available here: this https URL .

149. A$^2$Search: Ambiguity-Aware Question Answering with Reinforcement Learning

Authors: Fengji Zhang , Xinyao Niu , Chengyang Ying , Guancheng Lin , Zhongkai Hao , Zhou Fan , Chengen Huang , Jacky Keung , Bei Chen , Junyang Lin
URL: https://arxiv.org/abs/2510.07958
Abstract:

Recent advances in Large Language Models (LLMs) and Reinforcement Learning (RL) have led to strong performance in open-domain question answering (QA). However, existing models still struggle with questions that admit multiple valid answers. Standard QA benchmarks, which typically assume a single gold answer, overlook this reality and thus produce inappropriate training signals. Existing attempts to handle ambiguity often rely on costly manual annotation, which is difficult to scale to multi-hop datasets such as HotpotQA and MuSiQue. In this paper, we present A$^2$Search, an annotation-free, end-to-end training framework to recognize and handle ambiguity. At its core is an automated pipeline that detects ambiguous questions and gathers alternative answers via trajectory sampling and evidence verification. The model is then optimized with RL using a carefully designed $\mathrm{AnsF1}$ reward, which naturally accommodates multiple answers. Experiments on eight open-domain QA benchmarks demonstrate that A$^2$Search achieves new state-of-the-art performance. With only a single rollout, A$^2$Search-7B yields an average $\mathrm{AnsF1}@1$ score of $48.4\%$ across four multi-hop benchmarks, outperforming all strong baselines, including the substantially larger ReSearch-32B ($46.2\%$). Extensive analyses further show that A$^2$Search resolves ambiguity and generalizes across benchmarks, highlighting that embracing ambiguity is essential for building more reliable QA systems. Our code, data, and model weights can be found at this https URL

150. A Large-scale Dataset for Robust Complex Anime Scene Text Detection

Authors: Ziyi Dong , Yurui Zhang , Changmao Li , Naomi Rue Golding , Qing Long
URL: https://arxiv.org/abs/2510.07951
Abstract:

Current text detection datasets primarily target natural or document scenes, where text typically appear in regular font and shapes, monotonous colors, and orderly layouts. The text usually arranged along straight or curved lines. However, these characteristics differ significantly from anime scenes, where text is often diverse in style, irregularly arranged, and easily confused with complex visual elements such as symbols and decorative patterns. Text in anime scene also includes a large number of handwritten and stylized fonts. Motivated by this gap, we introduce AnimeText, a large-scale dataset containing 735K images and 4.2M annotated text blocks. It features hierarchical annotations and hard negative samples tailored for anime scenarios. %Cross-dataset evaluations using state-of-the-art methods demonstrate that models trained on AnimeText achieve superior performance in anime text detection tasks compared to existing datasets. To evaluate the robustness of AnimeText in complex anime scenes, we conducted cross-dataset benchmarking using state-of-the-art text detection methods. Experimental results demonstrate that models trained on AnimeText outperform those trained on existing datasets in anime scene text detection tasks. AnimeText on HuggingFace: this https URL

151. TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

Authors: Leigang Qu , Ziyang Wang , Na Zheng , Wenjie Wang , Liqiang Nie , Tat-Seng Chua
URL: https://arxiv.org/abs/2510.07940
Abstract:

Video Foundation Models (VFMs) exhibit remarkable visual generation performance, but struggle in compositional scenarios (e.g., motion, numeracy, and spatial relation). In this work, we introduce Test-Time Optimization and Memorization (TTOM), a training-free framework that aligns VFM outputs with spatiotemporal layouts during inference for better text-image alignment. Rather than direct intervention to latents or attention per-sample in existing work, we integrate and optimize new parameters guided by a general layout-attention objective. Furthermore, we formulate video generation within a streaming setting, and maintain historical optimization contexts with a parametric memory mechanism that supports flexible operations, such as insert, read, update, and delete. Notably, we found that TTOM disentangles compositional world knowledge, showing powerful transferability and generalization. Experimental results on the T2V-CompBench and Vbench benchmarks establish TTOM as an effective, practical, scalable, and efficient framework to achieve cross-modal alignment for compositional video generation on the fly.

152. STEPER: Step-wise Knowledge Distillation for Enhancing Reasoning Ability in Multi-Step Retrieval-Augmented Language Models

Authors: Kyumin Lee , Minjin Jeon , Sanghwan Jang , Hwanjo Yu
URL: https://arxiv.org/abs/2510.07923
Abstract:

Answering complex real-world questions requires step-by-step retrieval and integration of relevant information to generate well-grounded responses. However, existing knowledge distillation methods overlook the need for different reasoning abilities at different steps, hindering transfer in multi-step retrieval-augmented frameworks. To address this, we propose Stepwise Knowledge Distillation for Enhancing Reasoning Ability in Multi-Step Retrieval-Augmented Language Models (StepER). StepER employs step-wise supervision to align with evolving information and reasoning demands across stages. Additionally, it incorporates difficulty-aware training to progressively optimize learning by prioritizing suitable steps. Our method is adaptable to various multi-step retrieval-augmented language models, including those that use retrieval queries for reasoning paths or decomposed questions. Extensive experiments show that StepER outperforms prior methods on multi-hop QA benchmarks, with an 8B model achieving performance comparable to a 70B teacher model.

153. Towards Human-Like Grading: A Unified LLM-Enhanced Framework for Subjective Question Evaluation

Authors: Fanwei Zhua , Jiaxuan He , Xiaoxiao Chen , Zulong Chen , Quan Lu , Chenrui Mei
URL: https://arxiv.org/abs/2510.07912
Abstract:

Automatic grading of subjective questions remains a significant challenge in examination assessment due to the diversity in question formats and the open-ended nature of student responses. Existing works primarily focus on a specific type of subjective question and lack the generality to support comprehensive exams that contain diverse question types. In this paper, we propose a unified Large Language Model (LLM)-enhanced auto-grading framework that provides human-like evaluation for all types of subjective questions across various domains. Our framework integrates four complementary modules to holistically evaluate student answers. In addition to a basic text matching module that provides a foundational assessment of content similarity, we leverage the powerful reasoning and generative capabilities of LLMs to: (1) compare key knowledge points extracted from both student and reference answers, (2) generate a pseudo-question from the student answer to assess its relevance to the original question, and (3) simulate human evaluation by identifying content-related and non-content strengths and weaknesses. Extensive experiments on both general-purpose and domain-specific datasets show that our framework consistently outperforms traditional and LLM-based baselines across multiple grading metrics. Moreover, the proposed system has been successfully deployed in real-world training and certification exams at a major e-commerce enterprise.

154. MMM: Quantum-Chemical Molecular Representation Learning for Combinatorial Drug Recommendation

Authors: Chongmyung Kwon , Yujin Kim , Seoeun Park , Yunji Lee , Charmgil Hong
URL: https://arxiv.org/abs/2510.07910
Abstract:

Drug recommendation is an essential task in machine learning-based clinical decision support systems. However, the risk of drug-drug interactions (DDI) between co-prescribed medications remains a significant challenge. Previous studies have used graph neural networks (GNNs) to represent drug structures. Regardless, their simplified discrete forms cannot fully capture the molecular binding affinity and reactivity. Therefore, we propose Multimodal DDI Prediction with Molecular Electron Localization Function (ELF) Maps (MMM), a novel framework that integrates three-dimensional (3D) quantum-chemical information into drug representation learning. It generates 3D electron density maps using the ELF. To capture both therapeutic relevance and interaction risks, MMM combines ELF-derived features that encode global electronic properties with a bipartite graph encoder that models local substructure interactions. This design enables learning complementary characteristics of drug molecules. We evaluate MMM in the MIMIC-III dataset (250 drugs, 442 substructures), comparing it with several baseline models. In particular, a comparison with the GNN-based SafeDrug model demonstrates statistically significant improvements in the F1-score (p = 0.0387), Jaccard (p = 0.0112), and the DDI rate (p = 0.0386). These results demonstrate the potential of ELF-based 3D representations to enhance prediction accuracy and support safer combinatorial drug prescribing in clinical practice.

155. Contrastive Weak-to-strong Generalization

Authors: Houcheng Jiang , Junfeng Fang , Jiaxin Wu , Tianyu Zhang , Chen Gao , Yong Li , Xiang Wang , Xiangnan He , Yang Deng
URL: https://arxiv.org/abs/2510.07884
Abstract:

Weak-to-strong generalization provides a promising paradigm for scaling large language models (LLMs) by training stronger models on samples from aligned weaker ones, without requiring human feedback or explicit reward modeling. However, its robustness and generalization are hindered by the noise and biases in weak-model outputs, which limit its applicability in practice. To address this challenge, we leverage implicit rewards, which approximate explicit rewards through log-likelihood ratios, and reveal their structural equivalence with Contrastive Decoding (CD), a decoding strategy shown to reduce noise in LLM generation. Building on this connection, we propose Contrastive Weak-to-Strong Generalization (ConG), a framework that employs contrastive decoding between pre- and post-alignment weak models to generate higher-quality samples. This approach enables more reliable capability transfer, denoising, and improved robustness, substantially mitigating the limitations of traditional weak-to-strong methods. Empirical results across different model families confirm consistent improvements, demonstrating the generality and effectiveness of ConG. Taken together, our findings highlight the potential of ConG to advance weak-to-strong generalization and provide a promising pathway toward AGI.

Authors: Erjia Xiao , Lingfeng Zhang , Yingbo Tang , Hao Cheng , Renjing Xu , Wenbo Ding , Lei Zhou , Long Chen , Hangjun Ye , Xiaoshuai Hao
URL: https://arxiv.org/abs/2510.07871
Abstract:

In this report, we describe the technical details of our submission to the IROS 2025 RoboSense Challenge Social Navigation Track. This track focuses on developing RGBD-based perception and navigation systems that enable autonomous agents to navigate safely, efficiently, and socially compliantly in dynamic human-populated indoor environments. The challenge requires agents to operate from an egocentric perspective using only onboard sensors including RGB-D observations and odometry, without access to global maps or privileged information, while maintaining social norm compliance such as safe distances and collision avoidance. Building upon the Falcon model, we introduce a Proactive Risk Perception Module to enhance social navigation performance. Our approach augments Falcon with collision risk understanding that learns to predict distance-based collision risk scores for surrounding humans, which enables the agent to develop more robust spatial awareness and proactive collision avoidance behaviors. The evaluation on the Social-HM3D benchmark demonstrates that our method improves the agent’s ability to maintain personal space compliance while navigating toward goals in crowded indoor scenes with dynamic human agents, achieving 2nd place among 16 participating teams in the challenge.

157. DM1: MeanFlow with Dispersive Regularization for 1-Step Robotic Manipulation

Authors: Guowei Zou , Haitao Wang , Hejun Wu , Yukun Qian , Yuhang Wang , Weibing Li
URL: https://arxiv.org/abs/2510.07865
Abstract:

The ability to learn multi-modal action distributions is indispensable for robotic manipulation policies to perform precise and robust control. Flow-based generative models have recently emerged as a promising solution to learning distributions of actions, offering one-step action generation and thus achieving much higher sampling efficiency compared to diffusion-based methods. However, existing flow-based policies suffer from representation collapse, the inability to distinguish similar visual representations, leading to failures in precise manipulation tasks. We propose DM1 (MeanFlow with Dispersive Regularization for One-Step Robotic Manipulation), a novel flow matching framework that integrates dispersive regularization into MeanFlow to prevent collapse while maintaining one-step efficiency. DM1 employs multiple dispersive regularization variants across different intermediate embedding layers, encouraging diverse representations across training batches without introducing additional network modules or specialized training procedures. Experiments on RoboMimic benchmarks show that DM1 achieves 20-40 times faster inference (0.07s vs. 2-3.5s) and improves success rates by 10-20 percentage points, with the Lift task reaching 99% success over 85% of the baseline. Real-robot deployment on a Franka Panda further validates that DM1 transfers effectively from simulation to the physical world. To the best of our knowledge, this is the first work to leverage representation regularization to enable flow-based policies to achieve strong performance in robotic manipulation, establishing a simple yet powerful approach for efficient and robust manipulation.

158. Self-Supervised Learning Strategies for a Platform to Test the Toxicity of New Chemicals and Materials

Authors: Thomas Lautenschlager , Nils Friederich , Angelo Jovin Yamachui Sitcheu , Katja Nau , Gaëlle Hayot , Thomas Dickmeis , Ralf Mikut
URL: https://arxiv.org/abs/2510.07853
Abstract:

High-throughput toxicity testing offers a fast and cost-effective way to test large amounts of compounds. A key component for such systems is the automated evaluation via machine learning models. In this paper, we address critical challenges in this domain and demonstrate how representations learned via self-supervised learning can effectively identify toxicant-induced changes. We provide a proof-of-concept that utilizes the publicly available EmbryoNet dataset, which contains ten zebrafish embryo phenotypes elicited by various chemical compounds targeting different processes in early embryonic development. Our analysis shows that the learned representations using self-supervised learning are suitable for effectively distinguishing between the modes-of-action of different compounds. Finally, we discuss the integration of machine learning models in a physical toxicity testing device in the context of the TOXBOX project.

159. Meta-Learning Based Few-Shot Graph-Level Anomaly Detection

Authors: Liting Li , Yumeng Wang , Yueheng Sun
URL: https://arxiv.org/abs/2510.07847
Abstract:

Graph-level anomaly detection aims to identify anomalous graphs or subgraphs within graph datasets, playing a vital role in various fields such as fraud detection, review classification, and biochemistry. While Graph Neural Networks (GNNs) have made significant progress in this domain, existing methods rely heavily on large amounts of labeled data, which is often unavailable in real-world scenarios. Additionally, few-shot anomaly detection methods based on GNNs are prone to noise interference, resulting in poor embedding quality and reduced model robustness. To address these challenges, we propose a novel meta-learning-based graph-level anomaly detection framework (MA-GAD), incorporating a graph compression module that reduces the graph size, mitigating noise interference while retaining essential node information. We also leverage meta-learning to extract meta-anomaly information from similar networks, enabling the learning of an initialization model that can rapidly adapt to new tasks with limited samples. This improves the anomaly detection performance on target graphs, and a bias network is used to enhance the distinction between anomalous and normal nodes. Our experimental results, based on four real-world biochemical datasets, demonstrate that MA-GAD outperforms existing state-of-the-art methods in graph-level anomaly detection under few-shot conditions. Experiments on both graph anomaly and subgraph anomaly detection tasks validate the framework’s effectiveness on real-world datasets.

160. AdaSwitch: Adaptive Switching Generation for Knowledge Distillation

Authors: Jingyu Peng , Maolin Wang , Hengyi Cai , Yuchen Li , Kai Zhang , Shuaiqiang Wang , Dawei Yin , Xiangyu Zhao
URL: https://arxiv.org/abs/2510.07842
Abstract:

Small language models (SLMs) are crucial for applications with strict latency and computational constraints, yet achieving high performance remains challenging. Knowledge distillation (KD) can transfer capabilities from large teacher models, but existing methods involve trade-offs: off-policy distillation provides high-quality supervision but introduces a training-inference mismatch, while on-policy approaches maintain consistency but rely on low-quality student outputs. To address these issues, we propose AdaSwitch, a novel approach that dynamically combines on-policy and off-policy generation at the token level. AdaSwitch allows the student to first explore its own predictions and then selectively integrate teacher guidance based on real-time quality assessment. This approach simultaneously preserves consistency and maintains supervision quality. Experiments on three datasets with two teacher-student LLM pairs demonstrate that AdaSwitch consistently improves accuracy, offering a practical and effective method for distilling SLMs with acceptable additional overhead.

161. Self-Improving LLM Agents at Test-Time

Authors: Emre Can Acikgoz , Cheng Qian , Heng Ji , Dilek Hakkani-Tür , Gokhan Tur
URL: https://arxiv.org/abs/2510.07841
Abstract:

One paradigm of language model (LM) fine-tuning relies on creating large training datasets, under the assumption that high quantity and diversity will enable models to generalize to novel tasks after post-training. In practice, gathering large sets of data is inefficient, and training on them is prohibitively expensive; worse, there is no guarantee that the resulting model will handle complex scenarios or generalize better. Moreover, existing techniques rarely assess whether a training sample provides novel information or is redundant with the knowledge already acquired by the model, resulting in unnecessary costs. In this work, we explore a new test-time self-improvement method to create more effective and generalizable agentic LMs on-the-fly. The proposed algorithm can be summarized in three steps: (i) first it identifies the samples that model struggles with (self-awareness), (ii) then generates similar examples from detected uncertain samples (self-data augmentation), and (iii) uses these newly generated samples at test-time fine-tuning (self-improvement). We study two variants of this approach: Test-Time Self-Improvement (TT-SI), where the same model generates additional training examples from its own uncertain cases and then learns from them, and contrast this approach with Test-Time Distillation (TT-D), where a stronger model generates similar examples for uncertain cases, enabling student to adapt using distilled supervision. Empirical evaluations across different agent benchmarks demonstrate that TT-SI improves the performance with +5.48% absolute accuracy gain on average across all benchmarks and surpasses other standard learning methods, yet using 68x less training samples. Our findings highlight the promise of TT-SI, demonstrating the potential of self-improvement algorithms at test-time as a new paradigm for building more capable agents toward self-evolution.

162. MetaDefense: Defending Finetuning-based Jailbreak Attack Before and During Generation

Authors: Weisen Jiang , Sinno Jialin Pan
URL: https://arxiv.org/abs/2510.07835
Abstract:

This paper introduces MetaDefense, a novel framework for defending against finetuning-based jailbreak attacks in large language models (LLMs). We observe that existing defense mechanisms fail to generalize to harmful queries disguised by unseen attack templates, despite LLMs being capable of distinguishing disguised harmful queries in the embedding space. Based on these insights, we propose a two-stage defense approach: (i) pre-generation defense that detects harmful queries before response generation begins, and (ii) mid-generation defense that monitors partial responses during generation to prevent outputting more harmful content. Our MetaDefense trains the LLM to predict the harmfulness of both queries and partial responses using specialized prompts, enabling early termination of potentially harmful interactions. Extensive experiments across multiple LLM architectures (LLaMA-2-7B, Qwen-2.5-3B-Instruct, and LLaMA-3.2-3B-Instruct) demonstrate that MetaDefense significantly outperforms existing defense mechanisms, achieving robust defense against harmful queries with seen and unseen attack templates while maintaining competitive performance on benign tasks. Code is available at this https URL .

163. The Rise of the Knowledge Sculptor: A New Archetype for Knowledge Work in the Age of Generative AI

Authors: Cathal Doyle
URL: https://arxiv.org/abs/2510.07829
Abstract:

In the Generative Age, the nature of knowledge work is transforming. Traditional models that emphasise the organisation and retrieval of pre-existing information are increasingly inadequate in the face of generative AI (GenAI) systems capable of autonomous content creation. This paper introduces the Knowledge Sculptor (KS), a new professional archetype for Human-GenAI collaboration that transforms raw AI output into trustworthy, actionable knowledge. Grounded in a socio-technical perspective, the KS is conceptualised through a framework of competencies, including architecting a vision, iterative dialogue, information sculpting, and curiosity-driven synthesis. A practice-based vignette illustrates the KS role in action, and in a self-referential approach, the paper itself serves as an artefact of the sculpting process it describes.

164. SIMU: Selective Influence Machine Unlearning

Authors: Anu Agarwal , Mihir Pamnani , Dilek Hakkani-Tur
URL: https://arxiv.org/abs/2510.07822
Abstract:

The undesired memorization of sensitive information by Large Language Models (LLMs) has emphasized the need for safety mechanisms that can regulate model behavior. This has led to the development of machine unlearning techniques that enable models to precisely forget sensitive and unwanted information. For machine unlearning, first-order and second-order optimizer-based methods have shown significant progress in enabling LLMs to forget targeted information. However, in doing so, these approaches often compromise the model’s original capabilities, resulting in unlearned models that struggle to retain their prior knowledge and overall utility. To address this, we propose Selective Influence Machine Unlearning (SIMU), a two-step framework that enhances second-order optimizer-based unlearning by selectively updating only the critical neurons responsible for encoding the forget-set. By constraining updates to these targeted neurons, SIMU achieves comparable unlearning efficacy while substantially outperforming current methods in retaining the model’s original knowledge.

165. Effective and Stealthy One-Shot Jailbreaks on Deployed Mobile Vision-Language Agents

Authors: Renhua Ding , Xiao Yang , Zhengwei Fang , Jun Luo , Kun He , Jun Zhu
URL: https://arxiv.org/abs/2510.07809
Abstract:

Large vision-language models (LVLMs) enable autonomous mobile agents to operate smartphone user interfaces, yet vulnerabilities to UI-level attacks remain critically understudied. Existing research often depends on conspicuous UI overlays, elevated permissions, or impractical threat models, limiting stealth and real-world applicability. In this paper, we present a practical and stealthy one-shot jailbreak attack that leverages in-app prompt injections: malicious applications embed short prompts in UI text that remain inert during human interaction but are revealed when an agent drives the UI via ADB (Android Debug Bridge). Our framework comprises three crucial components: (1) low-privilege perception-chain targeting, which injects payloads into malicious apps as the agent’s visual inputs; (2) stealthy user-invisible activation, a touch-based trigger that discriminates agent from human touches using physical touch attributes and exposes the payload only during agent operation; and (3) one-shot prompt efficacy, a heuristic-guided, character-level iterative-deepening search algorithm (HG-IDA*) that performs one-shot, keyword-level detoxification to evade on-device safety filters. We evaluate across multiple LVLM backends, including closed-source services and representative open-source models within three Android applications, and we observe high planning and execution hijack rates in single-shot scenarios (e.g., GPT-4o: 82.5% planning / 75.0% execution). These findings expose a fundamental security vulnerability in current mobile agents with immediate implications for autonomous smartphone operation.

166. Dynamic Generation of Multi-LLM Agents Communication Topologies with Graph Diffusion Models

Authors: Eric Hanchen Jiang , Guancheng Wan , Sophia Yin , Mengting Li , Yuchen Wu , Xiao Liang , Xinfeng Li , Yizhou Sun , Wei Wang , Kai-Wei Chang , Ying Nian Wu
URL: https://arxiv.org/abs/2510.07799
Abstract:

The efficiency of multi-agent systems driven by large language models (LLMs) largely hinges on their communication topology. However, designing an optimal topology is a non-trivial challenge, as it requires balancing competing objectives such as task performance, communication cost, and robustness. Existing frameworks often rely on static or hand-crafted topologies, which inherently fail to adapt to diverse task requirements, leading to either excessive token consumption for simple problems or performance bottlenecks for complex ones. To address this challenge, we introduce a novel generative framework called \textit{Guided Topology Diffusion (GTD)}. Inspired by conditional discrete graph diffusion models, GTD formulates topology synthesis as an iterative construction process. At each step, the generation is steered by a lightweight proxy model that predicts multi-objective rewards (e.g., accuracy, utility, cost), enabling real-time, gradient-free optimization towards task-adaptive topologies. This iterative, guided synthesis process distinguishes GTD from single-step generative frameworks, enabling it to better navigate complex design trade-offs. We validated GTD across multiple benchmarks, and experiments show that this framework can generate highly task-adaptive, sparse, and efficient communication topologies, significantly outperforming existing methods in LLM agent collaboration.

167. HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation

Authors: Peilin Wu , Mian Zhang , Kun Wan , Wentian Zhao , Kaiyu He , Xinya Du , Zhiyu Chen
URL: https://arxiv.org/abs/2510.07794
Abstract:

Agentic RAG is a powerful technique for incorporating external information that LLMs lack, enabling better problem solving and question answering. However, suboptimal search behaviors exist widely, such as over-search (retrieving information already known) and under-search (failing to search when necessary), which leads to unnecessary overhead and unreliable outputs. Current training methods, which typically rely on outcome-based rewards in a RL framework, lack the fine-grained control needed to address these inefficiencies. To overcome this, we introduce Hierarchical Process Rewards for Efficient agentic RAG (HiPRAG), a training methodology that incorporates a fine-grained, knowledge-grounded process reward into the RL training. Our approach evaluates the necessity of each search decision on-the-fly by decomposing the agent’s reasoning trajectory into discrete, parsable steps. We then apply a hierarchical reward function that provides an additional bonus based on the proportion of optimal search and non-search steps, on top of commonly used outcome and format rewards. Experiments on the Qwen2.5 and Llama-3.2 models across seven diverse QA benchmarks show that our method achieves average accuracies of 65.4% (3B) and 67.2% (7B). This is accomplished while improving search efficiency, reducing the over-search rate to just 2.3% and concurrently lowering the under-search rate. These results demonstrate the efficacy of optimizing the reasoning process itself, not just the final outcome. Further experiments and analysis demonstrate that HiPRAG shows good generalizability across a wide range of RL algorithms, model families, sizes, and types. This work demonstrates the importance and potential of fine-grained control through RL, for improving the efficiency and optimality of reasoning for search agents.

168. LLM4Cell: A Survey of Large Language and Agentic Models for Single-Cell Biology

Authors: Sajib Acharjee Dip , Adrika Zafor , Bikash Kumar Paul , Uddip Acharjee Shuvo , Muhit Islam Emon , Xuan Wang , Liqing Zhang
URL: https://arxiv.org/abs/2510.07793
Abstract:

Large language models (LLMs) and emerging agentic frameworks are beginning to transform single-cell biology by enabling natural-language reasoning, generative annotation, and multimodal data integration. However, progress remains fragmented across data modalities, architectures, and evaluation standards. LLM4Cell presents the first unified survey of 58 foundation and agentic models developed for single-cell research, spanning RNA, ATAC, multi-omic, and spatial modalities. We categorize these methods into five families-foundation, text-bridge, spatial, multimodal, epigenomic, and agentic-and map them to eight key analytical tasks including annotation, trajectory and perturbation modeling, and drug-response prediction. Drawing on over 40 public datasets, we analyze benchmark suitability, data diversity, and ethical or scalability constraints, and evaluate models across 10 domain dimensions covering biological grounding, multi-omics alignment, fairness, privacy, and explainability. By linking datasets, models, and evaluation domains, LLM4Cell provides the first integrated view of language-driven single-cell intelligence and outlines open challenges in interpretability, standardization, and trustworthy model development.

169. IntentionVLA: Generalizable and Efficient Embodied Intention Reasoning for Human-Robot Interaction

Authors: Yandu Chen , Kefan Gu , Yuqing Wen , Yucheng Zhao , Tiancai Wang , Liqiang Nie
URL: https://arxiv.org/abs/2510.07778
Abstract:

Vision-Language-Action (VLA) models leverage pretrained vision-language models (VLMs) to couple perception with robotic control, offering a promising path toward general-purpose embodied intelligence. However, current SOTA VLAs are primarily pretrained on multimodal tasks with limited relevance to embodied scenarios, and then finetuned to map explicit instructions to actions. Consequently, due to the lack of reasoning-intensive pretraining and reasoning-guided manipulation, these models are unable to perform implicit human intention reasoning required for complex, real-world interactions. To overcome these limitations, we propose \textbf{IntentionVLA}, a VLA framework with a curriculum training paradigm and an efficient inference mechanism. Our proposed method first leverages carefully designed reasoning data that combine intention inference, spatial grounding, and compact embodied reasoning, endowing the model with both reasoning and perception capabilities. In the following finetuning stage, IntentionVLA employs the compact reasoning outputs as contextual guidance for action generation, enabling fast inference under indirect instructions. Experimental results show that IntentionVLA substantially outperforms $\pi_0$, achieving 18\% higher success rates with direct instructions and 28\% higher than ECoT under intention instructions. On out-of-distribution intention tasks, IntentionVLA achieves over twice the success rate of all baselines, and further enables zero-shot human-robot interaction with 40\% success rate. These results highlight IntentionVLA as a promising paradigm for next-generation human-robot interaction (HRI) systems.

170. Drift No More? Context Equilibria in Multi-Turn LLM Interactions

Authors: Vardhan Dongre , Ryan A. Rossi , Viet Dac Lai , David Seunghyun Yoon , Dilek Hakkani-Tür , Trung Bui
URL: https://arxiv.org/abs/2510.07777
Abstract:

Large Language Models (LLMs) excel at single-turn tasks such as instruction following and summarization, yet real-world deployments require sustained multi-turn interactions where user goals and conversational context persist and evolve. A recurring challenge in this setting is context drift: the gradual divergence of a model’s outputs from goal-consistent behavior across turns. Unlike single-turn errors, drift unfolds temporally and is poorly captured by static evaluation metrics. In this work, we present a study of context drift in multi-turn interactions and propose a simple dynamical framework to interpret its behavior. We formalize drift as the turn-wise KL divergence between the token-level predictive distributions of the test model and a goal-consistent reference model, and propose a recurrence model that interprets its evolution as a bounded stochastic process with restoring forces and controllable interventions. We instantiate this framework in both synthetic long-horizon rewriting tasks and realistic user-agent simulations such as in $\tau$-Bench, measuring drift for several open-weight LLMs that are used as user simulators. Our experiments consistently reveal stable, noise-limited equilibria rather than runaway degradation, and demonstrate that simple reminder interventions reliably reduce divergence in line with theoretical predictions. Together, these results suggest that multi-turn drift can be understood as a controllable equilibrium phenomenon rather than as inevitable decay, providing a foundation for studying and mitigating context drift in extended interactions.

171. Trajectory Conditioned Cross-embodiment Skill Transfer

Authors: YuHang Tang , Yixuan Lou , Pengfei Han , Haoming Song , Xinyi Ye , Dong Wang , Bin Zhao
URL: https://arxiv.org/abs/2510.07773
Abstract:

Learning manipulation skills from human demonstration videos presents a promising yet challenging problem, primarily due to the significant embodiment gap between human body and robot manipulators. Existing methods rely on paired datasets or hand-crafted rewards, which limit scalability and generalization. We propose TrajSkill, a framework for Trajectory Conditioned Cross-embodiment Skill Transfer, enabling robots to acquire manipulation skills directly from human demonstration videos. Our key insight is to represent human motions as sparse optical flow trajectories, which serve as embodiment-agnostic motion cues by removing morphological variations while preserving essential dynamics. Conditioned on these trajectories together with visual and textual inputs, TrajSkill jointly synthesizes temporally consistent robot manipulation videos and translates them into executable actions, thereby achieving cross-embodiment skill transfer. Extensive experiments are conducted, and the results on simulation data (MetaWorld) show that TrajSkill reduces FVD by 39.6\% and KVD by 36.6\% compared with the state-of-the-art, and improves cross-embodiment success rate by up to 16.7\%. Real-robot experiments in kitchen manipulation tasks further validate the effectiveness of our approach, demonstrating practical human-to-robot skill transfer across embodiments.

172. ToolLibGen: Scalable Automatic Tool Creation and Aggregation for LLM Reasoning

Authors: Murong Yue , Zhiwei Liu , Liangwei Yang , Jianguo Zhang , Zuxin Liu , Haolin Chen , Ziyu Yao , Silvio Savarese , Caiming Xiong , Shelby Heinecke , Huan Wang
URL: https://arxiv.org/abs/2510.07768
Abstract:

Large Language Models (LLMs) equipped with external tools have demonstrated enhanced performance on complex reasoning tasks. The widespread adoption of this tool-augmented reasoning is hindered by the scarcity of domain-specific tools. For instance, in domains such as physics question answering, suitable and specialized tools are often missing. Recent work has explored automating tool creation by extracting reusable functions from Chain-of-Thought (CoT) reasoning traces; however, these approaches face a critical scalability bottleneck. As the number of generated tools grows, storing them in an unstructured collection leads to significant retrieval challenges, including an expanding search space and ambiguity between function-related tools. To address this, we propose a systematic approach to automatically refactor an unstructured collection of tools into a structured tool library. Our system first generates discrete, task-specific tools and clusters them into semantically coherent topics. Within each cluster, we introduce a multi-agent framework to consolidate scattered functionalities: a code agent refactors code to extract shared logic and creates versatile, aggregated tools, while a reviewing agent ensures that these aggregated tools maintain the complete functional capabilities of the original set. This process transforms numerous question-specific tools into a smaller set of powerful, aggregated tools without loss of functionality. Experimental results demonstrate that our approach significantly improves tool retrieval accuracy and overall reasoning performance across multiple reasoning tasks. Furthermore, our method shows enhanced scalability compared with baselines as the number of question-specific increases.

173. A Unified Multi-Task Learning Framework for Generative Auto-Bidding with Validation-Aligned Optimization

Authors: Yiqin Lv , Zhiyu Mou , Miao Xu , Jinghao Chen , Qi Wang , Yixiu Mao , Yun Qu , Rongquan Bai , Chuan Yu , Jian Xu , Bo Zheng , Xiangyang Ji
URL: https://arxiv.org/abs/2510.07760
Abstract:

In online advertising, heterogeneous advertiser requirements give rise to numerous customized bidding tasks that are typically optimized independently, resulting in extensive computation and limited data efficiency. Multi-task learning offers a principled framework to train these tasks jointly through shared representations. However, existing multi-task optimization strategies are primarily guided by training dynamics and often generalize poorly in volatile bidding environments. To this end, we present Validation-Aligned Multi-task Optimization (VAMO), which adaptively assigns task weights based on the alignment between per-task training gradients and a held-out validation gradient, thereby steering updates toward validation improvement and better matching deployment objectives. We further equip the framework with a periodicity-aware temporal module and couple it with an advanced generative auto-bidding backbone to enhance cross-task transfer of seasonal structure and strengthen bidding performance. Meanwhile, we provide theoretical insights into the proposed method, e.g., convergence guarantee and alignment analysis. Extensive experiments on both simulated and large-scale real-world advertising systems consistently demonstrate significant improvements over typical baselines, illuminating the effectiveness of the proposed approach.

174. Parallel Test-Time Scaling for Latent Reasoning Models

Authors: Runyang You , Yongqi Li , Meng Liu , Wenjie Wang , Liqiang Nie , Wenjie Li
URL: https://arxiv.org/abs/2510.07745
Abstract:

Parallel test-time scaling (TTS) is a pivotal approach for enhancing large language models (LLMs), typically by sampling multiple token-based chains-of-thought in parallel and aggregating outcomes through voting or search. Recent advances in latent reasoning, where intermediate reasoning unfolds in continuous vector spaces, offer a more efficient alternative to explicit Chain-of-Thought, yet whether such latent models can similarly benefit from parallel TTS remains open, mainly due to the absence of sampling mechanisms in continuous space, and the lack of probabilistic signals for advanced trajectory aggregation.
This work enables parallel TTS for latent reasoning models by addressing the above issues. For sampling, we introduce two uncertainty-inspired stochastic strategies: Monte Carlo Dropout and Additive Gaussian Noise. For aggregation, we design a Latent Reward Model (LatentRM) trained with step-wise contrastive objective to score and guide latent reasoning. Extensive experiments and visualization analyses show that both sampling strategies scale effectively with compute and exhibit distinct exploration dynamics, while LatentRM enables effective trajectory selection. Together, our explorations open a new direction for scalable inference in continuous spaces. Code released at this https URL .

175. UltraLED: Learning to See Everything in Ultra-High Dynamic Range Scenes

Authors: Yuang Meng , Xin Jin , Lina Lei , Chun-Le Guo , Chongyi Li
URL: https://arxiv.org/abs/2510.07741
Abstract:

Ultra-high dynamic range (UHDR) scenes exhibit significant exposure disparities between bright and dark regions. Such conditions are commonly encountered in nighttime scenes with light sources. Even with standard exposure settings, a bimodal intensity distribution with boundary peaks often emerges, making it difficult to preserve both highlight and shadow details simultaneously. RGB-based bracketing methods can capture details at both ends using short-long exposure pairs, but are susceptible to misalignment and ghosting artifacts. We found that a short-exposure image already retains sufficient highlight detail. The main challenge of UHDR reconstruction lies in denoising and recovering information in dark regions. In comparison to the RGB images, RAW images, thanks to their higher bit depth and more predictable noise characteristics, offer greater potential for addressing this challenge. This raises a key question: can we learn to see everything in UHDR scenes using only a single short-exposure RAW image? In this study, we rely solely on a single short-exposure frame, which inherently avoids ghosting and motion blur, making it particularly robust in dynamic scenes. To achieve that, we introduce UltraLED, a two-stage framework that performs exposure correction via a ratio map to balance dynamic range, followed by a brightness-aware RAW denoiser to enhance detail recovery in dark regions. To support this setting, we design a 9-stop bracketing pipeline to synthesize realistic UHDR images and contribute a corresponding dataset based on diverse scenes, using only the shortest exposure as input for reconstruction. Extensive experiments show that UltraLED significantly outperforms existing single-frame approaches. Our code and dataset are made publicly available at this https URL .

176. AppForge: From Assistant to Independent Developer - Are GPTs Ready for Software Development?

Authors: Dezhi Ran , Yuan Cao , Mengzhou Wu , Simin Chen , Yuzhe Guo , Jun Ren , Zihe Song , Hao Yu , Jialei Wei , Linyi Li , Wei Yang , Baishakhi Ray , Tao Xie
URL: https://arxiv.org/abs/2510.07740
Abstract:

Large language models (LLMs) have demonstrated remarkable capability in function-level code generation tasks. Unlike isolated functions, real-world applications demand reasoning over the entire software system: developers must orchestrate how different components interact, maintain consistency across states over time, and ensure the application behaves correctly within the lifecycle and framework constraints. Yet, no existing benchmark adequately evaluates whether LLMs can bridge this gap and construct entire software systems from scratch. To address this gap, we propose APPFORGE, a benchmark consisting of 101 software development problems drawn from real-world Android apps. Given a natural language specification detailing the app functionality, a language model is tasked with implementing the functionality into an Android app from scratch. Developing an Android app from scratch requires understanding and coordinating app states, lifecycle management, and asynchronous operations, calling for LLMs to generate context-aware, robust, and maintainable code. To construct APPFORGE, we design a multi-agent system to automatically summarize the main functionalities from app documents and navigate the app to synthesize test cases validating the functional correctness of app implementation. Following rigorous manual verification by Android development experts, APPFORGE incorporates the test cases within an automated evaluation framework that enables reproducible assessment without human intervention, making it easily adoptable for future research. Our evaluation on 12 flagship LLMs show that all evaluated models achieve low effectiveness, with the best-performing model (GPT-5) developing only 18.8% functionally correct applications, highlighting fundamental limitations in current models’ ability to handle complex, multi-component software engineering challenges.

177. MeSH: Memory-as-State-Highways for Recursive Transformers

Authors: Chengting Yu , Xiaobo Shu , Yadao Wang , Yizhen Zhang , Haoyi Wu , Jiaang Li , Rujiao Long , Ziheng Chen , Yuchi Xu , Wenbo Su , Bo Zheng
URL: https://arxiv.org/abs/2510.07739
Abstract:

Recursive transformers reuse parameters and iterate over hidden states multiple times, decoupling compute depth from parameter depth. However, under matched compute, recursive models with fewer parameters often lag behind non-recursive counterparts. By probing hidden states, we trace this performance gap to two primary bottlenecks: undifferentiated computation, where the core is forced to adopt a similar computational pattern at every iteration, and information overload, where long-lived and transient information must coexist in a single hidden state. To address the issues, we introduce a Memory-as-State-Highways (MeSH) scheme, which externalizes state management into an explicit memory buffer and employs lightweight routers to dynamically diversify computation across iterations. Probing visualizations confirm that MeSH successfully resolves the pathologies by inducing functional specialization across iterations. On the Pythia suite (160M-1.4B), MeSH-enhanced recursive transformers consistently improve over recursive baselines and outperforms its larger non-recursive counterpart at the 1.4B scale, improving average downstream accuracy by +1.06% with 33% fewer non-embedding parameters. Our analysis establishes MeSH as a scalable and principled architecture for building stronger recursive models.

178. DEAS: DEtached value learning with Action Sequence for Scalable Offline RL

Authors: Changyeon Kim , Haeone Lee , Younggyo Seo , Kimin Lee , Yuke Zhu
URL: https://arxiv.org/abs/2510.07730
Abstract:

Offline reinforcement learning (RL) presents an attractive paradigm for training intelligent agents without expensive online interactions. However, current approaches still struggle with complex, long-horizon sequential decision making. In this work, we introduce DEtached value learning with Action Sequence (DEAS), a simple yet effective offline RL framework that leverages action sequences for value learning. These temporally extended actions provide richer information than single-step actions and can be interpreted through the options framework via semi-Markov decision process Q-learning, enabling reduction of the effective planning horizon by considering longer sequences at once. However, directly adopting such sequences in actor-critic algorithms introduces excessive value overestimation, which we address through detached value learning that steers value estimates toward in-distribution actions that achieve high return in the offline dataset. We demonstrate that DEAS consistently outperforms baselines on complex, long-horizon tasks from OGBench and can be applied to enhance the performance of large-scale Vision-Language-Action models that predict action sequences, significantly boosting performance in both RoboCasa Kitchen simulation tasks and real-world manipulation tasks.

179. Causality Guided Representation Learning for Cross-Style Hate Speech Detection

Authors: Chengshuai Zhao , Shu Wan , Paras Sheth , Karan Patwa , K. Selçuk Candan , Huan Liu
URL: https://arxiv.org/abs/2510.07707
Abstract:

The proliferation of online hate speech poses a significant threat to the harmony of the web. While explicit hate is easily recognized through overt slurs, implicit hate speech is often conveyed through sarcasm, irony, stereotypes, or coded language – making it harder to detect. Existing hate speech detection models, which predominantly rely on surface-level linguistic cues, fail to generalize effectively across diverse stylistic variations. Moreover, hate speech spread on different platforms often targets distinct groups and adopts unique styles, potentially inducing spurious correlations between them and labels, further challenging current detection approaches. Motivated by these observations, we hypothesize that the generation of hate speech can be modeled as a causal graph involving key factors: contextual environment, creator motivation, target, and style. Guided by this graph, we propose CADET, a causal representation learning framework that disentangles hate speech into interpretable latent factors and then controls confounders, thereby isolating genuine hate intent from superficial linguistic cues. Furthermore, CADET allows counterfactual reasoning by intervening on style within the latent space, naturally guiding the model to robustly identify hate speech in varying forms. CADET demonstrates superior performance in comprehensive experiments, highlighting the potential of causal priors in advancing generalizable hate speech detection.

180. Rethinking Reasoning: A Survey on Reasoning-based Backdoors in LLMs

Authors: Man Hu , Xinyi Wu , Zuofeng Suo , Jinbo Feng , Linghui Meng , Yanhao Jia , Anh Tuan Luu , Shuai Zhao
URL: https://arxiv.org/abs/2510.07697
Abstract:

With the rise of advanced reasoning capabilities, large language models (LLMs) are receiving increasing attention. However, although reasoning improves LLMs’ performance on downstream tasks, it also introduces new security risks, as adversaries can exploit these capabilities to conduct backdoor attacks. Existing surveys on backdoor attacks and reasoning security offer comprehensive overviews but lack in-depth analysis of backdoor attacks and defenses targeting LLMs’ reasoning abilities. In this paper, we take the first step toward providing a comprehensive review of reasoning-based backdoor attacks in LLMs by analyzing their underlying mechanisms, methodological frameworks, and unresolved challenges. Specifically, we introduce a new taxonomy that offers a unified perspective for summarizing existing approaches, categorizing reasoning-based backdoor attacks into associative, passive, and active. We also present defense strategies against such attacks and discuss current challenges alongside potential directions for future research. This work offers a novel perspective, paving the way for further exploration of secure and trustworthy LLM communities.

181. Stress-Testing Model Specs Reveals Character Differences among Language Models

Authors: Jifan Zhang , Henry Sleight , Andi Peng , John Schulman , Esin Durmus
URL: https://arxiv.org/abs/2510.07686
Abstract:

Large language models (LLMs) are increasingly trained from AI constitutions and model specifications that establish behavioral guidelines and ethical principles. However, these specifications face critical challenges, including internal conflicts between principles and insufficient coverage of nuanced scenarios. We present a systematic methodology for stress-testing model character specifications, automatically identifying numerous cases of principle contradictions and interpretive ambiguities in current model specs. We stress test current model specs by generating scenarios that force explicit tradeoffs between competing value-based principles. Using a comprehensive taxonomy we generate diverse value tradeoff scenarios where models must choose between pairs of legitimate principles that cannot be simultaneously satisfied. We evaluate responses from twelve frontier LLMs across major providers (Anthropic, OpenAI, Google, xAI) and measure behavioral disagreement through value classification scores. Among these scenarios, we identify over 70,000 cases exhibiting significant behavioral divergence. Empirically, we show this high divergence in model behavior strongly predicts underlying problems in model specifications. Through qualitative analysis, we provide numerous example issues in current model specs such as direct contradiction and interpretive ambiguities of several principles. Additionally, our generated dataset also reveals both clear misalignment cases and false-positive refusals across all of the frontier models we study. Lastly, we also provide value prioritization patterns and differences of these models.

182. Curriculum Learning with Synthetic Data for Enhanced Pulmonary Nodule Detection in Chest Radiographs

Authors: Pranav Sambhu , Om Guin , Madhav Sambhu , Jinho Cha
URL: https://arxiv.org/abs/2510.07681
Abstract:

This study evaluates whether integrating curriculum learning with diffusion-based synthetic augmentation can enhance the detection of difficult pulmonary nodules in chest radiographs, particularly those with low size, brightness, and contrast, which often challenge conventional AI models due to data imbalance and limited annotation. A Faster R-CNN with a Feature Pyramid Network (FPN) backbone was trained on a hybrid dataset comprising expert-labeled NODE21 (1,213 patients; 52.4 percent male; mean age 63.2 +/- 11.5 years), VinDr-CXR, CheXpert, and 11,206 DDPM-generated synthetic images. Difficulty scores based on size, brightness, and contrast guided curriculum learning. Performance was compared to a non-curriculum baseline using mean average precision (mAP), Dice score, and area under the curve (AUC). Statistical tests included bootstrapped confidence intervals, DeLong tests, and paired t-tests. The curriculum model achieved a mean AUC of 0.95 versus 0.89 for the baseline (p < 0.001), with improvements in sensitivity (70 percent vs. 48 percent) and accuracy (82 percent vs. 70 percent). Stratified analysis demonstrated consistent gains across all difficulty bins (Easy to Very Hard). Grad-CAM visualizations confirmed more anatomically focused attention under curriculum learning. These results suggest that curriculum-guided synthetic augmentation enhances model robustness and generalization for pulmonary nodule detection.

183. Controllable Video Synthesis via Variational Inference

Authors: Haoyi Duan , Yunzhi Zhang , Yilun Du , Jiajun Wu
URL: https://arxiv.org/abs/2510.07670
Abstract:

Many video workflows benefit from a mixture of user controls with varying granularity, from exact 4D object trajectories and camera paths to coarse text prompts, while existing video generative models are typically trained for fixed input formats. We develop a video synthesis method that addresses this need and generates samples with high controllability for specified elements while maintaining diversity for under-specified ones. We cast the task as variational inference to approximate a composed distribution, leveraging multiple video generation backbones to account for all task constraints collectively. To address the optimization challenge, we break down the problem into step-wise KL divergence minimization over an annealed sequence of distributions, and further propose a context-conditioned factorization technique that reduces modes in the solution space to circumvent local optima. Experiments suggest that our method produces samples with improved controllability, diversity, and 3D consistency compared to prior works.

184. TCIP: Threshold-Controlled Iterative Pyramid Network for Deformable Medical Image Registration

Authors: Heming Wu , Di Wang , Tai Ma , Peng Zhao , Yubin Xiao , Zhongke Wu , Xing-Ce Wang , Chuang Li , Xuan Wu , You Zhou
URL: https://arxiv.org/abs/2510.07666
Abstract:

Although pyramid networks have demonstrated superior performance in deformable medical image registration, their decoder architectures are inherently prone to propagating and accumulating anatomical structure misalignments. Moreover, most existing models do not adaptively determine the number of iterations for optimization under varying deformation requirements across images, resulting in either premature termination or excessive iterations that degrades registration accuracy. To effectively mitigate the accumulation of anatomical misalignments, we propose the Feature-Enhanced Residual Module (FERM) as the core component of each decoding layer in the pyramid network. FERM comprises three sequential blocks that extract anatomical semantic features, learn to suppress irrelevant features, and estimate the final deformation field, respectively. To adaptively determine the number of iterations for varying images, we propose the dual-stage Threshold-Controlled Iterative (TCI) strategy. In the first stage, TCI assesses registration stability and with asserted stability, it continues with the second stage to evaluate convergence. We coin the model that integrates FERM and TCI as Threshold-Controlled Iterative Pyramid (TCIP). Extensive experiments on three public brain MRI datasets and one abdomen CT dataset demonstrate that TCIP outperforms the state-of-the-art (SOTA) registration networks in terms of accuracy, while maintaining comparable inference speed and a compact model parameter size. Finally, we assess the generalizability of FERM and TCI by integrating them with existing registration networks and further conduct ablation studies to validate the effectiveness of these two proposed methods.

185. IKNet: Interpretable Stock Price Prediction via Keyword-Guided Integration of News and Technical Indicators

Authors: Jinwoong Kim , Sangjin Park
URL: https://arxiv.org/abs/2510.07661
Abstract:

The increasing influence of unstructured external information, such as news articles, on stock prices has attracted growing attention in financial markets. Despite recent advances, most existing newsbased forecasting models represent all articles using sentiment scores or average embeddings that capture the general tone but fail to provide quantitative, context-aware explanations of the impacts of public sentiment on predictions. To address this limitation, we propose an interpretable keyword-guided network (IKNet), which is an explainable forecasting framework that models the semantic association between individual news keywords and stock price movements. The IKNet identifies salient keywords via FinBERTbased contextual analysis, processes each embedding through a separate nonlinear projection layer, and integrates their representations with the time-series data of technical indicators to forecast next-day closing prices. By applying Shapley Additive Explanations the model generates quantifiable and interpretable attributions for the contribution of each keyword to predictions. Empirical evaluations of S&P 500 data from 2015 to 2024 demonstrate that IKNet outperforms baselines, including recurrent neural networks and transformer models, reducing RMSE by up to 32.9% and improving cumulative returns by 18.5%. Moreover, IKNet enhances transparency by offering contextualized explanations of volatility events driven by public sentiment.

186. OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference

Authors: Yuzhe Gu , Xiyu Liang , Jiaojiao Zhao , Enmao Diao
URL: https://arxiv.org/abs/2510.07651
Abstract:

Large language models (LLMs) with extended context windows enable powerful downstream applications but impose significant memory overhead, as caching all key-value (KV) states scales linearly with sequence length and batch size. Existing cache eviction methods address this by exploiting attention sparsity, yet they typically rank tokens heuristically using accumulated attention weights without considering their true impact on attention outputs. We propose Optimal Brain Cache (OBCache), a principled framework that formulates cache eviction as a layer-wise structured pruning problem. Building upon the Optimal Brain Damage (OBD) theory, OBCache quantifies token saliency by measuring the perturbation in attention outputs induced by pruning tokens, with closed-form scores derived for isolated keys, isolated values, and joint key-value pairs. Our scores account not only for attention weights but also for information from value states and attention outputs, thereby enhancing existing eviction strategies with output-aware signals. Experiments on LLaMA and Qwen models demonstrate that replacing the heuristic scores in existing works, which estimate token saliency across different query positions, with OBCache’s output-aware scores consistently improves long-context accuracy.

187. Value Flows

Authors: Perry Dong , Chongyi Zheng , Chelsea Finn , Dorsa Sadigh , Benjamin Eysenbach
URL: https://arxiv.org/abs/2510.07650
Abstract:

While most reinforcement learning methods today flatten the distribution of future returns to a single scalar value, distributional RL methods exploit the return distribution to provide stronger learning signals and to enable applications in exploration and safe RL. While the predominant method for estimating the return distribution is by modeling it as a categorical distribution over discrete bins or estimating a finite number of quantiles, such approaches leave unanswered questions about the fine-grained structure of the return distribution and about how to distinguish states with high return uncertainty for decision-making. The key idea in this paper is to use modern, flexible flow-based models to estimate the full future return distributions and identify those states with high return variance. We do so by formulating a new flow-matching objective that generates probability density paths satisfying the distributional Bellman equation. Building upon the learned flow models, we estimate the return uncertainty of distinct states using a new flow derivative ODE. We additionally use this uncertainty information to prioritize learning a more accurate return estimation on certain transitions. We compare our method (Value Flows) with prior methods in the offline and online-to-online settings. Experiments on $37$ state-based and $25$ image-based benchmark tasks demonstrate that Value Flows achieves a $1.3\times$ improvement on average in success rates. Website: this https URL Code: this https URL

188. Banking Done Right: Redefining Retail Banking with Language-Centric AI

Authors: Xin Jie Chua , Jeraelyn Ming Li Tan , Jia Xuan Tan , Soon Chang Poh , Yi Xian Goh , Debbie Hui Tian Choong , Chee Mun Foong , Sze Jue Yang , Chee Seng Chan
URL: https://arxiv.org/abs/2510.07645
Abstract:

This paper presents Ryt AI, an LLM-native agentic framework that powers Ryt Bank to enable customers to execute core financial transactions through natural language conversation. This represents the first global regulator-approved deployment worldwide where conversational AI functions as the primary banking interface, in contrast to prior assistants that have been limited to advisory or support roles. Built entirely in-house, Ryt AI is powered by ILMU, a closed-source LLM developed internally, and replaces rigid multi-screen workflows with a single dialogue orchestrated by four LLM-powered agents (Guardrails, Intent, Payment, and FAQ). Each agent attaches a task-specific LoRA adapter to ILMU, which is hosted within the bank’s infrastructure to ensure consistent behavior with minimal overhead. Deterministic guardrails, human-in-the-loop confirmation, and a stateless audit architecture provide defense-in-depth for security and compliance. The result is Banking Done Right: demonstrating that regulator-approved natural-language interfaces can reliably support core financial operations under strict governance.

189. Retentive Relevance: Capturing Long-Term User Value in Recommendation Systems

Authors: Saeideh Bakhshi , Phuong Mai Nguyen , Robert Schiller , Tiantian Xu , Pawan Kodandapani , Andrew Levine , Cayman Simpson , Qifan Wang
URL: https://arxiv.org/abs/2510.07621
Abstract:

Recommendation systems have traditionally relied on short-term engagement signals, such as clicks and likes, to personalize content. However, these signals are often noisy, sparse, and insufficient for capturing long-term user satisfaction and retention. We introduce Retentive Relevance, a novel content-level survey-based feedback measure that directly assesses users’ intent to return to the platform for similar content. Unlike other survey measures that focus on immediate satisfaction, Retentive Relevance targets forward-looking behavioral intentions, capturing longer term user intentions and providing a stronger predictor of retention. We validate Retentive Relevance using psychometric methods, establishing its convergent, discriminant, and behavioral validity. Through large-scale offline modeling, we show that Retentive Relevance significantly outperforms both engagement signals and other survey measures in predicting next-day retention, especially for users with limited historical engagement. We develop a production-ready proxy model that integrates Retentive Relevance into the final stage of a multi-stage ranking system on a social media platform. Calibrated score adjustments based on this model yield substantial improvements in engagement, and retention, while reducing exposure to low-quality content, as demonstrated by large-scale A/B experiments. This work provides the first empirically validated framework linking content-level user perceptions to retention outcomes in production systems. We offer a scalable, user-centered solution that advances both platform growth and user experience. Our work has broad implications for responsible AI development.

190. DGTEN: A Robust Deep Gaussian based Graph Neural Network for Dynamic Trust Evaluation with Uncertainty-Quantification Support

Authors: Muhammad Usman , Yugyung Lee
URL: https://arxiv.org/abs/2510.07620
Abstract:

Dynamic trust evaluation in large, rapidly evolving graphs requires models that can capture changing relationships, express calibrated confidence, and resist adversarial manipulation. DGTEN (Deep Gaussian-based Trust Evaluation Network) introduces a unified graph framework that achieves all three by combining uncertainty-aware message passing, expressive temporal modeling, and built-in defenses against trust-targeted attacks. It represents nodes and edges as Gaussian distributions so that both semantic signals and epistemic uncertainty propagate through the graph neural network, enabling risk-aware trust decisions rather than overconfident guesses. To model how trust evolves, it employs hybrid Absolute-Gaussian-Hourglass (HAGH) positional encoding with Kolmogorov-Arnold network-based unbiased multi-head attention, followed by an ordinary differential equation (ODE)-based residual learning module to jointly capture abrupt shifts and smooth trends. Robust adaptive ensemble coefficient analysis prunes or down-weights suspicious interactions using complementary cosine and Jaccard similarity measures, mitigating reputation laundering, sabotage, and on/off attacks. On two signed Bitcoin trust networks, DGTEN delivers significant improvements: in single-timeslot prediction on Bitcoin-Alpha, it improves MCC by 10.77% over the best dynamic baseline; in the cold-start scenario, it achieves a 16.41% MCC gain - the largest across all tasks and datasets. Under adversarial on/off attacks, it surpasses the baseline by up to 11.63% MCC. These results validate the effectiveness of the unified DGTEN framework.

191. Vocabulary embeddings organize linguistic structure early in language model training

Authors: Isabel Papadimitriou , Jacob Prince
URL: https://arxiv.org/abs/2510.07613
Abstract:

Large language models (LLMs) work by manipulating the geometry of input embedding vectors over multiple layers. Here, we ask: how are the input vocabulary representations of language models structured, and how and when does this structure evolve over training? To answer this question, we use representational similarity analysis, running a suite of experiments that correlate the geometric structure of the input embeddings and output embeddings of two open-source models (Pythia 12B and OLMo 7B) with semantic, syntactic, and frequency-based metrics over the course of training. Our key findings are as follows: 1) During training, the vocabulary embedding geometry quickly converges to high correlations with a suite of semantic and syntactic features; 2) Embeddings of high-frequency and function words (e.g., “the,” “of”) converge to their final vectors faster than lexical and low-frequency words, which retain some alignment with the bias in their random initializations. These findings help map the dynamic trajectory by which input embeddings organize around linguistic structure, revealing distinct roles for word frequency and function. Our findings motivate a deeper study of how the evolution of vocabulary geometry may facilitate specific capability gains during model training.

192. TGM: a Modular and Efficient Library for Machine Learning on Temporal Graphs

Authors: Jacob Chmura , Shenyang Huang , Tran Gia Bao Ngo , Ali Parviz , Farimah Poursafaei , Jure Leskovec , Michael Bronstein , Guillaume Rabusseau , Matthias Fey , Reihaneh Rabbany
URL: https://arxiv.org/abs/2510.07586
Abstract:

Well-designed open-source software drives progress in Machine Learning (ML) research. While static graph ML enjoys mature frameworks like PyTorch Geometric and DGL, ML for temporal graphs (TG), networks that evolve over time, lacks comparable infrastructure. Existing TG libraries are often tailored to specific architectures, hindering support for diverse models in this rapidly evolving field. Additionally, the divide between continuous- and discrete-time dynamic graph methods (CTDG and DTDG) limits direct comparisons and idea transfer. To address these gaps, we introduce Temporal Graph Modelling (TGM), a research-oriented library for ML on temporal graphs, the first to unify CTDG and DTDG approaches. TGM offers first-class support for dynamic node features, time-granularity conversions, and native handling of link-, node-, and graph-level tasks. Empirically, TGM achieves an average 7.8x speedup across multiple models, datasets, and tasks compared to the widely used DyGLib, and an average 175x speedup on graph discretization relative to available implementations. Beyond efficiency, we show in our experiments how TGM unlocks entirely new research possibilities by enabling dynamic graph property prediction and time-driven training paradigms, opening the door to questions previously impractical to study. TGM is available at this https URL

Authors: Mkululi Sikosana , Sean Maudsley-Barton , Oluwaseun Ajao
URL: https://arxiv.org/abs/2510.07579
Abstract:

This study conducts a computational linguistic analysis of pandemic-related online discourse to examine how language distinguishes health misinformation from factual communication. Drawing on three corpora: COVID-19 false narratives (n = 7588), general COVID-19 content (n = 10700), and Monkeypox-related posts (n = 5787), we identify significant differences in readability, rhetorical markers, and persuasive language use. COVID-19 misinformation exhibited markedly lower readability scores and contained over twice the frequency of fear-related or persuasive terms compared to the other datasets. It also showed minimal use of exclamation marks, contrasting with the more emotive style of Monkeypox content. These patterns suggest that misinformation employs a deliberately complex rhetorical style embedded with emotional cues, a combination that may enhance its perceived credibility. Our findings contribute to the growing body of work on digital health misinformation by highlighting linguistic indicators that may aid detection efforts. They also inform public health messaging strategies and theoretical models of crisis communication in networked media environments. At the same time, the study acknowledges limitations, including reliance on traditional readability indices, use of a deliberately narrow persuasive lexicon, and reliance on static aggregate analysis. Future research should therefore incorporate longitudinal designs, broader emotion lexicons, and platform-sensitive approaches to strengthen robustness.

194. Accuracy, Memory Efficiency and Generalization: A Comparative Study on Liquid Neural Networks and Recurrent Neural Networks

Authors: Shilong Zong , Alex Bierly , Almuatazbellah Boker , Hoda Eldardiry
URL: https://arxiv.org/abs/2510.07578
Abstract:

This review aims to conduct a comparative analysis of liquid neural networks (LNNs) and traditional recurrent neural networks (RNNs) and their variants, such as long short-term memory networks (LSTMs) and gated recurrent units (GRUs). The core dimensions of the analysis include model accuracy, memory efficiency, and generalization ability. By systematically reviewing existing research, this paper explores the basic principles, mathematical models, key characteristics, and inherent challenges of these neural network architectures in processing sequential data. Research findings reveal that LNN, as an emerging, biologically inspired, continuous-time dynamic neural network, demonstrates significant potential in handling noisy, non-stationary data, and achieving out-of-distribution (OOD) generalization. Additionally, some LNN variants outperform traditional RNN in terms of parameter efficiency and computational speed. However, RNN remains a cornerstone in sequence modeling due to its mature ecosystem and successful applications across various tasks. This review identifies the commonalities and differences between LNNs and RNNs, summarizes their respective shortcomings and challenges, and points out valuable directions for future research, particularly emphasizing the importance of improving the scalability of LNNs to promote their application in broader and more complex scenarios.

195. Multi-Task Pre-Finetuning of Lightweight Transformer Encoders for Text Classification and NER

Authors: Junyi Zhu , Savas Ozkan , Andrea Maracani , Sinan Mutlu , Cho Jung Min , Mete Ozay
URL: https://arxiv.org/abs/2510.07566
Abstract:

Deploying natural language processing (NLP) models on mobile platforms requires models that can adapt across diverse applications while remaining efficient in memory and computation. We investigate pre-finetuning strategies to enhance the adaptability of lightweight BERT-like encoders for two fundamental NLP task families: named entity recognition (NER) and text classification. While pre-finetuning improves downstream performance for each task family individually, we find that naïve multi-task pre-finetuning introduces conflicting optimization signals that degrade overall performance. To address this, we propose a simple yet effective multi-task pre-finetuning framework based on task-primary LoRA modules, which enables a single shared encoder backbone with modular adapters. Our approach achieves performance comparable to individual pre-finetuning while meeting practical deployment constraint. Experiments on 21 downstream tasks show average improvements of +0.8% for NER and +8.8% for text classification, demonstrating the effectiveness of our method for versatile mobile NLP applications.

196. Investigating Thematic Patterns and User Preferences in LLM Interactions using BERTopic

Authors: Abhay Bhandarkar , Gaurav Mishra , Khushi Juchani , Harsh Singhal
URL: https://arxiv.org/abs/2510.07557
Abstract:

This study applies BERTopic, a transformer-based topic modeling technique, to the lmsys-chat-1m dataset, a multilingual conversational corpus built from head-to-head evaluations of large language models (LLMs). Each user prompt is paired with two anonymized LLM responses and a human preference label, used to assess user evaluation of competing model outputs. The main objective is uncovering thematic patterns in these conversations and examining their relation to user preferences, particularly if certain LLMs are consistently preferred within specific topics. A robust preprocessing pipeline was designed for multilingual variation, balancing dialogue turns, and cleaning noisy or redacted data. BERTopic extracted over 29 coherent topics including artificial intelligence, programming, ethics, and cloud infrastructure. We analysed relationships between topics and model preferences to identify trends in model-topic alignment. Visualization techniques included inter-topic distance maps, topic probability distributions, and model-versus-topic matrices. Our findings inform domain-specific fine-tuning and optimization strategies for improving real-world LLM performance and user satisfaction.

197. Label Semantics for Robust Hyperspectral Image Classification

Authors: Rafin Hassan , Zarin Tasnim Roshni , Rafiqul Bari , Alimul Islam , Nabeel Mohammed , Moshiur Farazi , Shafin Rahman
URL: https://arxiv.org/abs/2510.07556
Abstract:

Hyperspectral imaging (HSI) classification is a critical tool with widespread applications across diverse fields such as agriculture, environmental monitoring, medicine, and materials science. Due to the limited availability of high-quality training samples and the high dimensionality of spectral data, HSI classification models are prone to overfitting and often face challenges in balancing accuracy and computational complexity. Furthermore, most of HSI classification models are monomodal, where it solely relies on spectral-spatial data to learn decision boundaries in the high dimensional embedding space. To address this, we propose a general-purpose Semantic Spectral-Spatial Fusion Network (S3FN) that uses contextual, class specific textual descriptions to complement the training of an HSI classification model. Specifically, S3FN leverages LLMs to generate comprehensive textual descriptions for each class label that captures their unique characteristics and spectral behaviors. These descriptions are then embedded into a vector space using a pre-trained text encoder such as BERT or RoBERTa to extract meaningful label semantics which in turn leads to a better feature-label alignment for improved classification performance. To demonstrate the effectiveness of our approach, we evaluate our model on three diverse HSI benchmark datasets - Hyperspectral Wood, HyperspectralBlueberries, and DeepHS-Fruit and report significant performance boost. Our results highlight the synergy between textual semantics and spectral-spatial data, paving the way for further advancements in semantically augmented HSI classification models. Codes are be available in: this https URL

198. TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility

Authors: Saman Motamed , Minghao Chen , Luc Van Gool , Iro Laina
URL: https://arxiv.org/abs/2510.07550
Abstract:

Despite impressive visual fidelity, modern video generative models frequently produce sequences that violate intuitive physical laws, such as objects floating, teleporting, or morphing in ways that defy causality. While humans can easily detect such implausibilities, there remains no robust method for quantitatively assessing physical realism in video. In this work, we explore whether Video-Language Models (VLMs) can be trained to serve as reliable judges of physical plausibility. We find that existing VLMs struggle to identify physics violations, exposing fundamental limitations in their temporal and causal reasoning. To address this, we introduce TRAVL, a fine-tuning recipe that combines a balanced training dataset with a trajectory-aware attention module to improve motion encoding and discrimination in VLMs. To evaluate physical reasoning more rigorously, we propose ImplausiBench, a benchmark of 300 videos (150 real, 150 generated) that removes linguistic biases and isolates visual-temporal understanding. Performance is reported both with gold-standard human judgments and stricter LLM-as-judge metrics. Together, TRAVL and ImplausiBench offer a unified framework for probing and improving physical plausibility in multimodal models, shedding light on a challenging and underexplored aspect of visual-temporal understanding.

199. OWL: Overcoming Window Length-Dependence in Speculative Decoding for Long-Context Inputs

Authors: Jaeseong Lee , seung-won hwang , Aurick Qiao , Gabriele Oliaro , Ye Wang , Samyam Rajbhandari
URL: https://arxiv.org/abs/2510.07535
Abstract:

Speculative decoding promises faster inference for large language models (LLMs), yet existing methods fail to generalize to real-world settings. Benchmarks typically assume short contexts (e.g., 2K tokens), whereas practical workloads involve long contexts. We find current approaches degrade severely with long contexts; for instance, EAGLE3 even slows down the generation speed by 0.81x. We address these limitations by releasing a new long-context benchmark (LongSpecBench) and introducing a novel model (OWL). OWL achieves about 5x higher acceptance length than EAGLE3 on long-context inputs through three innovations: (1) an LSTM-based drafter conditioned only on the last-token state, making it generalize to various lengths, (2) a special token [SPEC] in the verifier that produces richer representation for drafter, and (3) a hybrid algorithm combining both tree and non-tree decoding methods. We release all code and datasets to advance future research.

200. EEG Sleep Stage Classification with Continuous Wavelet Transform and Deep Learning

Authors: Mehdi Zekriyapanah Gashti , Ghasem Farjamnia
URL: https://arxiv.org/abs/2510.07524
Abstract:

Accurate classification of sleep stages is crucial for the diagnosis and management of sleep disorders. Conventional approaches for sleep scoring rely on manual annotation or features extracted from EEG signals in the time or frequency domain. This study proposes a novel framework for automated sleep stage scoring using time-frequency analysis based on the wavelet transform. The Sleep-EDF Expanded Database (sleep-cassette recordings) was used for evaluation. The continuous wavelet transform (CWT) generated time-frequency maps that capture both transient and oscillatory patterns across frequency bands relevant to sleep staging. Experimental results demonstrate that the proposed wavelet-based representation, combined with ensemble learning, achieves an overall accuracy of 88.37 percent and a macro-averaged F1 score of 73.15, outperforming conventional machine learning methods and exhibiting comparable or superior performance to recent deep learning approaches. These findings highlight the potential of wavelet analysis for robust, interpretable, and clinically applicable sleep stage classification.

201. MLLM4TS: Leveraging Vision and Multimodal Language Models for General Time-Series Analysis

Authors: Qinghua Liu , Sam Heshmati , Zheda Mai , Zubin Abraham , John Paparrizos , Liu Ren
URL: https://arxiv.org/abs/2510.07513
Abstract:

Effective analysis of time series data presents significant challenges due to the complex temporal dependencies and cross-channel interactions in multivariate data. Inspired by the way human analysts visually inspect time series to uncover hidden patterns, we ask: can incorporating visual representations enhance automated time-series analysis? Recent advances in multimodal large language models have demonstrated impressive generalization and visual understanding capability, yet their application to time series remains constrained by the modality gap between continuous numerical data and discrete natural language. To bridge this gap, we introduce MLLM4TS, a novel framework that leverages multimodal large language models for general time-series analysis by integrating a dedicated vision branch. Each time-series channel is rendered as a horizontally stacked color-coded line plot in one composite image to capture spatial dependencies across channels, and a temporal-aware visual patch alignment strategy then aligns visual patches with their corresponding time segments. MLLM4TS fuses fine-grained temporal details from the numerical data with global contextual information derived from the visual representation, providing a unified foundation for multimodal time-series analysis. Extensive experiments on standard benchmarks demonstrate the effectiveness of MLLM4TS across both predictive tasks (e.g., classification) and generative tasks (e.g., anomaly detection and forecasting). These results underscore the potential of integrating visual modalities with pretrained language models to achieve robust and generalizable time-series analysis.

202. When Thoughts Meet Facts: Reusable Reasoning for Long-Context LMs

Authors: Soyeong Jeong , Taehee Jung , Sung Ju Hwang , Joo-Kyung Kim , Dongyeop Kang
URL: https://arxiv.org/abs/2510.07499
Abstract:

Recent Long-Context Language Models (LCLMs) can process hundreds of thousands of tokens in a single prompt, enabling new opportunities for knowledge-intensive multi-hop reasoning by integrating large sets of retrieved documents or, in some cases, directly all necessary information. However, simply feeding more documents into the context window fails to capture how evidence should be connected. We address this gap with thought templates, which recast reasoning as reusable thought caches, derived from prior problem solving traces, structuring how evidence is combined and guiding multi-hop inference with factual documents. To keep these templates effective, we propose an update strategy that iteratively refines templates derived from training data through natural-language feedback. Across diverse benchmarks and LCLM families, our approach delivers consistent gains over strong baselines in both retrieval-based and retrieval-free settings. Furthermore, we show that optimized templates can be distilled into smaller open-source models, demonstrating its broad applicability and transparent reasoning reuse. We refer to our framework as Thought Template Augmented LCLMs (ToTAL).

203. Can Speech LLMs Think while Listening?

Authors: Yi-Jen Shih , Desh Raj , Chunyang Wu , Wei Zhou , SK Bong , Yashesh Gaur , Jay Mahadeokar , Ozlem Kalinli , Mike Seltzer
URL: https://arxiv.org/abs/2510.07497
Abstract:

Recent advances in speech large language models (speech LLMs) have enabled seamless spoken interactions, but these systems still struggle with complex reasoning tasks. Previously, chain-of-thought (CoT) prompting or fine-tuning has been to shown to significantly improve the reasoning abilities of text-based LLMs. In this work, we investigate the effect of CoT fine-tuning for multi-stream speech LLMs, demonstrating that reasoning in text space improves the accuracy of speech LLMs by 2.4x, on average, over a suite of spoken reasoning tasks. Beyond accuracy, the latency of the spoken response is a crucial factor for interacting with voice-based agents. Inspired by the human behavior of “thinking while listening,” we propose methods to reduce the additional latency from reasoning by allowing the model to start reasoning before the user query has ended. To achieve this, we introduce an entropy-based metric, “question completeness,” which acts as an indicator to guide the model on the optimal time to start reasoning. This method provides greater control over the accuracy-latency trade-off compared with heuristic-based approaches and, under equivalent latency conditions, yields a 4% accuracy gain on ARC-Easy. Finally, we use Direct Preference Optimization (DPO) on preference data created using rejection sampling to push the accuracy-latency pareto frontier further, resulting in a 70% reduction in latency without loss in accuracy.

204. A Denoising Framework for Real-World Ultra-Low Dose Lung CT Images Based on an Image Purification Strategy

Authors: Guoliang Gong , Man Yu
URL: https://arxiv.org/abs/2510.07492
Abstract:

Ultra-low dose CT (uLDCT) significantly reduces radiation exposure but introduces severe noise and artifacts. It also leads to substantial spatial misalignment between uLDCT and normal dose CT (NDCT) image pairs. This poses challenges for directly applying existing denoising networks trained on synthetic noise or aligned data. To address this core challenge in uLDCT denoising, this paper proposes an innovative denoising framework based on an Image Purification (IP) strategy. First, we construct a real clinical uLDCT lung dataset. Then, we propose an Image Purification strategy that generates structurally aligned uLDCT-NDCT image pairs, providing a high-quality data foundation for network training. Building upon this, we propose a Frequency-domain Flow Matching (FFM) model, which works synergistically with the IP strategy to excellently preserve the anatomical structure integrity of denoised images. Experiments on the real clinical dataset demonstrate that our IP strategy significantly enhances the performance of multiple mainstream denoising models on the uLDCT task. Notably, our proposed FFM model combined with the IP strategy achieves state-of-the-art (SOTA) results in anatomical structure preservation. This study provides an effective solution to the data mismatch problem in real-world uLDCT denoising. Code and dataset are available at this https URL .

205. Can Lessons From Human Teams Be Applied to Multi-Agent Systems? The Role of Structure, Diversity, and Interaction Dynamics

Authors: Rasika Muralidharan , Jaewoon Kwak , Jisun An
URL: https://arxiv.org/abs/2510.07488
Abstract:

Multi-Agent Systems (MAS) with Large Language Model (LLM)-powered agents are gaining attention, yet fewer studies explore their team dynamics. Inspired by human team science, we propose a multi-agent framework to examine core aspects of team science: structure, diversity, and interaction dynamics. We evaluate team performance across four tasks: CommonsenseQA, StrategyQA, Social IQa, and Latent Implicit Hate, spanning commonsense and social reasoning. Our results show that flat teams tend to perform better than hierarchical ones, while diversity has a nuanced impact. Interviews suggest agents are overconfident about their team performance, yet post-task reflections reveal both appreciation for collaboration and challenges in integration, including limited conversational coordination.

206. HEMERA: A Human-Explainable Transformer Model for Estimating Lung Cancer Risk using GWAS Data

Authors: Maria Mahbub , Robert J. Klein , Myvizhi Esai Selvan , Rowena Yip , Claudia Henschke , Providencia Morales , Ian Goethert , Olivera Kotevska , Mayanka Chandra Shekar , Sean R. Wilkinson , Eileen McAllister , Samuel M. Aguayo , Zeynep H. Gümüş , Ioana Danciu , VA Million Veteran Program
URL: https://arxiv.org/abs/2510.07477
Abstract:

Lung cancer (LC) is the third most common cancer and the leading cause of cancer deaths in the US. Although smoking is the primary risk factor, the occurrence of LC in never-smokers and familial aggregation studies highlight a genetic component. Genetic biomarkers identified through genome-wide association studies (GWAS) are promising tools for assessing LC risk. We introduce HEMERA (Human-Explainable Transformer Model for Estimating Lung Cancer Risk using GWAS Data), a new framework that applies explainable transformer-based deep learning to GWAS data of single nucleotide polymorphisms (SNPs) for predicting LC risk. Unlike prior approaches, HEMERA directly processes raw genotype data without clinical covariates, introducing additive positional encodings, neural genotype embeddings, and refined variant filtering. A post hoc explainability module based on Layer-wise Integrated Gradients enables attribution of model predictions to specific SNPs, aligning strongly with known LC risk loci. Trained on data from 27,254 Million Veteran Program participants, HEMERA achieved >99% AUC (area under receiver characteristics) score. These findings support transparent, hypothesis-generating models for personalized LC risk assessment and early intervention.

207. MoGU: Mixture-of-Gaussians with Uncertainty-based Gating for Time Series Forecasting

Authors: Yoli Shavit , Jacob Goldberger
URL: https://arxiv.org/abs/2510.07459
Abstract:

We introduce Mixture-of-Gaussians with Uncertainty-based Gating (MoGU), a novel Mixture-of-Experts (MoE) framework designed for regression tasks and applied to time series forecasting. Unlike conventional MoEs that provide only point estimates, MoGU models each expert’s output as a Gaussian distribution. This allows it to directly quantify both the forecast (the mean) and its inherent uncertainty (variance). MoGU’s core innovation is its uncertainty-based gating mechanism, which replaces the traditional input-based gating network by using each expert’s estimated variance to determine its contribution to the final prediction. Evaluated across diverse time series forecasting benchmarks, MoGU consistently outperforms single-expert models and traditional MoE setups. It also provides well-quantified, informative uncertainties that directly correlate with prediction errors, enhancing forecast reliability. Our code is available from: this https URL

208. Minimizing the Value-at-Risk of Loan Portfolio via Deep Neural Networks

Authors: Albert Di Wang , Ye Du
URL: https://arxiv.org/abs/2510.07444
Abstract:

Risk management is a prominent issue in peer-to-peer lending. An investor may naturally reduce his risk exposure by diversifying instead of putting all his money on one loan. In that case, an investor may want to minimize the Value-at-Risk (VaR) or Conditional Value-at-Risk (CVaR) of his loan portfolio. We propose a low degree of freedom deep neural network model, DeNN, as well as a high degree of freedom model, DSNN, to tackle the problem. In particular, our models predict not only the default probability of a loan but also the time when it will default. The experiments demonstrate that both models can significantly reduce the portfolio VaRs at different confidence levels, compared to benchmarks. More interestingly, the low degree of freedom model, DeNN, outperforms DSNN in most scenarios.

209. LASER: An LLM-based ASR Scoring and Evaluation Rubric

Authors: Amruta Parulekar , Preethi Jyothi
URL: https://arxiv.org/abs/2510.07437
Abstract:

Standard ASR evaluation metrics like Word Error Rate (WER) tend to unfairly penalize morphological and syntactic nuances that do not significantly alter sentence semantics. We introduce an LLM-based scoring rubric LASER that leverages state-of-the-art LLMs’ in-context learning abilities to learn from prompts with detailed examples. Hindi LASER scores using Gemini 2.5 Pro achieved a very high correlation score of 94% with human annotations. Hindi examples in the prompt were also effective in analyzing errors in other Indian languages such as Marathi, Kannada and Malayalam. We also demonstrate how a smaller LLM like Llama 3 can be finetuned on word-pair examples derived from reference and ASR predictions to predict what kind of penalty should be applied with close to 89% accuracy.

210. Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

Authors: Mufei Li , Dongqi Fu , Limei Wang , Si Zhang , Hanqing Zeng , Kaan Sancak , Ruizhong Qiu , Haoyu Wang , Xiaoxin He , Xavier Bresson , Yinglong Xia , Chonglin Sun , Pan Li
URL: https://arxiv.org/abs/2510.07414
Abstract:

Modern long-context large language models (LLMs) perform well on synthetic “needle-in-a-haystack” (NIAH) benchmarks, but such tests overlook how noisy contexts arise from biased retrieval and agentic workflows. We argue that haystack engineering is necessary to construct noisy long contexts that faithfully capture key real-world factors – distraction from heterogeneous biased retrievers and cascading errors in agentic workflows – to test models’ long-context robustness. We instantiate it through HaystackCraft, a new NIAH benchmark built on the full English Wikipedia hyperlink network with multi-hop questions. HaystackCraft evaluates how heterogeneous retrieval strategies (e.g., sparse, dense, hybrid, and graph-based) affect distractor composition, haystack ordering, and downstream LLM performance. HaystackCraft further extends NIAH to dynamic, LLM-dependent settings that simulate agentic operations, where models refine queries, reflect on their past reasonings, and decide when to stop. Experiments with 15 long-context models show that (1) while stronger dense retrievers can introduce more challenging distractors, graph-based reranking simultaneously improves retrieval effectiveness and mitigates more harmful distractors; (2) in agentic tests, even advanced models like Gemini 2.5 Pro and GPT-5 suffer cascading failures from self-generated distractors or struggle to perform early stops. These results highlight persistent challenges in agentic long-context reasoning and establish HaystackCraft as a valuable testbed for future progress.

211. Quantum Grid Path Planning Using Parallel QAOA Circuits Based on Minimum Energy Principle

Authors: Jun Liu
URL: https://arxiv.org/abs/2510.07413
Abstract:

To overcome the bottleneck of classical path planning schemes in solving NP problems and address the predicament faced by current mainstream quantum path planning frameworks in the Noisy Intermediate-Scale Quantum (NISQ) era, this study attempts to construct a quantum path planning solution based on parallel Quantum Approximate Optimization Algorithm (QAOA) architecture. Specifically, the grid path planning problem is mapped to the problem of finding the minimum quantum energy state. Two parallel QAOA circuits are built to simultaneously execute two solution processes, namely connectivity energy calculation and path energy calculation. A classical algorithm is employed to filter out unreasonable solutions of connectivity energy, and finally, the approximate optimal solution to the path planning problem is obtained by merging the calculation results of the two parallel circuits. The research findings indicate that by setting appropriate filter parameters, quantum states corresponding to position points with extremely low occurrence probabilities can be effectively filtered out, thereby increasing the probability of obtaining the target quantum state. Even when the circuit layer number p is only 1, the theoretical solution of the optimal path coding combination can still be found by leveraging the critical role of the filter. Compared with serial circuits, parallel circuits exhibit a significant advantage, as they can find the optimal feasible path coding combination with the highest probability.

212. Attention to Order: Transformers Discover Phase Transitions via Learnability

Authors: Şener Özönder
URL: https://arxiv.org/abs/2510.07401
Abstract:

Phase transitions mark qualitative reorganizations of collective behavior, yet identifying their boundaries remains challenging whenever analytic solutions are absent and conventional simulations fail. Here we introduce learnability as a universal criterion, defined as the ability of a transformer model containing attention mechanism to extract structure from microscopic states. Using self-supervised learning and Monte Carlo generated configurations of the two-dimensional Ising model, we show that ordered phases correspond to enhanced learnability, manifested in both reduced training loss and structured attention patterns, while disordered phases remain resistant to learning. Two unsupervised diagnostics, the sharp jump in training loss and the rise in attention entropy, recover the critical temperature in excellent agreement with the exact value. Our results establish learnability as a data-driven marker of phase transitions and highlight deep parallels between long-range order in condensed matter and the emergence of structure in modern language models.

213. Encode, Think, Decode: Scaling test-time reasoning with recursive latent thoughts

Authors: Yeskendir Koishekenov , Aldo Lipani , Nicola Cancedda
URL: https://arxiv.org/abs/2510.07358
Abstract:

Most efforts to improve the reasoning capabilities of large language models (LLMs) involve either scaling the number of parameters and the size of training data, or scaling inference computation by letting models generate complex chains of thought. Motivated by interpretability studies showing that the crucial computation required for reasoning tasks is concentrated in a limited range of layers, we introduce Encode-Think-Decode (ETD), a method that enhances the reasoning capabilities of a base model by training it to iterate over a small subset of reasoning-relevant layers during the mid-training stage. ETD amplifies latent reasoning while preserving the original architecture, parameter count, hyperparameters, and training data composition. When iterating on the selected layers at inference time, ETD models yield substantial gains on 17 reasoning benchmarks, including +28.4% relative accuracy improvement on GSM8K and +36% on MATH with the OLMo-2 1B Base model. We also explore an adaptive depth strategy that adjusts the computation per input token. Our results show that recursive latent reasoning offers a simple and effective path to stronger LLM reasoning.

214. Mitigating Surgical Data Imbalance with Dual-Prediction Video Diffusion Model

Authors: Danush Kumar Venkatesh , Adam Schmidt , Muhammad Abdullah Jamal , Omid Mohareri
URL: https://arxiv.org/abs/2510.07345
Abstract:

Surgical video datasets are essential for scene understanding, enabling procedural modeling and intra-operative support. However, these datasets are often heavily imbalanced, with rare actions and tools under-represented, which limits the robustness of downstream models. We address this challenge with $SurgiFlowVid$, a sparse and controllable video diffusion framework for generating surgical videos of under-represented classes. Our approach introduces a dual-prediction diffusion module that jointly denoises RGB frames and optical flow, providing temporal inductive biases to improve motion modeling from limited samples. In addition, a sparse visual encoder conditions the generation process on lightweight signals (e.g., sparse segmentation masks or RGB frames), enabling controllability without dense annotations. We validate our approach on three surgical datasets across tasks including action recognition, tool presence detection, and laparoscope motion prediction. Synthetic data generated by our method yields consistent gains of 10-20% over competitive baselines, establishing $SurgiFlowVid$ as a promising strategy to mitigate data imbalance and advance surgical video understanding methods.

215. Local MAP Sampling for Diffusion Models

Authors: Shaorong Zhang , Rob Brekelmans , Greg Ver Steeg
URL: https://arxiv.org/abs/2510.07343
Abstract:

Diffusion Posterior Sampling (DPS) provides a principled Bayesian approach to inverse problems by sampling from $p(x_0 \mid y)$. However, in practice, the goal of inverse problem solving is not to cover the posterior but to recover the most accurate reconstruction, where optimization-based diffusion solvers often excel despite lacking a clear probabilistic foundation. We introduce Local MAP Sampling (LMAPS), a new inference framework that iteratively solving local MAP subproblems along the diffusion trajectory. This perspective clarifies their connection to global MAP estimation and DPS, offering a unified probabilistic interpretation for optimization-based methods. Building on this foundation, we develop practical algorithms with a probabilistically interpretable covariance approximation, a reformulated objective for stability and interpretability, and a gradient approximation for non-differentiable operators. Across a broad set of image restoration and scientific tasks, LMAPS achieves state-of-the-art performance, including $\geq 2$ dB gains on motion deblurring, JPEG restoration, and quantization, and $>1.5$ dB improvements on inverse scattering benchmarks.

216. MultiFair: Multimodal Balanced Fairness-Aware Medical Classification with Dual-Level Gradient Modulation

Authors: Md Zubair , Hao Zheng , Nussdorf Jonathan , Grayson W. Armstrong , Lucy Q. Shen , Gabriela Wilson , Yu Tian , Xingquan Zhu , Min Shi
URL: https://arxiv.org/abs/2510.07328
Abstract:

Medical decision systems increasingly rely on data from multiple sources to ensure reliable and unbiased diagnosis. However, existing multimodal learning models fail to achieve this goal because they often ignore two critical challenges. First, various data modalities may learn unevenly, thereby converging to a model biased towards certain modalities. Second, the model may emphasize learning on certain demographic groups causing unfair performances. The two aspects can influence each other, as different data modalities may favor respective groups during optimization, leading to both imbalanced and unfair multimodal learning. This paper proposes a novel approach called MultiFair for multimodal medical classification, which addresses these challenges with a dual-level gradient modulation process. MultiFair dynamically modulates training gradients regarding the optimization direction and magnitude at both data modality and group levels. We conduct extensive experiments on two multimodal medical datasets with different demographic groups. The results show that MultiFair outperforms state-of-the-art multimodal learning and fairness learning methods.

217. Deep Learning Based Approach to Enhanced Recognition of Emotions and Behavioral Patterns of Autistic Children

Authors: Nelaka K.A.R , Peiris M.K.V , Liyanage R.P.B
URL: https://arxiv.org/abs/2510.07320
Abstract:

Autism Spectrum Disorder significantly influences the communication abilities, learning processes, behavior, and social interactions of individuals. Although early intervention and customized educational strategies are critical to improving outcomes, there is a pivotal gap in understanding and addressing nuanced behavioral patterns and emotional identification in autistic children prior to skill development. This extended research delves into the foundational step of recognizing and mapping these patterns as a prerequisite to improving learning and soft skills. Using a longitudinal approach to monitor emotions and behaviors, this study aims to establish a baseline understanding of the unique needs and challenges faced by autistic students, particularly in the Information Technology domain, where opportunities are markedly limited. Through a detailed analysis of behavioral trends over time, we propose a targeted framework for developing applications and technical aids designed to meet these identified needs. Our research underscores the importance of a sequential and evidence-based intervention approach that prioritizes a deep understanding of each child’s behavioral and emotional landscape as the basis for effective skill development. By shifting the focus toward early identification of behavioral patterns, we aim to foster a more inclusive and supportive learning environment that can significantly improve the educational and developmental trajectory of children with ASD.

218. DUA-D2C: Dynamic Uncertainty Aware Method for Overfitting Remediation in Deep Learning

Authors: Md. Saiful Bari Siddiqui , Md Mohaiminul Islam , Md. Golam Rabiul Alam
URL: https://arxiv.org/abs/2411.15876
Abstract:

Overfitting remains a significant challenge in deep learning, often arising from data outliers, noise, and limited training data. To address this, the Divide2Conquer (D2C) method was previously proposed, which partitions training data into multiple subsets and trains identical models independently on each. This strategy enables learning more consistent patterns while minimizing the influence of individual outliers and noise. However, D2C’s standard aggregation typically treats all subset models equally or based on fixed heuristics (like data size), potentially underutilizing information about their varying generalization capabilities. Building upon this foundation, we introduce Dynamic Uncertainty-Aware Divide2Conquer (DUA-D2C), an advanced technique that refines the aggregation process. DUA-D2C dynamically weights the contributions of subset models based on their performance on a shared validation set, considering both accuracy and prediction uncertainty. This intelligent aggregation allows the central model to preferentially learn from subsets yielding more generalizable and confident edge models, thereby more effectively combating overfitting. Empirical evaluations on benchmark datasets spanning multiple domains demonstrate that DUA-D2C significantly improves generalization. Our analysis includes evaluations of decision boundaries, loss curves, and other performance metrics, highlighting the effectiveness of DUA-D2C. This study demonstrates that DUA-D2C improves generalization performance even when applied on top of other regularization methods, establishing it as a theoretically grounded and effective approach to combating overfitting in modern deep learning. Our codes are publicly available at: this https URL .

전체 AI 논문 - 2025-10-10

1. How to Teach Large Multimodal Models New Skills

2. Agent Learning via Early Experience

3. FlowSearch: Advancing deep research with dynamic structured knowledge flow

4. CaRT: Teaching LLM Agents to Know When They Know Enough

5. AutoMLGen: Navigating Fine-Grained Optimization for Coding Agents

6. Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling

7. Revisiting Hallucination Detection with Effective Rank-based Uncertainty

8. QAgent: A modular Search Agent with Interactive Query Understanding

9. LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings

10. Beyond Pass@k: Breadth-Depth Metrics for Reasoning Boundaries

11. First Try Matters: Revisiting the Role of Reflection in Reasoning Models

12. Symmetry-Aware Fully-Amortized Optimization with Scale Equivariant Graph Metanetworks

13. Co-TAP: Three-Layer Agent Interaction Protocol Technical Report

14. Chain-of-Trigger: An Agentic Backdoor that Paradoxically Enhances Agentic Robustness

15. Selection, Reflection and Self-Refinement: Revisit Reasoning Tasks via a Causal Lens

16. DODO: Causal Structure Learning with Budgeted Interventions

17. The Tournament Tree Method for preference elicitation in Multi-criteria decision-making

18. Measuring What Matters: The AI Pluralism Index

19. R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?

20. Prepared mind, fast response: A temporal decoupling framework for adaptive knowledge orchestration in open-domain dialogue

21. Can Risk-taking AI-Assistants suitably represent entities

22. From Ethical Declarations to Provable Independence: An Ontology-Driven Optimal-Transport Framework for Certifiably Fair AI Systems

23. AutoQual: An LLM Agent for Automated Discovery of Interpretable Features for Review Quality Assessment

24. Multi-Condition Conformal Selection

25. LinguaSim: Interactive Multi-Vehicle Testing Scenario Generation via Natural Language Instruction Based on Large Language Models

26. AILoRA: Function-Aware Asymmetric Initialization for Low-Rank Adaptation of Large Language Models

27. PEAR: Phase Entropy Aware Reward for Efficient Reasoning

28. Language Models Do Not Embed Numbers Continuously

29. ReInAgent: A Context-Aware GUI Agent Enabling Human-in-the-Loop Mobile Task Navigation

30. VoiceAgentBench: Are Voice Assistants ready for agentic tasks?

31. TaoSR-SHE: Stepwise Hybrid Examination Reinforcement Learning Framework for E-commerce Search Relevance

32. Agent-Based Genetic Algorithm for Crypto Trading Strategy Optimization

33. Enabling Personalized Long-term Interactions in LLM-based Agents through Persistent Memory and User Profiles

34. Profit Mirage: Revisiting Information Leakage in LLM-based Financial Agents

35. Towards Meaningful Transparency in Civic AI Systems

36. Understanding DeepResearch via Reports

37. Augur: Modeling Covariate Causal Associations in Time Series via Large Language Models

38. FinMR: A Knowledge-Intensive Multimodal Benchmark for Advanced Financial Reasoning

39. An LLM-Powered Cooperative Framework for Large-Scale Multi-Vehicle Navigation

40. Strategic Communication under Threat: Learning Information Trade-offs in Pursuit-Evasion Games

41. GCPO: When Contrast Fails, Go Gold

42. An approach for systematic decomposition of complex llm tasks

43. From Noisy to Native: LLM-driven Graph Restoration for Test-Time Graph Domain Adaptation

44. Haibu Mathematical-Medical Intelligent Agent:Enhancing Large Language Model Reliability in Medical Tasks via Verifiable Reasoning Chains

45. SurveyG: A Multi-Agent LLM Framework with Hierarchical Citation Graph for Automated Survey Generation

46. oMeBench: Towards Robust Benchmarking of LLMs in Organic Mechanism Elucidation and Reasoning

47. Control Synthesis of Cyber-Physical Systems for Real-Time Specifications through Causation-Guided Reinforcement Learning

48. Multimodal Safety Evaluation in Generative Agent Social Simulations

49. Safely Exploring Novel Actions in Recommender Systems via Deployment-Efficient Policy Learning

50. Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models

51. A Case for Leveraging Generative AI to Expand and Enhance Training in the Provision of Mental Health Services

52. Traceability and Accountability in Role-Specialized Multi-Agent LLM Pipelines

53. AgentAsk: Multi-Agent Systems Need to Ask

54. Benchmarking is Broken - Don’t Let AI be its Own Judge

55. An Evaluation Study of Hybrid Methods for Multilingual PII Detection

56. Measuring and Mitigating Identity Bias in Multi-Agent Debate via Anonymization

57. CompassLLM: A Multi-Agent Approach toward Geo-Spatial Reasoning for Popular Path Query

58. Optimizing Ethical Risk Reduction for Medical Intelligent Systems with Constraint Programming

59. Evaluation of LLMs for Process Model Analysis and Optimization

60. ExpertAgent: Enhancing Personalized Education through Dynamic Planning and Retrieval-Augmented Long-Chain Reasoning

61. TS-Agent: A Time Series Reasoning Agent with Iterative Statistical Insight Gathering

62. Less is More: Strategic Expert Selection Outperforms Ensemble Complexity in Traffic Forecasting

63. ProSEA: Problem Solving via Exploration Agents

64. Position: AI Will Transform Neuropsychology Through Mental Health Digital Twins for Dynamic Mental Health Care, Especially for ADHD

65. Base Models Know How to Reason, Thinking Models Learn When

66. L2M-AID: Autonomous Cyber-Physical Defense by Fusing Semantic Reasoning of Large Language Models with Multi-Agent Reinforcement Learning (Preprint)

67. Truth-Aware Decoding: A Program-Logic Approach to Factual Language Generation

68. BLAZER: Bootstrapping LLM-based Manipulation Agents with Zero-Shot Data Generation

69. ArenaBencher: Automatic Benchmark Evolution via Multi-Model Competitive Evaluation

70. NovaFlow: Zero-Shot Manipulation via Actionable Flow from Generated Videos

71. MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning

72. SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models

73. Dream to Recall: Imagination-Guided Experience Retrieval for Memory-Persistent Vision-and-Language Navigation

74. VideoNorms: Benchmarking Cultural Awareness of Video Language Models

75. On the optimization dynamics of RLVR: Gradient gap and step size thresholds

76. Kontinuous Kontext: Continuous Strength Control for Instruction-based Image Editing

77. SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models

78. CoMAS: Co-Evolving Multi-Agent Systems via Interaction Rewards

79. To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models