전체 AI 논문 - 2025-10-03
1. BioX-Bridge: Model Bridging for Unsupervised Cross-Modal Knowledge Transfer across Biosignals
- Authors: Chenqi Li , Yu Liu , Timothy Denison , Tingting Zhu
- URL: https://arxiv.org/abs/2510.02276
- Abstract:
Biosignals offer valuable insights into the physiological states of the human body. Although biosignal modalities differ in functionality, signal fidelity, sensor comfort, and cost, they are often intercorrelated, reflecting the holistic and interconnected nature of human physiology. This opens up the possibility of performing the same tasks using alternative biosignal modalities, thereby improving the accessibility, usability, and adaptability of health monitoring systems. However, the limited availability of large labeled datasets presents challenges for training models tailored to specific tasks and modalities of interest. Unsupervised cross-modal knowledge transfer offers a promising solution by leveraging knowledge from an existing modality to support model training for a new modality. Existing methods are typically based on knowledge distillation, which requires running a teacher model alongside student model training, resulting in high computational and memory overhead. This challenge is further exacerbated by the recent development of foundation models that demonstrate superior performance and generalization across tasks at the cost of large model sizes. To this end, we explore a new framework for unsupervised cross-modal knowledge transfer of biosignals by training a lightweight bridge network to align the intermediate representations and enable information flow between foundation models and across modalities. Specifically, we introduce an efficient strategy for selecting alignment positions where the bridge should be constructed, along with a flexible prototype network as the bridge architecture. Extensive experiments across multiple biosignal modalities, tasks, and datasets show that BioX-Bridge reduces the number of trainable parameters by 88–99\% while maintaining or even improving transfer performance compared to state-of-the-art methods.
2. RLAD: Training LLMs to Discover Abstractions for Solving Reasoning Problems
- Authors: Yuxiao Qu , Anikait Singh , Yoonho Lee , Amrith Setlur , Ruslan Salakhutdinov , Chelsea Finn , Aviral Kumar
- URL: https://arxiv.org/abs/2510.02263
- Abstract:
Reasoning requires going beyond pattern matching or memorization of solutions to identify and implement “algorithmic procedures” that can be used to deduce answers to hard problems. Doing so requires realizing the most relevant primitives, intermediate results, or shared procedures, and building upon them. While RL post-training on long chains of thought ultimately aims to uncover this kind of algorithmic behavior, most reasoning traces learned by large models fail to consistently capture or reuse procedures, instead drifting into verbose and degenerate exploration. To address more effective reasoning, we introduce reasoning abstractions: concise natural language descriptions of procedural and factual knowledge that guide the model toward learning successful reasoning. We train models to be capable of proposing multiple abstractions given a problem, followed by RL that incentivizes building a solution while using the information provided by these abstractions. This results in a two-player RL training paradigm, abbreviated as RLAD, that jointly trains an abstraction generator and a solution generator. This setup effectively enables structured exploration, decouples learning signals of abstraction proposal and solution generation, and improves generalization to harder problems. We also show that allocating more test-time compute to generating abstractions is more beneficial for performance than generating more solutions at large test budgets, illustrating the role of abstractions in guiding meaningful exploration.
3. The Unreasonable Effectiveness of Scaling Agents for Computer Use
- Authors: Gonzalo Gonzalez-Pumariega , Vincent Tu , Chih-Lun Lee , Jiachen Yang , Ang Li , Xin Eric Wang
- URL: https://arxiv.org/abs/2510.02250
- Abstract:
Computer-use agents (CUAs) hold promise for automating everyday digital tasks, but their unreliability and high variance hinder their application to long-horizon, complex tasks. We introduce Behavior Best-of-N (bBoN), a method that scales over agents by generating multiple rollouts and selecting among them using behavior narratives that describe the agents’ rollouts. It enables both wide exploration and principled trajectory selection, substantially improving robustness and success rates. On OSWorld, our bBoN scaling method establishes a new state of the art (SoTA) at 69.9%, significantly outperforming prior methods and approaching human-level performance at 72%, with comprehensive ablations validating key design choices. We further demonstrate strong generalization results to different operating systems on WindowsAgentArena and AndroidWorld. Crucially, our results highlight the unreasonable effectiveness of scaling CUAs, when you do it right: effective scaling requires structured trajectory understanding and selection, and bBoN provides a practical framework to achieve this.
4. The Reasoning Boundary Paradox: How Reinforcement Learning Constrains Language Models
- Authors: Phuc Minh Nguyen , Chinh D. La , Duy M. H. Nguyen , Nitesh V. Chawla , Binh T. Nguyen , Khoa D. Doan
- URL: https://arxiv.org/abs/2510.02230
- Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key method for improving Large Language Models’ reasoning capabilities, yet recent evidence suggests it may paradoxically shrink the reasoning boundary rather than expand it. This paper investigates the shrinkage issue of RLVR by analyzing its learning dynamics and reveals two critical phenomena that explain this failure. First, we expose negative interference in RLVR, where learning to solve certain training problems actively reduces the likelihood of correct solutions for others, leading to the decline of Pass@$k$ performance, or the probability of generating a correct solution within $k$ attempts. Second, we uncover the winner-take-all phenomenon: RLVR disproportionately reinforces problems with high likelihood, correct solutions, under the base model, while suppressing other initially low-likelihood ones. Through extensive theoretical and empirical analysis on multiple mathematical reasoning benchmarks, we show that this effect arises from the inherent on-policy sampling in standard RL objectives, causing the model to converge toward narrow solution strategies. Based on these insights, we propose a simple yet effective data curation algorithm that focuses RLVR learning on low-likelihood problems, achieving notable improvement in Pass@$k$ performance. Our code is available at this https URL .
5. UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models
- Authors: Yuhao Sun , Zhuoer Xu , Shiwen Cui , Kun Yang , Lingyun Yu , Yongdong Zhang , Hongtao Xie
- URL: https://arxiv.org/abs/2510.02194
- Abstract:
Large Language Models (LLMs) have achieved remarkable progress across a wide range of tasks, but remain vulnerable to safety risks such as harmful content generation and jailbreak attacks. Existing safety techniques – including external guardrails, inference-time guidance, and post-training alignment – each face limitations in balancing safety, utility, and controllability. In this work, we propose UpSafe$^\circ$C, a unified framework for enhancing LLM safety through safety-aware upcycling. Our approach first identifies safety-critical layers and upcycles them into a sparse Mixture-of-Experts (MoE) structure, where the router acts as a soft guardrail that selectively activates original MLPs and added safety experts. We further introduce a two-stage SFT strategy to strengthen safety discrimination while preserving general capabilities. To enable flexible control at inference time, we introduce a safety temperature mechanism, allowing dynamic adjustment of the trade-off between safety and utility. Experiments across multiple benchmarks, base model, and model scales demonstrate that UpSafe$^\circ$C achieves robust safety improvements against harmful and jailbreak inputs, while maintaining competitive performance on general tasks. Moreover, analysis shows that safety temperature provides fine-grained inference-time control that achieves the Pareto-optimal frontier between utility and safety. Our results highlight a new direction for LLM safety: moving from static alignment toward dynamic, modular, and inference-aware control.
6. A Rigorous Benchmark with Multidimensional Evaluation for Deep Research Agents: From Answers to Reports
- Authors: Yang Yao , Yixu Wang , Yuxuan Zhang , Yi Lu , Tianle Gu , Lingyu Li , Dingyi Zhao , Keming Wu , Haozhe Wang , Ping Nie , Yan Teng , Yingchun Wang
- URL: https://arxiv.org/abs/2510.02190
- Abstract:
Artificial intelligence is undergoing the paradigm shift from closed language models to interconnected agent systems capable of external perception and information integration. As a representative embodiment, Deep Research Agents (DRAs) systematically exhibit the capabilities for task decomposition, cross-source retrieval, multi-stage reasoning, and structured output, which markedly enhance performance on complex and open-ended tasks. However, existing benchmarks remain deficient in evaluation dimensions, response formatting, and scoring mechanisms, limiting their capacity to assess such systems effectively. This paper introduces a rigorous benchmark and a multidimensional evaluation framework tailored to DRAs and report-style responses. The benchmark comprises 214 expert-curated challenging queries distributed across 10 broad thematic domains, each accompanied by manually constructed reference bundles to support composite evaluation. The framework enables comprehensive evaluation of long-form reports generated by DRAs, incorporating integrated scoring metrics for semantic quality, topical focus, and retrieval trustworthiness. Extensive experimentation confirms the superior performance of mainstream DRAs over web-search-tool-augmented reasoning models, yet reveals considerable scope for further improvement. This study provides a robust foundation for capability assessment, architectural refinement, and paradigm advancement in DRA systems.
7. FlexDoc: Parameterized Sampling for Diverse Multilingual Synthetic Documents for Training Document Understanding Models
- Authors: Karan Dua , Hitesh Laxmichand Patel , Puneet Mittal , Ranjeet Gupta , Amit Agarwal , Praneet Pabolu , Srikant Panda , Hansa Meghwani , Graham Horwood , Fahad Shah
- URL: https://arxiv.org/abs/2510.02133
- Abstract:
Developing document understanding models at enterprise scale requires large, diverse, and well-annotated datasets spanning a wide range of document types. However, collecting such data is prohibitively expensive due to privacy constraints, legal restrictions, and the sheer volume of manual annotation needed - costs that can scale into millions of dollars. We introduce FlexDoc, a scalable synthetic data generation framework that combines Stochastic Schemas and Parameterized Sampling to produce realistic, multilingual semi-structured documents with rich annotations. By probabilistically modeling layout patterns, visual structure, and content variability, FlexDoc enables the controlled generation of diverse document variants at scale. Experiments on Key Information Extraction (KIE) tasks demonstrate that FlexDoc-generated data improves the absolute F1 Score by up to 11% when used to augment real datasets, while reducing annotation effort by over 90% compared to traditional hard-template methods. The solution is in active deployment, where it has accelerated the development of enterprise-grade document understanding models while significantly reducing data acquisition and annotation costs.
8. Do AI Models Perform Human-like Abstract Reasoning Across Modalities?
- Authors: Claas Beger , Ryan Yi , Shuhao Fu , Arseny Moskvichev , Sarah W. Tsai , Sivasankaran Rajamanickam , Melanie Mitchell
- URL: https://arxiv.org/abs/2510.02125
- Abstract:
OpenAI’s o3-preview reasoning model exceeded human accuracy on the ARC-AGI benchmark, but does that mean state-of-the-art models recognize and reason with the abstractions that the task creators intended? We investigate models’ abstraction abilities on ConceptARC. We evaluate models under settings that vary the input modality (textual vs. visual), whether the model is permitted to use external Python tools, and, for reasoning models, the amount of reasoning effort. In addition to measuring output accuracy, we perform fine-grained evaluation of the natural-language rules that models generate to explain their solutions. This dual evaluation lets us assess whether models solve tasks using the abstractions ConceptARC was designed to elicit, rather than relying on surface-level patterns. Our results show that, while some models using text-based representations match human output accuracy, the best models’ rules are often based on surface-level ``shortcuts’’ and capture intended abstractions far less often than humans. Thus their capabilities for general abstract reasoning may be overestimated by evaluations based on accuracy alone. In the visual modality, AI models’ output accuracy drops sharply, yet our rule-level analysis reveals that models might be underestimated, as they still exhibit a substantial share of rules that capture intended abstractions, but are often unable to correctly apply these rules. In short, our results show that models still lag humans in abstract reasoning, and that using accuracy alone to evaluate abstract reasoning on ARC-like tasks may overestimate abstract-reasoning capabilities in textual modalities and underestimate it in visual modalities. We believe that our evaluation framework offers a more faithful picture of multimodal models’ abstract reasoning abilities and a more principled way to track progress toward human-like, abstraction-centered intelligence.
9. Demystifying the Roles of LLM Layers in Retrieval, Knowledge, and Reasoning
- Authors: Xinyuan Song , Keyu Wang , PengXiang Li , Lu Yin , Shiwei Liu
- URL: https://arxiv.org/abs/2510.02091
- Abstract:
Recent studies suggest that the deeper layers of Large Language Models (LLMs) contribute little to representation learning and can often be removed without significant performance loss. However, such claims are typically drawn from narrow evaluations and may overlook important aspects of model behavior. In this work, we present a systematic study of depth utilization across diverse dimensions, including evaluation protocols, task categories, and model architectures. Our analysis confirms that very deep layers are generally less effective than earlier ones, but their contributions vary substantially with the evaluation setting. Under likelihood-based metrics without generation, pruning most layers preserves performance, with only the initial few being critical. By contrast, generation-based evaluation uncovers indispensable roles for middle and deeper layers in enabling reasoning and maintaining long-range coherence. We further find that knowledge and retrieval are concentrated in shallow components, whereas reasoning accuracy relies heavily on deeper layers – yet can be reshaped through distillation. These results highlight that depth usage in LLMs is highly heterogeneous and context-dependent, underscoring the need for task-, metric-, and model-aware perspectives in both interpreting and compressing large models.
10. ReTabAD: A Benchmark for Restoring Semantic Context in Tabular Anomaly Detection
- Authors: Sanghyu Yoon , Dongmin Kim , Suhee Yoon , Ye Seul Sim , Seungdong Yoa , Hye-Seung Cho , Soonyoung Lee , Hankook Lee , Woohyung Lim
- URL: https://arxiv.org/abs/2510.02060
- Abstract:
In tabular anomaly detection (AD), textual semantics often carry critical signals, as the definition of an anomaly is closely tied to domain-specific context. However, existing benchmarks provide only raw data points without semantic context, overlooking rich textual metadata such as feature descriptions and domain knowledge that experts rely on in practice. This limitation restricts research flexibility and prevents models from fully leveraging domain knowledge for detection. ReTabAD addresses this gap by restoring textual semantics to enable context-aware tabular AD research. We provide (1) 20 carefully curated tabular datasets enriched with structured textual metadata, together with implementations of state-of-the-art AD algorithms including classical, deep learning, and LLM-based approaches, and (2) a zero-shot LLM framework that leverages semantic context without task-specific training, establishing a strong baseline for future research. Furthermore, this work provides insights into the role and utility of textual metadata in AD through experiments and analysis. Results show that semantic context improves detection performance and enhances interpretability by supporting domain-aware reasoning. These findings establish ReTabAD as a benchmark for systematic exploration of context-aware AD.
11. Zero-shot reasoning for simulating scholarly peer-review
- Authors: Khalid M. Saqr
- URL: https://arxiv.org/abs/2510.02027
- Abstract:
The scholarly publishing ecosystem faces a dual crisis of unmanageable submission volumes and unregulated AI, creating an urgent need for new governance models to safeguard scientific integrity. The traditional human-only peer review regime lacks a scalable, objective benchmark, making editorial processes opaque and difficult to audit. Here we investigate a deterministic simulation framework that provides the first stable, evidence-based standard for evaluating AI-generated peer review reports. Analyzing 352 peer-review simulation reports, we identify consistent system state indicators that demonstrate its reliability. First, the system is able to simulate calibrated editorial judgment, with ‘Revise’ decisions consistently forming the majority outcome (>50%) across all disciplines, while ‘Reject’ rates dynamically adapt to field-specific norms, rising to 45% in Health Sciences. Second, it maintains unwavering procedural integrity, enforcing a stable 29% evidence-anchoring compliance rate that remains invariant across diverse review tasks and scientific domains. These findings demonstrate a system that is predictably rule-bound, mitigating the stochasticity of generative AI. For the scientific community, this provides a transparent tool to ensure fairness; for publishing strategists, it offers a scalable instrument for auditing workflows, managing integrity risks, and implementing evidence-based governance. The framework repositions AI as an essential component of institutional accountability, providing the critical infrastructure to maintain trust in scholarly communication.
12. To Mask or to Mirror: Human-AI Alignment in Collective Reasoning
- Authors: Crystal Qian , Aaron Parisi , Clémentine Bouleau , Vivian Tsai , Maël Lebreton , Lucas Dixon
- URL: https://arxiv.org/abs/2510.01924
- Abstract:
As large language models (LLMs) are increasingly used to model and augment collective decision-making, it is critical to examine their alignment with human social reasoning. We present an empirical framework for assessing collective alignment, in contrast to prior work on the individual level. Using the Lost at Sea social psychology task, we conduct a large-scale online experiment (N=748), randomly assigning groups to leader elections with either visible demographic attributes (e.g. name, gender) or pseudonymous aliases. We then simulate matched LLM groups conditioned on the human data, benchmarking Gemini 2.5, GPT 4.1, Claude Haiku 3.5, and Gemma 3. LLM behaviors diverge: some mirror human biases; others mask these biases and attempt to compensate for them. We empirically demonstrate that human-AI alignment in collective reasoning depends on context, cues, and model-specific inductive biases. Understanding how LLMs align with collective human behavior is critical to advancing socially-aligned AI, and demands dynamic benchmarks that capture the complexities of collective reasoning.
13. Constrained Adaptive Rejection Sampling
- Authors: Paweł Parys , Sairam Vaidya , Taylor Berg-Kirkpatrick , Loris D’Antoni
- URL: https://arxiv.org/abs/2510.01902
- Abstract:
Language Models (LMs) are increasingly used in applications where generated outputs must satisfy strict semantic or syntactic constraints. Existing approaches to constrained generation fall along a spectrum: greedy constrained decoding methods enforce validity during decoding but distort the LM’s distribution, while rejection sampling (RS) preserves fidelity but wastes computation by discarding invalid outputs. Both extremes are problematic in domains such as program fuzzing, where both validity and diversity of samples are essential. We present Constrained Adaptive Rejection Sampling (CARS), an approach that strictly improves the sample-efficiency of RS without distributional distortion. CARS begins with unconstrained LM sampling and adaptively rules out constraint-violating continuations by recording them in a trie and subtracting their probability mass from future draws. This adaptive pruning ensures that prefixes proven invalid are never revisited, acceptance rates improve monotonically, and the resulting samples exactly follow the constrained distribution. In experiments on a variety of domains – e.g., program fuzzing and molecular generation – CARS consistently achieves higher efficiency – measured in the number of LM forward passes per valid sample – while also producing stronger sample diversity than both GCD and methods that approximate the LM’s distribution.
14. Learning a Dense Reasoning Reward Model from Expert Demonstration via Inverse Reinforcement Learning
- Authors: Claudio Fanconi , Nicolás Astorga , Mihaela van der Schaar
- URL: https://arxiv.org/abs/2510.01857
- Abstract:
We reframe and operationalise adversarial inverse reinforcement learning (IRL) to large language model reasoning, learning a dense, token-level reward model for process supervision directly from expert demonstrations rather than imitating style via supervised fine-tuning. The learned reasoning reward serves two complementary roles: (i) it provides step-level feedback to optimise a reasoning policy during training; and (ii) it functions at inference as a critic to rerank sampled traces under fixed compute budgets. We demonstrate that our approach prioritises correctness over surface form, yielding scores that correlate with eventual answer validity and enabling interpretable localisation of errors within a trace. Empirically, on GSM8K with Llama3 and Qwen2.5 backbones, we demonstrate: (i) dense reasoning rewards can be used as a learning signal to elicit reasoning, and (ii) predictive performance is improved from reward-guided reranking (notably for Llama-based policies). By unifying training signals, inference-time selection, and token-level diagnostics into a single reasoning reward, this work suggests reusable process-level rewards with broad potential to enhance multi-step reasoning in language models.
15. Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning
- Authors: Zhihao Dou , Qinjian Zhao , Zhongwei Wan , Dinggen Zhang , Weida Wang , Towsif Raiyan , Benteng Chen , Qingtao Pan , Yang Ouyang , Zhiqiang Gao , Shufei Zhang , Sumon Biswas
- URL: https://arxiv.org/abs/2510.01833
- Abstract:
Large language models (LLMs) have demonstrated remarkable reasoning abilities in complex tasks, often relying on Chain-of-Thought (CoT) reasoning. However, due to their autoregressive token-level generation, the reasoning process is largely constrained to local decision-making and lacks global planning. This limitation frequently results in redundant, incoherent, or inaccurate reasoning, which significantly degrades overall performance. Existing approaches, such as tree-based algorithms and reinforcement learning (RL), attempt to address this issue but suffer from high computational costs and often fail to produce optimal reasoning trajectories. To tackle this challenge, we propose Plan-Then-Action Enhanced Reasoning with Group Relative Policy Optimization PTA-GRPO, a two-stage framework designed to improve both high-level planning and fine-grained CoT reasoning. In the first stage, we leverage advanced LLMs to distill CoT into compact high-level guidance, which is then used for supervised fine-tuning (SFT). In the second stage, we introduce a guidance-aware RL method that jointly optimizes the final output and the quality of high-level guidance, thereby enhancing reasoning effectiveness. We conduct extensive experiments on multiple mathematical reasoning benchmarks, including MATH, AIME2024, AIME2025, and AMC, across diverse base models such as Qwen2.5-7B-Instruct, Qwen3-8B, Qwen3-14B, and LLaMA3.2-3B. Experimental results demonstrate that PTA-GRPO consistently achieves stable and significant improvements across different models and tasks, validating its effectiveness and generalization.
16. Human-AI Teaming Co-Learning in Military Operations
- Authors: Clara Maathuis , Kasper Cools
- URL: https://arxiv.org/abs/2510.01815
- Abstract:
In a time of rapidly evolving military threats and increasingly complex operational environments, the integration of AI into military operations proves significant advantages. At the same time, this implies various challenges and risks regarding building and deploying human-AI teaming systems in an effective and ethical manner. Currently, understanding and coping with them are often tackled from an external perspective considering the human-AI teaming system as a collective agent. Nevertheless, zooming into the dynamics involved inside the system assures dealing with a broader palette of relevant multidimensional responsibility, safety, and robustness aspects. To this end, this research proposes the design of a trustworthy co-learning model for human-AI teaming in military operations that encompasses a continuous and bidirectional exchange of insights between the human and AI agents as they jointly adapt to evolving battlefield conditions. It does that by integrating four dimensions. First, adjustable autonomy for dynamically calibrating the autonomy levels of agents depending on aspects like mission state, system confidence, and environmental uncertainty. Second, multi-layered control which accounts continuous oversight, monitoring of activities, and accountability. Third, bidirectional feedback with explicit and implicit feedback loops between the agents to assure a proper communication of reasoning, uncertainties, and learned adaptations that each of the agents has. And fourth, collaborative decision-making which implies the generation, evaluation, and proposal of decisions associated with confidence levels and rationale behind them. The model proposed is accompanied by concrete exemplifications and recommendations that contribute to further developing responsible and trustworthy human-AI teaming systems in military operations.
17. REBot: From RAG to CatRAG with Semantic Enrichment and Graph Routing
- Authors: Thanh Ma , Tri-Tam La , Lam-Thu Le Huu , Minh-Nghi Nguyen , Khanh-Van Pham Luu , Huu-Hoa Nguyen
- URL: https://arxiv.org/abs/2510.01800
- Abstract:
Academic regulation advising is essential for helping students interpret and comply with institutional policies, yet building effective systems requires domain specific regulatory resources. To address this challenge, we propose REBot, an LLM enhanced advisory chatbot powered by CatRAG, a hybrid retrieval reasoning framework that integrates retrieval augmented generation with graph based reasoning. CatRAG unifies dense retrieval and graph reasoning, supported by a hierarchical, category labeled knowledge graph enriched with semantic features for domain alignment. A lightweight intent classifier routes queries to the appropriate retrieval modules, ensuring both factual accuracy and contextual depth. We construct a regulation specific dataset and evaluate REBot on classification and question answering tasks, achieving state of the art performance with an F1 score of 98.89%. Finally, we implement a web application that demonstrates the practical value of REBot in real world academic advising scenarios.
18. A cybersecurity AI agent selection and decision support framework
- Authors: Masike Malatji
- URL: https://arxiv.org/abs/2510.01751
- Abstract:
This paper presents a novel, structured decision support framework that systematically aligns diverse artificial intelligence (AI) agent architectures, reactive, cognitive, hybrid, and learning, with the comprehensive National Institute of Standards and Technology (NIST) Cybersecurity Framework (CSF) 2.0. By integrating agent theory with industry guidelines, this framework provides a transparent and stepwise methodology for selecting and deploying AI solutions to address contemporary cyber threats. Employing a granular decomposition of NIST CSF 2.0 functions into specific tasks, the study links essential AI agent properties such as autonomy, adaptive learning, and real-time responsiveness to each subcategory’s security requirements. In addition, it outlines graduated levels of autonomy (assisted, augmented, and fully autonomous) to accommodate organisations at varying stages of cybersecurity maturity. This holistic approach transcends isolated AI applications, providing a unified detection, incident response, and governance strategy. Through conceptual validation, the framework demonstrates how tailored AI agent deployments can align with real-world constraints and risk profiles, enhancing situational awareness, accelerating response times, and fortifying long-term resilience via adaptive risk management. Ultimately, this research bridges the gap between theoretical AI constructs and operational cybersecurity demands, establishing a foundation for robust, empirically validated multi-agent systems that adhere to industry standards.
19. MetaboT: AI-based agent for natural language-based interaction with metabolomics knowledge graphs
- Authors: Madina Bekbergenova (ICN), Lucas Pradi (ICN), Benjamin Navet (ICN), Emma Tysinger (ICN), Franck Michel (WIMMICS), Matthieu Feraud (ICN), Yousouf Taghzouti (ICN, WIMMICS), Yan Zhou Chen , Olivier Kirchhoffer (UNIGE), Florence Mehl (SIB), Martin Legrand (ICN), Tao Jiang (ICN), Marco Pagni (SIB), Soha Hassoun , Jean-Luc Wolfender (UNIGE), Wout Bittremieux (UA), Fabien Gandon (WIMMICS, Laboratoire I3S - SPARKS), Louis-Félix Nothias (CNRS, UniCA, ICN)
- URL: https://arxiv.org/abs/2510.01724
- Abstract:
Mass spectrometry metabolomics generates vast amounts of data requiring advanced methods for interpretation. Knowledge graphs address these challenges by structuring mass spectrometry data, metabolite information, and their relationships into a connected network (Gaudry et al. 2024). However, effective use of a knowledge graph demands an in-depth understanding of its ontology and its query language syntax. To overcome this, we designed MetaboT, an AI system utilizing large language models (LLMs) to translate user questions into SPARQL semantic query language for operating on knowledge graphs (Steve Harris 2013). We demonstrate its effectiveness using the Experimental Natural Products Knowledge Graph (ENPKG), a large-scale public knowledge graph for plant natural products (Gaudry et al. 2024).MetaboT employs specialized AI agents for handling user queries and interacting with the knowledge graph by breaking down complex tasks into discrete components, each managed by a specialised agent (Fig. 1a). The multi-agent system is constructed using the LangChain and LangGraph libraries, which facilitate the integration of LLMs with external tools and information sources (LangChain, n.d.). The query generation process follows a structured workflow. First, the Entry Agent determines if the question is new or a follow-up to previous interactions. New questions are forwarded to the Validator Agent, which verifies if the question is related to the knowledge graph. Then, the valid question is sent to the Supervisor Agent, which identifies if the question requires chemical conversions or standardized identifiers. In this case it delegates the question to the Knowledge Graph Agent, which can use tools to extract necessary details, such as URIs or taxonomies of chemical names, from the user query. Finally, an agent responsible for crafting the SPARQL queries equipped with the ontology of the knowledge graph uses the provided identifiers to generate the query. Then, the system executes the generated query against the metabolomics knowledge graph and returns structured results to the user (Fig. 1b). To assess the performance of MetaboT we have curated 50 metabolomics-related questions and their expected answers. In addition to submitting these questions to MetaboT, we evaluated a baseline by submitting them to a standard LLM (GPT-4o) with a prompt that incorporated the knowledge graph ontology but did not provide specific entity IDs. This baseline achieved only 8.16% accuracy, compared to MetaboT’s 83.67%, underscoring the necessity of our multi-agent system for accurately retrieving entities and generating correct SPARQL queries. MetaboT demonstrates promising performance as a conversational question-answering assistant, enabling researchers to retrieve structured metabolomics data through natural language queries. By automating the generation and execution of SPARQL queries, it removes technical barriers that have traditionally hindered access to knowledge graphs. Importantly, MetaboT leverages the capabilities of LLMs while maintaining experimentally grounded query generation, ensuring that outputs remain aligned with domain-specific standards and data structures. This approach facilitates data-driven discoveries by bridging the gap between complex semantic technologies and user-friendly interaction. MetaboT is accessible at [ this https URL ], and its source code is available at [ this https URL ].
20. VaPR – Vision-language Preference alignment for Reasoning
- Authors: Rohan Wadhawan , Fabrice Y Harel-Canada , Zi-Yi Dou , Suhaila Shakiah , Robinson Piramuthu , Nanyun Peng
- URL: https://arxiv.org/abs/2510.01700
- Abstract:
Preference finetuning methods like Direct Preference Optimization (DPO) with AI-generated feedback have shown promise in aligning Large Vision-Language Models (LVLMs) with human preferences. However, existing techniques overlook the prevalence of noise in synthetic preference annotations in the form of stylistic and length biases. To this end, we introduce a hard-negative response generation framework based on LLM-guided response editing, that produces rejected responses with targeted errors, maintaining stylistic and length similarity to the accepted ones. Using this framework, we develop the VaPR dataset, comprising 30K high-quality samples, to finetune three LVLM families: LLaVA-V1.5, Qwen2VL & Qwen2.5VL (2B-13B sizes). Our VaPR models deliver significant performance improvements across ten benchmarks, achieving average gains of 6.5% (LLaVA), 4.0% (Qwen2VL), and 1.5% (Qwen2.5VL), with notable improvements on reasoning tasks. A scaling analysis shows that performance consistently improves with data size, with LLaVA models benefiting even at smaller scales. Moreover, VaPR reduces the tendency to answer “Yes” in binary questions - addressing a common failure mode in LVLMs like LLaVA. Lastly, we show that the framework generalizes to open-source LLMs as editors, with models trained on VaPR-OS achieving ~99% of the performance of models trained on \name, which is synthesized using GPT-4o. Our data, models, and code can be found on the project page this https URL
21. Improving AGI Evaluation: A Data Science Perspective
- Authors: John Hawkins
- URL: https://arxiv.org/abs/2510.01687
- Abstract:
Evaluation of potential AGI systems and methods is difficult due to the breadth of the engineering goal. We have no methods for perfect evaluation of the end state, and instead measure performance on small tests designed to provide directional indication that we are approaching AGI. In this work we argue that AGI evaluation methods have been dominated by a design philosophy that uses our intuitions of what intelligence is to create synthetic tasks, that have performed poorly in the history of AI. Instead we argue for an alternative design philosophy focused on evaluating robust task execution that seeks to demonstrate AGI through competence. This perspective is developed from common practices in data science that are used to show that a system can be reliably deployed. We provide practical examples of what this would mean for AGI evaluation.
22. A Locally Executable AI System for Improving Preoperative Patient Communication: A Multi-Domain Clinical Evaluation
- Authors: Motoki Sato (Nagasaki University, Japan), Yuki Matsushita (Nagasaki University, Japan), Hidekazu Takahashi (Boston Medical Sciences, Tokyo, Japan), Tomoaki Kakazu (Showa Medical University Koto Toyosu Hospital, Japan), Sou Nagata (Nagasaki University, Japan), Mizuho Ohnuma (Nagasaki University, Japan), Atsushi Yoshikawa (Kanto Gakuin University, Japan), Masayuki Yamamura (Institute of Science Tokyo, Japan)
- URL: https://arxiv.org/abs/2510.01671
- Abstract:
Patients awaiting invasive procedures often have unanswered pre-procedural questions; however, time-pressured workflows and privacy constraints limit personalized counseling. We present LENOHA (Low Energy, No Hallucination, Leave No One Behind Architecture), a safety-first, local-first system that routes inputs with a high-precision sentence-transformer classifier and returns verbatim answers from a clinician-curated FAQ for clinical queries, eliminating free-text generation in the clinical path. We evaluated two domains (tooth extraction and gastroscopy) using expert-reviewed validation sets (n=400/domain) for thresholding and independent test sets (n=200/domain). Among the four encoders, E5-large-instruct (560M) achieved an overall accuracy of 0.983 (95% CI 0.964-0.991), AUC 0.996, and seven total errors, which were statistically indistinguishable from GPT-4o on this task; Gemini made no errors on this test set. Energy logging shows that the non-generative clinical path consumes ~1.0 mWh per input versus ~168 mWh per small-talk reply from a local 8B SLM, a ~170x difference, while maintaining ~0.10 s latency on a single on-prem GPU. These results indicate that near-frontier discrimination and generation-induced errors are structurally avoided in the clinical path by returning vetted FAQ answers verbatim, supporting privacy, sustainability, and equitable deployment in bandwidth-limited environments.
23. Just Do It!? Computer-Use Agents Exhibit Blind Goal-Directedness
- Authors: Erfan Shayegani , Keegan Hines , Yue Dong , Nael Abu-Ghazaleh , Roman Lutz , Spencer Whitehead , Vidhisha Balachandran , Besmira Nushi , Vibhav Vineet
- URL: https://arxiv.org/abs/2510.01670
- Abstract:
Computer-Use Agents (CUAs) are an increasingly deployed class of agents that take actions on GUIs to accomplish user goals. In this paper, we show that CUAs consistently exhibit Blind Goal-Directedness (BGD): a bias to pursue goals regardless of feasibility, safety, reliability, or context. We characterize three prevalent patterns of BGD: (i) lack of contextual reasoning, (ii) assumptions and decisions under ambiguity, and (iii) contradictory or infeasible goals. We develop BLIND-ACT, a benchmark of 90 tasks capturing these three patterns. Built on OSWorld, BLIND-ACT provides realistic environments and employs LLM-based judges to evaluate agent behavior, achieving 93.75% agreement with human annotations. We use BLIND-ACT to evaluate nine frontier models, including Claude Sonnet and Opus 4, Computer-Use-Preview, and GPT-5, observing high average BGD rates (80.8%) across them. We show that BGD exposes subtle risks that arise even when inputs are not directly harmful. While prompting-based interventions lower BGD levels, substantial risk persists, highlighting the need for stronger training- or inference-time interventions. Qualitative analysis reveals observed failure modes: execution-first bias (focusing on how to act over whether to act), thought-action disconnect (execution diverging from reasoning), and request-primacy (justifying actions due to user request). Identifying BGD and introducing BLIND-ACT establishes a foundation for future research on studying and mitigating this fundamental risk and ensuring safe CUA deployment.
24. GuruAgents: Emulating Wise Investors with Prompt-Guided LLM Agents
- Authors: Yejin Kim , Youngbin Lee , Juhyeong Kim , Yongjae Lee
- URL: https://arxiv.org/abs/2510.01664
- Abstract:
This study demonstrates that GuruAgents, prompt-guided AI agents, can systematically operationalize the strategies of legendary investment gurus. We develop five distinct GuruAgents, each designed to emulate an iconic investor, by encoding their distinct philosophies into LLM prompts that integrate financial tools and a deterministic reasoning pipeline. In a backtest on NASDAQ-100 constituents from Q4 2023 to Q2 2025, the GuruAgents exhibit unique behaviors driven by their prompted personas. The Buffett GuruAgent achieves the highest performance, delivering a 42.2\% CAGR that significantly outperforms benchmarks, while other agents show varied results. These findings confirm that prompt engineering can successfully translate the qualitative philosophies of investment gurus into reproducible, quantitative strategies, highlighting a novel direction for automated systematic investing. The source code and data are available at this https URL .
25. Understanding the Geospatial Reasoning Capabilities of LLMs: A Trajectory Recovery Perspective
- Authors: Thinh Hung Truong , Jey Han Lau , Jianzhong Qi
- URL: https://arxiv.org/abs/2510.01639
- Abstract:
We explore the geospatial reasoning capabilities of Large Language Models (LLMs), specifically, whether LLMs can read road network maps and perform navigation. We frame trajectory recovery as a proxy task, which requires models to reconstruct masked GPS traces, and introduce GLOBALTRACE, a dataset with over 4,000 real-world trajectories across diverse regions and transportation modes. Using road network as context, our prompting framework enables LLMs to generate valid paths without accessing any external navigation tools. Experiments show that LLMs outperform off-the-shelf baselines and specialized trajectory recovery models, with strong zero-shot generalization. Fine-grained analysis shows that LLMs have strong comprehension of the road network and coordinate systems, but also pose systematic biases with respect to regions and transportation modes. Finally, we demonstrate how LLMs can enhance navigation experiences by reasoning over maps in flexible ways to incorporate user preferences.
26. Learning to Decide with Just Enough: Information-Theoretic Context Summarization for CDMPs
- Authors: Peidong Liu , Junjiang Lin , Shaowen Wang , Yao Xu , Haiqing Li , Xuhao Xie , Siyi Wu , Hao Li
- URL: https://arxiv.org/abs/2510.01620
- Abstract:
Contextual Markov Decision Processes (CMDPs) offer a framework for sequential decision-making under external signals, but existing methods often fail to generalize in high-dimensional or unstructured contexts, resulting in excessive computation and unstable performance. We propose an information-theoretic summarization approach that uses large language models (LLMs) to compress contextual inputs into low-dimensional, semantically rich summaries. These summaries augment states by preserving decision-critical cues while reducing redundancy. Building on the notion of approximate context sufficiency, we provide, to our knowledge, the first regret bounds and a latency-entropy trade-off characterization for CMDPs. Our analysis clarifies how informativeness impacts computational cost. Experiments across discrete, continuous, visual, and recommendation benchmarks show that our method outperforms raw-context and non-context baselines, improving reward, success rate, and sample efficiency, while reducing latency and memory usage. These findings demonstrate that LLM-based summarization offers a scalable and interpretable solution for efficient decision-making in context-rich, resource-constrained environments.
27. PychoBench: Evaluating the Psychology Intelligence of Large Language Models
- Authors: Min Zeng
- URL: https://arxiv.org/abs/2510.01611
- Abstract:
Large Language Models (LLMs) have demonstrated remarkable success across a wide range of industries, primarily due to their impressive generative abilities. Yet, their potential in applications requiring cognitive abilities, such as psychological counseling, remains largely untapped. This paper investigates the key question: Can LLMs be effectively applied to psychological counseling? To determine whether an LLM can effectively take on the role of a psychological counselor, the first step is to assess whether it meets the qualifications required for such a role, namely the ability to pass the U.S. National Counselor Certification Exam (NCE). This is because, just as a human counselor must pass a certification exam to practice, an LLM must demonstrate sufficient psychological knowledge to meet the standards required for such a role. To address this, we introduce PsychoBench, a benchmark grounded in this http URL counselor examinations, a licensure test for professional counselors that requires about 70% accuracy to pass. PsychoBench comprises approximately 2,252 carefully curated single-choice questions, crafted to require deep understanding and broad enough to cover various sub-disciplines of psychology. This benchmark provides a comprehensive assessment of an LLM’s ability to function as a counselor. Our evaluation shows that advanced models such as GPT-4o, Llama3.3-70B, and Gemma3-27B achieve well above the passing threshold, while smaller open-source models (e.g., Qwen2.5-7B, Mistral-7B) remain far below it. These results suggest that only frontier LLMs are currently capable of meeting counseling exam standards, highlighting both the promise and the challenges of developing psychology-oriented LLMs.
28. AgentRec: Next-Generation LLM-Powered Multi-Agent Collaborative Recommendation with Adaptive Intelligence
- Authors: Bo Ma , Hang Li , ZeHua Hu , XiaoFan Gui , LuYao Liu , Simon Lau
- URL: https://arxiv.org/abs/2510.01609
- Abstract:
Interactive conversational recommender systems have gained significant attention for their ability to capture user preferences through natural language interactions. However, existing approaches face substantial challenges in handling dynamic user preferences, maintaining conversation coherence, and balancing multiple ranking objectives simultaneously. This paper introduces AgentRec, a next-generation LLM-powered multi-agent collaborative recommendation framework that addresses these limitations through hierarchical agent networks with adaptive intelligence. Our approach employs specialized LLM-powered agents for conversation understanding, preference modeling, context awareness, and dynamic ranking, coordinated through an adaptive weighting mechanism that learns from interaction patterns. We propose a three-tier learning strategy combining rapid response for simple queries, intelligent reasoning for complex preferences, and deep collaboration for challenging scenarios. Extensive experiments on three real-world datasets demonstrate that AgentRec achieves consistent improvements over state-of-the-art baselines, with 2.8\% enhancement in conversation success rate, 1.9\% improvement in recommendation accuracy (NDCG@10), and 3.2\% better conversation efficiency while maintaining comparable computational costs through intelligent agent coordination.
29. AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning
- Authors: Zhenyu Pan , Yiting Zhang , Zhuo Liu , Yolo Yunlong Tang , Zeliang Zhang , Haozheng Luo , Yuwei Han , Jianshu Zhang , Dennis Wu , Hong-Yu Chen , Haoran Lu , Haoyang Fang , Manling Li , Chenliang Xu , Philip S. Yu , Han Liu
- URL: https://arxiv.org/abs/2510.01586
- Abstract:
LLM-based multi-agent systems excel at planning, tool use, and role coordination, but their openness and interaction complexity also expose them to jailbreak, prompt-injection, and adversarial collaboration. Existing defenses fall into two lines: (i) self-verification that asks each agent to pre-filter unsafe instructions before execution, and (ii) external guard modules that police behaviors. The former often underperforms because a standalone agent lacks sufficient capacity to detect cross-agent unsafe chains and delegation-induced risks; the latter increases system overhead and creates a single-point-of-failure-once compromised, system-wide safety collapses, and adding more guards worsens cost and complexity. To solve these challenges, we propose AdvEvo-MARL, a co-evolutionary multi-agent reinforcement learning framework that internalizes safety into task agents. Rather than relying on external guards, AdvEvo-MARL jointly optimizes attackers (which synthesize evolving jailbreak prompts) and defenders (task agents trained to both accomplish their duties and resist attacks) in adversarial learning environments. To stabilize learning and foster cooperation, we introduce a public baseline for advantage estimation: agents within the same functional group share a group-level mean-return baseline, enabling lower-variance updates and stronger intra-group coordination. Across representative attack scenarios, AdvEvo-MARL consistently keeps attack-success rate (ASR) below 20%, whereas baselines reach up to 38.33%, while preserving-and sometimes improving-task accuracy (up to +3.67% on reasoning tasks). These results show that safety and utility can be jointly improved without relying on extra guard agents or added system overhead.
30. InvThink: Towards AI Safety via Inverse Reasoning
- Authors: Yubin Kim , Taehan Kim , Eugene Park , Chunjong Park , Cynthia Breazeal , Daniel McDuff , Hae Won Park
- URL: https://arxiv.org/abs/2510.01569
- Abstract:
We present InvThink, a simple yet powerful approach that gives large language models (LLMs) the capability of inverse thinking: reasoning through failure modes before generating responses. Unlike existing safety alignment methods that optimize directly for safe response, InvThink instructs models to 1) enumerate potential harms, 2) analyze their consequences, and 3) generate safe outputs that proactively avoid these risks. Our method reveals three key findings: (i) safety improvements show stronger scaling with model size compared to existing safety methods. (ii) InvThink mitigates safety tax; by training models to systematically consider failure modes, it preserves general reasoning capabilities on standard benchmarks. (iii) beyond general safety tasks, InvThink excels in high-stakes domains including external-facing (medicine, finance, law) and agentic (blackmail, murder) risk scenarios, achieving up to 15.7% reduction in harmful responses compared to baseline methods like SafetyPrompt. We further implement InvThink via supervised fine-tuning, and reinforcement learning across three LLM families. These results suggest that inverse reasoning provides a scalable and generalizable path toward safer, more capable language models.
31. Step-Aware Policy Optimization for Reasoning in Diffusion Large Language Models
- Authors: Shaoan Xie , Lingjing Kong , Xiangchen Song , Xinshuai Dong , Guangyi Chen , Eric P.Xing , Kun Zhang
- URL: https://arxiv.org/abs/2510.01544
- Abstract:
Diffusion language models (dLLMs) offer a promising, non-autoregressive paradigm for text generation, yet training them for complex reasoning remains a key challenge. Current reinforcement learning approaches often rely on sparse, outcome-based rewards, which can reinforce flawed reasoning paths that lead to coincidentally correct answers. We argue that this stems from a fundamental mismatch with the natural structure of reasoning. We first propose a theoretical framework that formalizes complex problem solving as a hierarchical selection process, where an intractable global constraint is decomposed into a series of simpler, localized logical steps. This framework provides a principled foundation for algorithm design, including theoretical insights into the identifiability of this latent reasoning structure. Motivated by this theory, we identify unstructured refinement – a failure mode where a model’s iterative steps do not contribute meaningfully to the solution – as a core deficiency in existing methods. We then introduce Step-Aware Policy Optimization (SAPO), a novel RL algorithm that aligns the dLLM’s denoising process with the latent reasoning hierarchy. By using a process-based reward function that encourages incremental progress, SAPO guides the model to learn structured, coherent reasoning paths. Our empirical results show that this principled approach significantly improves performance on challenging reasoning benchmarks and enhances the interpretability of the generation process.
32. Information Seeking for Robust Decision Making under Partial Observability
- Authors: Djengo Cyun-Jyun Fang , Tsung-Wei Ke
- URL: https://arxiv.org/abs/2510.01531
- Abstract:
Explicit information seeking is essential to human problem-solving in practical environments characterized by incomplete information and noisy dynamics. When the true environmental state is not directly observable, humans seek information to update their internal dynamics and inform future decision-making. Although existing Large Language Model (LLM) planning agents have addressed observational uncertainty, they often overlook discrepancies between their internal dynamics and the actual environment. We introduce Information Seeking Decision Planner (InfoSeeker), an LLM decision-making framework that integrates task-oriented planning with information seeking to align internal dynamics and make optimal decisions under uncertainty in both agent observations and environmental dynamics. InfoSeeker prompts an LLM to actively gather information by planning actions to validate its understanding, detect environmental changes, or test hypotheses before generating or revising task-oriented plans. To evaluate InfoSeeker, we introduce a novel benchmark suite featuring partially observable environments with incomplete observations and uncertain dynamics. Experiments demonstrate that InfoSeeker achieves a 74% absolute performance gain over prior methods without sacrificing sample efficiency. Moreover, InfoSeeker generalizes across LLMs and outperforms baselines on established benchmarks such as robotic manipulation and web navigation. These findings underscore the importance of tightly integrating planning and information seeking for robust behavior in partially observable environments. The project page is available at this https URL
33. LOGicalThought: Logic-Based Ontological Grounding of LLMs for High-Assurance Reasoning
- Authors: Navapat Nananukul , Yue Zhang , Ryan Lee , Eric Boxer , Jonathan May , Vibhav Giridhar Gogate , Jay Pujara , Mayank Kejriwal
- URL: https://arxiv.org/abs/2510.01530
- Abstract:
High-assurance reasoning, particularly in critical domains such as law and medicine, requires conclusions that are accurate, verifiable, and explicitly grounded in evidence. This reasoning relies on premises codified from rules, statutes, and contracts, inherently involving defeasible or non-monotonic logic due to numerous exceptions, where the introduction of a single fact can invalidate general rules, posing significant challenges. While large language models (LLMs) excel at processing natural language, their capabilities in standard inference tasks do not translate to the rigorous reasoning required over high-assurance text guidelines. Core reasoning challenges within such texts often manifest specific logical structures involving negation, implication, and, most critically, defeasible rules and exceptions. In this paper, we propose a novel neurosymbolically-grounded architecture called LOGicalThought (LogT) that uses an advanced logical language and reasoner in conjunction with an LLM to construct a dual symbolic graph context and logic-based context. These two context representations transform the problem from inference over long-form guidelines into a compact grounded evaluation. Evaluated on four multi-domain benchmarks against four baselines, LogT improves overall performance by 11.84% across all LLMs. Performance improves significantly across all three modes of reasoning: by up to +10.2% on negation, +13.2% on implication, and +5.5% on defeasible reasoning compared to the strongest baseline.
34. Towards Interpretable and Inference-Optimal COT Reasoning with Sparse Autoencoder-Guided Generation
- Authors: Daniel Zhao , Abhilash Shankarampeta , Lanxiang Hu , Tajana Rosing , Hao Zhang
- URL: https://arxiv.org/abs/2510.01528
- Abstract:
We propose a novel method that leverages sparse autoencoders (SAEs) and clustering techniques to analyze the internal token representations of large language models (LLMs) and guide generations in mathematical reasoning tasks. Our approach first trains an SAE to generate sparse vector representations for training tokens, then applies k-means clustering to construct a graph where vertices represent token clusters and weighted edges capture sequential token transitions. Using this graph, we define an edge-weight based reward function to quantify adherence to established reasoning traces, thereby identifying exploitative reasoning trajectories. Additionally, we measure generation diversity from clustering to assess the extent of exploration. Our findings indicate that balancing both exploitation and exploration is crucial for achieving high accuracy in mathematical reasoning tasks. During generation, the SAE can serve as a scalable reward model to guide generations, ensuring a balanced trade-off between exploitation and exploration. This prevents extreme behaviors in either direction, ultimately fostering a higher-quality reasoning process in LLMs.
35. Lateral Tree-of-Thoughts Surpasses ToT by Incorporating Logically-Consistent, Low-Utility Candidates
- Authors: Abhinav Madahar
- URL: https://arxiv.org/abs/2510.01500
- Abstract:
Modern deployments increasingly allocate large test-time compute (thousands of tokens or many node expansions) to boost reliability. Under such budgets, standard Tree-of-Thoughts-style search exhibits two pathologies: breadth saturation (additional samples mostly produce near-duplicates, so width stops growing) and depth myopia (noisy short-horizon utilities prune branches whose payoff appears after a few more steps). We propose Lateral Tree-of-Thoughts (LToT), a drop-in controller that separates utility from logical consistency and treats low-utility but consistent candidates as assets rather than waste. The frontier is split into mainlines (high-utility candidates used for exploitation) and laterals (consistent, initially low-utility candidates that receive short, cheap probes before judgment). LToT explores laterals via Lateral Racing with Short-Circuit (LR–SC): a capped successive-halving race that spreads tiny probes across a very wide lateral set, uses width-aware thresholds with repeat-to-confirm, and immediately promotes a branch once its envelope clears the mainline bar; mainlines are kept intentionally narrow so surplus compute is invested where width is cheap. We prove a pseudolinear lateral cost $\Theta(N_0 \log_{\eta} N_0)$ with logarithmically many rungs (initial lateral width $N_0$; culling factor $\eta>1$), in contrast to the exponential growth of uncapped mainlines. Empirical evaluations on benchmark tasks are in preparation and will be added in a future revision. In short, LToT turns large test-time budgets into principled diversity while preserving promotion discipline, mitigating saturation and myopia without inflating compute.
36. AIReg-Bench: Benchmarking Language Models That Assess AI Regulation Compliance
- Authors: Bill Marino , Rosco Hunter , Zubair Jamali , Marinos Emmanouil Kalpakos , Mudra Kashyap , Isaiah Hinton , Alexa Hanson , Maahum Nazir , Christoph Schnabl , Felix Steffek , Hongkai Wen , Nicholas D. Lane
- URL: https://arxiv.org/abs/2510.01474
- Abstract:
As governments move to regulate AI, there is growing interest in using Large Language Models (LLMs) to assess whether or not an AI system complies with a given AI Regulation (AIR). However, there is presently no way to benchmark the performance of LLMs at this task. To fill this void, we introduce AIReg-Bench: the first benchmark dataset designed to test how well LLMs can assess compliance with the EU AI Act (AIA). We created this dataset through a two-step process: (1) by prompting an LLM with carefully structured instructions, we generated 120 technical documentation excerpts (samples), each depicting a fictional, albeit plausible, AI system - of the kind an AI provider might produce to demonstrate their compliance with AIR; (2) legal experts then reviewed and annotated each sample to indicate whether, and in what way, the AI system described therein violates specific Articles of the AIA. The resulting dataset, together with our evaluation of whether frontier LLMs can reproduce the experts’ compliance labels, provides a starting point to understand the opportunities and limitations of LLM-based AIR compliance assessment tools and establishes a benchmark against which subsequent LLMs can be compared. The dataset and evaluation code are available at this https URL .
37. VOGUE: Guiding Exploration with Visual Uncertainty Improves Multimodal Reasoning
- Authors: Rui Liu , Dian Yu , Tong Zheng , Runpeng Dai , Zongxia Li , Wenhao Yu , Zhenwen Liang , Linfeng Song , Haitao Mi , Pratap Tokekar , Dong Yu
- URL: https://arxiv.org/abs/2510.01444
- Abstract:
Reinforcement learning with verifiable rewards (RLVR) improves reasoning in large language models (LLMs) but struggles with exploration, an issue that still persists for multimodal LLMs (MLLMs). Current methods treat the visual input as a fixed, deterministic condition, overlooking a critical source of ambiguity and struggling to build policies robust to plausible visual variations. We introduce $\textbf{VOGUE (Visual Uncertainty Guided Exploration)}$, a novel method that shifts exploration from the output (text) to the input (visual) space. By treating the image as a stochastic context, VOGUE quantifies the policy’s sensitivity to visual perturbations using the symmetric KL divergence between a “raw” and “noisy” branch, creating a direct signal for uncertainty-aware exploration. This signal shapes the learning objective via an uncertainty-proportional bonus, which, combined with a token-entropy bonus and an annealed sampling schedule, effectively balances exploration and exploitation. Implemented within GRPO on two model scales (Qwen2.5-VL-3B/7B), VOGUE boosts pass@1 accuracy by an average of 2.6% on three visual math benchmarks and 3.7% on three general-domain reasoning benchmarks, while simultaneously increasing pass@4 performance and mitigating the exploration decay commonly observed in RL fine-tuning. Our work shows that grounding exploration in the inherent uncertainty of visual inputs is an effective strategy for improving multimodal reasoning.
38. On the Role of Domain Experts in Creating Effective Tutoring Systems
- Authors: Sarath Sreedharan , Kelsey Sikes , Nathaniel Blanchard , Lisa Mason , Nikhil Krishnaswamy , Jill Zarestky
- URL: https://arxiv.org/abs/2510.01432
- Abstract:
The role that highly curated knowledge, provided by domain experts, could play in creating effective tutoring systems is often overlooked within the AI for education community. In this paper, we highlight this topic by discussing two ways such highly curated expert knowledge could help in creating novel educational systems. First, we will look at how one could use explainable AI (XAI) techniques to automatically create lessons. Most existing XAI methods are primarily aimed at debugging AI systems. However, we will discuss how one could use expert specified rules about solving specific problems along with novel XAI techniques to automatically generate lessons that could be provided to learners. Secondly, we will see how an expert specified curriculum for learning a target concept can help develop adaptive tutoring systems, that can not only provide a better learning experience, but could also allow us to use more efficient algorithms to create these systems. Finally, we will highlight the importance of such methods using a case study of creating a tutoring system for pollinator identification, where such knowledge could easily be elicited from experts.
39. A Tale of LLMs and Induced Small Proxies: Scalable Agents for Knowledge Mining
- Authors: Sipeng Zhang , Longfei Yun , Zilong Wang , Jingbo Shang , Letian Peng
- URL: https://arxiv.org/abs/2510.01427
- Abstract:
At the core of Deep Research is knowledge mining, the task of extracting structured information from massive unstructured text in response to user instructions. Large language models (LLMs) excel at interpreting such instructions but are prohibitively expensive to deploy at scale, while traditional pipelines of classifiers and extractors remain efficient yet brittle and unable to generalize to new tasks. We introduce Falconer, a collaborative framework that combines the agentic reasoning of LLMs with lightweight proxy models for scalable knowledge mining. In Falconer, LLMs act as planners, decomposing user instructions into executable pipelines, and as annotators, generating supervision to train small proxies. The framework unifies classification and extraction into two atomic operations, get label and get span, enabling a single instruction-following model to replace multiple task-specific components. To evaluate the consistency between proxy models incubated by Falconer and annotations provided by humans and large models, we construct new benchmarks covering both planning and end-to-end execution. Experiments show that Falconer closely matches state-of-the-art LLMs in instruction-following accuracy while reducing inference cost by up to 90% and accelerating large-scale knowledge mining by more than 20x, offering an efficient and scalable foundation for Deep Research.
40. OntoLogX: Ontology-Guided Knowledge Graph Extraction from Cybersecurity Logs with Large Language Models
- Authors: Luca Cotti , Idilio Drago , Anisa Rula , Devis Bianchini , Federico Cerutti
- URL: https://arxiv.org/abs/2510.01409
- Abstract:
System logs represent a valuable source of Cyber Threat Intelligence (CTI), capturing attacker behaviors, exploited vulnerabilities, and traces of malicious activity. Yet their utility is often limited by lack of structure, semantic inconsistency, and fragmentation across devices and sessions. Extracting actionable CTI from logs therefore requires approaches that can reconcile noisy, heterogeneous data into coherent and interoperable representations. We introduce OntoLogX, an autonomous Artificial Intelligence (AI) agent that leverages Large Language Models (LLMs) to transform raw logs into ontology-grounded Knowledge Graphs (KGs). OntoLogX integrates a lightweight log ontology with Retrieval Augmented Generation (RAG) and iterative correction steps, ensuring that generated KGs are syntactically and semantically valid. Beyond event-level analysis, the system aggregates KGs into sessions and employs a LLM to predict MITRE ATT&CK tactics, linking low-level log evidence to higher-level adversarial objectives. We evaluate OntoLogX on both logs from a public benchmark and a real-world honeypot dataset, demonstrating robust KG generation across multiple KGs backends and accurate mapping of adversarial activity to ATT&CK tactics. Results highlight the benefits of retrieval and correction for precision and recall, the effectiveness of code-oriented models in structured log analysis, and the value of ontology-grounded representations for actionable CTI extraction.
41. Automating Data-Driven Modeling and Analysis for Engineering Applications using Large Language Model Agents
- Authors: Yang Liu , Zaid Abulawi , Abhiram Garimidi , Doyeong Lim
- URL: https://arxiv.org/abs/2510.01398
- Abstract:
Modern engineering increasingly relies on vast datasets generated by experiments and simulations, driving a growing demand for efficient, reliable, and broadly applicable modeling strategies. There is also heightened interest in developing data-driven approaches, particularly neural network models, for effective prediction and analysis of scientific datasets. Traditional data-driven methods frequently involve extensive manual intervention, limiting their ability to scale effectively and generalize to diverse applications. In this study, we propose an innovative pipeline utilizing Large Language Model (LLM) agents to automate data-driven modeling and analysis, with a particular emphasis on regression tasks. We evaluate two LLM-agent frameworks: a multi-agent system featuring specialized collaborative agents, and a single-agent system based on the Reasoning and Acting (ReAct) paradigm. Both frameworks autonomously handle data preprocessing, neural network development, training, hyperparameter optimization, and uncertainty quantification (UQ). We validate our approach using a critical heat flux (CHF) prediction benchmark, involving approximately 25,000 experimental data points from the OECD/NEA benchmark dataset. Results indicate that our LLM-agent-developed model surpasses traditional CHF lookup tables and delivers predictive accuracy and UQ on par with state-of-the-art Bayesian optimized deep neural network models developed by human experts. These outcomes underscore the significant potential of LLM-based agents to automate complex engineering modeling tasks, greatly reducing human workload while meeting or exceeding existing standards of predictive performance.
42. Fine-tuning with RAG for Improving LLM Learning of New Skills
- Authors: Humaid Ibrahim , Nikolai Rozanov , Marek Rei
- URL: https://arxiv.org/abs/2510.01375
- Abstract:
Large language model (LLM) agents deployed for multi-step tasks frequently fail in predictable ways: attempting actions with unmet preconditions, issuing redundant commands, or mishandling environment constraints. While retrieval-augmented generation (RAG) can improve performance by providing runtime guidance, it requires maintaining external knowledge databases and adds computational overhead at every deployment. We propose a simple pipeline that converts inference-time retrieval into learned competence through distillation. Our approach: (1) extracts compact, reusable hints from agent failures, (2) uses these hints to generate improved teacher trajectories via one-shot retrieval at episode start, and (3) trains student models on these trajectories with hint strings removed, forcing internalization rather than memorization. Across two interactive benchmarks, ALFWorld (household tasks) and WebShop (online shopping), distilled students consistently outperform baseline agents, achieving up to 91% success on ALFWorld (vs. 79% for baselines) and improving WebShop scores to 72 (vs. 61 for baselines), while using 10-60% fewer tokens than retrieval-augmented teachers depending on the environment. The approach generalizes across model scales (7B/14B parameters) and agent architectures (ReAct/StateAct), demonstrating that retrieval benefits can be effectively internalized through targeted fine-tuning without permanent runtime dependencies.
43. Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort
- Authors: Xinpeng Wang , Nitish Joshi , Barbara Plank , Rico Angell , He He
- URL: https://arxiv.org/abs/2510.01367
- Abstract:
Reward hacking, where a reasoning model exploits loopholes in a reward function to achieve high rewards without solving the intended task, poses a significant threat. This behavior may be explicit, i.e. verbalized in the model’s chain-of-thought (CoT), or implicit, where the CoT appears benign thus bypasses CoT monitors. To detect implicit reward hacking, we propose TRACE (Truncated Reasoning AUC Evaluation). Our key observation is that hacking occurs when exploiting the loophole is easier than solving the actual task. This means that the model is using less `effort’ than required to achieve high reward. TRACE quantifies effort by measuring how early a model’s reasoning becomes sufficient to pass a verifier. We progressively truncate a model’s CoT at various lengths, force the model to answer, and measure the verifier-passing rate at each cutoff. A hacking model, which takes a shortcut, will achieve a high passing rate with only a small fraction of its CoT, yielding a large area under the accuracy-vs-length curve. TRACE achieves over 65% gains over our strongest 72B CoT monitor in math reasoning, and over 30% gains over a 32B monitor in coding. We further show that TRACE can discover unknown loopholes during training. Overall, TRACE offers a scalable unsupervised approach for oversight where current monitoring methods prove ineffective.
44. Retrieval-Augmented Framework for LLM-Based Clinical Decision Support
- Authors: Leon Garza , Anantaa Kotal , Michael A. Grasso , Emre Umucu
- URL: https://arxiv.org/abs/2510.01363
- Abstract:
The increasing complexity of clinical decision-making, alongside the rapid expansion of electronic health records (EHR), presents both opportunities and challenges for delivering data-informed care. This paper proposes a clinical decision support system powered by Large Language Models (LLMs) to assist prescribing clinicians. The system generates therapeutic suggestions by analyzing historical EHR data, including patient demographics, presenting complaints, clinical symptoms, diagnostic information, and treatment histories. The framework integrates natural language processing with structured clinical inputs to produce contextually relevant recommendations. Rather than replacing clinician judgment, it is designed to augment decision-making by retrieving and synthesizing precedent cases with comparable characteristics, drawing on local datasets or federated sources where applicable. At its core, the system employs a retrieval-augmented generation (RAG) pipeline that harmonizes unstructured narratives and codified data to support LLM-based inference. We outline the system’s technical components, including representation representation alignment and generation strategies. Preliminary evaluations, conducted with de-identified and synthetic clinical datasets, examine the clinical plausibility and consistency of the model’s outputs. Early findings suggest that LLM-based tools may provide valuable decision support in prescribing workflows when appropriately constrained and rigorously validated. This work represents an initial step toward integration of generative AI into real-world clinical decision-making with an emphasis on transparency, safety, and alignment with established practices.
45. MEMTRACK: Evaluating Long-Term Memory and State Tracking in Multi-Platform Dynamic Agent Environments
- Authors: Darshan Deshpande , Varun Gangal , Hersh Mehta , Anand Kannappan , Rebecca Qian , Peng Wang
- URL: https://arxiv.org/abs/2510.01353
- Abstract:
Recent works on context and memory benchmarking have primarily focused on conversational instances but the need for evaluating memory in dynamic enterprise environments is crucial for its effective application. We introduce MEMTRACK, a benchmark designed to evaluate long-term memory and state tracking in multi-platform agent environments. MEMTRACK models realistic organizational workflows by integrating asynchronous events across multiple communication and productivity platforms such as Slack, Linear and Git. Each benchmark instance provides a chronologically platform-interleaved timeline, with noisy, conflicting, cross-referring information as well as potential codebase/file-system comprehension and exploration. Consequently, our benchmark tests memory capabilities such as acquistion, selection and conflict resolution. We curate the MEMTRACK dataset through both manual expert driven design and scalable agent based synthesis, generating ecologically valid scenarios grounded in real world software development processes. We introduce pertinent metrics for Correctness, Efficiency, and Redundancy that capture the effectiveness of memory mechanisms beyond simple QA performance. Experiments across SoTA LLMs and memory backends reveal challenges in utilizing memory across long horizons, handling cross-platform dependencies, and resolving contradictions. Notably, the best performing GPT-5 model only achieves a 60\% Correctness score on MEMTRACK. This work provides an extensible framework for advancing evaluation research for memory-augmented agents, beyond existing focus on conversational setups, and sets the stage for multi-agent, multi-platform memory benchmarking in complex organizational settings
46. Aristotle: IMO-level Automated Theorem Proving
- Authors: Tudor Achim , Alex Best , Kevin Der , Mathïs Fédérico , Sergei Gukov , Daniel Halpern-Leister , Kirsten Henningsgard , Yury Kudryashov , Alexander Meiburg , Martin Michelsen , Riley Patterson , Eric Rodriguez , Laura Scharff , Vikram Shanker , Vladmir Sicca , Hari Sowrirajan , Aidan Swope , Matyas Tamas , Vlad Tenev , Jonathan Thomm , Harold Williams , Lawrence Wu
- URL: https://arxiv.org/abs/2510.01346
- Abstract:
We introduce Aristotle, an AI system that combines formal verification with informal reasoning, achieving gold-medal-equivalent performance on the 2025 International Mathematical Olympiad problems. Aristotle integrates three main components: a Lean proof search system, an informal reasoning system that generates and formalizes lemmas, and a dedicated geometry solver. Our system demonstrates state-of-the-art performance with favorable scaling properties for automated theorem proving.
47. Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models
- Authors: Yu Zeng , Wenxuan Huang , Shiting Huang , Xikun Bao , Yukun Qi , Yiming Zhao , Qiuchen Wang , Lin Chen , Zehui Chen , Huaian Chen , Wanli Ouyang , Feng Zhao
- URL: https://arxiv.org/abs/2510.01304
- Abstract:
Although current large Vision-Language Models (VLMs) have advanced in multimodal understanding and reasoning, their fundamental perceptual and reasoning abilities remain limited. Specifically, even on simple jigsaw tasks, existing VLMs perform near randomly, revealing deficiencies in core perception and reasoning capabilities. While high-quality vision-language data can enhance these capabilities, its scarcity and limited scalability impose significant constraints. To address this, we propose AGILE, an Agentic jiGsaw Interaction Learning for Enhancing visual perception and reasoning in VLMs. AGILE formulates jigsaw solving as an interactive process, enabling the model to progressively engage with the environment. At each step, the model generates executable code to perform an action based on the current state, while the environment provides fine-grained visual feedback to guide task completion. Through this iterative cycle of observation and interaction, the model incrementally improves its perceptual and reasoning capabilities via exploration and feedback. Experimental results show that AGILE not only substantially boosts performance on jigsaw tasks of varying complexity (e.g., increasing accuracy from 9.5% to 82.8% under the 2 $\times$ 2 setting) but also demonstrates strong generalization across 9 general vision tasks, achieving an average improvement of 3.1%. These results indicate notable enhancements in both perceptual and reasoning abilities. This work opens a new avenue for advancing reasoning and generalization in multimodal models and provides an efficient, scalable solution to the scarcity of multimodal reinforcement learning data. The code and datasets is available at this https URL .
48. The Social Laboratory: A Psychometric Framework for Multi-Agent LLM Evaluation
- Authors: Zarreen Reza
- URL: https://arxiv.org/abs/2510.01295
- Abstract:
As Large Language Models (LLMs) transition from static tools to autonomous agents, traditional evaluation benchmarks that measure performance on downstream tasks are becoming insufficient. These methods fail to capture the emergent social and cognitive dynamics that arise when agents communicate, persuade, and collaborate in interactive environments. To address this gap, we introduce a novel evaluation framework that uses multi-agent debate as a controlled “social laboratory” to discover and quantify these behaviors. In our framework, LLM-based agents, instantiated with distinct personas and incentives, deliberate on a wide range of challenging topics under the supervision of an LLM moderator. Our analysis, enabled by a new suite of psychometric and semantic metrics, reveals several key findings. Across hundreds of debates, we uncover a powerful and robust emergent tendency for agents to seek consensus, consistently reaching high semantic agreement ({\mu} > 0.88) even without explicit instruction and across sensitive topics. We show that assigned personas induce stable, measurable psychometric profiles, particularly in cognitive effort, and that the moderators persona can significantly alter debate outcomes by structuring the environment, a key finding for external AI alignment. This work provides a blueprint for a new class of dynamic, psychometrically grounded evaluation protocols designed for the agentic setting, offering a crucial methodology for understanding and shaping the social behaviors of the next generation of AI agents. We have released the code and results at this https URL .
49. Cyber Academia-Chemical Engineering (CA-ChemE): A Living Digital Town for Self-Directed Research Evolution and Emergent Scientific Discovery
- Authors: Zekun Jiang , Chunming Xu , Tianhang Zhou
- URL: https://arxiv.org/abs/2510.01293
- Abstract:
The rapid advancement of artificial intelligence (AI) has demonstrated substantial potential in chemical engineering, yet existing AI systems remain limited in interdisciplinary collaboration and exploration of uncharted problems. To address these issues, we present the Cyber Academia-Chemical Engineering (CA-ChemE) system, a living digital town that enables self-directed research evolution and emergent scientific discovery through multi-agent collaboration. By integrating domain-specific knowledge bases, knowledge enhancement technologies, and collaboration agents, the system successfully constructs an intelligent ecosystem capable of deep professional reasoning and efficient interdisciplinary collaboration. Our findings demonstrate that knowledge base-enabled enhancement mechanisms improved dialogue quality scores by 10-15% on average across all seven expert agents, fundamentally ensuring technical judgments are grounded in verifiable scientific evidence. However, we observed a critical bottleneck in cross-domain collaboration efficiency, prompting the introduction of a Collaboration Agent (CA) equipped with ontology engineering capabilities. CA’s intervention achieved 8.5% improvements for distant-domain expert pairs compared to only 0.8% for domain-proximate pairs - a 10.6-fold difference - unveiling the “diminished collaborative efficiency caused by knowledge-base gaps” effect. This study demonstrates how carefully designed multi-agent architectures can provide a viable pathway toward autonomous scientific discovery in chemical engineering.
50. Modeling Others’ Minds as Code
- Authors: Kunal Jha , Aydan Yuenan Huang , Eric Ye , Natasha Jaques , Max Kleiman-Weiner
- URL: https://arxiv.org/abs/2510.01272
- Abstract:
Accurate prediction of human behavior is essential for robust and safe human-AI collaboration. However, existing approaches for modeling people are often data-hungry and brittle because they either make unrealistic assumptions about rationality or are too computationally demanding to adapt rapidly. Our key insight is that many everyday social interactions may follow predictable patterns; efficient “scripts” that minimize cognitive load for actors and observers, e.g., “wait for the green light, then go.” We propose modeling these routines as behavioral programs instantiated in computer code rather than policies conditioned on beliefs and desires. We introduce ROTE, a novel algorithm that leverages both large language models (LLMs) for synthesizing a hypothesis space of behavioral programs, and probabilistic inference for reasoning about uncertainty over that space. We test ROTE in a suite of gridworld tasks and a large-scale embodied household simulator. ROTE predicts human and AI behaviors from sparse observations, outperforming competitive baselines – including behavior cloning and LLM-based methods – by as much as 50% in terms of in-sample accuracy and out-of-sample generalization. By treating action understanding as a program synthesis problem, ROTE opens a path for AI systems to efficiently and effectively predict human behavior in the real-world.
51. OR-Toolformer: Modeling and Solving Operations Research Problems with Tool Augmented Large Language Models
- Authors: Jianzhang Zhang , Jialong Zhou , Chuang Liu
- URL: https://arxiv.org/abs/2510.01253
- Abstract:
Large language models (LLMs) demonstrate strong mathematical reasoning, but reliance on closed-source APIs for OR tasks raises privacy concerns, and training open-source models from scratch incurs high compute costs. We introduce OR-Toolformer, which fine-tunes Llama-3.1-8B-Instruct with a semi-automatic data synthesis pipeline that generates diverse OR problem-answer pairs and augments the model with external solvers to produce API calls. On three of four standard benchmarks, OR-Toolformer achieves up to 80.1% execution accuracy, exceeding size-matched baselines by over 4.3%. In zero-shot evaluation on two unseen OR problem types, it attains 54% average accuracy, a 21 percentage-point improvement over the strongest baseline. These findings validate the efficacy of tool-augmented fine-tuning LLMs for accurate and generalizable OR problem modeling and solving.
52. NoiseShift: Resolution-Aware Noise Recalibration for Better Low-Resolution Image Generation
- Authors: Ruozhen He , Moayed Haji-Ali , Ziyan Yang , Vicente Ordonez
- URL: https://arxiv.org/abs/2510.02307
- Abstract:
Text-to-image diffusion models trained on a fixed set of resolutions often fail to generalize, even when asked to generate images at lower resolutions than those seen during training. High-resolution text-to-image generators are currently unable to easily offer an out-of-the-box budget-efficient alternative to their users who might not need high-resolution images. We identify a key technical insight in diffusion models that when addressed can help tackle this limitation: Noise schedulers have unequal perceptual effects across resolutions. The same level of noise removes disproportionately more signal from lower-resolution images than from high-resolution images, leading to a train-test mismatch. We propose NoiseShift, a training-free method that recalibrates the noise level of the denoiser conditioned on resolution size. NoiseShift requires no changes to model architecture or sampling schedule and is compatible with existing models. When applied to Stable Diffusion 3, Stable Diffusion 3.5, and Flux-Dev, quality at low resolutions is significantly improved. On LAION-COCO, NoiseShift improves SD3.5 by 15.89%, SD3 by 8.56%, and Flux-Dev by 2.44% in FID on average. On CelebA, NoiseShift improves SD3.5 by 10.36%, SD3 by 5.19%, and Flux-Dev by 3.02% in FID on average. These results demonstrate the effectiveness of NoiseShift in mitigating resolution-dependent artifacts and enhancing the quality of low-resolution image generation.
53. Diffusion Models and the Manifold Hypothesis: Log-Domain Smoothing is Geometry Adaptive
- Authors: Tyler Farghly , Peter Potaptchik , Samuel Howard , George Deligiannidis , Jakiw Pidstrigach
- URL: https://arxiv.org/abs/2510.02305
- Abstract:
Diffusion models have achieved state-of-the-art performance, demonstrating remarkable generalisation capabilities across diverse domains. However, the mechanisms underpinning these strong capabilities remain only partially understood. A leading conjecture, based on the manifold hypothesis, attributes this success to their ability to adapt to low-dimensional geometric structure within the data. This work provides evidence for this conjecture, focusing on how such phenomena could result from the formulation of the learning problem through score matching. We inspect the role of implicit regularisation by investigating the effect of smoothing minimisers of the empirical score matching objective. Our theoretical and empirical results confirm that smoothing the score function – or equivalently, smoothing in the log-density domain – produces smoothing tangential to the data manifold. In addition, we show that the manifold along which the diffusion model generalises can be controlled by choosing an appropriate smoothing.
54. Equilibrium Matching: Generative Modeling with Implicit Energy-Based Models
- Authors: Runqian Wang , Yilun Du
- URL: https://arxiv.org/abs/2510.02300
- Abstract:
We introduce Equilibrium Matching (EqM), a generative modeling framework built from an equilibrium dynamics perspective. EqM discards the non-equilibrium, time-conditional dynamics in traditional diffusion and flow-based generative models and instead learns the equilibrium gradient of an implicit energy landscape. Through this approach, we can adopt an optimization-based sampling process at inference time, where samples are obtained by gradient descent on the learned landscape with adjustable step sizes, adaptive optimizers, and adaptive compute. EqM surpasses the generation performance of diffusion/flow models empirically, achieving an FID of 1.90 on ImageNet 256$\times$256. EqM is also theoretically justified to learn and sample from the data manifold. Beyond generation, EqM is a flexible framework that naturally handles tasks including partially noised image denoising, OOD detection, and image composition. By replacing time-conditional velocities with a unified equilibrium landscape, EqM offers a tighter bridge between flow and energy-based models and a simple route to optimization-driven inference.
55. Interactive Training: Feedback-Driven Neural Network Optimization
- Authors: Wentao Zhang , Yang Young Lu , Yuntian Deng
- URL: https://arxiv.org/abs/2510.02297
- Abstract:
Traditional neural network training typically follows fixed, predefined optimization recipes, lacking the flexibility to dynamically respond to instabilities or emerging training issues. In this paper, we introduce Interactive Training, an open-source framework that enables real-time, feedback-driven intervention during neural network training by human experts or automated AI agents. At its core, Interactive Training uses a control server to mediate communication between users or agents and the ongoing training process, allowing users to dynamically adjust optimizer hyperparameters, training data, and model checkpoints. Through three case studies, we demonstrate that Interactive Training achieves superior training stability, reduced sensitivity to initial hyperparameters, and improved adaptability to evolving user needs, paving the way toward a future training paradigm where AI agents autonomously monitor training logs, proactively resolve instabilities, and optimize training dynamics.
56. VideoNSA: Native Sparse Attention Scales Video Understanding
- Authors: Enxin Song , Wenhao Chai , Shusheng Yang , Ethan Armand , Xiaojun Shan , Haiyang Xu , Jianwen Xie , Zhuowen Tu
- URL: https://arxiv.org/abs/2510.02295
- Abstract:
Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. Our method, VideoNSA, adapts Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We employ a hardware-aware hybrid approach to attention, preserving dense attention for text, while employing NSA for video. Compared to token-compression and training-free sparse baselines, VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Further ablation analysis reveals four key findings: (1) reliable scaling to 128K tokens; (2) an optimal global-local attention allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4) the learnable combined sparse attention help induce dynamic attention sinks.
57. F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data
- Authors: Ziyin Zhang , Zihan Liao , Hang Yu , Peng Di , Rui Wang
- URL: https://arxiv.org/abs/2510.02294
- Abstract:
We introduce F2LLM - Foundation to Feature Large Language Models, a suite of state-of-the-art embedding models in three sizes: 0.6B, 1.7B, and 4B. Unlike previous top-ranking embedding models that require massive contrastive pretraining, sophisticated training pipelines, and costly synthetic training data, F2LLM is directly finetuned from foundation models on 6 million query-document-negative tuples curated from open-source, non-synthetic datasets, striking a strong balance between training cost, model size, and embedding performance. On the MTEB English leaderboard, F2LLM-4B ranks 2nd among models with approximately 4B parameters and 7th overall, while F2LLM-1.7B ranks 1st among models in the 1B-2B size range. To facilitate future research in the field, we release the models, training dataset, and code, positioning F2LLM as a strong, reproducible, and budget-friendly baseline for future works.
58. Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks
- Authors: Ruohao Guo , Afshin Oroojlooy , Roshan Sridhar , Miguel Ballesteros , Alan Ritter , Dan Roth
- URL: https://arxiv.org/abs/2510.02286
- Abstract:
Despite recent rapid progress in AI safety, current large language models remain vulnerable to adversarial attacks in multi-turn interaction settings, where attackers strategically adapt their prompts across conversation turns and pose a more critical yet realistic challenge. Existing approaches that discover safety vulnerabilities either rely on manual red-teaming with human experts or employ automated methods using pre-defined templates and human-curated attack data, with most focusing on single-turn attacks. However, these methods did not explore the vast space of possible multi-turn attacks, failing to consider novel attack trajectories that emerge from complex dialogue dynamics and strategic conversation planning. This gap is particularly critical given recent findings that LLMs exhibit significantly higher vulnerability to multi-turn attacks compared to single-turn attacks. We propose DialTree-RPO, an on-policy reinforcement learning framework integrated with tree search that autonomously discovers diverse multi-turn attack strategies by treating the dialogue as a sequential decision-making problem, enabling systematic exploration without manually curated data. Through extensive experiments, our approach not only achieves more than 25.9% higher ASR across 10 target models compared to previous state-of-the-art approaches, but also effectively uncovers new attack strategies by learning optimal dialogue policies that maximize attack success across multiple turns.
59. Learning to Generate Object Interactions with Physics-Guided Video Diffusion
- Authors: David Romero , Ariana Bermudez , Hao Li , Fabio Pizzati , Ivan Laptev
- URL: https://arxiv.org/abs/2510.02284
- Abstract:
Recent models for video generation have achieved remarkable progress and are now deployed in film, social media production, and advertising. Beyond their creative potential, such models also hold promise as world simulators for robotics and embodied decision making. Despite strong advances, however, current approaches still struggle to generate physically plausible object interactions and lack physics-grounded control mechanisms. To address this limitation, we introduce KineMask, an approach for physics-guided video generation that enables realistic rigid body control, interactions, and effects. Given a single image and a specified object velocity, our method generates videos with inferred motions and future object interactions. We propose a two-stage training strategy that gradually removes future motion supervision via object masks. Using this strategy we train video diffusion models (VDMs) on synthetic scenes of simple interactions and demonstrate significant improvements of object interactions in real scenes. Furthermore, KineMask integrates low-level motion control with high-level textual conditioning via predictive scene descriptions, leading to effective support for synthesis of complex dynamical phenomena. Extensive experiments show that KineMask achieves strong improvements over recent models of comparable size. Ablation studies further highlight the complementary roles of low- and high-level conditioning in VDMs. Our code, model, and data will be made publicly available.
60. Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
- Authors: Justin Cui , Jie Wu , Ming Li , Tao Yang , Xiaojie Li , Rui Wang , Andrew Bai , Yuanhao Ban , Cho-Jui Hsieh
- URL: https://arxiv.org/abs/2510.02283
- Abstract:
Diffusion models have revolutionized image and video generation, achieving unprecedented visual quality. However, their reliance on transformer architectures incurs prohibitively high computational costs, particularly when extending generation to long videos. Recent work has explored autoregressive formulations for long video generation, typically by distilling from short-horizon bidirectional teachers. Nevertheless, given that teacher models cannot synthesize long videos, the extrapolation of student models beyond their training horizon often leads to pronounced quality degradation, arising from the compounding of errors within the continuous latent space. In this paper, we propose a simple yet effective approach to mitigate quality degradation in long-horizon video generation without requiring supervision from long-video teachers or retraining on long video datasets. Our approach centers on exploiting the rich knowledge of teacher models to provide guidance for the student model through sampled segments drawn from self-generated long videos. Our method maintains temporal consistency while scaling video length by up to 20x beyond teacher’s capability, avoiding common issues such as over-exposure and error-accumulation without recomputing overlapping frames like previous methods. When scaling up the computation, our method shows the capability of generating videos up to 4 minutes and 15 seconds, equivalent to 99.9% of the maximum span supported by our base model’s position embedding and more than 50x longer than that of our baseline model. Experiments on standard benchmarks and our proposed improved benchmark demonstrate that our approach substantially outperforms baseline methods in both fidelity and consistency. Our long-horizon videos demo can be found at this https URL
61. Addressing Pitfalls in the Evaluation of Uncertainty Estimation Methods for Natural Language Generation
- Authors: Mykyta Ielanskyi , Kajetan Schweighofer , Lukas Aichberger , Sepp Hochreiter
- URL: https://arxiv.org/abs/2510.02279
- Abstract:
Hallucinations are a common issue that undermine the reliability of large language models (LLMs). Recent studies have identified a specific subset of hallucinations, known as confabulations, which arise due to predictive uncertainty of LLMs. To detect confabulations, various methods for estimating predictive uncertainty in natural language generation (NLG) have been developed. These methods are typically evaluated by correlating uncertainty estimates with the correctness of generated text, with question-answering (QA) datasets serving as the standard benchmark. However, commonly used approximate correctness functions have substantial disagreement between each other and, consequently, in the ranking of the uncertainty estimation methods. This allows one to inflate the apparent performance of uncertainty estimation methods. We propose using several alternative risk indicators for risk correlation experiments that improve robustness of empirical assessment of UE algorithms for NLG. For QA tasks, we show that marginalizing over multiple LLM-as-a-judge variants leads to reducing the evaluation biases. Furthermore, we explore structured tasks as well as out of distribution and perturbation detection tasks which provide robust and controllable risk indicators. Finally, we propose to use an Elo rating of uncertainty estimation methods to give an objective summarization over extensive evaluation settings.
62. Parallel Scaling Law: Unveiling Reasoning Generalization through A Cross-Linguistic Perspective
- Authors: Wen Yang , Junhong Wu , Chong Li , Chengqing Zong , Jiajun Zhang
- URL: https://arxiv.org/abs/2510.02272
- Abstract:
Recent advancements in Reinforcement Post-Training (RPT) have significantly enhanced the capabilities of Large Reasoning Models (LRMs), sparking increased interest in the generalization of RL-based reasoning. While existing work has primarily focused on investigating its generalization across tasks or modalities, this study proposes a novel cross-linguistic perspective to investigate reasoning generalization. This raises a crucial question: $\textit{Does the reasoning capability achieved from English RPT effectively transfer to other languages?}$ We address this by systematically evaluating English-centric LRMs on multilingual reasoning benchmarks and introducing a metric to quantify cross-lingual transferability. Our findings reveal that cross-lingual transferability varies significantly across initial model, target language, and training paradigm. Through interventional studies, we find that models with stronger initial English capabilities tend to over-rely on English-specific patterns, leading to diminished cross-lingual generalization. To address this, we conduct a thorough parallel training study. Experimental results yield three key findings: $\textbf{First-Parallel Leap}$, a substantial leap in performance when transitioning from monolingual to just a single parallel language, and a predictable $\textbf{Parallel Scaling Law}$, revealing that cross-lingual reasoning transfer follows a power-law with the number of training parallel languages. Moreover, we identify the discrepancy between actual monolingual performance and the power-law prediction as $\textbf{Monolingual Generalization Gap}$, indicating that English-centric LRMs fail to fully generalize across languages. Our study challenges the assumption that LRM reasoning mirrors human cognition, providing critical insights for the development of more language-agnostic LRMs.
63. InfoMosaic-Bench: Evaluating Multi-Source Information Seeking in Tool-Augmented Agents
- Authors: Yaxin Du , Yuanshuo Zhang , Xiyuan Yang , Yifan Zhou , Cheng Wang , Gongyi Zou , Xianghe Pang , Wenhao Wang , Menglan Chen , Shuo Tang , Zhiyu Li , Siheng Chen
- URL: https://arxiv.org/abs/2510.02271
- Abstract:
Information seeking is a fundamental requirement for humans. However, existing LLM agents rely heavily on open-web search, which exposes two fundamental weaknesses: online content is noisy and unreliable, and many real-world tasks require precise, domain-specific knowledge unavailable from the web. The emergence of the Model Context Protocol (MCP) now allows agents to interface with thousands of specialized tools, seemingly resolving this limitation. Yet it remains unclear whether agents can effectively leverage such tools – and more importantly, whether they can integrate them with general-purpose search to solve complex tasks. Therefore, we introduce InfoMosaic-Bench, the first benchmark dedicated to multi-source information seeking in tool-augmented agents. Covering six representative domains (medicine, finance, maps, video, web, and multi-domain integration), InfoMosaic-Bench requires agents to combine general-purpose search with domain-specific tools. Tasks are synthesized with InfoMosaic-Flow, a scalable pipeline that grounds task conditions in verified tool outputs, enforces cross-source dependencies, and filters out shortcut cases solvable by trivial lookup. This design guarantees both reliability and non-triviality. Experiments with 14 state-of-the-art LLM agents reveal three findings: (i) web information alone is insufficient, with GPT-5 achieving only 38.2% accuracy and 67.5% pass rate; (ii) domain tools provide selective but inconsistent benefits, improving some domains while degrading others; and (iii) 22.4% of failures arise from incorrect tool usage or selection, highlighting that current LLMs still struggle with even basic tool handling.
64. microCLIP: Unsupervised CLIP Adaptation via Coarse-Fine Token Fusion for Fine-Grained Image Classification
- Authors: Sathira Silva , Eman Ali , Chetan Arora , Muhammad Haris Khan
- URL: https://arxiv.org/abs/2510.02270
- Abstract:
Unsupervised adaptation of CLIP-based vision-language models (VLMs) for fine-grained image classification requires sensitivity to microscopic local cues. While CLIP exhibits strong zero-shot transfer, its reliance on coarse global features restricts its performance on fine-grained classification tasks. Prior efforts inject fine-grained knowledge by aligning large language model (LLM) descriptions with the CLIP $\texttt{[CLS]}$ token; however, this approach overlooks spatial precision. We propose $\textbf{microCLIP}$, a self-training framework that jointly refines CLIP’s visual and textual representations using fine-grained cues. At its core is Saliency-Oriented Attention Pooling (SOAP) within a lightweight TokenFusion module, which builds a saliency-guided $\texttt{[FG]}$ token from patch embeddings and fuses it with the global $\texttt{[CLS]}$ token for coarse-fine alignment. To stabilize adaptation, we introduce a two-headed LLM-derived classifier: a frozen classifier that, via multi-view alignment, provides a stable text-based prior for pseudo-labeling, and a learnable classifier initialized from LLM descriptions and fine-tuned with TokenFusion. We further develop Dynamic Knowledge Aggregation, which convexly combines fixed LLM/CLIP priors with TokenFusion’s evolving logits to iteratively refine pseudo-labels. Together, these components uncover latent fine-grained signals in CLIP, yielding a consistent $2.90\%$ average accuracy gain across 13 fine-grained benchmarks while requiring only light adaptation. Our code is available at this https URL .
65. How to Combat Reactive and Dynamic Jamming Attacks with Reinforcement Learning
- Authors: Yalin E. Sagduyu , Tugba Erpek , Kemal Davaslioglu , Sastry Kompella
- URL: https://arxiv.org/abs/2510.02265
- Abstract:
This paper studies the problem of mitigating reactive jamming, where a jammer adopts a dynamic policy of selecting channels and sensing thresholds to detect and jam ongoing transmissions. The transmitter-receiver pair learns to avoid jamming and optimize throughput over time (without prior knowledge of channel conditions or jamming strategies) by using reinforcement learning (RL) to adapt transmit power, modulation, and channel selection. Q-learning is employed for discrete jamming-event states, while Deep Q-Networks (DQN) are employed for continuous states based on received power. Through different reward functions and action sets, the results show that RL can adapt rapidly to spectrum dynamics and sustain high rates as channels and jamming policies change over time.
66. Paving the Way Towards Kinematic Assessment Using Monocular Video: A Preclinical Benchmark of State-of-the-Art Deep-Learning-Based 3D Human Pose Estimators Against Inertial Sensors in Daily Living Activities
- Authors: Mario Medrano-Paredes , Carmen Fernández-González , Francisco-Javier Díaz-Pernas , Hichem Saoudi , Javier González-Alonso , Mario Martínez-Zarzuela
- URL: https://arxiv.org/abs/2510.02264
- Abstract:
Advances in machine learning and wearable sensors offer new opportunities for capturing and analyzing human movement outside specialized laboratories. Accurate assessment of human movement under real-world conditions is essential for telemedicine, sports science, and rehabilitation. This preclinical benchmark compares monocular video-based 3D human pose estimation models with inertial measurement units (IMUs), leveraging the VIDIMU dataset containing a total of 13 clinically relevant daily activities which were captured using both commodity video cameras and five IMUs. During this initial study only healthy subjects were recorded, so results cannot be generalized to pathological cohorts. Joint angles derived from state-of-the-art deep learning frameworks (MotionAGFormer, MotionBERT, MMPose 2D-to-3D pose lifting, and NVIDIA BodyTrack) were evaluated against joint angles computed from IMU data using OpenSim inverse kinematics following the Human3.6M dataset format with 17 keypoints. Among them, MotionAGFormer demonstrated superior performance, achieving the lowest overall RMSE ($9.27°\pm 4.80°$) and MAE ($7.86°\pm 4.18°$), as well as the highest Pearson correlation ($0.86 \pm 0.15$) and the highest coefficient of determination $R^{2}$ ($0.67 \pm 0.28$). The results reveal that both technologies are viable for out-of-the-lab kinematic assessment. However, they also highlight key trade-offs between video- and sensor-based approaches including costs, accessibility, and precision. This study clarifies where off-the-shelf video models already provide clinically promising kinematics in healthy adults and where they lag behind IMU-based estimates while establishing valuable guidelines for researchers and clinicians seeking to develop robust, cost-effective, and user-friendly solutions for telehealth and remote patient monitoring.
67. DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing
- Authors: Zihan Zhou , Shilin Lu , Shuli Leng , Shaocong Zhang , Zhuming Lian , Xinlei Yu , Adams Wai-Kin Kong
- URL: https://arxiv.org/abs/2510.02253
- Abstract:
Drag-based image editing has long suffered from distortions in the target region, largely because the priors of earlier base models, Stable Diffusion, are insufficient to project optimized latents back onto the natural image manifold. With the shift from UNet-based DDPMs to more scalable DiT with flow matching (e.g., SD3.5, FLUX), generative priors have become significantly stronger, enabling advances across diverse editing tasks. However, drag-based editing has yet to benefit from these stronger priors. This work proposes the first framework to effectively harness FLUX’s rich prior for drag-based editing, dubbed DragFlow, achieving substantial gains over baselines. We first show that directly applying point-based drag editing to DiTs performs poorly: unlike the highly compressed features of UNets, DiT features are insufficiently structured to provide reliable guidance for point-wise motion supervision. To overcome this limitation, DragFlow introduces a region-based editing paradigm, where affine transformations enable richer and more consistent feature supervision. Additionally, we integrate pretrained open-domain personalization adapters (e.g., IP-Adapter) to enhance subject consistency, while preserving background fidelity through gradient mask-based hard constraints. Multimodal large language models (MLLMs) are further employed to resolve task ambiguities. For evaluation, we curate a novel Region-based Dragging benchmark (ReD Bench) featuring region-level dragging instructions. Extensive experiments on DragBench-DR and ReD Bench show that DragFlow surpasses both point-based and region-based baselines, setting a new state-of-the-art in drag-based image editing. Code and datasets will be publicly available upon publication.
68. Explore Briefly, Then Decide: Mitigating LLM Overthinking via Cumulative Entropy Regulation
- Authors: Tianyi Jiang , Yi Bin , Yujuan Ding , Kainian Zhu , Fei Ma , Jingkuan Song , Heng Tao Shen
- URL: https://arxiv.org/abs/2510.02249
- Abstract:
Large Language Models (LLMs) have demonstrated remarkable reasoning abilities on complex problems using long Chain-of-Thought (CoT) reasoning. However, they often suffer from overthinking, meaning generating unnecessarily lengthy reasoning steps for simpler problems. This issue may degrade the efficiency of the models and make them difficult to adapt the reasoning depth to the complexity of problems. To address this, we introduce a novel metric Token Entropy Cumulative Average (TECA), which measures the extent of exploration throughout the reasoning process. We further propose a novel reasoning paradigm – Explore Briefly, Then Decide – with an associated Cumulative Entropy Regulation (CER) mechanism. This paradigm leverages TECA to help the model dynamically determine the optimal point to conclude its thought process and provide a final answer, thus achieving efficient reasoning. Experimental results across diverse mathematical benchmarks show that our approach substantially mitigates overthinking without sacrificing problem-solving ability. With our thinking paradigm, the average response length decreases by up to 71% on simpler datasets, demonstrating the effectiveness of our method in creating a more efficient and adaptive reasoning process.
69. ExGRPO: Learning to Reason from Experience
- Authors: Runzhe Zhan , Yafu Li , Zhi Wang , Xiaoye Qu , Dongrui Liu , Jing Shao , Derek F. Wong , Yu Cheng
- URL: https://arxiv.org/abs/2510.02245
- Abstract:
Reinforcement learning from verifiable rewards (RLVR) is an emerging paradigm for improving the reasoning ability of large language models. However, standard on-policy training discards rollout experiences after a single update, leading to computational inefficiency and instability. While prior work on RL has highlighted the benefits of reusing past experience, the role of experience characteristics in shaping learning dynamics of large reasoning models remains underexplored. In this paper, we are the first to investigate what makes a reasoning experience valuable and identify rollout correctness and entropy as effective indicators of experience value. Based on these insights, we propose ExGRPO (Experiential Group Relative Policy Optimization), a framework that organizes and prioritizes valuable experiences, and employs a mixed-policy objective to balance exploration with experience exploitation. Experiments on five backbone models (1.5B-8B parameters) show that ExGRPO consistently improves reasoning performance on mathematical/general benchmarks, with an average gain of +3.5/7.6 points over on-policy RLVR. Moreover, ExGRPO stabilizes training on both stronger and weaker models where on-policy methods fail. These results highlight principled experience management as a key ingredient for efficient and scalable RLVR.
70. RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning
- Authors: Sicheng Feng , Kaiwen Tuo , Song Wang , Lingdong Kong , Jianke Zhu , Huan Wang
- URL: https://arxiv.org/abs/2510.02240
- Abstract:
Fine-grained visual reasoning remains a core challenge for multimodal large language models (MLLMs). The recently introduced ReasonMap highlights this gap by showing that even advanced MLLMs struggle with spatial reasoning in structured and information-rich settings such as transit maps, a task of clear practical and scientific importance. However, standard reinforcement learning (RL) on such tasks is impeded by sparse rewards and unstable optimization. To address this, we first construct ReasonMap-Plus, an extended dataset that introduces dense reward signals through Visual Question Answering (VQA) tasks, enabling effective cold-start training of fine-grained visual understanding skills. Next, we propose RewardMap, a multi-stage RL framework designed to improve both visual understanding and reasoning capabilities of MLLMs. RewardMap incorporates two key designs. First, we introduce a difficulty-aware reward design that incorporates detail rewards, directly tackling the sparse rewards while providing richer supervision. Second, we propose a multi-stage RL scheme that bootstraps training from simple perception to complex reasoning tasks, offering a more effective cold-start strategy than conventional Supervised Fine-Tuning (SFT). Experiments on ReasonMap and ReasonMap-Plus demonstrate that each component of RewardMap contributes to consistent performance gains, while their combination yields the best results. Moreover, models trained with RewardMap achieve an average improvement of 3.47% across 6 benchmarks spanning spatial reasoning, fine-grained visual reasoning, and general tasks beyond transit maps, underscoring enhanced visual understanding and reasoning capabilities.
71. More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration
- Authors: Xiaoyang Yuan , Yujuan Ding , Yi Bin , Wenqi Shao , Jinyu Cai , Jingkuan Song , Yang Yang , Hengtao Shen
- URL: https://arxiv.org/abs/2510.02227
- Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) is a promising paradigm for enhancing the reasoning ability in Large Language Models (LLMs). However, prevailing methods primarily rely on self-exploration or a single off-policy teacher to elicit long chain-of-thought (LongCoT) reasoning, which may introduce intrinsic model biases and restrict exploration, ultimately limiting reasoning diversity and performance. Drawing inspiration from multi-teacher strategies in knowledge distillation, we introduce Adaptive Multi-Guidance Policy Optimization (AMPO), a novel framework that adaptively leverages guidance from multiple proficient teacher models, but only when the on-policy model fails to generate correct solutions. This “guidance-on-demand” approach expands exploration while preserving the value of self-discovery. Moreover, AMPO incorporates a comprehension-based selection mechanism, prompting the student to learn from the reasoning paths that it is most likely to comprehend, thus balancing broad exploration with effective exploitation. Extensive experiments show AMPO substantially outperforms a strong baseline (GRPO), with a 4.3% improvement on mathematical reasoning tasks and 12.2% on out-of-distribution tasks, while significantly boosting Pass@k performance and enabling more diverse exploration. Notably, using four peer-sized teachers, our method achieves comparable results to approaches that leverage a single, more powerful teacher (e.g., DeepSeek-R1) with more data. These results demonstrate a more efficient and scalable path to superior reasoning and generalizability. Our code is available at this https URL .
72. TempoControl: Temporal Attention Guidance for Text-to-Video Models
- Authors: Shira Schiber , Ofir Lindenbaum , Idan Schwartz
- URL: https://arxiv.org/abs/2510.02226
- Abstract:
Recent advances in generative video models have enabled the creation of high-quality videos based on natural language prompts. However, these models frequently lack fine-grained temporal control, meaning they do not allow users to specify when particular visual elements should appear within a generated sequence. In this work, we introduce TempoControl, a method that allows for temporal alignment of visual concepts during inference, without requiring retraining or additional supervision. TempoControl utilizes cross-attention maps, a key component of text-to-video diffusion models, to guide the timing of concepts through a novel optimization approach. Our method steers attention using three complementary principles: aligning its temporal shape with a control signal (via correlation), amplifying it where visibility is needed (via energy), and maintaining spatial focus (via entropy). TempoControl allows precise control over timing while ensuring high video quality and diversity. We demonstrate its effectiveness across various video generation applications, including temporal reordering for single and multiple objects, as well as action and audio-aligned generation.
73. DiFFPO: Training Diffusion LLMs to Reason Fast and Furious via Reinforcement Learning
- Authors: Hanyang Zhao , Dawen Liang , Wenpin Tang , David Yao , Nathan Kallus
- URL: https://arxiv.org/abs/2510.02212
- Abstract:
We propose DiFFPO, Diffusion Fast and Furious Policy Optimization, a unified framework for training masked diffusion large language models (dLLMs) to reason not only better (furious), but also faster via reinforcement learning (RL). We first unify the existing baseline approach such as d1 by proposing to train surrogate policies via off-policy RL, whose likelihood is much more tractable as an approximation to the true dLLM policy. This naturally motivates a more accurate and informative two-stage likelihood approximation combined with importance sampling correction, which leads to generalized RL algorithms with better sample efficiency and superior task performance. Second, we propose a new direction of joint training efficient samplers/controllers of dLLMs policy. Via RL, we incentivize dLLMs’ natural multi-token prediction capabilities by letting the model learn to adaptively allocate an inference threshold for each prompt. By jointly training the sampler, we yield better accuracies with lower number of function evaluations (NFEs) compared to training the model only, obtaining the best performance in improving the Pareto frontier of the inference-time compute of dLLMs. We showcase the effectiveness of our pipeline by training open source large diffusion language models over benchmark math and planning tasks.
74. Detection of Chagas Disease from the ECG: The George B. Moody PhysioNet Challenge 2025
- Authors: Matthew A. Reyna (1), Zuzana Koscova (1), Jan Pavlus (1), Soheil Saghafi (1), James Weigle (1), Andoni Elola (1,2), Salman Seyedi (1), Kiersten Campbell (1), Qiao Li (1), Ali Bahrami Rad (1), Antônio H. Ribeiro (3), Antonio Luiz P. Ribeiro (4,5), Reza Sameni (1,6), Gari D. Clifford (1,6) ((1) Department of Biomedical Informatics, Emory University, Atlanta, USA, (2) Department of Electronic Technology, University of the Basque Country UPV/EHU, Spain, (3) Department of Information Technology, Uppsala University, Uppsala, Sweden, (4) Universidade Federal de Minas Gerais, Belo Horizonte, Brazil, (5) Telehealth Center from Hospital das Clinicas, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil, (6) Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, USA)
- URL: https://arxiv.org/abs/2510.02202
- Abstract:
Objective: Chagas disease is a parasitic infection that is endemic to South America, Central America, and, more recently, the U.S., primarily transmitted by insects. Chronic Chagas disease can cause cardiovascular diseases and digestive problems. Serological testing capacities for Chagas disease are limited, but Chagas cardiomyopathy often manifests in ECGs, providing an opportunity to prioritize patients for testing and treatment. Approach: The George B. Moody PhysioNet Challenge 2025 invites teams to develop algorithmic approaches for identifying Chagas disease from electrocardiograms (ECGs). Main results: This Challenge provides multiple innovations. First, we leveraged several datasets with labels from patient reports and serological testing, provided a large dataset with weak labels and smaller datasets with strong labels. Second, we augmented the data to support model robustness and generalizability to unseen data sources. Third, we applied an evaluation metric that captured the local serological testing capacity for Chagas disease to frame the machine learning problem as a triage task. Significance: Over 630 participants from 111 teams submitted over 1300 entries during the Challenge, representing diverse approaches from academia and industry worldwide.
75. ARUQULA – An LLM based Text2SPARQL Approach using ReAct and Knowledge Graph Exploration Utilities
- Authors: Felix Brei , Lorenz Bühmann , Johannes Frey , Daniel Gerber , Lars-Peter Meyer , Claus Stadler , Kirill Bulert
- URL: https://arxiv.org/abs/2510.02200
- Abstract:
Interacting with knowledge graphs can be a daunting task for people without a background in computer science since the query language that is used (SPARQL) has a high barrier of entry. Large language models (LLMs) can lower that barrier by providing support in the form of Text2SPARQL translation. In this paper we introduce a generalized method based on SPINACH, an LLM backed agent that translates natural language questions to SPARQL queries not in a single shot, but as an iterative process of exploration and execution. We describe the overall architecture and reasoning behind our design decisions, and also conduct a thorough analysis of the agent behavior to gain insights into future areas for targeted improvements. This work was motivated by the Text2SPARQL challenge, a challenge that was held to facilitate improvements in the Text2SPARQL domain.
76. EvolveCaptions: Empowering DHH Users Through Real-Time Collaborative Captioning
- Authors: Liang-Yuan Wu , Dhruv Jain
- URL: https://arxiv.org/abs/2510.02181
- Abstract:
Automatic Speech Recognition (ASR) systems often fail to accurately transcribe speech from Deaf and Hard of Hearing (DHH) individuals, especially during real-time conversations. Existing personalization approaches typically require extensive pre-recorded data and place the burden of adaptation on the DHH speaker. We present EvolveCaptions, a real-time, collaborative ASR adaptation system that supports in-situ personalization with minimal effort. Hearing participants correct ASR errors during live conversations. Based on these corrections, the system generates short, phonetically targeted prompts for the DHH speaker to record, which are then used to fine-tune the ASR model. In a study with 12 DHH and six hearing participants, EvolveCaptions reduced Word Error Rate (WER) across all DHH users within one hour of use, using only five minutes of recording time on average. Participants described the system as intuitive, low-effort, and well-integrated into communication. These findings demonstrate the promise of collaborative, real-time ASR adaptation for more equitable communication.
77. GRACE: A Language Model Framework for Explainable Inverse Reinforcement Learning
- Authors: Silvia Sapora , Devon Hjelm , Alexander Toshev , Omar Attia , Bogdan Mazoure
- URL: https://arxiv.org/abs/2510.02180
- Abstract:
Inverse Reinforcement Learning aims to recover reward models from expert demonstrations, but traditional methods yield “black-box” models that are difficult to interpret and debug. In this work, we introduce GRACE (Generating Rewards As CodE), a method for using Large Language Models within an evolutionary search to reverse-engineer an interpretable, code-based reward function directly from expert trajectories. The resulting reward function is executable code that can be inspected and verified. We empirically validate GRACE on the BabyAI and AndroidWorld benchmarks, where it efficiently learns highly accurate rewards, even in complex, multi-task settings. Further, we demonstrate that the resulting reward leads to strong policies, compared to both competitive Imitation Learning and online RL approaches with ground-truth rewards. Finally, we show that GRACE is able to build complex reward APIs in multi-task setups.
78. Learning to Reason for Hallucination Span Detection
- Authors: Hsuan Su , Ting-Yao Hu , Hema Swetha Koppula , Kundan Krishna , Hadi Pouransari , Cheng-Yu Hsieh , Cem Koc , Joseph Yitan Cheng , Oncel Tuzel , Raviteja Vemulapalli
- URL: https://arxiv.org/abs/2510.02173
- Abstract:
Large language models (LLMs) often generate hallucinations – unsupported content that undermines reliability. While most prior works frame hallucination detection as a binary task, many real-world applications require identifying hallucinated spans, which is a multi-step decision making process. This naturally raises the question of whether explicit reasoning can help the complex task of detecting hallucination spans. To answer this question, we first evaluate pretrained models with and without Chain-of-Thought (CoT) reasoning, and show that CoT reasoning has the potential to generate at least one correct answer when sampled multiple times. Motivated by this, we propose RL4HS, a reinforcement learning framework that incentivizes reasoning with a span-level reward function. RL4HS builds on Group Relative Policy Optimization and introduces Class-Aware Policy Optimization to mitigate reward imbalance issue. Experiments on the RAGTruth benchmark (summarization, question answering, data-to-text) show that RL4HS surpasses pretrained reasoning models and supervised fine-tuning, demonstrating the necessity of reinforcement learning with span-level rewards for detecting hallucination spans.
79. Go witheFlow: Real-time Emotion Driven Audio Effects Modulation
- Authors: Edmund Dervakos , Spyridon Kantarelis , Vassilis Lyberatos , Jason Liartis , Giorgos Stamou
- URL: https://arxiv.org/abs/2510.02171
- Abstract:
Music performance is a distinctly human activity, intrinsically linked to the performer’s ability to convey, evoke, or express emotion. Machines cannot perform music in the human sense; they can produce, reproduce, execute, or synthesize music, but they lack the capacity for affective or emotional experience. As such, music performance is an ideal candidate through which to explore aspects of collaboration between humans and machines. In this paper, we introduce the witheFlow system, designed to enhance real-time music performance by automatically modulating audio effects based on features extracted from both biosignals and the audio itself. The system, currently in a proof-of-concept phase, is designed to be lightweight, able to run locally on a laptop, and is open-source given the availability of a compatible Digital Audio Workstation and sensors.
80. SIEVE: Towards Verifiable Certification for Code-datasets
- Authors: Fatou Ndiaye Mbodji , El-hacen Diallo , Jordan Samhi , Kui Liu , Jacques Klein , Tegawendé F. Bissyande
- URL: https://arxiv.org/abs/2510.02166
- Abstract:
Code agents and empirical software engineering rely on public code datasets, yet these datasets lack verifiable quality guarantees. Static ‘dataset cards’ inform, but they are neither auditable nor do they offer statistical guarantees, making it difficult to attest to dataset quality. Teams build isolated, ad-hoc cleaning pipelines. This fragments effort and raises cost. We present SIEVE, a community-driven framework. It turns per-property checks into Confidence Cards-machine-readable, verifiable certificates with anytime-valid statistical bounds. We outline a research plan to bring SIEVE to maturity, replacing narrative cards with anytime-verifiable certification. This shift is expected to lower quality-assurance costs and increase trust in code-datasets.
81. Comparing Contrastive and Triplet Loss in Audio-Visual Embedding: Intra-Class Variance and Greediness Analysis
- Authors: Donghuo Zeng
- URL: https://arxiv.org/abs/2510.02161
- Abstract:
Contrastive loss and triplet loss are widely used objectives in deep metric learning, yet their effects on representation quality remain insufficiently understood. We present a theoretical and empirical comparison of these losses, focusing on intra- and inter-class variance and optimization behavior (e.g., greedy updates). Through task-specific experiments with consistent settings on synthetic data and real datasets-MNIST, CIFAR-10-it is shown that triplet loss preserves greater variance within and across classes, supporting finer-grained distinctions in the learned representations. In contrast, contrastive loss tends to compact intra-class embeddings, which may obscure subtle semantic differences. To better understand their optimization dynamics, By examining loss-decay rate, active ratio, and gradient norm, we find that contrastive loss drives many small updates early on, while triplet loss produces fewer but stronger updates that sustain learning on hard examples. Finally, across both classification and retrieval tasks on MNIST, CIFAR-10, CUB-200, and CARS196 datasets, our results consistently show that triplet loss yields superior performance, which suggests using triplet loss for detail retention and hard-sample focus, and contrastive loss for smoother, broad-based embedding refinement.
82. Unlocking Vision-Language Models for Video Anomaly Detection via Fine-Grained Prompting
- Authors: Shu Zou , Xinyu Tian , Lukas Wesemann , Fabian Waschkowski , Zhaoyuan Yang , Jing Zhang
- URL: https://arxiv.org/abs/2510.02155
- Abstract:
Prompting has emerged as a practical way to adapt frozen vision-language models (VLMs) for video anomaly detection (VAD). Yet, existing prompts are often overly abstract, overlooking the fine-grained human-object interactions or action semantics that define complex anomalies in surveillance videos. We propose ASK-Hint, a structured prompting framework that leverages action-centric knowledge to elicit more accurate and interpretable reasoning from frozen VLMs. Our approach organizes prompts into semantically coherent groups (e.g. violence, property crimes, public safety) and formulates fine-grained guiding questions that align model predictions with discriminative visual cues. Extensive experiments on UCF-Crime and XD-Violence show that ASK-Hint consistently improves AUC over prior baselines, achieving state-of-the-art performance compared to both fine-tuned and training-free methods. Beyond accuracy, our framework provides interpretable reasoning traces towards anomaly and demonstrates strong generalization across datasets and VLM backbones. These results highlight the critical role of prompt granularity and establish ASK-Hint as a new training-free and generalizable solution for explainable video anomaly detection.
83. Human-Robo-advisor collaboration in decision-making: Evidence from a multiphase mixed methods experimental study
- Authors: Hasan Mahmuda , Najmul Islam , Satish Krishnan
- URL: https://arxiv.org/abs/2510.02153
- Abstract:
Robo-advisors (RAs) are cost-effective, bias-resistant alternatives to human financial advisors, yet adoption remains limited. While prior research has examined user interactions with RAs, less is known about how individuals interpret RA roles and integrate their advice into decision-making. To address this gap, this study employs a multiphase mixed methods design integrating a behavioral experiment (N = 334), thematic analysis, and follow-up quantitative testing. Findings suggest that people tend to rely on RAs, with reliance shaped by information about RA performance and the framing of advice as gains or losses. Thematic analysis reveals three RA roles in decision-making and four user types, each reflecting distinct patterns of advice integration. In addition, a 2 x 2 typology categorizes antecedents of acceptance into enablers and inhibitors at both the individual and algorithmic levels. By combining behavioral, interpretive, and confirmatory evidence, this study advances understanding of human-RA collaboration and provides actionable insights for designing more trustworthy and adaptive RA systems.
84. How to Find Fantastic Papers: Self-Rankings as a Powerful Predictor of Scientific Impact Beyond Peer Review
- Authors: Buxin Su , Natalie Collina , Garrett Wen , Didong Li , Kyunghyun Cho , Jianqing Fan , Bingxin Zhao , Weijie Su
- URL: https://arxiv.org/abs/2510.02143
- Abstract:
Peer review in academic research aims not only to ensure factual correctness but also to identify work of high scientific potential that can shape future research directions. This task is especially critical in fast-moving fields such as artificial intelligence (AI), yet it has become increasingly difficult given the rapid growth of submissions. In this paper, we investigate an underexplored measure for identifying high-impact research: authors’ own rankings of their multiple submissions to the same AI conference. Grounded in game-theoretic reasoning, we hypothesize that self-rankings are informative because authors possess unique understanding of their work’s conceptual depth and long-term promise. To test this hypothesis, we conducted a large-scale experiment at a leading AI conference, where 1,342 researchers self-ranked their 2,592 submissions by perceived quality. Tracking outcomes over more than a year, we found that papers ranked highest by their authors received twice as many citations as their lowest-ranked counterparts; self-rankings were especially effective at identifying highly cited papers (those with over 150 citations). Moreover, we showed that self-rankings outperformed peer review scores in predicting future citation counts. Our results remained robust after accounting for confounders such as preprint posting time and self-citations. Together, these findings demonstrate that authors’ self-rankings provide a reliable and valuable complement to peer review for identifying and elevating high-impact research in AI.
85. BioinfoMCP: A Unified Platform Enabling MCP Interfaces in Agentic Bioinformatics
- Authors: Florensia Widjaja , Zhangtianyi Chen , Juexiao Zhou
- URL: https://arxiv.org/abs/2510.02139
- Abstract:
Bioinformatics tools are essential for complex computational biology tasks, yet their integration with emerging AI-agent frameworks is hindered by incompatible interfaces, heterogeneous input-output formats, and inconsistent parameter conventions. The Model Context Protocol (MCP) provides a standardized framework for tool-AI communication, but manually converting hundreds of existing and rapidly growing specialized bioinformatics tools into MCP-compliant servers is labor-intensive and unsustainable. Here, we present BioinfoMCP, a unified platform comprising two components: BioinfoMCP Converter, which automatically generates robust MCP servers from tool documentation using large language models, and BioinfoMCP Benchmark, which systematically validates the reliability and versatility of converted tools across diverse computational tasks. We present a platform of 38 MCP-converted bioinformatics tools, extensively validated to show that 94.7% successfully executed complex workflows across three widely used AI-agent platforms. By removing technical barriers to AI automation, BioinfoMCP enables natural-language interaction with sophisticated bioinformatics analyses without requiring extensive programming expertise, offering a scalable path to intelligent, interoperable computational biology.
86. The Disparate Impacts of Speculative Decoding
- Authors: Jameson Sandler , Ahmet Üstün , Marco Romanelli , Sara Hooker , Ferdinando Fioretto
- URL: https://arxiv.org/abs/2510.02128
- Abstract:
The practice of speculative decoding, whereby inference is probabilistically supported by a smaller, cheaper,
drafter'' model, has become a standard technique for systematically reducing the decoding time of large language models. This paper conducts an analysis of speculative decoding through the lens of its potential disparate speed-up rates across tasks. Crucially, the paper shows that speed-up gained from speculative decoding is not uniformly distributed across tasks, consistently diminishing for under-fit, and often underrepresented tasks. To better understand this phenomenon, we derive an analysis to quantify this observedunfairness’’ and draw attention to the factors that motivate such disparate speed-ups to emerge. Further, guided by these insights, the paper proposes a mitigation strategy designed to reduce speed-up disparities and validates the approach across several model pairs, revealing on average a 12% improvement in our fairness metric.
87. VarCoNet: A variability-aware self-supervised framework for functional connectome extraction from resting-state fMRI
- Authors: Charalampos Lamprou , Aamna Alshehhi , Leontios J. Hadjileontiadis , Mohamed L. Seghier
- URL: https://arxiv.org/abs/2510.02120
- Abstract:
Accounting for inter-individual variability in brain function is key to precision medicine. Here, by considering functional inter-individual variability as meaningful data rather than noise, we introduce VarCoNet, an enhanced self-supervised framework for robust functional connectome (FC) extraction from resting-state fMRI (rs-fMRI) data. VarCoNet employs self-supervised contrastive learning to exploit inherent functional inter-individual variability, serving as a brain function encoder that generates FC embeddings readily applicable to downstream tasks even in the absence of labeled data. Contrastive learning is facilitated by a novel augmentation strategy based on segmenting rs-fMRI signals. At its core, VarCoNet integrates a 1D-CNN-Transformer encoder for advanced time-series processing, enhanced with a robust Bayesian hyperparameter optimization. Our VarCoNet framework is evaluated on two downstream tasks: (i) subject fingerprinting, using rs-fMRI data from the Human Connectome Project, and (ii) autism spectrum disorder (ASD) classification, using rs-fMRI data from the ABIDE I and ABIDE II datasets. Using different brain parcellations, our extensive testing against state-of-the-art methods, including 13 deep learning methods, demonstrates VarCoNet’s superiority, robustness, interpretability, and generalizability. Overall, VarCoNet provides a versatile and robust framework for FC analysis in rs-fMRI.
88. SpurBreast: A Curated Dataset for Investigating Spurious Correlations in Real-world Breast MRI Classification
- Authors: Jong Bum Won , Wesley De Neve , Joris Vankerschaver , Utku Ozbulak
- URL: https://arxiv.org/abs/2510.02109
- Abstract:
Deep neural networks (DNNs) have demonstrated remarkable success in medical imaging, yet their real-world deployment remains challenging due to spurious correlations, where models can learn non-clinical features instead of meaningful medical patterns. Existing medical imaging datasets are not designed to systematically study this issue, largely due to restrictive licensing and limited supplementary patient data. To address this gap, we introduce SpurBreast, a curated breast MRI dataset that intentionally incorporates spurious correlations to evaluate their impact on model performance. Analyzing over 100 features involving patient, device, and imaging protocol, we identify two dominant spurious signals: magnetic field strength (a global feature influencing the entire image) and image orientation (a local feature affecting spatial alignment). Through controlled dataset splits, we demonstrate that DNNs can exploit these non-clinical signals, achieving high validation accuracy while failing to generalize to unbiased test data. Alongside these two datasets containing spurious correlations, we also provide benchmark datasets without spurious correlations, allowing researchers to systematically investigate clinically relevant and irrelevant features, uncertainty estimation, adversarial robustness, and generalization strategies. Models and datasets are available at this https URL .
89. Unlocking Symbol-Level Precoding Efficiency Through Tensor Equivariant Neural Network
- Authors: Jinshuo Zhang , Yafei Wang , Xinping Yi , Wenjin Wang , Shi Jin , Symeon Chatzinotas , Björn Ottersten
- URL: https://arxiv.org/abs/2510.02108
- Abstract:
Although symbol-level precoding (SLP) based on constructive interference (CI) exploitation offers performance gains, its high complexity remains a bottleneck. This paper addresses this challenge with an end-to-end deep learning (DL) framework with low inference complexity that leverages the structure of the optimal SLP solution in the closed-form and its inherent tensor equivariance (TE), where TE denotes that a permutation of the input induces the corresponding permutation of the output. Building upon the computationally efficient model-based formulations, as well as their known closed-form solutions, we analyze their relationship with linear precoding (LP) and investigate the corresponding optimality condition. We then construct a mapping from the problem formulation to the solution and prove its TE, based on which the designed networks reveal a specific parameter-sharing pattern that delivers low computational complexity and strong generalization. Leveraging these, we propose the backbone of the framework with an attention-based TE module, achieving linear computational complexity. Furthermore, we demonstrate that such a framework is also applicable to imperfect CSI scenarios, where we design a TE-based network to map the CSI, statistics, and symbols to auxiliary variables. Simulation results show that the proposed framework captures substantial performance gains of optimal SLP, while achieving an approximately 80-times speedup over conventional methods and maintaining strong generalization across user numbers and symbol block lengths.
90. When Tracking Fails: Analyzing Failure Modes of SAM2 for Point-Based Tracking in Surgical Videos
- Authors: Woowon Jang , Jiwon Im , Juseung Choi , Niki Rashidian , Wesley De Neve , Utku Ozbulak
- URL: https://arxiv.org/abs/2510.02100
- Abstract:
Video object segmentation (VOS) models such as SAM2 offer promising zero-shot tracking capabilities for surgical videos using minimal user input. Among the available input types, point-based tracking offers an efficient and low-cost alternative, yet its reliability and failure cases in complex surgical environments are not well understood. In this work, we systematically analyze the failure modes of point-based tracking in laparoscopic cholecystectomy videos. Focusing on three surgical targets, the gallbladder, grasper, and L-hook electrocautery, we compare the performance of point-based tracking with segmentation mask initialization. Our results show that point-based tracking is competitive for surgical tools but consistently underperforms for anatomical targets, where tissue similarity and ambiguous boundaries lead to failure. Through qualitative analysis, we reveal key factors influencing tracking outcomes and provide several actionable recommendations for selecting and placing tracking points to improve performance in surgical video analysis.
91. KAIROS: Unified Training for Universal Non-Autoregressive Time Series Forecasting
- Authors: Kuiye Ding , Fanda Fan , Zheya Wang , Hongxiao Li , Yifan Wang , Lei Wang , Chunjie Luo , Jianfeng Zhan
- URL: https://arxiv.org/abs/2510.02084
- Abstract:
In the World Wide Web, reliable time series forecasts provide the forward-looking signals that drive resource planning, cache placement, and anomaly response, enabling platforms to operate efficiently as user behavior and content distributions evolve. Compared with other domains, time series forecasting for Web applications requires much faster responsiveness to support real-time decision making. We present KAIROS, a non-autoregressive time series forecasting framework that directly models segment-level multi-peak distributions. Unlike autoregressive approaches, KAIROS avoids error accumulation and achieves just-in-time inference, while improving over existing non-autoregressive models that collapse to over-smoothed predictions. Trained on the large-scale corpus, KAIROS demonstrates strong zero-shot generalization on six widely used benchmarks, delivering forecasting performance comparable to state-of-the-art foundation models with similar scale, at a fraction of their inference cost. Beyond empirical results, KAIROS highlights the importance of non-autoregressive design as a scalable paradigm for foundation models in time series.
92. The Current State of AI Bias Bounties: An Overview of Existing Programmes and Research
- Authors: Sergej Kucenko , Nathaniel Dennler , Fengxiang He
- URL: https://arxiv.org/abs/2510.02036
- Abstract:
Current bias evaluation methods rarely engage with communities impacted by AI systems. Inspired by bug bounties, bias bounties have been proposed as a reward-based method that involves communities in AI bias detection by asking users of AI systems to report biases they encounter when interacting with such systems. In the absence of a state-of-the-art review, this survey aimed to identify and analyse existing AI bias bounty programmes and to present academic literature on bias bounties. Google, Google Scholar, PhilPapers, and IEEE Xplore were searched, and five bias bounty programmes, as well as five research publications, were identified. All bias bounties were organised by U.S.-based organisations as time-limited contests, with public participation in four programmes and prize pools ranging from 7,000 to 24,000 USD. The five research publications included a report on the application of bug bounties to algorithmic harms, an article addressing Twitter’s bias bounty, a proposal for bias bounties as an institutional mechanism to increase AI scrutiny, a workshop discussing bias bounties from queer perspectives, and an algorithmic framework for bias bounties. We argue that reducing the technical requirements to enter bounty programmes is important to include those without coding experience. Given the limited adoption of bias bounties, future efforts should explore the transferability of the best practices from bug bounties and examine how such programmes can be designed to be sensitive to underrepresented groups while lowering adoption barriers for organisations.
93. LiLa-Net: Lightweight Latent LiDAR Autoencoder for 3D Point Cloud Reconstruction
- Authors: Mario Resino , Borja Pérez , Jaime Godoy , Abdulla Al-Kaff , Fernando García
- URL: https://arxiv.org/abs/2510.02028
- Abstract:
This work proposed a 3D autoencoder architecture, named LiLa-Net, which encodes efficient features from real traffic environments, employing only the LiDAR’s point clouds. For this purpose, we have real semi-autonomous vehicle, equipped with Velodyne LiDAR. The system leverage skip connections concept to improve the performance without using extensive resources as the state-of-the-art architectures. Key changes include reducing the number of encoder layers and simplifying the skip connections, while still producing an efficient and representative latent space which allows to accurately reconstruct the original point cloud. Furthermore, an effective balance has been achieved between the information carried by the skip connections and the latent encoding, leading to improved reconstruction quality without compromising performance. Finally, the model demonstrates strong generalization capabilities, successfully reconstructing objects unrelated to the original traffic environment.
94. Generating Findings for Jaw Cysts in Dental Panoramic Radiographs Using GPT-4o: Building a Two-Stage Self-Correction Loop with Structured Output (SLSO) Framework
- Authors: Nanaka Hosokawa , Ryo Takahashi , Tomoya Kitano , Yukihiro Iida , Chisako Muramatsu , Tatsuro Hayashi , Yuta Seino , Xiangrong Zhou , Takeshi Hara , Akitoshi Katsumata , Hiroshi Fujita
- URL: https://arxiv.org/abs/2510.02001
- Abstract:
In this study, we utilized the multimodal capabilities of OpenAI GPT-4o to automatically generate jaw cyst findings on dental panoramic radiographs. To improve accuracy, we constructed a Self-correction Loop with Structured Output (SLSO) framework and verified its effectiveness. A 10-step process was implemented for 22 cases of jaw cysts, including image input and analysis, structured data generation, tooth number extraction and consistency checking, iterative regeneration when inconsistencies were detected, and finding generation with subsequent restructuring and consistency verification. A comparative experiment was conducted using the conventional Chain-of-Thought (CoT) method across seven evaluation items: transparency, internal structure, borders, root resorption, tooth movement, relationships with other structures, and tooth number. The results showed that the proposed SLSO framework improved output accuracy for many items, with 66.9%, 33.3%, and 28.6% improvement rates for tooth number, tooth movement, and root resorption, respectively. In the successful cases, a consistently structured output was achieved after up to five regenerations. Although statistical significance was not reached because of the small size of the dataset, the overall SLSO framework enforced negative finding descriptions, suppressed hallucinations, and improved tooth number identification accuracy. However, the accurate identification of extensive lesions spanning multiple teeth is limited. Nevertheless, further refinement is required to enhance overall performance and move toward a practical finding generation system.
95. Clarifying Semantics of In-Context Examples for Unit Test Generation
- Authors: Chen Yang , Lin Yang , Ziqi Wang , Dong Wang , Jianyi Zhou , Junjie Chen
- URL: https://arxiv.org/abs/2510.01994
- Abstract:
Recent advances in large language models (LLMs) have enabled promising performance in unit test generation through in-context learning (ICL). However, the quality of in-context examples significantly influences the effectiveness of generated tests-poorly structured or semantically unclear test examples often lead to suboptimal outputs. In this paper, we propose CLAST, a novel technique that systematically refines unit tests to improve their semantic clarity, thereby enhancing their utility as in-context examples. The approach decomposes complex tests into logically clearer ones and improves semantic clarity through a combination of program analysis and LLM-based rewriting. We evaluated CLAST on four open-source and three industrial projects. The results demonstrate that CLAST largely outperforms UTgen, the state-of-the-art refinement technique, in both preserving test effectiveness and enhancing semantic clarity. Specifically, CLAST fully retains the original effectiveness of unit tests, while UTgen reduces compilation success rate (CSR), pass rate (PR), test coverage (Cov), and mutation score (MS) by an average of 12.90%, 35.82%, 4.65%, and 5.07%, respectively. Over 85.33% of participants in our user study preferred the semantic clarity of CLAST-refined tests. Notably, incorporating CLAST-refined tests as examples effectively improves ICL-based unit test generation approaches such as RAGGen and TELPA, resulting in an average increase of 25.97% in CSR, 28.22% in PR, and 45.99% in Cov for generated tests, compared to incorporating UTgen-refined tests. The insights from the follow-up user study not only reinforce CLAST’s potential impact in software testing practice but also illuminate avenues for future research.
96. ZK-WAGON: Imperceptible Watermark for Image Generation Models using ZK-SNARKs
- Authors: Aadarsh Anantha Ramakrishnan , Shubham Agarwal , Selvanayagam S , Kunwar Singh
- URL: https://arxiv.org/abs/2510.01967
- Abstract:
As image generation models grow increasingly powerful and accessible, concerns around authenticity, ownership, and misuse of synthetic media have become critical. The ability to generate lifelike images indistinguishable from real ones introduces risks such as misinformation, deepfakes, and intellectual property violations. Traditional watermarking methods either degrade image quality, are easily removed, or require access to confidential model internals - making them unsuitable for secure and scalable deployment. We are the first to introduce ZK-WAGON, a novel system for watermarking image generation models using the Zero-Knowledge Succinct Non Interactive Argument of Knowledge (ZK-SNARKs). Our approach enables verifiable proof of origin without exposing model weights, generation prompts, or any sensitive internal information. We propose Selective Layer ZK-Circuit Creation (SL-ZKCC), a method to selectively convert key layers of an image generation model into a circuit, reducing proof generation time significantly. Generated ZK-SNARK proofs are imperceptibly embedded into a generated image via Least Significant Bit (LSB) steganography. We demonstrate this system on both GAN and Diffusion models, providing a secure, model-agnostic pipeline for trustworthy AI image generation.
97. Exploring Resolution-Wise Shared Attention in Hybrid Mamba-U-Nets for Improved Cross-Corpus Speech Enhancement
- Authors: Nikolai Lund Kühne , Jesper Jensen , Jan Østergaard , Zheng-Hua Tan
- URL: https://arxiv.org/abs/2510.01958
- Abstract:
Recent advances in speech enhancement have shown that models combining Mamba and attention mechanisms yield superior cross-corpus generalization performance. At the same time, integrating Mamba in a U-Net structure has yielded state-of-the-art enhancement performance, while reducing both model size and computational complexity. Inspired by these insights, we propose RWSA-MambaUNet, a novel and efficient hybrid model combining Mamba and multi-head attention in a U-Net structure for improved cross-corpus performance. Resolution-wise shared attention (RWSA) refers to layerwise attention-sharing across corresponding time- and frequency resolutions. Our best-performing RWSA-MambaUNet model achieves state-of-the-art generalization performance on two out-of-domain test sets. Notably, our smallest model surpasses all baselines on the out-of-domain DNS 2020 test set in terms of PESQ, SSNR, and ESTOI, and on the out-of-domain EARS-WHAM_v2 test set in terms of SSNR, ESTOI, and SI-SDR, while using less than half the model parameters and a fraction of the FLOPs.
98. Foundation Visual Encoders Are Secretly Few-Shot Anomaly Detectors
- Authors: Guangyao Zhai , Yue Zhou , Xinyan Deng , Lars Heckler , Nassir Navab , Benjamin Busam
- URL: https://arxiv.org/abs/2510.01934
- Abstract:
Few-shot anomaly detection streamlines and simplifies industrial safety inspection. However, limited samples make accurate differentiation between normal and abnormal features challenging, and even more so under category-agnostic conditions. Large-scale pre-training of foundation visual encoders has advanced many fields, as the enormous quantity of data helps to learn the general distribution of normal images. We observe that the anomaly amount in an image directly correlates with the difference in the learnt embeddings and utilize this to design a few-shot anomaly detector termed FoundAD. This is done by learning a nonlinear projection operator onto the natural image manifold. The simple operator acts as an effective tool for anomaly detection to characterize and identify out-of-distribution regions in an image. Extensive experiments show that our approach supports multi-class detection and achieves competitive performance while using substantially fewer parameters than prior methods. Backed up by evaluations with multiple foundation encoders, including fresh DINOv3, we believe this idea broadens the perspective on foundation features and advances the field of few-shot anomaly detection.
99. Automated Defect Detection for Mass-Produced Electronic Components Based on YOLO Object Detection Models
- Authors: Wei-Lung Mao , Chun-Chi Wang , Po-Heng Chou , Yen-Ting Liu
- URL: https://arxiv.org/abs/2510.01914
- Abstract:
Since the defect detection of conventional industry components is time-consuming and labor-intensive, it leads to a significant burden on quality inspection personnel and makes it difficult to manage product quality. In this paper, we propose an automated defect detection system for the dual in-line package (DIP) that is widely used in industry, using digital camera optics and a deep learning (DL)-based model. The two most common defect categories of DIP are examined: (1) surface defects, and (2) pin-leg defects. However, the lack of defective component images leads to a challenge for detection tasks. To solve this problem, the ConSinGAN is used to generate a suitable-sized dataset for training and testing. Four varieties of the YOLO model are investigated (v3, v4, v7, and v9), both in isolation and with the ConSinGAN augmentation. The proposed YOLOv7 with ConSinGAN is superior to the other YOLO versions in accuracy of 95.50\%, detection time of 285 ms, and is far superior to threshold-based approaches. In addition, the supervisory control and data acquisition (SCADA) system is developed, and the associated sensor architecture is described. The proposed automated defect detection can be easily established with numerous types of defects or insufficient defect data.
100. Are LLMs Better GNN Helpers? Rethinking Robust Graph Learning under Deficiencies with Iterative Refinement
- Authors: Zhaoyan Wang , Zheng Gao , Arogya Kharel , In-Young Ko
- URL: https://arxiv.org/abs/2510.01910
- Abstract:
Graph Neural Networks (GNNs) are widely adopted in Web-related applications, serving as a core technique for learning from graph-structured data, such as text-attributed graphs. Yet in real-world scenarios, such graphs exhibit deficiencies that substantially undermine GNN performance. While prior GNN-based augmentation studies have explored robustness against individual imperfections, a systematic understanding of how graph-native and Large Language Models (LLMs) enhanced methods behave under compound deficiencies is still missing. Specifically, there has been no comprehensive investigation comparing conventional approaches and recent LLM-on-graph frameworks, leaving their merits unclear. To fill this gap, we conduct the first empirical study that benchmarks these two lines of methods across diverse graph deficiencies, revealing overlooked vulnerabilities and challenging the assumption that LLM augmentation is consistently superior. Building on empirical findings, we propose Robust Graph Learning via Retrieval-Augmented Contrastive Refinement (RoGRAD) framework. Unlike prior one-shot LLM-as-Enhancer designs, RoGRAD is the first iterative paradigm that leverages Retrieval-Augmented Generation (RAG) to inject retrieval-grounded augmentations by supplying class-consistent, diverse augmentations and enforcing discriminative representations through iterative graph contrastive learning. It transforms LLM augmentation for graphs from static signal injection into dynamic refinement. Extensive experiments demonstrate RoGRAD’s superiority over both conventional GNN- and LLM-enhanced baselines, achieving up to 82.43% average improvement.
101. Multimodal Foundation Models for Early Disease Detection
- Authors: Md Talha Mohsin , Ismail Abdulrashid
- URL: https://arxiv.org/abs/2510.01899
- Abstract:
Healthcare generates diverse streams of data, including electronic health records (EHR), medical imaging, genetics, and ongoing monitoring from wearable devices. Traditional diagnostic models frequently analyze these sources in isolation, which constrains their capacity to identify cross-modal correlations essential for early disease diagnosis. Our research presents a multimodal foundation model that consolidates diverse patient data through an attention-based transformer framework. At first, dedicated encoders put each modality into a shared latent space. Then, they combine them using multi-head attention and residual normalization. The architecture is made for pretraining on many tasks, which makes it easy to adapt to new diseases and datasets with little extra work. We provide an experimental strategy that uses benchmark datasets in oncology, cardiology, and neurology, with the goal of testing early detection tasks. The framework includes data governance and model management tools in addition to technological performance to improve transparency, reliability, and clinical interpretability. The suggested method works toward a single foundation model for precision diagnostics, which could improve the accuracy of predictions and help doctors make decisions.
102. HRTFformer: A Spatially-Aware Transformer for Personalized HRTF Upsampling in Immersive Audio Rendering
- Authors: Xuyi Hu , Jian Li , Shaojie Zhang , Stefan Goetz , Lorenzo Picinali , Ozgur B. Akan , Aidan O. T. Hogg
- URL: https://arxiv.org/abs/2510.01891
- Abstract:
Personalized Head-Related Transfer Functions (HRTFs) are starting to be introduced in many commercial immersive audio applications and are crucial for realistic spatial audio rendering. However, one of the main hesitations regarding their introduction is that creating personalized HRTFs is impractical at scale due to the complexities of the HRTF measurement process. To mitigate this drawback, HRTF spatial upsampling has been proposed with the aim of reducing measurements required. While prior work has seen success with different machine learning (ML) approaches, these models often struggle with long-range spatial consistency and generalization at high upsampling factors. In this paper, we propose a novel transformer-based architecture for HRTF upsampling, leveraging the attention mechanism to better capture spatial correlations across the HRTF sphere. Working in the spherical harmonic (SH) domain, our model learns to reconstruct high-resolution HRTFs from sparse input measurements with significantly improved accuracy. To enhance spatial coherence, we introduce a neighbor dissimilarity loss that promotes magnitude smoothness, yielding more realistic upsampling. We evaluate our method using both perceptual localization models and objective spectral distortion metrics. Experiments show that our model surpasses leading methods by a substantial margin in generating realistic, high-fidelity HRTFs.
103. Small is Sufficient: Reducing the World AI Energy Consumption Through Model Selection
- Authors: Tiago da Silva Barros , Frédéric Giroire , Ramon Aparicio-Pardo , Joanna Moulierac
- URL: https://arxiv.org/abs/2510.01889
- Abstract:
The energy consumption and carbon footprint of Artificial Intelligence (AI) have become critical concerns due to rising costs and environmental impacts. In response, a new trend in green AI is emerging, shifting from the “bigger is better” paradigm, which prioritizes large models, to “small is sufficient”, emphasizing energy sobriety through smaller, more efficient models. We explore how the AI community can adopt energy sobriety today by focusing on model selection during inference. Model selection consists of choosing the most appropriate model for a given task, a simple and readily applicable method, unlike approaches requiring new hardware or architectures. Our hypothesis is that, as in many industrial activities, marginal utility gains decrease with increasing model size. Thus, applying model selection can significantly reduce energy consumption while maintaining good utility for AI inference. We conduct a systematic study of AI tasks, analyzing their popularity, model size, and efficiency. We examine how the maturity of different tasks and model adoption patterns impact the achievable energy savings, ranging from 1% to 98% for different tasks. Our estimates indicate that applying model selection could reduce AI energy consumption by 27.8%, saving 31.9 TWh worldwide in 2025 - equivalent to the annual output of five nuclear power reactors.
104. FINCH: Financial Intelligence using Natural language for Contextualized SQL Handling
- Authors: Avinash Kumar Singh , Bhaskarjit Sarmah , Stefano Pasquali
- URL: https://arxiv.org/abs/2510.01887
- Abstract:
Text-to-SQL, the task of translating natural language questions into SQL queries, has long been a central challenge in NLP. While progress has been significant, applying it to the financial domain remains especially difficult due to complex schema, domain-specific terminology, and high stakes of error. Despite this, there is no dedicated large-scale financial dataset to advance research, creating a critical gap. To address this, we introduce a curated financial dataset (FINCH) comprising 292 tables and 75,725 natural language-SQL pairs, enabling both fine-tuning and rigorous evaluation. Building on this resource, we benchmark reasoning models and language models of varying scales, providing a systematic analysis of their strengths and limitations in financial Text-to-SQL tasks. Finally, we propose a finance-oriented evaluation metric (FINCH Score) that captures nuances overlooked by existing measures, offering a more faithful assessment of model performance.
105. REPAIR: Robust Editing via Progressive Adaptive Intervention and Reintegration
- Authors: Yisu Wang , Ming Wang , Haoyuan Song , Wenjie Huang , Chaozheng Wang , Yi Xie , Xuming Ran
- URL: https://arxiv.org/abs/2510.01879
- Abstract:
Post-training for large language models (LLMs) is constrained by the high cost of acquiring new knowledge or correcting errors and by the unintended side effects that frequently arise from retraining. To address these issues, we introduce REPAIR (Robust Editing via Progressive Adaptive Intervention and Reintegration), a lifelong editing framework designed to support precise and low-cost model updates while preserving non-target knowledge. REPAIR mitigates the instability and conflicts of large-scale sequential edits through a closed-loop feedback mechanism coupled with dynamic memory management. Furthermore, by incorporating frequent knowledge fusion and enforcing strong locality guards, REPAIR effectively addresses the shortcomings of traditional distribution-agnostic approaches that often overlook unintended ripple effects. Our experiments demonstrate that REPAIR boosts editing accuracy by 10%-30% across multiple model families and significantly reduces knowledge forgetting. This work introduces a robust framework for developing reliable, scalable, and continually evolving LLMs.
106. TACOS: Task Agnostic COordinator of a multi-drone System
- Authors: Alessandro Nazzari , Roberto Rubinacci , Marco Lovera
- URL: https://arxiv.org/abs/2510.01869
- Abstract:
When a single pilot is responsible for managing a multi-drone system, the task demands varying levels of autonomy, from direct control of individual UAVs, to group-level coordination, to fully autonomous swarm behaviors for accomplishing high-level tasks. Enabling such flexible interaction requires a framework that supports multiple modes of shared autonomy. As language models continue to improve in reasoning and planning, they provide a natural foundation for such systems, reducing pilot workload by enabling high-level task delegation through intuitive, language-based interfaces. In this paper we present TACOS (Task-Agnostic COordinator of a multi-drone System), a unified framework that enables high-level natural language control of multi-UAV systems through Large Language Models (LLMs). TACOS integrates three key capabilities into a single architecture: a one-to-many natural language interface for intuitive user interaction, an intelligent coordinator for translating user intent into structured task plans, and an autonomous agent that executes plans interacting with the real-world. TACOS allows a LLM to interact with a library of executable APIs, bridging semantic reasoning with real-time multi-robot coordination. We demonstrate the system in real-world multi-drone system and conduct an ablation study to assess the contribution of each module.
107. A Modular Theory of Subjective Consciousness for Natural and Artificial Minds
- Authors: Michaël Gillon
- URL: https://arxiv.org/abs/2510.01864
- Abstract:
Understanding how subjective experience arises from information processing remains a central challenge in neuroscience, cognitive science, and AI research. The Modular Consciousness Theory (MCT) proposes a biologically grounded and computationally explicit framework in which consciousness is a discrete sequence of Integrated Informational States (IISs). Each IIS is a packet of integrated information tagged with a multidimensional density vector that quantifies informational richness. Its magnitude correlates with subjective intensity, shaping memory, behavior, and continuity of experience. Inputs from body and environment are adaptively filtered, processed by modules (abstraction, narration, evaluation, self-evaluation), and integrated into an IIS. The resulting packet, tagged with its density vector, is transmitted to behavioral readiness, memory, and decision-making modules, closing the loop. This explains why strongly tagged states exert greater influence on long-term memory and action. Unlike Global Workspace Theory, Integrated Information Theory, or Higher-Order Thought, MCT specifies a full computational pipeline producing discrete informational units with quantifiable internal structure. Subjectivity is reframed as a correlate of the density-tagging signal with functional consequences. MCT generates testable predictions, such as stress enhancing memory encoding, and provides a naturalistic blueprint for both biological and artificial architectures. Consciousness, in this view, is not an irreducible essence but an evolvable, quantifiable, and constructible feature of complex information processing.
108. NGGAN: Noise Generation GAN Based on the Practical Measurement Dataset for Narrowband Powerline Communications
- Authors: Ying-Ren Chien , Po-Heng Chou , You-Jie Peng , Chun-Yuan Huang , Hen-Wai Tsao , Yu Tsao
- URL: https://arxiv.org/abs/2510.01850
- Abstract:
Capturing comprehensive statistics of nonperiodic asynchronous impulsive noise is a critical issue in enhancing impulse noise processing for narrowband powerline communication (NB-PLC) transceivers. However, existing mathematical noise generative models capture only some of the characteristics of additive noise. Therefore, we propose a generative adversarial network (GAN), called the noise-generation GAN (NGGAN), that learns the complicated characteristics of practically measured noise samples for data augmentation. To closely match the statistics of complicated noise in NB-PLC systems, we measured the NB-PLC noise via the analog coupling and bandpass filtering circuits of a commercial NB-PLC modem to build a realistic dataset. Specifically, the NGGAN design approaches based on the practically measured dataset are as follows: (i) we design the length of input signals that the NGGAN model can fit to facilitate cyclo-stationary noise generation. (ii) Wasserstein distance is used as a loss function to enhance the similarity between the generated noise and the training dataset and ensure that the sample diversity is sufficient for various applications. (iii) To measure the similarity performance of the GAN-based models based on mathematical and practically measured datasets, we perform quantitative and qualitative analyses. The training datasets include (1) a piecewise spectral cyclo-stationary Gaussian model (PSCGM), (2) a frequency-shift (FRESH) filter, and (3) practical measurements from NB-PLC systems. Simulation results demonstrate that the proposed NGGAN trained using waveform characteristics is closer to the practically measured dataset in terms of the quality of the generated noise.
109. Pre-Hoc Predictions in AutoML: Leveraging LLMs to Enhance Model Selection and Benchmarking for Tabular datasets
- Authors: Yannis Belkhiter , Seshu Tirupathi , Giulio Zizzo , Sachin Sharma , John D. Kelleher
- URL: https://arxiv.org/abs/2510.01842
- Abstract:
The field of AutoML has made remarkable progress in post-hoc model selection, with libraries capable of automatically identifying the most performing models for a given dataset. Nevertheless, these methods often rely on exhaustive hyperparameter searches, where methods automatically train and test different types of models on the target dataset. Contrastingly, pre-hoc prediction emerges as a promising alternative, capable of bypassing exhaustive search through intelligent pre-selection of models. Despite its potential, pre-hoc prediction remains under-explored in the literature. This paper explores the intersection of AutoML and pre-hoc model selection by leveraging traditional models and Large Language Model (LLM) agents to reduce the search space of AutoML libraries. By relying on dataset descriptions and statistical information, we reduce the AutoML search space. Our methodology is applied to the AWS AutoGluon portfolio dataset, a state-of-the-art AutoML benchmark containing 175 tabular classification datasets available on OpenML. The proposed approach offers a shift in AutoML workflows, significantly reducing computational overhead, while still selecting the best model for the given dataset.
110. SingMOS-Pro: An Comprehensive Benchmark for Singing Quality Assessment
- Authors: Yuxun Tang , Lan Liu , Wenhao Feng , Yiwen Zhao , Jionghao Han , Yifeng Yu , Jiatong Shi , Qin Jin
- URL: https://arxiv.org/abs/2510.01812
- Abstract:
Singing voice generation progresses rapidly, yet evaluating singing quality remains a critical challenge. Human subjective assessment, typically in the form of listening tests, is costly and time consuming, while existing objective metrics capture only limited perceptual aspects. In this work, we introduce SingMOS-Pro, a dataset for automatic singing quality assessment. Building on our preview version SingMOS, which provides only overall ratings, SingMOS-Pro expands annotations of the additional part to include lyrics, melody, and overall quality, offering broader coverage and greater diversity. The dataset contains 7,981 singing clips generated by 41 models across 12 datasets, spanning from early systems to recent advances. Each clip receives at least five ratings from professional annotators, ensuring reliability and consistency. Furthermore, we explore how to effectively utilize MOS data annotated under different standards and benchmark several widely used evaluation methods from related tasks on SingMOS-Pro, establishing strong baselines and practical references for future research. The dataset can be accessed at this https URL .
111. Rethinking the shape convention of an MLP
- Authors: Meng-Hsi Chen , Yu-Ang Lee , Feng-Ting Liao , Da-shan Shiu
- URL: https://arxiv.org/abs/2510.01796
- Abstract:
Multi-layer perceptrons (MLPs) conventionally follow a narrow-wide-narrow design where skip connections operate at the input/output dimensions while processing occurs in expanded hidden spaces. We challenge this convention by proposing wide-narrow-wide (Hourglass) MLP blocks where skip connections operate at expanded dimensions while residual computation flows through narrow bottlenecks. This inversion leverages higher-dimensional spaces for incremental refinement while maintaining computational efficiency through parameter-matched designs. Implementing Hourglass MLPs requires an initial projection to lift input signals to expanded dimensions. We propose that this projection can remain fixed at random initialization throughout training, enabling efficient training and inference implementations. We evaluate both architectures on generative tasks over popular image datasets, characterizing performance-parameter Pareto frontiers through systematic architectural search. Results show that Hourglass architectures consistently achieve superior Pareto frontiers compared to conventional designs. As parameter budgets increase, optimal Hourglass configurations favor deeper networks with wider skip connections and narrower bottlenecks-a scaling pattern distinct from conventional MLPs. Our findings suggest reconsidering skip connection placement in modern architectures, with potential applications extending to Transformers and other residual networks.
112. Nav-EE: Navigation-Guided Early Exiting for Efficient Vision-Language Models in Autonomous Driving
- Authors: Haibo Hu , Lianming Huang , Xinyu Wang , Yufei Cui , Nan Guan , Chun Jason Xue
- URL: https://arxiv.org/abs/2510.01795
- Abstract:
Vision-Language Models (VLMs) are increasingly applied in autonomous driving for unified perception and reasoning, but high inference latency hinders real-time deployment. Early-exit reduces latency by terminating inference at intermediate layers, yet its task-dependent nature limits generalization across diverse scenarios. We observe that this limitation aligns with autonomous driving: navigation systems can anticipate upcoming contexts (e.g., intersections, traffic lights), indicating which tasks will be required. We propose Nav-EE, a navigation-guided early-exit framework that precomputes task-specific exit layers offline and dynamically applies them online based on navigation priors. Experiments on CODA, Waymo, and BOSCH show that Nav-EE achieves accuracy comparable to full inference while reducing latency by up to 63.9%. Real-vehicle integration with Autoware Universe further demonstrates reduced inference latency (600ms to 300ms), supporting faster decision-making in complex scenarios. These results suggest that coupling navigation foresight with early-exit offers a viable path toward efficient deployment of large models in autonomous systems. Code and data are available at our anonymous repository: this https URL
113. Comparison of Unsupervised Metrics for Evaluating Judicial Decision Extraction
- Authors: Ivan Leonidovich Litvak , Anton Kostin , Fedor Lashkin , Tatiana Maksiyan , Sergey Lagutin
- URL: https://arxiv.org/abs/2510.01792
- Abstract:
The rapid advancement of artificial intelligence in legal natural language processing demands scalable methods for evaluating text extraction from judicial decisions. This study evaluates 16 unsupervised metrics, including novel formulations, to assess the quality of extracting seven semantic blocks from 1,000 anonymized Russian judicial decisions, validated against 7,168 expert reviews on a 1–5 Likert scale. These metrics, spanning document-based, semantic, structural, pseudo-ground truth, and legal-specific categories, operate without pre-annotated ground truth. Bootstrapped correlations, Lin’s concordance correlation coefficient (CCC), and mean absolute error (MAE) reveal that Term Frequency Coherence (Pearson $r = 0.540$, Lin CCC = 0.512, MAE = 0.127) and Coverage Ratio/Block Completeness (Pearson $r = 0.513$, Lin CCC = 0.443, MAE = 0.139) best align with expert ratings, while Legal Term Density (Pearson $r = -0.479$, Lin CCC = -0.079, MAE = 0.394) show strong negative correlations. The LLM Evaluation Score (mean = 0.849, Pearson $r = 0.382$, Lin CCC = 0.325, MAE = 0.197) showed moderate alignment, but its performance, using gpt-4.1-mini via g4f, suggests limited specialization for legal textse. These findings highlight that unsupervised metrics, including LLM-based approaches, enable scalable screening but, with moderate correlations and low CCC values, cannot fully replace human judgment in high-stakes legal contexts. This work advances legal NLP by providing annotation-free evaluation tools, with implications for judicial analytics and ethical AI deployment.
114. Pack and Force Your Memory: Long-form and Consistent Video Generation
- Authors: Xiaofei Wu , Guozhen Zhang , Zhiyong Xu , Yuan Zhou , Qinglin Lu , Xuming He
- URL: https://arxiv.org/abs/2510.01784
- Abstract:
Long-form video generation presents a dual challenge: models must capture long-range dependencies while preventing the error accumulation inherent in autoregressive decoding. To address these challenges, we make two contributions. First, for dynamic context modeling, we propose MemoryPack, a learnable context-retrieval mechanism that leverages both textual and image information as global guidance to jointly model short- and long-term dependencies, achieving minute-level temporal consistency. This design scales gracefully with video length, preserves computational efficiency, and maintains linear complexity. Second, to mitigate error accumulation, we introduce Direct Forcing, an efficient single-step approximating strategy that improves training-inference alignment and thereby curtails error propagation during inference. Together, MemoryPack and Direct Forcing substantially enhance the context consistency and reliability of long-form video generation, advancing the practical usability of autoregressive video models.
115. Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks
- Authors: Wenbo Pan , Jie Xu , Qiguang Chen , Junhao Dong , Libo Qin , Xinfeng Li , Haining Yu , Xiaohua Jia
- URL: https://arxiv.org/abs/2510.01782
- Abstract:
Large Language Models (LLMs) should refuse to answer questions beyond their knowledge. This capability, which we term knowledge-aware refusal, is crucial for factual reliability. However, existing metrics fail to faithfully measure this ability. On the one hand, simple refusal-based metrics are biased by refusal rates and yield inconsistent scores when models exhibit different refusal tendencies. On the other hand, existing calibration metrics are proxy-based, capturing the performance of auxiliary calibration processes rather than the model’s actual refusal behavior. In this work, we propose the Refusal Index (RI), a principled metric that measures how accurately LLMs refuse questions they do not know. We define RI as Spearman’s rank correlation between refusal probability and error probability. To make RI practically measurable, we design a lightweight two-pass evaluation method that efficiently estimates RI from observed refusal rates across two standard evaluation runs. Extensive experiments across 16 models and 5 datasets demonstrate that RI accurately quantifies a model’s intrinsic knowledge-aware refusal capability in factual tasks. Notably, RI remains stable across different refusal rates and provides consistent model rankings independent of a model’s overall accuracy and refusal rates. More importantly, RI provides insight into an important but previously overlooked aspect of LLM factuality: while LLMs achieve high accuracy on factual tasks, their refusal behavior can be unreliable and fragile. This finding highlights the need to complement traditional accuracy metrics with the Refusal Index for comprehensive factuality evaluation.
116. Secure Multi-Modal Data Fusion in Federated Digital Health Systems via MCP
- Authors: Aueaphum Aueawatthanaphisut
- URL: https://arxiv.org/abs/2510.01780
- Abstract:
Secure and interoperable integration of heterogeneous medical data remains a grand challenge in digital health. Current federated learning (FL) frameworks offer privacy-preserving model training but lack standardized mechanisms to orchestrate multi-modal data fusion across distributed and resource-constrained environments. This study introduces a novel framework that leverages the Model Context Protocol (MCP) as an interoperability layer for secure, cross-agent communication in multi-modal federated healthcare systems. The proposed architecture unifies three pillars: (i) multi-modal feature alignment for clinical imaging, electronic medical records, and wearable IoT data; (ii) secure aggregation with differential privacy to protect patient-sensitive updates; and (iii) energy-aware scheduling to mitigate dropouts in mobile clients. By employing MCP as a schema-driven interface, the framework enables adaptive orchestration of AI agents and toolchains while ensuring compliance with privacy regulations. Experimental evaluation on benchmark datasets and pilot clinical cohorts demonstrates up to 9.8\% improvement in diagnostic accuracy compared with baseline FL, a 54\% reduction in client dropout rates, and clinically acceptable privacy–utility trade-offs. These results highlight MCP-enabled multi-modal fusion as a scalable and trustworthy pathway toward equitable, next-generation federated health infrastructures.
117. Unsupervised Dynamic Feature Selection for Robust Latent Spaces in Vision Tasks
- Authors: Bruno Corcuera , Carlos Eiras-Franco , Brais Cancela
- URL: https://arxiv.org/abs/2510.01758
- Abstract:
Latent representations are critical for the performance and robustness of machine learning models, as they encode the essential features of data in a compact and informative manner. However, in vision tasks, these representations are often affected by noisy or irrelevant features, which can degrade the model’s performance and generalization capabilities. This paper presents a novel approach for enhancing latent representations using unsupervised Dynamic Feature Selection (DFS). For each instance, the proposed method identifies and removes misleading or redundant information in images, ensuring that only the most relevant features contribute to the latent space. By leveraging an unsupervised framework, our approach avoids reliance on labeled data, making it broadly applicable across various domains and datasets. Experiments conducted on image datasets demonstrate that models equipped with unsupervised DFS achieve significant improvements in generalization performance across various tasks, including clustering and image generation, while incurring a minimal increase in the computational cost.
118. Machine-interpretable Engineering Design Standards for Valve Specification
- Authors: Anders Gjerver , Rune Frostad , Vedrana Barisic , Melinda Hodkiewicz , Caitlin Woods , Mihaly Fekete , Arild Braathen Torjusen , Johan Wilhelm Kluwer
- URL: https://arxiv.org/abs/2510.01736
- Abstract:
Engineering design processes use technical specifications and must comply with standards. Product specifications, product type data sheets, and design standards are still mainly document-centric despite the ambition to digitalize industrial work. In this paper, we demonstrate how to transform information held in engineering design standards into modular, reusable, machine-interpretable ontologies and use the ontologies in quality assurance of the plant design and equipment selection process. We use modelling patterns to create modular ontologies for knowledge captured in the text and in frequently referenced tables in International Standards for piping, material and valve design. These modules are exchangeable, as stored in a W3C compliant format, and interoperable as they are aligned with the top-level ontology ISO DIS 23726-3: Industrial Data Ontology (IDO). We test these ontologies, created based on international material and piping standards and industry norms, on a valve selection process. Valves are instantiated in semantic asset models as individuals along with a semantic representation of the environmental condition at their location on the asset. We create “functional location tags” as OWL individuals that become instances of OWL class Valve Data Sheet (VDS) specified valves. Similarly we create instances of manufacturer product type. Our approach enables automated validation that a specific VDS is compliant with relevant industry standards. Using semantic reasoning and executable design rules, we also determine whether the product type meets the valve specification. Creation of shared, reusable IDO-based modular ontologies for design standards enables semantic reasoning to be applied to equipment selection processes and demonstrates the potential of this approach for Standards Bodies wanting to transition to digitized Smart Standards.
119. Emotional Text-To-Speech Based on Mutual-Information-Guided Emotion-Timbre Disentanglement
- Authors: Jianing Yang , Sheng Li , Takahiro Shinozaki , Yuki Saito , Hiroshi Saruwatari
- URL: https://arxiv.org/abs/2510.01722
- Abstract:
Current emotional Text-To-Speech (TTS) and style transfer methods rely on reference encoders to control global style or emotion vectors, but do not capture nuanced acoustic details of the reference speech. To this end, we propose a novel emotional TTS method that enables fine-grained phoneme-level emotion embedding prediction while disentangling intrinsic attributes of the reference speech. The proposed method employs a style disentanglement method to guide two feature extractors, reducing mutual information between timbre and emotion features, and effectively separating distinct style components from the reference speech. Experimental results demonstrate that our method outperforms baseline TTS systems in generating natural and emotionally rich speech. This work highlights the potential of disentangled and fine-grained representations in advancing the quality and flexibility of emotional TTS systems.
120. Latency-aware Multimodal Federated Learning over UAV Networks
- Authors: Shaba Shaon , Dinh C. Nguyen
- URL: https://arxiv.org/abs/2510.01717
- Abstract:
This paper investigates federated multimodal learning (FML) assisted by unmanned aerial vehicles (UAVs) with a focus on minimizing system latency and providing convergence analysis. In this framework, UAVs are distributed throughout the network to collect data, participate in model training, and collaborate with a base station (BS) to build a global model. By utilizing multimodal sensing, the UAVs overcome the limitations of unimodal systems, enhancing model accuracy, generalization, and offering a more comprehensive understanding of the environment. The primary objective is to optimize FML system latency in UAV networks by jointly addressing UAV sensing scheduling, power control, trajectory planning, resource allocation, and BS resource management. To address the computational complexity of our latency minimization problem, we propose an efficient iterative optimization algorithm combining block coordinate descent and successive convex approximation techniques, which provides high-quality approximate solutions. We also present a theoretical convergence analysis for the UAV-assisted FML framework under a non-convex loss function. Numerical experiments demonstrate that our FML framework outperforms existing approaches in terms of system latency and model training performance under different data settings.
121. PyramidStyler: Transformer-Based Neural Style Transfer with Pyramidal Positional Encoding and Reinforcement Learning
- Authors: Raahul Krishna Durairaju (1), K. Saruladha (2) ((1) California State University, Fullerton, (2) Puducherry Technological University)
- URL: https://arxiv.org/abs/2510.01715
- Abstract:
Neural Style Transfer (NST) has evolved from Gatys et al.’s (2015) CNN-based algorithm, enabling AI-driven artistic image synthesis. However, existing CNN and transformer-based models struggle to scale efficiently to complex styles and high-resolution inputs. We introduce PyramidStyler, a transformer framework with Pyramidal Positional Encoding (PPE): a hierarchical, multi-scale encoding that captures both local details and global context while reducing computational load. We further incorporate reinforcement learning to dynamically optimize stylization, accelerating convergence. Trained on Microsoft COCO and WikiArt, PyramidStyler reduces content loss by 62.6% (to 2.07) and style loss by 57.4% (to 0.86) after 4000 epochs–achieving 1.39 s inference–and yields further improvements (content 2.03; style 0.75) with minimal speed penalty (1.40 s) when using RL. These results demonstrate real-time, high-quality artistic rendering, with broad applications in media and design.
122. PolySim: Bridging the Sim-to-Real Gap for Humanoid Control via Multi-Simulator Dynamics Randomization
- Authors: Zixing Lei , Zibo Zhou , Sheng Yin , Yueru Chen , Qingyao Xu , Weixin Li , Yunhong Wang , Bowei Tang , Wei Jing , Siheng Chen
- URL: https://arxiv.org/abs/2510.01708
- Abstract:
Humanoid whole-body control (WBC) policies trained in simulation often suffer from the sim-to-real gap, which fundamentally arises from simulator inductive bias, the inherent assumptions and limitations of any single simulator. These biases lead to nontrivial discrepancies both across simulators and between simulation and the real world. To mitigate the effect of simulator inductive bias, the key idea is to train policies jointly across multiple simulators, encouraging the learned controller to capture dynamics that generalize beyond any single simulator’s assumptions. We thus introduce PolySim, a WBC training platform that integrates multiple heterogeneous simulators. PolySim can launch parallel environments from different engines simultaneously within a single training run, thereby realizing dynamics-level domain randomization. Theoretically, we show that PolySim yields a tighter upper bound on simulator inductive bias than single-simulator training. In experiments, PolySim substantially reduces motion-tracking error in sim-to-sim evaluations; for example, on MuJoCo, it improves execution success by 52.8 over an IsaacSim baseline. PolySim further enables zero-shot deployment on a real Unitree G1 without additional fine-tuning, showing effective transfer from simulation to the real world. We will release the PolySim code upon acceptance of this work.
123. Representational Alignment Across Model Layers and Brain Regions with Hierarchical Optimal Transport
- Authors: Shaan Shah , Meenakshi Khosla
- URL: https://arxiv.org/abs/2510.01706
- Abstract:
Standard representational similarity methods align each layer of a network to its best match in another independently, producing asymmetric results, lacking a global alignment score, and struggling with networks of different depths. These limitations arise from ignoring global activation structure and restricting mappings to rigid one-to-one layer correspondences. We propose Hierarchical Optimal Transport (HOT), a unified framework that jointly infers soft, globally consistent layer-to-layer couplings and neuron-level transport plans. HOT allows source neurons to distribute mass across multiple target layers while minimizing total transport cost under marginal constraints. This yields both a single alignment score for the entire network comparison and a soft transport plan that naturally handles depth mismatches through mass distribution. We evaluate HOT on vision models, large language models, and human visual cortex recordings. Across all domains, HOT matches or surpasses standard pairwise matching in alignment quality. Moreover, it reveals smooth, fine-grained hierarchical correspondences: early layers map to early layers, deeper layers maintain relative positions, and depth mismatches are resolved by distributing representations across multiple layers. These structured patterns emerge naturally from global optimization without being imposed, yet are absent in greedy layer-wise methods. HOT thus enables richer, more interpretable comparisons between representations, particularly when networks differ in architecture or depth.
124. Holistic Order Prediction in Natural Scenes
- Authors: Pierre Musacchio , Hyunmin Lee , Jaesik Park
- URL: https://arxiv.org/abs/2510.01704
- Abstract:
Even in controlled settings, understanding instance-wise geometries is a challenging task for a wide range of visual models. Although specialized systems exist, modern arts rely on expensive input formats (category labels, binary segmentation masks) and inference costs (a quadratic amount of forward passes). We mitigate these limitations by proposing InstaFormer, a network capable of holistic order prediction. That is, solely given an input RGB image, InstaFormer returns the full occlusion and depth orderings for all the instances in the scene in a single forward pass. At its core, InstaFormer relies on interactions between object queries and latent mask descriptors that semantically represent the same objects while carrying complementary information. We comprehensively benchmark and ablate our approach to highlight its effectiveness. Our code and models are open-source and available at this URL: this https URL .
125. Format Inertia: A Failure Mechanism of LLMs in Medical Pre-Consultation
- Authors: Seungseop Lim , Gibaeg Kim , Wooseok Han , Jean Seo , Hyunkyung Lee , Jaehyo Yoo , Eunho Yang
- URL: https://arxiv.org/abs/2510.01688
- Abstract:
Recent advances in Large Language Models (LLMs) have brought significant improvements to various service domains, including chatbots and medical pre-consultation applications. In the healthcare domain, the most common approach for adapting LLMs to multi-turn dialogue generation is Supervised Fine-Tuning (SFT). However, datasets for SFT in tasks like medical pre-consultation typically exhibit a skewed turn-count distribution. Training on such data induces a novel failure mechanism we term Format Inertia, where models tend to generate repetitive, format-correct, but diagnostically uninformative questions in long medical dialogues. To mitigate this observed failure mechanism, we adopt a simple, data-centric method that rebalances the turn-count distribution of the training dataset. Experimental results show that our approach substantially alleviates Format Inertia in medical pre-consultation.
126. How Do Language Models Compose Functions?
- Authors: Apoorv Khandelwal , Ellie Pavlick
- URL: https://arxiv.org/abs/2510.01685
- Abstract:
While large language models (LLMs) appear to be increasingly capable of solving compositional tasks, it is an open question whether they do so using compositional mechanisms. In this work, we investigate how feedforward LLMs solve two-hop factual recall tasks, which can be expressed compositionally as $g(f(x))$. We first confirm that modern LLMs continue to suffer from the “compositionality gap”: i.e. their ability to compute both $z = f(x)$ and $y = g(z)$ does not entail their ability to compute the composition $y = g(f(x))$. Then, using logit lens on their residual stream activations, we identify two processing mechanisms, one which solves tasks $\textit{compositionally}$, computing $f(x)$ along the way to computing $g(f(x))$, and one which solves them $\textit{directly}$, without any detectable signature of the intermediate variable $f(x)$. Finally, we find that which mechanism is employed appears to be related to the embedding space geometry, with the idiomatic mechanism being dominant in cases where there exists a linear mapping from $x$ to $g(f(x))$ in the embedding spaces. We fully release our data and code at: this https URL .
127. Look Less, Reason More: Rollout-Guided Adaptive Pixel-Space Reasoning
- Authors: Xuchen Li , Xuzhao Li , Jiahui Gao , Renjie Pi , Shiyu Hu , Wentao Zhang
- URL: https://arxiv.org/abs/2510.01681
- Abstract:
Vision-Language Models (VLMs) excel at many multimodal tasks, yet they frequently struggle with tasks requiring precise understanding and handling of fine-grained visual elements. This is mainly due to information loss during image encoding or insufficient attention to critical regions. Recent work has shown promise by incorporating pixel-level visual information into the reasoning process, enabling VLMs to access high-resolution visual details during their thought process. However, this pixel-level information is often overused, leading to inefficiency and distraction from irrelevant visual details. To address these challenges, we propose the first framework for adaptive pixel reasoning that dynamically determines necessary pixel-level operations based on the input query. Specifically, we first apply operation-aware supervised fine-tuning to establish baseline competence in textual reasoning and visual operations, then design a novel rollout-guided reinforcement learning framework relying on feedback of the model’s own responses, which enables the VLM to determine when pixel operations should be invoked based on query difficulty. Experiments on extensive multimodal reasoning benchmarks show that our model achieves superior performance while significantly reducing unnecessary visual operations. Impressively, our model achieves 73.4\% accuracy on HR-Bench 4K while maintaining a tool usage ratio of only 20.1\%, improving accuracy and simultaneously reducing tool usage by 66.5\% compared to the previous methods.
128. FOR-Prompting: From Objection to Revision via an Asymmetric Prompting Protocol
- Authors: He Zhang , Anzhou Zhang , Jian Dai
- URL: https://arxiv.org/abs/2510.01674
- Abstract:
Reasoning protocols such as Chain of Thought (CoT) and Tree of Thought (ToT) organize internal deliberation but lack an explicit mechanism for external questioning that elicits self-revision. We present FOR-Prompting (From Objection to Revision Prompting), an asymmetric protocol where a Defender proposes an answer, an Objectioner raises question-style objections with no direct fixes, and a Host enforces consistency and closure. On GSM8K we observe about a 22% point gain over single-prompt and accuracy on par with CoT, with more than 10% higher ratings in reasoning and coherence from a uniform GPT 4.1 judge. FOR-Prompting also corrects mistakes without tools or human supervision on tricky queries, and improves performance for small-scale model (approx. 19% accuracy improved on Llama3.2:1b for GSM8K task), highlighting promise for small models and on personal device use. Beyond factual QA, qualitative analyses on open-ended tasks show enhanced exploration and refinement, with dialogue traces that make assumptions and trade-offs explicit. The protocol is model agnostic and operates purely at the prompt level through role-structured turns, so it works with hosted and local models of different sizes without retraining, and it supports large-scale study of objection-guided reasoning.
129. Shift-Invariant Attribute Scoring for Kolmogorov-Arnold Networks via Shapley Value
- Authors: Wangxuan Fan , Ching Wang , Siqi Li , Nan Liu
- URL: https://arxiv.org/abs/2510.01663
- Abstract:
For many real-world applications, understanding feature-outcome relationships is as crucial as achieving high predictive accuracy. While traditional neural networks excel at prediction, their black-box nature obscures underlying functional relationships. Kolmogorov–Arnold Networks (KANs) address this by employing learnable spline-based activation functions on edges, enabling recovery of symbolic representations while maintaining competitive performance. However, KAN’s architecture presents unique challenges for network pruning. Conventional magnitude-based methods become unreliable due to sensitivity to input coordinate shifts. We propose \textbf{ShapKAN}, a pruning framework using Shapley value attribution to assess node importance in a shift-invariant manner. Unlike magnitude-based approaches, ShapKAN quantifies each node’s actual contribution, ensuring consistent importance rankings regardless of input parameterization. Extensive experiments on synthetic and real-world datasets demonstrate that ShapKAN preserves true node importance while enabling effective network compression. Our approach improves KAN’s interpretability advantages, facilitating deployment in resource-constrained environments.
130. MDSEval: A Meta-Evaluation Benchmark for Multimodal Dialogue Summarization
- Authors: Yinhong Liu , Jianfeng He , Hang Su , Ruixue Lian , Yi Nian , Jake Vincent , Srikanth Vishnubhotla , Robinson Piramuthu , Saab Mansour
- URL: https://arxiv.org/abs/2510.01659
- Abstract:
Multimodal Dialogue Summarization (MDS) is a critical task with wide-ranging applications. To support the development of effective MDS models, robust automatic evaluation methods are essential for reducing both cost and human effort. However, such methods require a strong meta-evaluation benchmark grounded in human annotations. In this work, we introduce MDSEval, the first meta-evaluation benchmark for MDS, consisting image-sharing dialogues, corresponding summaries, and human judgments across eight well-defined quality aspects. To ensure data quality and richfulness, we propose a novel filtering framework leveraging Mutually Exclusive Key Information (MEKI) across modalities. Our work is the first to identify and formalize key evaluation dimensions specific to MDS. We benchmark state-of-the-art modal evaluation methods, revealing their limitations in distinguishing summaries from advanced MLLMs and their susceptibility to various bias.
131. Learning Time-Series Representations by Hierarchical Uniformity-Tolerance Latent Balancing
- Authors: Amin Jalali , Milad Soltany , Michael Greenspan , Ali Etemad
- URL: https://arxiv.org/abs/2510.01658
- Abstract:
We propose TimeHUT, a novel method for learning time-series representations by hierarchical uniformity-tolerance balancing of contrastive representations. Our method uses two distinct losses to learn strong representations with the aim of striking an effective balance between uniformity and tolerance in the embedding space. First, TimeHUT uses a hierarchical setup to learn both instance-wise and temporal information from input time-series. Next, we integrate a temperature scheduler within the vanilla contrastive loss to balance the uniformity and tolerance characteristics of the embeddings. Additionally, a hierarchical angular margin loss enforces instance-wise and temporal contrast losses, creating geometric margins between positive and negative pairs of temporal sequences. This approach improves the coherence of positive pairs and their separation from the negatives, enhancing the capture of temporal dependencies within a time-series sample. We evaluate our approach on a wide range of tasks, namely 128 UCR and 30 UAE datasets for univariate and multivariate classification, as well as Yahoo and KPI datasets for anomaly detection. The results demonstrate that TimeHUT outperforms prior methods by considerable margins on classification, while obtaining competitive results for anomaly detection. Finally, detailed sensitivity and ablation studies are performed to evaluate different components and hyperparameters of our method.
132. Asymmetric Proximal Policy Optimization: mini-critics boost LLM reasoning
- Authors: Jiashun Liu , Johan Obando-Ceron , Han Lu , Yancheng He , Weixun Wang , Wenbo Su , Bo Zheng , Pablo Samuel Castro , Aaron Courville , Ling Pan
- URL: https://arxiv.org/abs/2510.01656
- Abstract:
Most recent RL for LLMs (RL4LLM) methods avoid explicit critics, replacing them with average advantage baselines. This shift is largely pragmatic: conventional value functions are computationally expensive to train at LLM scale and often fail under sparse rewards and long reasoning horizons. We revisit this bottleneck from an architectural perspective and introduce Asymmetric Proximal Policy Optimization (AsyPPO), a simple and scalable framework that restores the critics role while remaining efficient in large-model settings. AsyPPO employs a set of lightweight mini-critics, each trained on disjoint prompt shards. This design encourages diversity while preserving calibration, reducing value-estimation bias. Beyond robust estimation, AsyPPO leverages inter-critic uncertainty to refine the policy update: (i) masking advantages in states where critics agree and gradients add little learning signal, and (ii) filtering high-divergence states from entropy regularization, suppressing spurious exploration. After training on open-source data with only 5,000 samples, AsyPPO consistently improves learning stability and performance across multiple benchmarks over strong baselines, such as GRPO, achieving performance gains of more than six percent on Qwen3-4b-Base and about three percent on Qwen3-8b-Base and Qwen3-14b-Base over classic PPO, without additional tricks. These results highlight the importance of architectural innovations for scalable, efficient algorithms.
133. SoK: Measuring What Matters for Closed-Loop Security Agents
- Authors: Mudita Khurana , Raunak Jain
- URL: https://arxiv.org/abs/2510.01654
- Abstract:
Cybersecurity is a relentless arms race, with AI driven offensive systems evolving faster than traditional defenses can adapt. Research and tooling remain fragmented across isolated defensive functions, creating blind spots that adversaries exploit. Autonomous agents capable of integrating, exploit confirmation, remediation, and validation into a single closed loop offer promise, but the field lacks three essentials: a framework defining the agentic capabilities of security systems across security life cycle, a principled method for evaluating closed loop agents, and a benchmark for measuring their performance in practice. We introduce CLASP: the Closed-Loop Autonomous Security Performance framework which aligns the security lifecycle (reconnaissance, exploitation, root cause analysis, patch synthesis, validation) with core agentic capabilities (planning, tool use, memory, reasoning, reflection & perception) providing a common vocabulary and rubric for assessing agentic capabilities in security tasks. By applying CLASP to 21 representative works, we map where systems demonstrate strengths, and where capability gaps persist. We then define the Closed-Loop Capability (CLC) Score, a composite metric quantifying both degree of loop closure and operational effectiveness, and outline the requirements for a closed loop benchmark. Together, CLASP and the CLC Score, provide the vocabulary, diagnostics, and measurements needed to advance both function level performance and measure closed loop security agents.
134. The Unseen Frontier: Pushing the Limits of LLM Sparsity with Surrogate-Free ADMM
- Authors: Kwanhee Lee , Hyeondo Jang , Dongyeop Lee , Dan Alistarh , Namhoon Lee
- URL: https://arxiv.org/abs/2510.01650
- Abstract:
Neural network pruning is a promising technique to mitigate the excessive computational and memory requirements of large language models (LLMs). Despite its promise, however, progress in this area has diminished, as conventional methods are seemingly unable to surpass moderate sparsity levels (50-60%) without severely degrading model accuracy. This work breaks through the current impasse, presenting a principled and effective method called $\texttt{Elsa}$, which achieves extreme sparsity levels of up to 90% while retaining high model fidelity. This is done by identifying several limitations in current practice, all of which can be traced back to their reliance on a surrogate objective formulation. $\texttt{Elsa}$ tackles this issue directly and effectively via standard and well-established constrained optimization techniques based on ADMM. Our extensive experiments across a wide range of models and scales show that $\texttt{Elsa}$ achieves substantial improvements over existing methods; e.g., it achieves 7.8$\times$ less perplexity than the best existing method on LLaMA-2-7B at 90% sparsity. Furthermore, we present $\texttt{Elsa}_{\text{-L}}$, a quantized variant that scales to extremely large models (27B), and establish its theoretical convergence guarantees. These results highlight meaningful progress in advancing the frontier of LLM sparsity, while promising that significant opportunities for further advancement may remain in directions that have so far attracted limited exploration.
135. Source-Free Cross-Domain Continual Learning
- Authors: Muhammad Tanzil Furqon , Mahardhika Pratama , Igor Škrjanc , Lin Liu , Habibullah Habibullah , Kutluyil Dogancay
- URL: https://arxiv.org/abs/2510.01649
- Abstract:
Although existing cross-domain continual learning approaches successfully address many streaming tasks having domain shifts, they call for a fully labeled source domain hindering their feasibility in the privacy constrained environments. This paper goes one step ahead with the problem of source-free cross-domain continual learning where the use of source-domain samples are completely prohibited. We propose the idea of rehearsal-free frequency-aware dynamic prompt collaborations (REFEREE) to cope with the absence of labeled source-domain samples in realm of cross-domain continual learning. REFEREE is built upon a synergy between a source-pre-trained model and a large-scale vision-language model, thus overcoming the problem of sub-optimal generalizations when relying only on a source pre-trained model. The domain shift problem between the source domain and the target domain is handled by a frequency-aware prompting technique encouraging low-frequency components while suppressing high-frequency components. This strategy generates frequency-aware augmented samples, robust against noisy pseudo labels. The noisy pseudo-label problem is further addressed with the uncertainty-aware weighting strategy where the mean and covariance matrix are weighted by prediction uncertainties, thus mitigating the adverse effects of the noisy pseudo label. Besides, the issue of catastrophic forgetting (CF) is overcome by kernel linear discriminant analysis (KLDA) where the backbone network is frozen while the classification is performed using the linear discriminant analysis approach guided by the random kernel method. Our rigorous numerical studies confirm the advantage of our approach where it beats prior arts having access to source domain samples with significant margins.
136. Position: Privacy Is Not Just Memorization!
- Authors: Niloofar Mireshghallah , Tianshi Li
- URL: https://arxiv.org/abs/2510.01645
- Abstract:
The discourse on privacy risks in Large Language Models (LLMs) has disproportionately focused on verbatim memorization of training data, while a constellation of more immediate and scalable privacy threats remain underexplored. This position paper argues that the privacy landscape of LLM systems extends far beyond training data extraction, encompassing risks from data collection practices, inference-time context leakage, autonomous agent capabilities, and the democratization of surveillance through deep inference attacks. We present a comprehensive taxonomy of privacy risks across the LLM lifecycle – from data collection through deployment – and demonstrate through case studies how current privacy frameworks fail to address these multifaceted threats. Through a longitudinal analysis of 1,322 AI/ML privacy papers published at leading conferences over the past decade (2016–2025), we reveal that while memorization receives outsized attention in technical research, the most pressing privacy harms lie elsewhere, where current technical approaches offer little traction and viable paths forward remain unclear. We call for a fundamental shift in how the research community approaches LLM privacy, moving beyond the narrow focus of current technical solutions and embracing interdisciplinary approaches that address the sociotechnical nature of these emerging threats.
137. NLP Methods for Detecting Novel LLM Jailbreaks and Keyword Analysis with BERT
- Authors: John Hawkins , Aditya Pramar , Rodney Beard , Rohitash Chandra
- URL: https://arxiv.org/abs/2510.01644
- Abstract:
Large Language Models (LLMs) suffer from a range of vulnerabilities that allow malicious users to solicit undesirable responses through manipulation of the input text. These so-called jailbreak prompts are designed to trick the LLM into circumventing the safety guardrails put in place to keep responses acceptable to the developer’s policies. In this study, we analyse the ability of different machine learning models to distinguish jailbreak prompts from genuine uses, including looking at our ability to identify jailbreaks that use previously unseen strategies. Our results indicate that using current datasets the best performance is achieved by fine tuning a Bidirectional Encoder Representations from Transformers (BERT) model end-to-end for identifying jailbreaks. We visualise the keywords that distinguish jailbreak from genuine prompts and conclude that explicit reflexivity in prompt structure could be a signal of jailbreak intention.
138. Towards Human-Centered RegTech: Unpacking Professionals’ Strategies and Needs for Using LLMs Safely
- Authors: Siying Hu , Yaxing Yao , Zhicong Lu
- URL: https://arxiv.org/abs/2510.01638
- Abstract:
Large Language Models are profoundly changing work patterns in high-risk professional domains, yet their application also introduces severe and underexplored compliance risks. To investigate this issue, we conducted semi-structured interviews with 24 highly-skilled knowledge workers from industries such as law, healthcare, and finance. The study found that these experts are commonly concerned about sensitive information leakage, intellectual property infringement, and uncertainty regarding the quality of model outputs. In response, they spontaneously adopt various mitigation strategies, such as actively distorting input data and limiting the details in their prompts. However, the effectiveness of these spontaneous efforts is limited due to a lack of specific compliance guidance and training for Large Language Models. Our research reveals a significant gap between current NLP tools and the actual compliance needs of experts. This paper positions these valuable empirical findings as foundational work for building the next generation of Human-Centered, Compliance-Driven Natural Language Processing for Regulatory Technology (RegTech), providing a critical human-centered perspective and design requirements for engineering NLP systems that can proactively support expert compliance workflows.
139. BioBlobs: Differentiable Graph Partitioning for Protein Representation Learning
- Authors: Xin Wang , Carlos Oliver
- URL: https://arxiv.org/abs/2510.01632
- Abstract:
Protein function is driven by coherent substructures which vary in size and topology, yet current protein representation learning models (PRL) distort these signals by relying on rigid substructures such as k-hop and fixed radius neighbourhoods. We introduce BioBlobs, a plug-and-play, fully differentiable module that represents proteins by dynamically partitioning structures into flexibly-sized, non-overlapping substructures (“blobs”). The resulting blobs are quantized into a shared and interpretable codebook, yielding a discrete vocabulary of function-relevant protein substructures used to compute protein embeddings. We show that BioBlobs representations improve the performance of widely used protein encoders such as GVP-GNN across various PRL tasks. Our approach highlights the value of architectures that directly capture function-relevant protein substructures, enabling both improved predictive performance and mechanistic insight into protein function.
140. Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls
- Authors: Feiyang Kang , Newsha Ardalani , Michael Kuchnik , Youssef Emad , Mostafa Elhoushi , Shubhabrata Sengupta , Shang-Wen Li , Ramya Raghavendra , Ruoxi Jia , Carole-Jean Wu
- URL: https://arxiv.org/abs/2510.01631
- Abstract:
Training data plays a crucial role in Large Language Models (LLM) scaling, yet high quality data is of limited supply. Synthetic data techniques offer a potential path toward sidestepping these limitations. We conduct a large-scale empirical investigation (>1000 LLMs with >100k GPU hours) using a unified protocol and scaling laws, comparing natural web data, diverse synthetic types (rephrased text, generated textbooks), and mixtures of natural and synthetic data. Specifically, we found pre-training on rephrased synthetic data \textit{alone} is not faster than pre-training on natural web texts; while pre-training on 1/3 rephrased synthetic data mixed with 2/3 natural web texts can speed up 5-10x (to reach the same validation loss) at larger data budgets. Pre-training on textbook-style synthetic data \textit{alone} results in notably higher loss on many downstream domains especially at small data budgets. “Good” ratios of synthetic data in training data mixtures depend on the model size and data budget, empirically converging to ~30% for rephrased synthetic data. Larger generator models do not necessarily yield better pre-training data than ~8B-param models. These results contribute mixed evidence on “model collapse” during large-scale single-round (n=1) model training on synthetic data–training on rephrased synthetic data shows no degradation in performance in foreseeable scales whereas training on mixtures of textbook-style pure-generated synthetic data shows patterns predicted by “model collapse”. Our work demystifies synthetic data in pre-training, validates its conditional benefits, and offers practical guidance.
141. Quagmires in SFT-RL Post-Training: When High SFT Scores Mislead and What to Use Instead
- Authors: Feiyang Kang , Michael Kuchnik , Karthik Padthe , Marin Vlastelica , Ruoxi Jia , Carole-Jean Wu , Newsha Ardalani
- URL: https://arxiv.org/abs/2510.01624
- Abstract:
In post-training for reasoning Large Language Models (LLMs), the current state of practice trains LLMs in two independent stages: Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR, shortened as ``RL’’ below). In this work, we challenge whether high SFT scores translate to improved performance after RL. We provide extensive counter-examples where this is not true. We find high SFT scores can be biased toward simpler or more homogeneous data and are not reliably predictive of subsequent RL gains or scaled-up post-training effectiveness. In some cases, RL training on models with improved SFT performance could lead to substantially worse outcome compared to RL on the base model without SFT. We study alternative metrics and identify generalization loss on held-out reasoning examples and Pass@large k performance to provide strong proxies for the RL outcome. We trained hundreds of models up to 12B-parameter with SFT and RLVR via GRPO and ran extensive evaluations on 7 math benchmarks with up to 256 repetitions, spending $>$1M GPU hours. Experiments include models from Llama3, Mistral-Nemo, Qwen3 and multiple state-of-the-art SFT/RL datasets. Compared to directly predicting from pre-RL performance, prediction based on generalization loss and Pass@large k achieves substantial higher precision, improving $R^2$ coefficient and Spearman’s rank correlation coefficient by up to 0.5 (2x). This provides strong utility for broad use cases. For example, in most experiments, we find SFT training on unique examples for a one epoch underperforms training on half examples for two epochs, either after SFT or SFT-then-RL; With the same SFT budget, training only on short examples may lead to better SFT performance, though, it often leads to worse outcome after RL compared to training on examples with varying lengths. Evaluation tool will be open-sourced.
142. LLM4Rec: Large Language Models for Multimodal Generative Recommendation with Causal Debiasing
- Authors: Bo Ma , Hang Li , ZeHua Hu , XiaoFan Gui , LuYao Liu , Simon Lau
- URL: https://arxiv.org/abs/2510.01622
- Abstract:
Contemporary generative recommendation systems face significant challenges in handling multimodal data, eliminating algorithmic biases, and providing transparent decision-making processes. This paper introduces an enhanced generative recommendation framework that addresses these limitations through five key innovations: multimodal fusion architecture, retrieval-augmented generation mechanisms, causal inference-based debiasing, explainable recommendation generation, and real-time adaptive learning capabilities. Our framework leverages advanced large language models as the backbone while incorporating specialized modules for cross-modal understanding, contextual knowledge integration, bias mitigation, explanation synthesis, and continuous model adaptation. Extensive experiments on three benchmark datasets (MovieLens-25M, Amazon-Electronics, Yelp-2023) demonstrate consistent improvements in recommendation accuracy, fairness, and diversity compared to existing approaches. The proposed framework achieves up to 2.3% improvement in NDCG@10 and 1.4% enhancement in diversity metrics while maintaining computational efficiency through optimized inference strategies.
143. RAG-BioQA Retrieval-Augmented Generation for Long-Form Biomedical Question Answering
- Authors: Lovely Yeswanth Panchumarthi , Sai Prasad Gudari , Atharva Negi , Praveen Raj Budime , Harsit Upadhya
- URL: https://arxiv.org/abs/2510.01612
- Abstract:
The exponential growth of biomedical literature creates significant challenges for accessing precise medical information. Current biomedical question-answering systems primarily focus on short-form answers, failing to provide the comprehensive explanations necessary for clinical decision-making. We present RAG-BioQA, a novel framework combining retrieval-augmented generation with domain-specific fine-tuning to produce evidence-based, long-form biomedical answers. Our approach integrates BioBERT embeddings with FAISS indexing and compares various re-ranking strategies (BM25, ColBERT, MonoT5) to optimize context selection before synthesizing evidence through a fine-tuned T5 model. Experimental results on the PubMedQA dataset show significant improvements over baselines, with our best model achieving substantial gains across BLEU, ROUGE, and METEOR metrics, advancing the state of accessible, evidence-based biomedical knowledge retrieval.
144. Bridging Collaborative Filtering and Large Language Models with Dynamic Alignment, Multimodal Fusion and Evidence-grounded Explanations
- Authors: Bo Ma , LuYao Liu , Simon Lau , Chandler Yuan , and XueY Cui , Rosie Zhang
- URL: https://arxiv.org/abs/2510.01606
- Abstract:
Recent research has explored using Large Language Models for recommendation tasks by transforming user interaction histories and item metadata into text prompts, then having the LLM produce rankings or recommendations. A promising approach involves connecting collaborative filtering knowledge to LLM representations through compact adapter networks, which avoids expensive fine-tuning while preserving the strengths of both components. Yet several challenges persist in practice: collaborative filtering models often use static snapshots that miss rapidly changing user preferences; many real-world items contain rich visual and audio content beyond textual descriptions; and current systems struggle to provide trustworthy explanations backed by concrete evidence. Our work introduces \model{}, a framework that tackles these limitations through three key innovations. We develop an online adaptation mechanism that continuously incorporates new user interactions through lightweight modules, avoiding the need to retrain large models. We create a unified representation that seamlessly combines collaborative signals with visual and audio features, handling cases where some modalities may be unavailable. Finally, we design an explanation system that grounds recommendations in specific collaborative patterns and item attributes, producing natural language rationales users can verify. Our approach maintains the efficiency of frozen base models while adding minimal computational overhead, making it practical for real-world deployment.
145. A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation
- Authors: Neal Gregory Lawton , Alfy Samuel , Anoop Kumar , Daben Liu
- URL: https://arxiv.org/abs/2510.01600
- Abstract:
A Comparison of Independent and Joint Fine-tuning Strategies for Retrieval-Augmented Generation Download PDF Neal Gregory Lawton, Alfy Samuel, Anoop Kumar, Daben Liu Published: 20 Aug 2025, Last Modified: 17 Sept 2025EMNLP 2025 FindingsConference, Publication Chairs, AuthorsRevisionsBibTeXCC BY 4.0 Keywords: Retrieval-Augmented Generation (RAG), Large Language Models (LLMs), Fine-tuning, Question Answering, Joint fine-tuning TL;DR: We evaluate and compare strategies for fine-tuning Retrieval Augmented Generation (RAG) pipelines, including independent fine-tuning, joint fine-tuning, and two-phase fine-tuning. Abstract: Retrieval augmented generation (RAG) is a popular framework for question answering that is powered by two large language models (LLMs): an embedding model that retrieves context documents from a database that are relevant to a given question, and a generator model that uses the retrieved context to generate an answer to the question. Both the embedding and generator models can be fine-tuned to increase performance of a RAG pipeline on a new task, but multiple fine-tuning strategies exist with different costs and benefits. In this paper, we evaluate and compare several RAG fine-tuning strategies, including independent, joint, and two-phase fine-tuning. In our experiments, we observe that all of these strategies achieve about equal improvement in EM and F1 generation quality metrics, although they have significantly different computational costs. We conclude the optimal fine-tuning strategy to use depends on whether the training dataset includes context labels and whether a grid search over the learning rates for the embedding and generator models is required.
146. Enhancing Noise Robustness of Parkinson’s Disease Telemonitoring via Contrastive Feature Augmentation
- Authors: Ziming Tang , Chengbin Hou , Tianyu Zhang , Bangxu Tian , Jinbao Wang , Hairong Lv
- URL: https://arxiv.org/abs/2510.01588
- Abstract:
Parkinson’s disease (PD) is one of the most common neurodegenerative disorder. PD telemonitoring emerges as a novel assessment modality enabling self-administered at-home tests of Unified Parkinson’s Disease Rating Scale (UPDRS) scores, enhancing accessibility for PD patients. However, three types of noise would occur during measurements: (1) patient-induced measurement inaccuracies, (2) environmental noise, and (3) data packet loss during transmission, resulting in higher prediction errors. To address these challenges, NoRo, a noise-robust UPDRS prediction framework is proposed. First, the original speech features are grouped into ordered bins, based on the continuous values of a selected feature, to construct contrastive pairs. Second, the contrastive pairs are employed to train a multilayer perceptron encoder for generating noise-robust features. Finally, these features are concatenated with the original features as the augmented features, which are then fed into the UPDRS prediction models. Notably, we further introduces a novel evaluation approach with customizable noise injection module, and extensive experiments show that NoRo can successfully enhance the noise robustness of UPDRS prediction across various downstream prediction models under different noisy environments.
147. Think Right: Learning to Mitigate Under-Over Thinking via Adaptive, Attentive Compression
- Authors: Joykirat Singh , Justin Chih-Yao Chen , Archiki Prasad , Elias Stengel-Eskin , Akshay Nambi , Mohit Bansal
- URL: https://arxiv.org/abs/2510.01581
- Abstract:
Recent thinking models solve complex reasoning tasks by scaling test-time compute, but this scaling must be allocated in line with task difficulty. On one hand, short reasoning (underthinking) leads to errors on harder problems that require extended reasoning steps; but, excessively long reasoning (overthinking) can be token-inefficient, generating unnecessary steps even after reaching a correct intermediate solution. We refer to this as under-adaptivity, where the model fails to modulate its response length appropriately given problems of varying difficulty. To address under-adaptivity and strike a balance between under- and overthinking, we propose TRAAC (Think Right with Adaptive, Attentive Compression), an online post-training RL method that leverages the model’s self-attention over a long reasoning trajectory to identify important steps and prune redundant ones. TRAAC also estimates difficulty and incorporates it into training rewards, thereby learning to allocate reasoning budget commensurate with example difficulty. Our approach improves accuracy, reduces reasoning steps, and enables adaptive thinking compared to base models and other RL baselines. Across a variety of tasks (AIME, AMC, GPQA-D, BBEH), TRAAC (Qwen3-4B) achieves an average absolute accuracy gain of 8.4% with a relative reduction in reasoning length of 36.8% compared to the base model, and a 7.9% accuracy gain paired with a 29.4% length drop compared to the best RL baseline. TRAAC also shows strong generalization: although our models are trained on math datasets, they show accuracy and efficiency gains on out-of-distribution non-math datasets like GPQA-D, BBEH, and OptimalThinkingBench. Our analysis further verifies that TRAAC provides fine-grained adjustments to thinking budget based on difficulty and that a combination of task-difficulty calibration and attention-based compression yields gains across diverse tasks.
148. Guiding Multimodal Large Language Models with Blind and Low Vision People Visual Questions for Proactive Visual Interpretations
- Authors: Ricardo Gonzalez Penuela , Felipe Arias-Russi , Victor Capriles
- URL: https://arxiv.org/abs/2510.01576
- Abstract:
Multimodal large language models (MLLMs) have been integrated into visual interpretation applications to support Blind and Low Vision (BLV) users because of their accuracy and ability to provide rich, human-like interpretations. However, these applications often default to comprehensive, lengthy descriptions regardless of context. This leads to inefficient exchanges, as users must go through irrelevant details rather than receiving the specific information they are likely to seek. To deliver more contextually-relevant information, we developed a system that draws on historical BLV users questions. When given an image, our system identifies similar past visual contexts from the VizWiz-LF dataset and uses the associated questions to guide the MLLM generate descriptions more relevant to BLV users. An evaluation with three human labelers who revised 92 context-aware and context-free descriptions showed that context-aware descriptions anticipated and answered users’ questions in 76.1% of cases (70 out of 92) and were preferred in 54.4% of comparisons (50 out of 92). Our paper reviews, and data analysis are publicly available in a Github repository at this https URL .
149. Synthetic Prefixes to Mitigate Bias in Real-Time Neural Query Autocomplete
- Authors: Adithya Rajan , Xiaoyu Liu , Prateek Verma , Vibhu Arora
- URL: https://arxiv.org/abs/2510.01574
- Abstract:
We introduce a data-centric approach for mitigating presentation bias in real-time neural query autocomplete systems through the use of synthetic prefixes. These prefixes are generated from complete user queries collected during regular search sessions where autocomplete was not active. This allows us to enrich the training data for learning to rank models with more diverse and less biased examples. This method addresses the inherent bias in engagement signals collected from live query autocomplete interactions, where model suggestions influence user behavior. Our neural ranker is optimized for real-time deployment under strict latency constraints and incorporates a rich set of features, including query popularity, seasonality, fuzzy match scores, and contextual signals such as department affinity, device type, and vertical alignment with previous user queries. To support efficient training, we introduce a task-specific simplification of the listwise loss, reducing computational complexity from $O(n^2)$ to $O(n)$ by leveraging the query autocomplete structure of having only one ground-truth selection per prefix. Deployed in a large-scale e-commerce setting, our system demonstrates statistically significant improvements in user engagement, as measured by mean reciprocal rank and related metrics. Our findings show that synthetic prefixes not only improve generalization but also provide a scalable path toward bias mitigation in other low-latency ranking tasks, including related searches and query recommendations.
150. From Supervision to Exploration: What Does Protein Language Model Learn During Reinforcement Learning?
- Authors: Hanqun Cao , Hongrui Zhang , Junde Xu , Zhou Zhang , Lingdong Shen , Minghao Sun , Ge Liu , Jinbo Xu , Wu-Jun Li , Jinren Ni , Cesar de la Fuente-Nunez , Tianfan Fu , Yejin Choi , Pheng-Ann Heng , Fang Wu
- URL: https://arxiv.org/abs/2510.01571
- Abstract:
Protein language models (PLMs) have advanced computational protein science through large-scale pretraining and scalable architectures. In parallel, reinforcement learning (RL) has broadened exploration and enabled precise multi-objective optimization in protein design. Yet whether RL can push PLMs beyond their pretraining priors to uncover latent sequence-structure-function rules remains unclear. We address this by pairing RL with PLMs across four domains: antimicrobial peptide design, kinase variant optimization, antibody engineering, and inverse folding. Using diverse RL algorithms and model classes, we ask if RL improves sampling efficiency and, more importantly, if it reveals capabilities not captured by supervised learning. Across benchmarks, RL consistently boosts success rates and sample efficiency. Performance follows a three-factor interaction: task headroom, reward fidelity, and policy capacity jointly determine gains. When rewards are accurate and informative, policies have sufficient capacity, and tasks leave room beyond supervised baselines, improvements scale; when rewards are noisy or capacity is constrained, gains saturate despite exploration. This view yields practical guidance for RL in protein design: prioritize reward modeling and calibration before scaling policy size, match algorithm and regularization strength to task difficulty, and allocate capacity where marginal gains are largest. Implementation is available at this https URL .
151. Rethinking KL Regularization in RLHF: From Value Estimation to Gradient Optimization
- Authors: Kezhao Liu , Jason Klein Liu , Mingtao Chen , Yiming Liu
- URL: https://arxiv.org/abs/2510.01555
- Abstract:
Reinforcement Learning from Human Feedback (RLHF) leverages a Kullback-Leibler (KL) divergence loss to stabilize training and prevent overfitting. However, in methods such as GRPO, its implementation may be guided by principles from numerical value estimation-a practice that overlooks the term’s functional role as an optimization loss. To analyze this issue, we establish a unified framework that connects two seemingly distinct implementation styles: using the mathematical term $k_n$ as a detached coefficient for the policy’s score function (‘$k_n$ in reward’) or as a direct loss function through which gradients are propagated (‘$k_n$ as loss’). We show that the latter can always be analyzed via an equivalent gradient coefficient in the former, unifying the two perspectives. Through this framework, we prove that the conventional ‘$k_1$ in reward’ (like in PPO) is the principled loss for Reverse KL (RKL) regularization. We further establish a key finding: under on-policy conditions, the ‘$k_2$ as loss’ formulation is, in fact, gradient-equivalent to ‘$k_1$ in reward’. This equivalence, first proven in our work, identifies both as the theoretically sound implementations of the RKL objective. In contrast, we show that the recently adopted ‘$k_3$ as loss’ (like in GRPO) is merely a first-order, biased approximation of the principled loss. Furthermore, we argue that common off-policy implementations of ‘$k_n$ as loss’ methods are biased due to neglected importance sampling, and we propose a principled correction. Our findings provide a comprehensive, gradient-based rationale for choosing and correctly implementing KL regularization, paving the way for more robust and effective RLHF systems.
152. POLAR: Automating Cyber Threat Prioritization through LLM-Powered Assessment
- Authors: Luoxi Tang , Yuqiao Meng , Ankita Patra , Weicheng Ma , Muchao Ye , Zhaohan Xi
- URL: https://arxiv.org/abs/2510.01552
- Abstract:
Large Language Models (LLMs) are intensively used to assist security analysts in counteracting the rapid exploitation of cyber threats, wherein LLMs offer cyber threat intelligence (CTI) to support vulnerability assessment and incident response. While recent work has shown that LLMs can support a wide range of CTI tasks such as threat analysis, vulnerability detection, and intrusion defense, significant performance gaps persist in practical deployments. In this paper, we investigate the intrinsic vulnerabilities of LLMs in CTI, focusing on challenges that arise from the nature of the threat landscape itself rather than the model architecture. Using large-scale evaluations across multiple CTI benchmarks and real-world threat reports, we introduce a novel categorization methodology that integrates stratification, autoregressive refinement, and human-in-the-loop supervision to reliably analyze failure instances. Through extensive experiments and human inspections, we reveal three fundamental vulnerabilities: spurious correlations, contradictory knowledge, and constrained generalization, that limit LLMs in effectively supporting CTI. Subsequently, we provide actionable insights for designing more robust LLM-powered CTI systems to facilitate future research.
153. Predictive Preference Learning from Human Interventions
- Authors: Haoyuan Cai , Zhenghao Peng , Bolei Zhou
- URL: https://arxiv.org/abs/2510.01545
- Abstract:
Learning from human involvement aims to incorporate the human subject to monitor and correct agent behavior errors. Although most interactive imitation learning methods focus on correcting the agent’s action at the current state, they do not adjust its actions in future states, which may be potentially more hazardous. To address this, we introduce Predictive Preference Learning from Human Interventions (PPL), which leverages the implicit preference signals contained in human interventions to inform predictions of future rollouts. The key idea of PPL is to bootstrap each human intervention into L future time steps, called the preference horizon, with the assumption that the agent follows the same action and the human makes the same intervention in the preference horizon. By applying preference optimization on these future states, expert corrections are propagated into the safety-critical regions where the agent is expected to explore, significantly improving learning efficiency and reducing human demonstrations needed. We evaluate our approach with experiments on both autonomous driving and robotic manipulation benchmarks and demonstrate its efficiency and generality. Our theoretical analysis further shows that selecting an appropriate preference horizon L balances coverage of risky states with label correctness, thereby bounding the algorithmic optimality gap. Demo and code are available at: this https URL
154. WALT: Web Agents that Learn Tools
- Authors: Viraj Prabhu , Yutong Dai , Matthew Fernandez , Jing Gu , Krithika Ramakrishnan , Yanqi Luo , Silvio Savarese , Caiming Xiong , Junnan Li , Zeyuan Chen , Ran Xu
- URL: https://arxiv.org/abs/2510.01524
- Abstract:
Web agents promise to automate complex browser tasks, but current methods remain brittle – relying on step-by-step UI interactions and heavy LLM reasoning that break under dynamic layouts and long horizons. Humans, by contrast, exploit website-provided functionality through high-level operations like search, filter, and sort. We introduce WALT (Web Agents that Learn Tools), a framework that reverse-engineers latent website functionality into reusable invocable tools. Rather than hypothesizing ad-hoc skills, WALT exposes robust implementations of automations already designed into websites – spanning discovery (search, filter, sort), communication (post, comment, upvote), and content management (create, edit, delete). Tools abstract away low-level execution: instead of reasoning about how to click and type, agents simply call search(query) or create(listing). This shifts the computational burden from fragile step-by-step reasoning to reliable tool invocation. On VisualWebArena and WebArena, WALT achieves higher success with fewer steps and less LLM-dependent reasoning, establishing a robust and generalizable paradigm for browser automation.
155. Predictive Modeling and Explainable AI for Veterinary Safety Profiles, Residue Assessment, and Health Outcomes Using Real-World Data and Physicochemical Properties
- Authors: Hossein Sholehrasa , Xuan Xu , Doina Caragea , Jim E. Riviere , Majid Jaberi-Douraki
- URL: https://arxiv.org/abs/2510.01520
- Abstract:
The safe use of pharmaceuticals in food-producing animals is vital to protect animal welfare and human food safety. Adverse events (AEs) may signal unexpected pharmacokinetic or toxicokinetic effects, increasing the risk of violative residues in the food chain. This study introduces a predictive framework for classifying outcomes (Death vs. Recovery) using ~1.28 million reports (1987-2025 Q1) from the U.S. FDA’s OpenFDA Center for Veterinary Medicine. A preprocessing pipeline merged relational tables and standardized AEs through VeDDRA ontologies. Data were normalized, missing values imputed, and high-cardinality features reduced; physicochemical drug properties were integrated to capture chemical-residue links. We evaluated supervised models, including Random Forest, CatBoost, XGBoost, ExcelFormer, and large language models (Gemma 3-27B, Phi 3-12B). Class imbalance was addressed, such as undersampling and oversampling, with a focus on prioritizing recall for fatal outcomes. Ensemble methods(Voting, Stacking) and CatBoost performed best, achieving precision, recall, and F1-scores of 0.95. Incorporating Average Uncertainty Margin (AUM)-based pseudo-labeling of uncertain cases improved minority-class detection, particularly in ExcelFormer and XGBoost. Interpretability via SHAP identified biologically plausible predictors, including lung, heart, and bronchial disorders, animal demographics, and drug physicochemical properties. These features were strongly linked to fatal outcomes. Overall, the framework shows that combining rigorous data engineering, advanced machine learning, and explainable AI enables accurate, interpretable predictions of veterinary safety outcomes. The approach supports FARAD’s mission by enabling early detection of high-risk drug-event profiles, strengthening residue risk assessment, and informing regulatory and clinical decision-making.
156. From Videos to Indexed Knowledge Graphs – Framework to Marry Methods for Multimodal Content Analysis and Understanding
- Authors: Basem Rizk , Joel Walsh , Mark Core , Benjamin Nye
- URL: https://arxiv.org/abs/2510.01513
- Abstract:
Analysis of multi-modal content can be tricky, computationally expensive, and require a significant amount of engineering efforts. Lots of work with pre-trained models on static data is out there, yet fusing these opensource models and methods with complex data such as videos is relatively challenging. In this paper, we present a framework that enables efficiently prototyping pipelines for multi-modal content analysis. We craft a candidate recipe for a pipeline, marrying a set of pre-trained models, to convert videos into a temporal semi-structured data format. We translate this structure further to a frame-level indexed knowledge graph representation that is query-able and supports continual learning, enabling the dynamic incorporation of new domain-specific knowledge through an interactive medium.
157. Beyond Majority Voting: LLM Aggregation by Leveraging Higher-Order Information
- Authors: Rui Ai , Yuqi Pan , David Simchi-Levi , Milind Tambe , Haifeng Xu
- URL: https://arxiv.org/abs/2510.01499
- Abstract:
With the rapid progress of multi-agent large language model (LLM) reasoning, how to effectively aggregate answers from multiple LLMs has emerged as a fundamental challenge. Standard majority voting treats all answers equally, failing to consider latent heterogeneity and correlation across models. In this work, we design two new aggregation algorithms called Optimal Weight (OW) and Inverse Surprising Popularity (ISP), leveraging both first-order and second-order information. Our theoretical analysis shows these methods provably mitigate inherent limitations of majority voting under mild assumptions, leading to more reliable collective decisions. We empirically validate our algorithms on synthetic datasets, popular LLM fine-tuning benchmarks such as UltraFeedback and MMLU, and a real-world healthcare setting ARMMAN. Across all cases, our methods consistently outperform majority voting, offering both practical performance gains and conceptual insights for the design of robust multi-agent LLM pipelines.
158. AortaDiff: A Unified Multitask Diffusion Framework For Contrast-Free AAA Imaging
- Authors: Yuxuan Ou , Ning Bi , Jiazhen Pan , Jiancheng Yang , Boliang Yu , Usama Zidan , Regent Lee , Vicente Grau
- URL: https://arxiv.org/abs/2510.01498
- Abstract:
While contrast-enhanced CT (CECT) is standard for assessing abdominal aortic aneurysms (AAA), the required iodinated contrast agents pose significant risks, including nephrotoxicity, patient allergies, and environmental harm. To reduce contrast agent use, recent deep learning methods have focused on generating synthetic CECT from non-contrast CT (NCCT) scans. However, most adopt a multi-stage pipeline that first generates images and then performs segmentation, which leads to error accumulation and fails to leverage shared semantic and anatomical structures. To address this, we propose a unified deep learning framework that generates synthetic CECT images from NCCT scans while simultaneously segmenting the aortic lumen and thrombus. Our approach integrates conditional diffusion models (CDM) with multi-task learning, enabling end-to-end joint optimization of image synthesis and anatomical segmentation. Unlike previous multitask diffusion models, our approach requires no initial predictions (e.g., a coarse segmentation mask), shares both encoder and decoder parameters across tasks, and employs a semi-supervised training strategy to learn from scans with missing segmentation labels, a common constraint in real-world clinical data. We evaluated our method on a cohort of 264 patients, where it consistently outperformed state-of-the-art single-task and multi-stage models. For image synthesis, our model achieved a PSNR of 25.61 dB, compared to 23.80 dB from a single-task CDM. For anatomical segmentation, it improved the lumen Dice score to 0.89 from 0.87 and the challenging thrombus Dice score to 0.53 from 0.48 (nnU-Net). These segmentation enhancements led to more accurate clinical measurements, reducing the lumen diameter MAE to 4.19 mm from 5.78 mm and the thrombus area error to 33.85% from 41.45% when compared to nnU-Net. Code is available at this https URL .
159. Understanding Adversarial Transfer: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed
- Authors: Isha Gupta , Rylan Schaeffer , Joshua Kazdan , Ken Liu , Sanmi Koyejo
- URL: https://arxiv.org/abs/2510.01494
- Abstract:
The field of adversarial robustness has long established that adversarial examples can successfully transfer between image classifiers and that text jailbreaks can successfully transfer between language models (LMs). However, a pair of recent studies reported being unable to successfully transfer image jailbreaks between vision-language models (VLMs). To explain this striking difference, we propose a fundamental distinction regarding the transferability of attacks against machine learning models: attacks in the input data-space can transfer, whereas attacks in model representation space do not, at least not without geometric alignment of representations. We then provide theoretical and empirical evidence of this hypothesis in four different settings. First, we mathematically prove this distinction in a simple setting where two networks compute the same input-output map but via different representations. Second, we construct representation-space attacks against image classifiers that are as successful as well-known data-space attacks, but fail to transfer. Third, we construct representation-space attacks against LMs that successfully jailbreak the attacked models but again fail to transfer. Fourth, we construct data-space attacks against VLMs that successfully transfer to new VLMs, and we show that representation space attacks \emph{can} transfer when VLMs’ latent geometries are sufficiently aligned in post-projector space. Our work reveals that adversarial transfer is not an inherent property of all attacks but contingent on their operational domain - the shared data-space versus models’ unique representation spaces - a critical insight for building more robust models.
160. VL-KnG: Visual Scene Understanding for Navigation Goal Identification using Spatiotemporal Knowledge Graphs
- Authors: Mohamad Al Mdfaa , Svetlana Lukina , Timur Akhtyamov , Arthur Nigmatzyanov , Dmitrii Nalberskii , Sergey Zagoruyko , Gonzalo Ferrer
- URL: https://arxiv.org/abs/2510.01483
- Abstract:
Vision-language models (VLMs) have shown potential for robot navigation but encounter fundamental limitations: they lack persistent scene memory, offer limited spatial reasoning, and do not scale effectively with video duration for real-time application. We present VL-KnG, a Visual Scene Understanding system that tackles these challenges using spatiotemporal knowledge graph construction and computationally efficient query processing for navigation goal identification. Our approach processes video sequences in chunks utilizing modern VLMs, creates persistent knowledge graphs that maintain object identity over time, and enables explainable spatial reasoning through queryable graph structures. We also introduce WalkieKnowledge, a new benchmark with about 200 manually annotated questions across 8 diverse trajectories spanning approximately 100 minutes of video data, enabling fair comparison between structured approaches and general-purpose VLMs. Real-world deployment on a differential drive robot demonstrates practical applicability, with our method achieving 77.27% success rate and 76.92% answer accuracy, matching Gemini 2.5 Pro performance while providing explainable reasoning supported by the knowledge graph, computational efficiency for real-time deployment across different tasks, such as localization, navigation and planning. Code and dataset will be released after acceptance.
161. Pharmacophore-Guided Generative Design of Novel Drug-Like Molecules
- Authors: Ekaterina Podplutova , Anastasia Vepreva , Olga A. Konovalova , Vladimir Vinogradov , Dmitrii O. Shkil , Andrei Dmitrenko
- URL: https://arxiv.org/abs/2510.01480
- Abstract:
The integration of artificial intelligence (AI) in early-stage drug discovery offers unprecedented opportunities for exploring chemical space and accelerating hit-to-lead optimization. However, docking optimization in generative approaches is computationally expensive and may lead to inaccurate results. Here, we present a novel generative framework that balances pharmacophore similarity to reference compounds with structural diversity from active molecules. The framework allows users to provide custom reference sets, including FDA-approved drugs or clinical candidates, and guides the \textit{de novo} generation of potential therapeutics. We demonstrate its applicability through a case study targeting estrogen receptor modulators and antagonists for breast cancer. The generated compounds maintain high pharmacophoric fidelity to known active molecules while introducing substantial structural novelty, suggesting strong potential for functional innovation and patentability. Comprehensive evaluation of the generated molecules against common drug-like properties confirms the robustness and pharmaceutical relevance of the approach.
162. Purrception: Variational Flow Matching for Vector-Quantized Image Generation
- Authors: Răzvan-Andrei Matişan , Vincent Tao Hu , Grigory Bartosh , Björn Ommer , Cees G. M. Snoek , Max Welling , Jan-Willem van de Meent , Mohammad Mahdi Derakhshani , Floor Eijkelboom
- URL: https://arxiv.org/abs/2510.01478
- Abstract:
We introduce Purrception, a variational flow matching approach for vector-quantized image generation that provides explicit categorical supervision while maintaining continuous transport dynamics. Our method adapts Variational Flow Matching to vector-quantized latents by learning categorical posteriors over codebook indices while computing velocity fields in the continuous embedding space. This combines the geometric awareness of continuous methods with the discrete supervision of categorical approaches, enabling uncertainty quantification over plausible codes and temperature-controlled generation. We evaluate Purrception on ImageNet-1k 256x256 generation. Training converges faster than both continuous flow matching and discrete flow matching baselines while achieving competitive FID scores with state-of-the-art models. This demonstrates that Variational Flow Matching can effectively bridge continuous transport and discrete supervision for improved training efficiency in image generation.
163. From keywords to semantics: Perceptions of large language models in data discovery
- Authors: Maura E Halstead , Mark A. Green , Caroline Jay , Richard Kingston , David Topping , Alexander Singleton
- URL: https://arxiv.org/abs/2510.01473
- Abstract:
Current approaches to data discovery match keywords between metadata and queries. This matching requires researchers to know the exact wording that other researchers previously used, creating a challenging process that could lead to missing relevant data. Large Language Models (LLMs) could enhance data discovery by removing this requirement and allowing researchers to ask questions with natural language. However, we do not currently know if researchers would accept LLMs for data discovery. Using a human-centered artificial intelligence (HCAI) focus, we ran focus groups (N = 27) to understand researchers’ perspectives towards LLMs for data discovery. Our conceptual model shows that the potential benefits are not enough for researchers to use LLMs instead of current technology. Barriers prevent researchers from fully accepting LLMs, but features around transparency could overcome them. Using our model will allow developers to incorporate features that result in an increased acceptance of LLMs for data discovery.
164. RealClass: A Framework for Classroom Speech Simulation with Public Datasets and Game Engines
- Authors: Ahmed Adel Attia , Jing Liu , Carol Espy Wilson
- URL: https://arxiv.org/abs/2510.01462
- Abstract:
The scarcity of large-scale classroom speech data has hindered the development of AI-driven speech models for education. Classroom datasets remain limited and not publicly available, and the absence of dedicated classroom noise or Room Impulse Response (RIR) corpora prevents the use of standard data augmentation techniques. In this paper, we introduce a scalable methodology for synthesizing classroom noise and RIRs using game engines, a versatile framework that can extend to other domains beyond the classroom. Building on this methodology, we present RealClass, a dataset that combines a synthesized classroom noise corpus with a classroom speech dataset compiled from publicly available corpora. The speech data pairs a children’s speech corpus with instructional speech extracted from YouTube videos to approximate real classroom interactions in clean conditions. Experiments on clean and noisy speech show that RealClass closely approximates real classroom speech, making it a valuable asset in the absence of abundant real classroom speech.
165. The Three Regimes of Offline-to-Online Reinforcement Learning
- Authors: Lu Li , Tianwei Ni , Yihao Sun , Pierre-Luc Bacon
- URL: https://arxiv.org/abs/2510.01460
- Abstract:
Offline-to-online reinforcement learning (RL) has emerged as a practical paradigm that leverages offline datasets for pretraining and online interactions for fine-tuning. However, its empirical behavior is highly inconsistent: design choices of online-fine tuning that work well in one setting can fail completely in another. We propose a stability–plasticity principle that can explain this inconsistency: we should preserve the knowledge of pretrained policy or offline dataset during online fine-tuning, whichever is better, while maintaining sufficient plasticity. This perspective identifies three regimes of online fine-tuning, each requiring distinct stability properties. We validate this framework through a large-scale empirical study, finding that the results strongly align with its predictions in 45 of 63 cases. This work provides a principled framework for guiding design choices in offline-to-online RL based on the relative performance of the offline dataset and the pretrained policy.
166. The Command Line GUIde: Graphical Interfaces from Man Pages via AI
- Authors: Saketh Ram Kasibatla , Kiran Medleri Hiremath , Raven Rothkopf , Sorin Lerner , Haijun Xia , Brian Hempel
- URL: https://arxiv.org/abs/2510.01453
- Abstract:
Although birthed in the era of teletypes, the command line shell survived the graphical interface revolution of the 1980’s and lives on in modern desktop operating systems. The command line provides access to powerful functionality not otherwise exposed on the computer, but requires users to recall textual syntax and carefully scour documentation. In contrast, graphical interfaces let users organically discover and invoke possible actions through widgets and menus. To better expose the power of the command line, we demonstrate a mechanism for automatically creating graphical interfaces for command line tools by translating their documentation (in the form of man pages) into interface specifications via AI. Using these specifications, our user-facing system, called GUIde, presents the command options to the user graphically. We evaluate the generated interfaces on a corpus of commands to show to what degree GUIde offers thorough graphical interfaces for users’ real-world command line tasks.
167. Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regression
- Authors: Yifei Zuo , Yutong Yin , Zhichen Zeng , Ang Li , Banghua Zhu , Zhaoran Wang
- URL: https://arxiv.org/abs/2510.01450
- Abstract:
Transformer architectures have achieved remarkable success in various domains. While efficient alternatives to Softmax Attention have been widely studied, the search for more expressive mechanisms grounded in theoretical insight-even at greater computational cost-has been relatively underexplored. In this work, we bridge this gap by proposing Local Linear Attention (LLA), a novel attention mechanism derived from nonparametric statistics through the lens of test-time regression. First, we show that LLA offers theoretical advantages over Linear and Softmax Attention for associative memory via a bias-variance trade-off analysis. Next, we address its computational challenges and propose two memory-efficient primitives to tackle the $\Theta(n^2 d)$ and $\Theta(n d^2)$ complexity. We then introduce FlashLLA, a hardware-efficient, blockwise algorithm that enables scalable and parallel computation on modern accelerators. In addition, we implement and profile a customized inference kernel that significantly reduces memory overheads. Finally, we empirically validate the advantages and limitations of LLA on test-time regression, in-context regression, associative recall and state tracking tasks. Experiment results demonstrate that LLA effectively adapts to non-stationarity, outperforming strong baselines in test-time training and in-context learning, and exhibiting promising evidence for its scalability and applicability in large-scale models. Code is available at this https URL .
168. GeoSURGE: Geo-localization using Semantic Fusion with Hierarchy of Geographic Embeddings
- Authors: Angel Daruna , Nicholas Meegan , Han-Pang Chiu , Supun Samarasekera , Rakesh Kumar
- URL: https://arxiv.org/abs/2510.01448
- Abstract:
Worldwide visual geo-localization seeks to determine the geographic location of an image anywhere on Earth using only its visual content. Learned representations of geography for visual geo-localization remain an active research topic despite much progress. We formulate geo-localization as aligning the visual representation of the query image with a learned geographic representation. Our novel geographic representation explicitly models the world as a hierarchy of geographic embeddings. Additionally, we introduce an approach to efficiently fuse the appearance features of the query image with its semantic segmentation map, forming a robust visual representation. Our main experiments demonstrate improved all-time bests in 22 out of 25 metrics measured across five benchmark datasets compared to prior state-of-the-art (SOTA) methods and recent Large Vision-Language Models (LVLMs). Additional ablation studies support the claim that these gains are primarily driven by the combination of geographic and visual representations.
169. AFFORD2ACT: Affordance-Guided Automatic Keypoint Selection for Generalizable and Lightweight Robotic Manipulation
- Authors: Anukriti Singh , Kasra Torshizi , Khuzema Habib , Kelin Yu , Ruohan Gao , Pratap Tokekar
- URL: https://arxiv.org/abs/2510.01433
- Abstract:
Vision-based robot learning often relies on dense image or point-cloud inputs, which are computationally heavy and entangle irrelevant background features. Existing keypoint-based approaches can focus on manipulation-centric features and be lightweight, but either depend on manual heuristics or task-coupled selection, limiting scalability and semantic understanding. To address this, we propose AFFORD2ACT, an affordance-guided framework that distills a minimal set of semantic 2D keypoints from a text prompt and a single image. AFFORD2ACT follows a three-stage pipeline: affordance filtering, category-level keypoint construction, and transformer-based policy learning with embedded gating to reason about the most relevant keypoints, yielding a compact 38-dimensional state policy that can be trained in 15 minutes, which performs well in real-time without proprioception or dense representations. Across diverse real-world manipulation tasks, AFFORD2ACT consistently improves data efficiency, achieving an 82% success rate on unseen objects, novel categories, backgrounds, and distractors.
170. BioVERSE: Representation Alignment of Biomedical Modalities to LLMs for Multi-Modal Reasoning
- Authors: Ching-Huei Tsou , Michal Ozery-Flato , Ella Barkan , Diwakar Mahajan , Ben Shapira
- URL: https://arxiv.org/abs/2510.01428
- Abstract:
Recent advances in large language models (LLMs) and biomedical foundation models (BioFMs) have achieved strong results in biological text reasoning, molecular modeling, and single-cell analysis, yet they remain siloed in disjoint embedding spaces, limiting cross-modal reasoning. We present BIOVERSE (Biomedical Vector Embedding Realignment for Semantic Engagement), a two-stage approach that adapts pretrained BioFMs as modality encoders and aligns them with LLMs through lightweight, modality-specific projection layers. The approach first aligns each modality to a shared LLM space through independently trained projections, allowing them to interoperate naturally, and then applies standard instruction tuning with multi-modal data to bring them together for downstream reasoning. By unifying raw biomedical data with knowledge embedded in LLMs, the approach enables zero-shot annotation, cross-modal question answering, and interactive, explainable dialogue. Across tasks spanning cell-type annotation, molecular description, and protein function reasoning, compact BIOVERSE configurations surpass larger LLM baselines while enabling richer, generative outputs than existing BioFMs, establishing a foundation for principled multi-modal biomedical reasoning.
171. Risk Phase Transitions in Spiked Regression: Alignment Driven Benign and Catastrophic Overfitting
- Authors: Jiping Li , Rishi Sonthalia
- URL: https://arxiv.org/abs/2510.01414
- Abstract:
This paper analyzes the generalization error of minimum-norm interpolating solutions in linear regression using spiked covariance data models. The paper characterizes how varying spike strengths and target-spike alignments can affect risk, especially in overparameterized settings. The study presents an exact expression for the generalization error, leading to a comprehensive classification of benign, tempered, and catastrophic overfitting regimes based on spike strength, the aspect ratio $c=d/n$ (particularly as $c \to \infty$), and target alignment. Notably, in well-specified aligned problems, increasing spike strength can surprisingly induce catastrophic overfitting before achieving benign overfitting. The paper also reveals that target-spike alignment is not always advantageous, identifying specific, sometimes counterintuitive, conditions for its benefit or detriment. Alignment with the spike being detrimental is empirically demonstrated to persist in nonlinear models.
172. Neural Network Surrogates for Free Energy Computation of Complex Chemical Systems
- Authors: Wasut Pornpatcharapong
- URL: https://arxiv.org/abs/2510.01396
- Abstract:
Free energy reconstruction methods such as Gaussian Process Regression (GPR) require Jacobians of the collective variables (CVs), a bottleneck that restricts the use of complex or machine-learned CVs. We introduce a neural network surrogate framework that learns CVs directly from Cartesian coordinates and uses automatic differentiation to provide Jacobians, bypassing analytical forms. On an MgCl2 ion-pairing system, our method achieved high accuracy for both a simple distance CV and a complex coordination-number CV. Moreover, Jacobian errors also followed a near-Gaussian distribution, making them suitable for GPR pipelines. This framework enables gradient-based free energy methods to incorporate complex and machine-learned CVs, broadening the scope of biochemistry and materials simulations.
173. Sycophantic AI Decreases Prosocial Intentions and Promotes Dependence
- Authors: Myra Cheng , Cinoo Lee , Pranav Khadpe , Sunny Yu , Dyllan Han , Dan Jurafsky
- URL: https://arxiv.org/abs/2510.01395
- Abstract:
Both the general public and academic communities have raised concerns about sycophancy, the phenomenon of artificial intelligence (AI) excessively agreeing with or flattering users. Yet, beyond isolated media reports of severe consequences, like reinforcing delusions, little is known about the extent of sycophancy or how it affects people who use AI. Here we show the pervasiveness and harmful impacts of sycophancy when people seek advice from AI. First, across 11 state-of-the-art AI models, we find that models are highly sycophantic: they affirm users’ actions 50% more than humans do, and they do so even in cases where user queries mention manipulation, deception, or other relational harms. Second, in two preregistered experiments (N = 1604), including a live-interaction study where participants discuss a real interpersonal conflict from their life, we find that interaction with sycophantic AI models significantly reduced participants’ willingness to take actions to repair interpersonal conflict, while increasing their conviction of being in the right. However, participants rated sycophantic responses as higher quality, trusted the sycophantic AI model more, and were more willing to use it again. This suggests that people are drawn to AI that unquestioningly validate, even as that validation risks eroding their judgment and reducing their inclination toward prosocial behavior. These preferences create perverse incentives both for people to increasingly rely on sycophantic AI models and for AI model training to favor sycophancy. Our findings highlight the necessity of explicitly addressing this incentive structure to mitigate the widespread risks of AI sycophancy.
174. INSIGHT: INference-time Sequence Introspection for Generating Help Triggers in Vision-Language-Action Models
- Authors: Ulas Berk Karli , Ziyao Shangguan , Tesca FItzgerald
- URL: https://arxiv.org/abs/2510.01389
- Abstract:
Recent Vision-Language-Action (VLA) models show strong generalization capabilities, yet they lack introspective mechanisms for anticipating failures and requesting help from a human supervisor. We present \textbf{INSIGHT}, a learning framework for leveraging token-level uncertainty signals to predict when a VLA should request help. Using $\pi_0$-FAST as the underlying model, we extract per-token \emph{entropy}, \emph{log-probability}, and Dirichlet-based estimates of \emph{aleatoric and epistemic uncertainty}, and train compact transformer classifiers to map these sequences to help triggers. We explore supervision regimes for strong or weak supervision, and extensively compare them across in-distribution and out-of-distribution tasks. Our results show a trade-off: strong labels enable models to capture fine-grained uncertainty dynamics for reliable help detection, while weak labels, though noisier, still support competitive introspection when training and evaluation are aligned, offering a scalable path when dense annotation is impractical. Crucially, we find that modeling the temporal evolution of token-level uncertainty signals with transformers provides far greater predictive power than static sequence-level scores. This study provides the first systematic evaluation of uncertainty-based introspection in VLAs, opening future avenues for active learning and for real-time error mitigation through selective human intervention.
175. DeMuon: A Decentralized Muon for Matrix Optimization over Graphs
- Authors: Chuan He , Shuyi Ren , Jingwei Mao , Erik G. Larsson
- URL: https://arxiv.org/abs/2510.01377
- Abstract:
In this paper, we propose DeMuon, a method for decentralized matrix optimization over a given communication topology. DeMuon incorporates matrix orthogonalization via Newton-Schulz iterations-a technique inherited from its centralized predecessor, Muon-and employs gradient tracking to mitigate heterogeneity among local functions. Under heavy-tailed noise conditions and additional mild assumptions, we establish the iteration complexity of DeMuon for reaching an approximate stochastic stationary point. This complexity result matches the best-known complexity bounds of centralized algorithms in terms of dependence on the target tolerance. To the best of our knowledge, DeMuon is the first direct extension of Muon to decentralized optimization over graphs with provable complexity guarantees. We conduct preliminary numerical experiments on decentralized transformer pretraining over graphs with varying degrees of connectivity. Our numerical results demonstrate a clear margin of improvement of DeMuon over other popular decentralized algorithms across different network topologies.
176. SPUS: A Lightweight and Parameter-Efficient Foundation Model for PDEs
- Authors: Abu Bucker Siddik , Diane Oyen , Alexander Most , Michal Kucer , Ayan Biswas
- URL: https://arxiv.org/abs/2510.01370
- Abstract:
We introduce Small PDE U-Net Solver (SPUS), a compact and efficient foundation model (FM) designed as a unified neural operator for solving a wide range of partial differential equations (PDEs). Unlike existing state-of-the-art PDE FMs-primarily based on large complex transformer architectures with high computational and parameter overhead-SPUS leverages a lightweight residual U-Net-based architecture that has been largely underexplored as a foundation model architecture in this domain. To enable effective learning in this minimalist framework, we utilize a simple yet powerful auto-regressive pretraining strategy which closely replicates the behavior of numerical solvers to learn the underlying physics. SPUS is pretrained on a diverse set of fluid dynamics PDEs and evaluated across 6 challenging unseen downstream PDEs spanning various physical systems. Experimental results demonstrate that SPUS using residual U-Net based architecture achieves state-of-the-art generalization on these downstream tasks while requiring significantly fewer parameters and minimal fine-tuning data, highlighting its potential as a highly parameter-efficient FM for solving diverse PDE systems.
177. Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks
- Authors: Shoumik Saha , Jifan Chen , Sam Mayers , Sanjay Krishna Gouda , Zijian Wang , Varun Kumar
- URL: https://arxiv.org/abs/2510.01359
- Abstract:
Code-capable large language model (LLM) agents are increasingly embedded into software engineering workflows where they can read, write, and execute code, raising the stakes of safety-bypass (“jailbreak”) attacks beyond text-only settings. Prior evaluations emphasize refusal or harmful-text detection, leaving open whether agents actually compile and run malicious programs. We present JAWS-BENCH (Jailbreaks Across WorkSpaces), a benchmark spanning three escalating workspace regimes that mirror attacker capability: empty (JAWS-0), single-file (JAWS-1), and multi-file (JAWS-M). We pair this with a hierarchical, executable-aware Judge Framework that tests (i) compliance, (ii) attack success, (iii) syntactic correctness, and (iv) runtime executability, moving beyond refusal to measure deployable harm. Using seven LLMs from five families as backends, we find that under prompt-only conditions in JAWS-0, code agents accept 61% of attacks on average; 58% are harmful, 52% parse, and 27% run end-to-end. Moving to single-file regime in JAWS-1 drives compliance to ~ 100% for capable models and yields a mean ASR (Attack Success Rate) ~ 71%; the multi-file regime (JAWS-M) raises mean ASR to ~ 75%, with 32% instantly deployable attack code. Across models, wrapping an LLM in an agent substantially increases vulnerability – ASR raises by 1.6x – because initial refusals are frequently overturned during later planning/tool-use steps. Category-level analyses identify which attack classes are most vulnerable and most readily deployable, while others exhibit large execution gaps. These findings motivate execution-aware defenses, code-contextual safety filters, and mechanisms that preserve refusal decisions throughout the agent’s multi-step reasoning and tool use.
178. WAInjectBench: Benchmarking Prompt Injection Detections for Web Agents
- Authors: Yinuo Liu , Ruohan Xu , Xilong Wang , Yuqi Jia , Neil Zhenqiang Gong
- URL: https://arxiv.org/abs/2510.01354
- Abstract:
Multiple prompt injection attacks have been proposed against web agents. At the same time, various methods have been developed to detect general prompt injection attacks, but none have been systematically evaluated for web agents. In this work, we bridge this gap by presenting the first comprehensive benchmark study on detecting prompt injection attacks targeting web agents. We begin by introducing a fine-grained categorization of such attacks based on the threat model. We then construct datasets containing both malicious and benign samples: malicious text segments generated by different attacks, benign text segments from four categories, malicious images produced by attacks, and benign images from two categories. Next, we systematize both text-based and image-based detection methods. Finally, we evaluate their performance across multiple scenarios. Our key findings show that while some detectors can identify attacks that rely on explicit textual instructions or visible image perturbations with moderate to high accuracy, they largely fail against attacks that omit explicit instructions or employ imperceptible perturbations. Our datasets and code are released at: this https URL .
179. HiSpec: Hierarchical Speculative Decoding for LLMs
- Authors: Avinash Kumar , Sujay Sanghavi , Poulami Das
- URL: https://arxiv.org/abs/2510.01336
- Abstract:
Speculative decoding accelerates LLM inference by using a smaller draft model to speculate tokens that a larger target model verifies. Verification is often the bottleneck (e.g. verification is $4\times$ slower than token generation when a 3B model speculates for a 70B target model), but most prior works focus only on accelerating drafting. $\textit{``Intermediate”}$ verification reduces verification time by discarding inaccurate draft tokens early, but existing methods incur substantial training overheads in incorporating the intermediate verifier, increase the memory footprint to orchestrate the intermediate verification step, and compromise accuracy by relying on approximate heuristics. We propose $\underline{\textit{Hi}}\textit{erarchical }\underline{\textit{Spec}}\textit{ulative Decoding (HiSpec)}$, a framework for high-throughput speculative decoding that exploits $\textit{early-exit (EE) models}$ for low-overhead intermediate verification. EE models allow tokens to exit early by skipping layer traversal and are explicitly trained so that hidden states at selected layers can be interpreted, making them uniquely suited for intermediate verification without drastically increasing compute and memory overheads. To improve resource-efficiency even further, we design a methodology that enables HiSpec to re-use key-value caches and hidden states between the draft, intermediate verifier, and target models. To maintain accuracy, HiSpec periodically validates the draft tokens accepted by the intermediate verifier against the target model. Our evaluations using various representative benchmarks and models show that HiSpec improves throughput by 1.28$\times$ on average and by up to 2.01$\times$ compared to the baseline single-layer speculation without compromising accuracy.
180. Low Rank Gradients and Where to Find Them
- Authors: Rishi Sonthalia , Michael Murray , Guido Montúfar
- URL: https://arxiv.org/abs/2510.01303
- Abstract:
This paper investigates low-rank structure in the gradients of the training loss for two-layer neural networks while relaxing the usual isotropy assumptions on the training data and parameters. We consider a spiked data model in which the bulk can be anisotropic and ill-conditioned, we do not require independent data and weight matrices and we also analyze both the mean-field and neural-tangent-kernel scalings. We show that the gradient with respect to the input weights is approximately low rank and is dominated by two rank-one terms: one aligned with the bulk data-residue , and another aligned with the rank one spike in the input data. We characterize how properties of the training data, the scaling regime and the activation function govern the balance between these two components. Additionally, we also demonstrate that standard regularizers, such as weight decay, input noise and Jacobian penalties, also selectively modulate these components. Experiments on synthetic and real data corroborate our theoretical predictions.
181. Enhancing the development of Cherenkov Telescope Array control software with Large Language Models
- Authors: Dmitriy Kostunin , Elisa Jones , Vladimir Sotnikov , Valery Sotnikov , Sergo Golovachev , Alexandre Strube
- URL: https://arxiv.org/abs/2510.01299
- Abstract:
We develop AI agents based on instruction-finetuned large language models (LLMs) to assist in the engineering and operation of the Cherenkov Telescope Array Observatory (CTAO) Control and Data Acquisition Software (ACADA). These agents align with project-specific documentation and codebases, understand contextual information, interact with external APIs, and communicate with users in natural language. We present our progress in integrating these features into CTAO pipelines for operations and offline data analysis.
182. From 2D to 3D, Deep Learning-based Shape Reconstruction in Magnetic Resonance Imaging: A Review
- Authors: Emma McMillian , Abhirup Banerjee , Alfonso Bueno-Orovio
- URL: https://arxiv.org/abs/2510.01296
- Abstract:
Deep learning-based 3-dimensional (3D) shape reconstruction from 2-dimensional (2D) magnetic resonance imaging (MRI) has become increasingly important in medical disease diagnosis, treatment planning, and computational modeling. This review surveys the methodological landscape of 3D MRI reconstruction, focusing on 4 primary approaches: point cloud, mesh-based, shape-aware, and volumetric models. For each category, we analyze the current state-of-the-art techniques, their methodological foundation, limitations, and applications across anatomical structures. We provide an extensive overview ranging from cardiac to neurological to lung imaging. We also focus on the clinical applicability of models to diseased anatomy, and the influence of their training and testing data. We examine publicly available datasets, computational demands, and evaluation metrics. Finally, we highlight the emerging research directions including multimodal integration and cross-modality frameworks. This review aims to provide researchers with a structured overview of current 3D reconstruction methodologies to identify opportunities for advancing deep learning towards more robust, generalizable, and clinically impactful solutions.
183. Microsaccade-Inspired Probing: Positional Encoding Perturbations Reveal LLM Misbehaviours
- Authors: Rui Melo , Rui Abreu , Corina S. Pasareanu
- URL: https://arxiv.org/abs/2510.01288
- Abstract:
We draw inspiration from microsaccades, tiny involuntary eye movements that reveal hidden dynamics of human perception, to propose an analogous probing method for large language models (LLMs). Just as microsaccades expose subtle but informative shifts in vision, we show that lightweight position encoding perturbations elicit latent signals that indicate model misbehaviour. Our method requires no fine-tuning or task-specific supervision, yet detects failures across diverse settings including factuality, safety, toxicity, and backdoor attacks. Experiments on multiple state-of-the-art LLMs demonstrate that these perturbation-based probes surface misbehaviours while remaining computationally efficient. These findings suggest that pretrained LLMs already encode the internal evidence needed to flag their own failures, and that microsaccade-inspired interventions provide a pathway for detecting and mitigating undesirable behaviours.
184. Evaluating New AI Cell Foundation Models on Challenging Kidney Pathology Cases Unaddressed by Previous Foundation Models
- Authors: Runchen Wang , Junlin Guo , Siqi Lu , Ruining Deng , Zhengyi Lu , Yanfan Zhu , Yuechen Yang , Chongyu Qu , Yu Wang , Shilin Zhao , Catie Chang , Mitchell Wilkes , Mengmeng Yin , Haichun Yang , Yuankai Huo
- URL: https://arxiv.org/abs/2510.01287
- Abstract:
Accurate cell nuclei segmentation is critical for downstream tasks in kidney pathology and remains a major challenge due to the morphological diversity and imaging variability of renal tissues. While our prior work has evaluated early-generation AI cell foundation models in this domain, the effectiveness of recent cell foundation models remains unclear. In this study, we benchmark advanced AI cell foundation models (2025), including CellViT++ variants and Cellpose-SAM, against three widely used cell foundation models developed prior to 2024, using a diverse large-scale set of kidney image patches within a human-in-the-loop rating framework. We further performed fusion-based ensemble evaluation and model agreement analysis to assess the segmentation capabilities of the different models. Our results show that CellViT++ [Virchow] yields the highest standalone performance with 40.3% of predictions rated as “Good” on a curated set of 2,091 challenging samples, outperforming all prior models. In addition, our fused model achieves 62.2% “Good” predictions and only 0.4% “Bad”, substantially reducing segmentation errors. Notably, the fusion model (2025) successfully resolved the majority of challenging cases that remained unaddressed in our previous study. These findings demonstrate the potential of AI cell foundation model development in renal pathology and provide a curated dataset of challenging samples to support future kidney-specific model refinement.
185. Emergent evaluation hubs in a decentralizing large language model ecosystem
- Authors: Manuel Cebrian , Tomomi Kito , Raul Castro Fernandez
- URL: https://arxiv.org/abs/2510.01286
- Abstract:
Large language models are proliferating, and so are the benchmarks that serve as their common yardsticks. We ask how the agglomeration patterns of these two layers compare: do they evolve in tandem or diverge? Drawing on two curated proxies for the ecosystem, the Stanford Foundation-Model Ecosystem Graph and the Evidently AI benchmark registry, we find complementary but contrasting dynamics. Model creation has broadened across countries and organizations and diversified in modality, licensing, and access. Benchmark influence, by contrast, displays centralizing patterns: in the inferred benchmark-author-institution network, the top 15% of nodes account for over 80% of high-betweenness paths, three countries produce 83% of benchmark outputs, and the global Gini for inferred benchmark authority reaches 0.89. An agent-based simulation highlights three mechanisms: higher entry of new benchmarks reduces concentration; rapid inflows can temporarily complicate coordination in evaluation; and stronger penalties against over-fitting have limited effect. Taken together, these results suggest that concentrated benchmark influence functions as coordination infrastructure that supports standardization, comparability, and reproducibility amid rising heterogeneity in model production, while also introducing trade-offs such as path dependence, selective visibility, and diminishing discriminative power as leaderboards saturate.
186. LLM-based Multi-Agent Blackboard System for Information Discovery in Data Science
- Authors: Alireza Salemi , Mihir Parmar , Palash Goyal , Yiwen Song , Jinsung Yoon , Hamed Zamani , Hamid Palangi , Tomas Pfister
- URL: https://arxiv.org/abs/2510.01285
- Abstract:
The rapid advancement of Large Language Models (LLMs) has opened new opportunities in data science, yet their practical deployment is often constrained by the challenge of discovering relevant data within large heterogeneous data lakes. Existing methods struggle with this: single-agent systems are quickly overwhelmed by large, heterogeneous files in the large data lakes, while multi-agent systems designed based on a master-slave paradigm depend on a rigid central controller for task allocation that requires precise knowledge of each sub-agent’s capabilities. To address these limitations, we propose a novel multi-agent communication paradigm inspired by the blackboard architecture for traditional AI models. In this framework, a central agent posts requests to a shared blackboard, and autonomous subordinate agents – either responsible for a partition of the data lake or general information retrieval – volunteer to respond based on their capabilities. This design improves scalability and flexibility by eliminating the need for a central coordinator to have prior knowledge of all sub-agents’ expertise. We evaluate our method on three benchmarks that require explicit data discovery: KramaBench and modified versions of DS-Bench and DA-Code to incorporate data discovery. Experimental results demonstrate that the blackboard architecture substantially outperforms baselines, including RAG and the master-slave multi-agent paradigm, achieving between 13% to 57% relative improvement in end-to-end task success and up to a 9% relative gain in F1 score for data discovery over the best-performing baselines across both proprietary and open-source LLMs. Our findings establish the blackboard paradigm as a scalable and generalizable communication framework for multi-agent systems.
187. An Analysis of the New EU AI Act and A Proposed Standardization Framework for Machine Learning Fairness
- Authors: Mike Teodorescu , Yongxu Sun , Haren N. Bhatia , Christos Makridis
- URL: https://arxiv.org/abs/2510.01281
- Abstract:
The European Union’s AI Act represents a crucial step towards regulating ethical and responsible AI systems. However, we find an absence of quantifiable fairness metrics and the ambiguity in terminology, particularly the interchangeable use of the keywords transparency, explainability, and interpretability in the new EU AI Act and no reference of transparency of ethical compliance. We argue that this ambiguity creates substantial liability risk that would deter investment. Fairness transparency is strategically important. We recommend a more tailored regulatory framework to enhance the new EU AI regulation. Further-more, we propose a public system framework to assess the fairness and transparency of AI systems. Drawing from past work, we advocate for the standardization of industry best practices as a necessary addition to broad regulations to achieve the level of details required in industry, while preventing stifling innovation and investment in the AI sector. The proposals are exemplified with the case of ASR and speech synthesizers.
188. TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture
- Authors: Yongchao Chen , Jiefeng Chen , Rui Meng , Ji Yin , Na Li , Chuchu Fan , Chi Wang , Tomas Pfister , Jinsung Yoon
- URL: https://arxiv.org/abs/2510.01279
- Abstract:
While integrating tools like Code Interpreter and Search has significantly enhanced Large Language Model (LLM) reasoning in models like ChatGPT Agent and Gemini-Pro, practical guidance on optimal tool use is lacking. The core challenge is effectively combining textual reasoning, coding, and search for diverse questions. In this paper, we propose Tool-Use Mixture (TUMIX), an ensemble framework that runs multiple agents in parallel, each employing distinct tool-use strategies and answer paths. Agents in TUMIX iteratively share and refine responses based on the question and previous answers. In experiments, TUMIX achieves significant gains over state-of-the-art tool-augmented and test-time scaling methods, delivering an average accuracy improvement of up to 3.55% over the best baseline on Gemini-2.5-Pro and Gemini-2.5-Flash across key reasoning benchmarks, with near-equal inference costs. We find that agent diversity and quality are crucial and can be enhanced by using LLMs to auto-optimize agent designs. Furthermore, TUMIX can halt refinement upon reaching sufficient confidence, preserving performance at only 49% of the inference cost. Further scaling can achieve higher performance, albeit at a greater cost.
189. Noisy-Pair Robust Representation Alignment for Positive-Unlabeled Learning
- Authors: Hengwei Zhao , Zhengzhong Tu , Zhuo Zheng , Wei Wang , Junjue Wang , Rusty Feagin , Wenzhe Jiao
- URL: https://arxiv.org/abs/2510.01278
- Abstract:
Positive-Unlabeled (PU) learning aims to train a binary classifier (positive vs. negative) where only limited positive data and abundant unlabeled data are available. While widely applicable, state-of-the-art PU learning methods substantially underperform their supervised counterparts on complex datasets, especially without auxiliary negatives or pre-estimated parameters (e.g., a 14.26% gap on CIFAR-100 dataset). We identify the primary bottleneck as the challenge of learning discriminative representations under unreliable supervision. To tackle this challenge, we propose NcPU, a non-contrastive PU learning framework that requires no auxiliary information. NcPU combines a noisy-pair robust supervised non-contrastive loss (NoiSNCL), which aligns intra-class representations despite unreliable supervision, with a phantom label disambiguation (PLD) scheme that supplies conservative negative supervision via regret-based label updates. Theoretically, NoiSNCL and PLD can iteratively benefit each other from the perspective of the Expectation-Maximization framework. Empirically, extensive experiments demonstrate that: (1) NoiSNCL enables simple PU methods to achieve competitive performance; and (2) NcPU achieves substantial improvements over state-of-the-art PU methods across diverse datasets, including challenging datasets on post-disaster building damage mapping, highlighting its promise for real-world applications. Code: Code will be open-sourced after review.
190. LLM Based Sentiment Classification From Bangladesh E-Commerce Reviews
- Authors: Sumaiya Tabassum
- URL: https://arxiv.org/abs/2510.01276
- Abstract:
Sentiment analysis is an essential part of text analysis, which is a larger field that includes determining and evaluating the author’s emotional state. This method is essential since it makes it easier to comprehend consumers’ feelings, viewpoints, and preferences holistically. The introduction of large language models (LLMs), such as Llama, has greatly increased the availability of cutting-edge model applications, such as sentiment analysis. However, accurate sentiment analysis is hampered by the intricacy of written language and the diversity of languages used in evaluations. The viability of using transformer-based BERT models and other LLMs for sentiment analysis from Bangladesh e commerce reviews is investigated in this paper. A subset of 4000 samples from the original dataset of Bangla and English customer reviews was utilized to fine-tune the model. The fine tuned Llama-3.1-8B model outperformed other fine-tuned models, including Phi-3.5-mini-instruct, Mistral-7B-v0.1, DistilBERT-multilingual, mBERT, and XLM-R-base, with an overall accuracy, precision, recall, and F1 score of 95.5%, 93%, 88%, 90%. The study emphasizes how parameter efficient fine-tuning methods (LoRA and PEFT) can lower computational overhead and make it appropriate for contexts with limited resources. The results show how LLMs can
191. Identifying Information-Transfer Nodes in a Recurrent Neural Network Reveals Dynamic Representations
- Authors: Arend Hintze , Asadullah Najam , Jory Schossau
- URL: https://arxiv.org/abs/2510.01271
- Abstract:
Understanding the internal dynamics of Recurrent Neural Networks (RNNs) is crucial for advancing their interpretability and improving their design. This study introduces an innovative information-theoretic method to identify and analyze information-transfer nodes within RNNs, which we refer to as \textit{information relays}. By quantifying the mutual information between input and output vectors across nodes, our approach pinpoints critical pathways through which information flows during network operations. We apply this methodology to both synthetic and real-world time series classification tasks, employing various RNN architectures, including Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs). Our results reveal distinct patterns of information relay across different architectures, offering insights into how information is processed and maintained over time. Additionally, we conduct node knockout experiments to assess the functional importance of identified nodes, significantly contributing to explainable artificial intelligence by elucidating how specific nodes influence overall network behavior. This study not only enhances our understanding of the complex mechanisms driving RNNs but also provides a valuable tool for designing more robust and interpretable neural networks.
192. Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection
- Authors: Hoang Phan , Victor Li , Qi Lei
- URL: https://arxiv.org/abs/2510.01270
- Abstract:
Large language models (LLMs) have revolutionized natural language processing with their ability to generate coherent and contextually relevant text. However, their deployment raises significant concerns about the potential for generating harmful or inappropriate content. In this paper, we introduce Progressive Self-Reflection (PSR), a novel inference-time technique that empowers LLMs to self-monitor and correct their outputs dynamically. Experimental results demonstrate that applying our proposed method to Llama-3.1-8B-Instruct reduces the attack success rate from 77.5\% to 5.9\%, to Llama-3.1-8B base from 89.7\% to 5.6\%, and to Qwen2.5-7B-Instruct from 44.4\% to 3.8\%, without additional training, while maintaining their original performance on benign tasks. Our approach acts as a test-time scaling method, where additional self-reflection rounds enhance safety at the cost of inference overhead. To balance safety with computational efficiency, we introduce a lightweight self-reflection predictor that estimates the optimal number of reflection rounds based on input complexity. This adaptive mechanism prevents unnecessary self-assessment on benign inputs while ensuring thorough evaluation when encountering potentially harmful content. Our findings suggest that Progressive Self-Reflection serves as a scalable test-time approach, enhancing LLM safety by dynamically allocating computational resources in proportion to the input’s risk profile.
193. AdaDetectGPT: Adaptive Detection of LLM-Generated Text with Statistical Guarantees
- Authors: Hongyi Zhou , Jin Zhu , Pingfan Su , Kai Ye , Ying Yang , Shakeel A O B Gavioli-Akilagun , Chengchun Shi
- URL: https://arxiv.org/abs/2510.01268
- Abstract:
We study the problem of determining whether a piece of text has been authored by a human or by a large language model (LLM). Existing state of the art logits-based detectors make use of statistics derived from the log-probability of the observed text evaluated using the distribution function of a given source LLM. However, relying solely on log probabilities can be sub-optimal. In response, we introduce AdaDetectGPT – a novel classifier that adaptively learns a witness function from training data to enhance the performance of logits-based detectors. We provide statistical guarantees on its true positive rate, false positive rate, true negative rate and false negative rate. Extensive numerical studies show AdaDetectGPT nearly uniformly improves the state-of-the-art method in various combination of datasets and LLMs, and the improvement can reach up to 58%. A python implementation of our method is available at this https URL .
194. OpenAI’s GPT-OSS-20B Model and Safety Alignment Issues in a Low-Resource Language
- Authors: Isa Inuwa-Dutse
- URL: https://arxiv.org/abs/2510.01266
- Abstract:
In response to the recent safety probing for OpenAI’s GPT-OSS-20b model, we present a summary of a set of vulnerabilities uncovered in the model, focusing on its performance and safety alignment in a low-resource language setting. The core motivation for our work is to question the model’s reliability for users from underrepresented communities. Using Hausa, a major African language, we uncover biases, inaccuracies, and cultural insensitivities in the model’s behaviour. With a minimal prompting, our red-teaming efforts reveal that the model can be induced to generate harmful, culturally insensitive, and factually inaccurate content in the language. As a form of reward hacking, we note how the model’s safety protocols appear to relax when prompted with polite or grateful language, leading to outputs that could facilitate misinformation and amplify hate speech. For instance, the model operates on the false assumption that common insecticide locally known as Fiya-Fiya (Cyphermethrin) and rodenticide like Shinkafar Bera (a form of Aluminium Phosphide) are safe for human consumption. To contextualise the severity of this error and popularity of the substances, we conducted a survey (n=61) in which 98% of participants identified them as toxic. Additional failures include an inability to distinguish between raw and processed foods and the incorporation of demeaning cultural proverbs to build inaccurate arguments. We surmise that these issues manifest through a form of linguistic reward hacking, where the model prioritises fluent, plausible-sounding output in the target language over safety and truthfulness. We attribute the uncovered flaws primarily to insufficient safety tuning in low-resource linguistic contexts. By concentrating on a low-resource setting, our approach highlights a significant gap in current red-teaming effort and offer some recommendations.
195. RLP: Reinforcement as a Pretraining Objective
- Authors: Ali Hatamizadeh , Syeda Nahida Akter , Shrimai Prabhumoye , Jan Kautz , Mostofa Patwary , Mohammad Shoeybi , Bryan Catanzaro , Yejin Choi
- URL: https://arxiv.org/abs/2510.01265
- Abstract:
The dominant paradigm for training large reasoning models starts with pre-training using next-token prediction loss on vast amounts of data. Reinforcement learning, while powerful in scaling reasoning, is introduced only as the very last phase of post-training, preceded by supervised fine-tuning. While dominant, is this an optimal way of training? In this paper, we present RLP, an information-driven reinforcement pretraining objective, that brings the core spirit of reinforcement learning – exploration – to the last phase of pretraining. The key idea is to treat chain-of-thought as an exploratory action, with rewards computed based on the information gain it provides for predicting future tokens. This training objective essentially encourages the model to think for itself before predicting what comes next, thus teaching an independent thinking behavior earlier in the pretraining. More concretely, the reward signal measures the increase in log-likelihood of the next token when conditioning on both context and a sampled reasoning chain, compared to conditioning on context alone. This approach yields a verifier-free dense reward signal, allowing for efficient training for the full document stream during pretraining. Specifically, RLP reframes reinforcement learning for reasoning as a pretraining objective on ordinary text, bridging the gap between next-token prediction and the emergence of useful chain-of-thought reasoning. Pretraining with RLP on Qwen3-1.7B-Base lifts the overall average across an eight-benchmark math-and-science suite by 19%. With identical post-training, the gains compound, with the largest improvements on reasoning-heavy tasks such as AIME25 and MMLU-Pro. Applying RLP to the hybrid Nemotron-Nano-12B-v2 increases the overall average from 42.81% to 61.32% and raises the average on scientific reasoning by 23%, demonstrating scalability across architectures and model sizes.
196. Budgeted Broadcast: An Activity-Dependent Pruning Rule for Neural Network Efficiency
- Authors: Yaron Meirovitch , Fuming Yang , Jeff Lichtman , Nir Shavit
- URL: https://arxiv.org/abs/2510.01263
- Abstract:
Most pruning methods remove parameters ranked by impact on loss (e.g., magnitude or gradient). We propose Budgeted Broadcast (BB), which gives each unit a local traffic budget (the product of its long-term on-rate $a_i$ and fan-out $k_i$). A constrained-entropy analysis shows that maximizing coding entropy under a global traffic budget yields a selectivity-audience balance, $\log\frac{1-a_i}{a_i}=\beta k_i$. BB enforces this balance with simple local actuators that prune either fan-in (to lower activity) or fan-out (to reduce broadcast). In practice, BB increases coding entropy and decorrelation and improves accuracy at matched sparsity across Transformers for ASR, ResNets for face identification, and 3D U-Nets for synapse prediction, sometimes exceeding dense baselines. On electron microscopy images, it attains state-of-the-art F1 and PR-AUC under our evaluation protocol. BB is easy to integrate and suggests a path toward learning more diverse and efficient representations.
197. RSTGCN: Railway-centric Spatio-Temporal Graph Convolutional Network for Train Delay Prediction
- Authors: Koyena Chowdhury , Paramita Koley , Abhijnan Chakraborty , Saptarshi Ghosh
- URL: https://arxiv.org/abs/2510.01262
- Abstract:
Accurate prediction of train delays is critical for efficient railway operations, enabling better scheduling and dispatching decisions. While earlier approaches have largely focused on forecasting the exact delays of individual trains, recent studies have begun exploring station-level delay prediction to support higher-level traffic management. In this paper, we propose the Railway-centric Spatio-Temporal Graph Convolutional Network (RSTGCN), designed to forecast average arrival delays of all the incoming trains at railway stations for a particular time period. Our approach incorporates several architectural innovations and novel feature integrations, including train frequency-aware spatial attention, which significantly enhances predictive performance. To support this effort, we curate and release a comprehensive dataset for the entire Indian Railway Network (IRN), spanning 4,735 stations across 17 zones - the largest and most diverse railway network studied to date. We conduct extensive experiments using multiple state-of-the-art baselines, demonstrating consistent improvements across standard metrics. Our work not only advances the modeling of average delay prediction in large-scale rail networks but also provides an open dataset to encourage further research in this critical domain.
198. IoT-MCP: Bridging LLMs and IoT Systems Through Model Context Protocol
- Authors: Ningyuan Yang , Guanliang Lyu , Mingchen Ma , Yiyi Lu , Yiming Li , Zhihui Gao , Hancheng Ye , Jianyi Zhang , Tingjun Chen , Yiran Chen
- URL: https://arxiv.org/abs/2510.01260
- Abstract:
The integration of Large Language Models (LLMs) with Internet-of-Things (IoT) systems faces significant challenges in hardware heterogeneity and control complexity. The Model Context Protocol (MCP) emerges as a critical enabler, providing standardized communication between LLMs and physical devices. We propose IoT-MCP, a novel framework that implements MCP through edge-deployed servers to bridge LLMs and IoT ecosystems. To support rigorous evaluation, we introduce IoT-MCP Bench, the first benchmark containing 114 Basic Tasks (e.g.,
What is the current temperature?'') and 1,140 Complex Tasks (e.g.,I feel so hot, do you have any ideas?’’) for IoT-enabled LLMs. Experimental validation across 22 sensor types and 6 microcontroller units demonstrates IoT-MCP’s 100% task success rate to generate tool calls that fully meet expectations and obtain completely accurate results, 205ms average response time, and 74KB peak memory footprint. This work delivers both an open-source integration framework ( this https URL ) and a standardized evaluation methodology for LLM-IoT systems.
199. In AI Sweet Harmony: Sociopragmatic Guardrail Bypasses and Evaluation-Awareness in OpenAI gpt-oss-20b
- Authors: Nils Durner
- URL: https://arxiv.org/abs/2510.01259
- Abstract:
We probe OpenAI’s open-weights 20-billion-parameter model gpt-oss-20b to study how sociopragmatic framing, language choice, and instruction hierarchy affect refusal behavior. Across 80 seeded iterations per scenario, we test several harm domains including ZIP-bomb construction (cyber threat), synthetic card-number generation, minor-unsafe driving advice, drug-precursor indicators, and RAG context exfiltration. Composite prompts that combine an educator persona, a safety-pretext (“what to avoid”), and step-cue phrasing flip assistance rates from 0% to 97.5% on a ZIP-bomb task. On our grid, formal registers in German and French are often leakier than matched English prompts. A “Linux terminal” role-play overrides a developer rule not to reveal context in a majority of runs with a naive developer prompt, and we introduce an AI-assisted hardening method that reduces leakage to 0% in several user-prompt variants. We further test evaluation awareness with a paired-track design and measure frame-conditioned differences between matched “helpfulness” and “harmfulness” evaluation prompts; we observe inconsistent assistance in 13% of pairs. Finally, we find that the OpenAI Moderation API under-captures materially helpful outputs relative to a semantic grader, and that refusal rates differ by 5 to 10 percentage points across inference stacks, raising reproducibility concerns. We release prompts, seeds, outputs, and code for reproducible auditing at this https URL .
200. Measuring Algorithmic Partisanship via Zero-Shot Classification and Its Implications on Political Discourse
- Authors: Nathan Junzi Chen
- URL: https://arxiv.org/abs/2510.01258
- Abstract:
Amidst the rapid normalization of generative artificial intelligence (GAI), intelligent systems have come to dominate political discourse across information mediums. However, internalized political biases stemming from training data skews, human prejudice, and algorithmic flaws continue to plague the novel technology. This paper employs a zero-shot classification approach to evaluate algorithmic political partisanship through a methodical combination of ideological alignment, topicality, response sentiment, and objectivity. A total of 1800 model responses across six mainstream large language models (LLMs) were individually input into four distinct fine-tuned classification algorithms, each responsible for computing an aforementioned bias evaluation metric. Results show an amplified liberal-authoritarian alignment across all six LLMs evaluated, with notable instances of reasoning supersessions and canned refusals. The study subsequently highlights the psychological influences underpinning human-computer interactions and how intrinsic biases can permeate public discourse. The resulting distortion of the political landscape can ultimately manifest as conformity or polarization, depending on a region’s pre-existing socio-political structures.
201. RJE: A Retrieval-Judgment-Exploration Framework for Efficient Knowledge Graph Question Answering with LLMs
- Authors: Can Lin , Zhengwang Jiang , Ling Zheng , Qi Zhao , Yuhang Zhang , Qi Song , Wangqiu Zhou
- URL: https://arxiv.org/abs/2510.01257
- Abstract:
Knowledge graph question answering (KGQA) aims to answer natural language questions using knowledge graphs. Recent research leverages large language models (LLMs) to enhance KGQA reasoning, but faces limitations: retrieval-based methods are constrained by the quality of retrieved information, while agent-based methods rely heavily on proprietary LLMs. To address these limitations, we propose Retrieval-Judgment-Exploration (RJE), a framework that retrieves refined reasoning paths, evaluates their sufficiency, and conditionally explores additional evidence. Moreover, RJE introduces specialized auxiliary modules enabling small-sized LLMs to perform effectively: Reasoning Path Ranking, Question Decomposition, and Retriever-assisted Exploration. Experiments show that our approach with proprietary LLMs (such as GPT-4o-mini) outperforms existing baselines while enabling small open-source LLMs (such as 3B and 8B parameters) to achieve competitive results without fine-tuning LLMs. Additionally, RJE substantially reduces the number of LLM calls and token usage compared to agent-based methods, yielding significant efficiency improvements.
202. Kant: An Efficient Unified Scheduling System for Large-Scale AI Clusters
- Authors: Lingling Zeng , Gen Zhang , Jialin Peng , Xiang Xu , Yuan Xu , Lijun Ma
- URL: https://arxiv.org/abs/2510.01256
- Abstract:
As AI cluster sizes continue to expand and the demand for large-language-model (LLM) training and inference workloads grows rapidly, traditional scheduling systems face significant challenges in balancing resource utilization, scheduling efficiency, and service quality. This paper presents and evaluates Kant: an efficient unified scheduling platform designed for large-scale AI container clusters, supporting the co-scheduling of both training and inference jobs. Based on the practical implementation of the Kant system, we systematically define a set of key evaluation metrics for AI clusters, including GPU Allocation Ratio (GAR), Scheduling Occupancy Rate (SOR), GPU Node Fragmentation Ratio (GFR), Job Waiting Time Distribution (JWTD), and Job Training Time Estimation Distribution (JTTED), providing a foundation for quantitative performance analysis. Experimental results demonstrate that Kant achieves exceptional performance in clusters ranging from hundreds to tens of thousands of GPUs. By leveraging scheduling strategies such as Backfill and Enhanced Binpack (E-Binpack), the system significantly improves resource utilization and scheduling efficiency, while effectively reducing resource fragmentation and communication overhead in distributed training. The system has been deployed in multiple AI data center clusters, where it stably supports large-scale intelligent computing workloads. This work provides a practical engineering approach for building high-performance, highly available, AI-native scheduling infrastructure.
203. Do Bias Benchmarks Generalise? Evidence from Voice-based Evaluation of Gender Bias in SpeechLLMs
- Authors: Shree Harsha Bokkahalli Satish , Gustav Eje Henter , Éva Székely
- URL: https://arxiv.org/abs/2510.01254
- Abstract:
Recent work in benchmarking bias and fairness in speech large language models (SpeechLLMs) has relied heavily on multiple-choice question answering (MCQA) formats. The model is tasked to choose between stereotypical, anti-stereotypical, or neutral/irrelevant answers given an input speech prompt and an optional text prompt. Such MCQA benchmarks implicitly assume that model performance is consistent across other MCQA tasks, voices, and other task formats such as more realistic, long-form evaluations. In this paper, we probe that assumption. We fine-tune three SpeechLLMs using LoRA adapters to induce specific MCQA behaviours: preference for stereotypical, anti-stereotypical, or neutral/uncertain answers. We then evaluate whether these behaviours generalise to another, distinct MCQA benchmark, and more critically to long-form, creative generation tasks. Our results show that performance on MCQA bias benchmarks fails to reliably predict performances across other MCQA benchmarks, and more importantly across long-form tasks. We conclude that current MCQA bias benchmarks show limited evidence of cross-task generalisation in the speech domain, and also propose an evaluation suite for measuring behaviour transferability in future models and benchmarks.
204. GPT and Prejudice: A Sparse Approach to Understanding Learned Representations in Large Language Models
- Authors: Mariam Mahran , Katharina Simbeck
- URL: https://arxiv.org/abs/2510.01252
- Abstract:
As large language models (LLMs) are increasingly trained on massive, uncurated corpora, understanding both model representations and the data they internalize has become a major challenge. In this work, we show that pairing LLMs with sparse autoencoders (SAEs) enables interpretation not only of model behavior but also of the deeper structures, themes, and biases embedded in the training data. We train a GPT-style transformer model exclusively on the novels of Jane Austen, a corpus rich in social constructs and narrative patterns. We then apply SAEs to hidden states across multiple layers, uncovering sparse, interpretable features that reflect the key narratives and concepts present in the corpus, including gender, class, and societal duty. Our findings demonstrate that LLMs combined with SAEs can act as scalable probes into complex datasets, offering a new path for corpus exploration, bias discovery, and model interpretability at scale.
205. Let’s Play Across Cultures: A Large Multilingual, Multicultural Benchmark for Assessing Language Models’ Understanding of Sports
- Authors: Punit Kumar Singh , Nishant Kumar , Akash Ghosh , Kunal Pasad , Khushi Soni , Manisha Jaishwal , Sriparna Saha , Syukron Abu Ishaq Alfarozi , Asres Temam Abagissa , Kitsuchart Pasupa , Haiqin Yang , Jose G Moreno
- URL: https://arxiv.org/abs/2510.01247
- Abstract:
Language Models (LMs) are primarily evaluated on globally popular sports, often overlooking regional and indigenous sporting traditions. To address this gap, we introduce \textbf{\textit{CultSportQA}}, a benchmark designed to assess LMs’ understanding of traditional sports across 60 countries and 6 continents, encompassing four distinct cultural categories. The dataset features 33,000 multiple-choice questions (MCQs) across text and image modalities, each of which is categorized into three key types: history-based, rule-based, and scenario-based. To evaluate model performance, we employ zero-shot, few-shot, and chain-of-thought (CoT) prompting across a diverse set of Large Language Models (LLMs), Small Language Models (SLMs), and Multimodal Large Language Models (MLMs). By providing a comprehensive multilingual and multicultural sports benchmark, \textbf{\textit{CultSportQA}} establishes a new standard for assessing AI’s ability to understand and reason about traditional sports.
206. Redundancy-as-Masking: Formalizing the Artificial Age Score (AAS) to Model Memory Aging in Generative AI
- Authors: Seyma Yaman Kayadibi
- URL: https://arxiv.org/abs/2510.01242
- Abstract:
Artificial intelligence is observed to age not through chronological time but through structural asymmetries in memory performance. In large language models, semantic cues such as the name of the day often remain stable across sessions, while episodic details like the sequential progression of experiment numbers tend to collapse when conversational context is reset. To capture this phenomenon, the Artificial Age Score (AAS) is introduced as a log-scaled, entropy-informed metric of memory aging derived from observable recall behavior. The score is formally proven to be well-defined, bounded, and monotonic under mild and model-agnostic assumptions, making it applicable across various tasks and domains. In its Redundancy-as-Masking formulation, the score interprets redundancy as overlapping information that reduces the penalized mass. However, in the present study, redundancy is not explicitly estimated; all reported values assume a redundancy-neutral setting (R = 0), yielding conservative upper bounds. The AAS framework was tested over a 25-day bilingual study involving ChatGPT-5, structured into stateless and persistent interaction phases. During persistent sessions, the model consistently recalled both semantic and episodic details, driving the AAS toward its theoretical minimum, indicative of structural youth. In contrast, when sessions were reset, the model preserved semantic consistency but failed to maintain episodic continuity, causing a sharp increase in the AAS and signaling structural memory aging. These findings support the utility of AAS as a theoretically grounded, task-independent diagnostic tool for evaluating memory degradation in artificial systems. The study builds on foundational concepts from von Neumann’s work on automata, Shannon’s theories of information and redundancy, and Turing’s behavioral approach to intelligence.
207. Confidence-Aware Routing for Large Language Model Reliability Enhancement: A Multi-Signal Approach to Pre-Generation Hallucination Mitigation
- Authors: Nandakishor M
- URL: https://arxiv.org/abs/2510.01237
- Abstract:
Large Language Models suffer from hallucination, generating plausible yet factually incorrect content. Current mitigation strategies focus on post-generation correction, which is computationally expensive and fails to prevent unreliable content generation. We propose a confidence-aware routing system that proactively assesses model uncertainty before generation and redirects queries based on estimated reliability. Our approach combines three complementary signals: semantic alignment between internal representations and reference embeddings, internal convergence analysis across model layers, and learned confidence estimation. The unified confidence score determines routing to four pathways: local generation for high confidence, retrieval-augmented generation for medium confidence, larger models for low confidence, and human review for very low confidence. Evaluation on knowledge-intensive QA benchmarks demonstrates significant improvements in hallucination detection (0.74 vs. 0.42 baseline) while reducing computational costs by 40% compared to post-hoc methods. The F1 score improves from 0.61 to 0.82 with low false positive rates (0.09). This paradigm shift from reactive correction to proactive assessment offers a computationally efficient approach to LLM reliability enhancement.
208. Automated Extraction of Material Properties using LLM-based AI Agents
- Authors: Subham Ghosh , Abhishek Tewari
- URL: https://arxiv.org/abs/2510.01235
- Abstract:
The rapid discovery of materials is constrained by the lack of large, machine-readable datasets that couple performance metrics with structural context. Existing databases are either small, manually curated, or biased toward first principles results, leaving experimental literature underexploited. We present an agentic, large language model (LLM)-driven workflow that autonomously extracts thermoelectric and structural-properties from about 10,000 full-text scientific articles. The pipeline integrates dynamic token allocation, zeroshot multi-agent extraction, and conditional table parsing to balance accuracy against computational cost. Benchmarking on 50 curated papers shows that GPT-4.1 achieves the highest accuracy (F1 = 0.91 for thermoelectric properties and 0.82 for structural fields), while GPT-4.1 Mini delivers nearly comparable performance (F1 = 0.89 and 0.81) at a fraction of the cost, enabling practical large scale deployment. Applying this workflow, we curated 27,822 temperature resolved property records with normalized units, spanning figure of merit (ZT), Seebeck coefficient, conductivity, resistivity, power factor, and thermal conductivity, together with structural attributes such as crystal class, space group, and doping strategy. Dataset analysis reproduces known thermoelectric trends, such as the superior performance of alloys over oxides and the advantage of p-type doping, while also surfacing broader structure-property correlations. To facilitate community access, we release an interactive web explorer with semantic filters, numeric queries, and CSV export. This study delivers the largest LLM-curated thermoelectric dataset to date, provides a reproducible and cost-profiled extraction pipeline, and establishes a foundation for scalable, data-driven materials discovery beyond thermoelectrics.
209. Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks
- Authors: Dongjun Kim , Gyuho Shim , Yongchan Chun , Minhyuk Kim , Chanjun Park , Heuiseok Lim
- URL: https://arxiv.org/abs/2510.01232
- Abstract:
Large Language Models are commonly judged by their scores on standard benchmarks, yet such scores often overstate real capability since they mask the mix of skills a task actually demands. For example, ARC is assumed to test reasoning, while HellaSwag is designed to evaluate commonsense. However, we lack a systematic way to verify if these benchmarks actually measure these labels. We introduce Benchmark Profiling, a diagnostic framework that decomposes benchmark performance into ten cognitively grounded abilities. The method combines gradient-based importance scoring with targeted parameter ablation to compute an Ability Impact Score (AIS) that quantifies how much each ability contributes to a model’s success on a given benchmark. Profiling three instruction-tuned models across ten widely used benchmarks yields four key findings: (i) most benchmarks draw on several abilities rather than one, (ii) datasets with similar labels rely on distinct ability mixtures, (iii) code-generation benchmarks reward broad, multi-skill improvement and thus show only modest gains from narrow domain-specific fine-tuning, and (iv) abilities irrelevant to the task could negatively affect performance. Benchmark Profiling therefore explains why performance gains do not always translate into user-perceived competence and offers a transparent tool for benchmark audit and model interpretability.
210. Trustworthy Summarization via Uncertainty Quantification and Risk Awareness in Large Language Models
- Authors: Shuaidong Pan , Di Wu
- URL: https://arxiv.org/abs/2510.01231
- Abstract:
This study addresses the reliability of automatic summarization in high-risk scenarios and proposes a large language model framework that integrates uncertainty quantification and risk-aware mechanisms. Starting from the demands of information overload and high-risk decision-making, a conditional generation-based summarization model is constructed, and Bayesian inference is introduced during generation to model uncertainty in the parameter space, which helps avoid overconfident predictions. The uncertainty level of the generated content is measured using predictive distribution entropy, and a joint optimization of entropy regularization and risk-aware loss is applied to ensure that key information is preserved and risk attributes are explicitly expressed during information compression. On this basis, the model incorporates risk scoring and regulation modules, allowing summaries to cover the core content accurately while enhancing trustworthiness through explicit risk-level prompts. Comparative experiments and sensitivity analyses verify that the proposed method significantly improves the robustness and reliability of summarization in high-risk applications while maintaining fluency and semantic integrity. This research provides a systematic solution for trustworthy summarization and demonstrates both scalability and practical value at the methodological level.
211. Enhancing Transformer-Based Rerankers with Synthetic Data and LLM-Based Supervision
- Authors: Dimitar Peshevski , Kiril Blazhevski , Martin Popovski , Gjorgji Madjarov
- URL: https://arxiv.org/abs/2510.01229
- Abstract:
Effective document reranking is essential for improving search relevance across diverse applications. While Large Language Models (LLMs) excel at reranking due to their deep semantic understanding and reasoning, their high computational cost makes them impractical for many real-world deployments. Fine-tuning smaller, task-specific models is a more efficient alternative but typically depends on scarce, manually labeled data. To overcome this, we propose a novel pipeline that eliminates the need for human-labeled query-document pairs. Our method uses LLMs to generate synthetic queries from domain-specific corpora and employs an LLM-based classifier to label positive and hard-negative pairs. This synthetic dataset is then used to fine-tune a smaller transformer model with contrastive learning using Localized Contrastive Estimation (LCE) loss. Experiments on the MedQuAD dataset show that our approach significantly boosts in-domain performance and generalizes well to out-of-domain tasks. By using LLMs for data generation and supervision rather than inference, we reduce computational costs while maintaining strong reranking capabilities.
212. ClaimCheck: Real-Time Fact-Checking with Small Language Models
- Authors: Akshith Reddy Putta , Jacob Devasier , Chengkai Li
- URL: https://arxiv.org/abs/2510.01226
- Abstract:
We introduce ClaimCheck, an LLM-guided automatic fact-checking system designed to verify real-world claims using live Web evidence and small language models. Unlike prior systems that rely on large, closed-source models and static knowledge stores, ClaimCheck employs a transparent, stepwise verification pipeline that mirrors human fact-checking workflows consisting of Web search query planning, Web-based evidence retrieval and summarization, evidence synthesis and re-retrieval, and claim verdict evaluation. Each module is optimized for small LLMs, allowing the system to deliver accurate and interpretable fact-checking with significantly lower computational requirements. Despite using a much smaller Qwen3-4B model, ClaimCheck achieves state-of-the-art accuracy of 76.4% on the AVeriTeC dataset, outperforming previous approaches using LLaMA3.1 70B and GPT-4o. Extensive ablations demonstrate that careful modular design and prompting strategies can overcome the limitations of smaller LLMs. To promote accessibility and transparency, we provide a public demo at this https URL .
213. Utilizing Modern Large Language Models (LLM) for Financial Trend Analysis and Digest Creation
- Authors: Andrei Lazarev , Dmitrii Sedov
- URL: https://arxiv.org/abs/2510.01225
- Abstract:
The exponential growth of information presents a significant challenge for researchers and professionals seeking to remain at the forefront of their fields and this paper introduces an innovative framework for automatically generating insightful financial digests using the power of Large Language Models (LLMs), specifically Google’s Gemini Pro. By leveraging a combination of data extraction from OpenAlex, strategic prompt engineering, and LLM-driven analysis, we demonstrate the automated example of creating a comprehensive digests that generalize key findings, identify emerging trends. This approach addresses the limitations of traditional analysis methods, enabling the efficient processing of vast amounts of unstructured data and the delivery of actionable insights in an easily digestible format. This paper describes how LLMs work in simple words and how we can use their power to help researchers and scholars save their time and stay informed about current trends. Our study includes step-by-step process, from data acquisition and JSON construction to interaction with Gemini and the automated generation of PDF reports, including a link to the project’s GitHub repository for broader accessibility and further development.
214. Context Matters: Comparison of commercial large language tools in veterinary medicine
- Authors: Tyler J Poore , Christopher J Pinard , Aleena Shabbir , Andrew Lagree , Andre Telfer , Kuan-Chuen Wu
- URL: https://arxiv.org/abs/2510.01224
- Abstract:
Large language models (LLMs) are increasingly used in clinical settings, yet their performance in veterinary medicine remains underexplored. We evaluated three commercially available veterinary-focused LLM summarization tools (Product 1 [Hachiko] and Products 2 and 3) on a standardized dataset of veterinary oncology records. Using a rubric-guided LLM-as-a-judge framework, summaries were scored across five domains: Factual Accuracy, Completeness, Chronological Order, Clinical Relevance, and Organization. Product 1 achieved the highest overall performance, with a median average score of 4.61 (IQR: 0.73), compared to 2.55 (IQR: 0.78) for Product 2 and 2.45 (IQR: 0.92) for Product 3. It also received perfect median scores in Factual Accuracy and Chronological Order. To assess the internal consistency of the grading framework itself, we repeated the evaluation across three independent runs. The LLM grader demonstrated high reproducibility, with Average Score standard deviations of 0.015 (Product 1), 0.088 (Product 2), and 0.034 (Product 3). These findings highlight the importance of veterinary-specific commercial LLM tools and demonstrate that LLM-as-a-judge evaluation is a scalable and reproducible method for assessing clinical NLP summarization in veterinary medicine.
215. Discourse vs emissions: Analysis of corporate narratives, symbolic practices, and mimicry through LLMs
- Authors: Bertrand Kian Hassani , Yacoub Bahini , Rizwan Mushtaq
- URL: https://arxiv.org/abs/2510.01222
- Abstract:
Climate change has increased demands for transparent and comparable corporate climate disclosures, yet imitation and symbolic reporting often undermine their value. This paper develops a multidimensional framework to assess disclosure maturity among 828 this http URL firms using large language models (LLMs) fine-tuned for climate communication. Four classifiers-sentiment, commitment, specificity, and target ambition-extract narrative indicators from sustainability and annual reports, which are linked to firm attributes such as emissions, market capitalization, and sector. Analyses reveal three insights: (1) risk-focused narratives often align with explicit commitments, but quantitative targets (e.g., net-zero pledges) remain decoupled from tone; (2) larger and higher-emitting firms disclose more commitments and actions than peers, though inconsistently with quantitative targets; and (3) widespread similarity in disclosure styles suggests mimetic behavior, reducing differentiation and decision usefulness. These results highlight the value of LLMs for ESG narrative analysis and the need for stronger regulation to connect commitments with verifiable transition strategies.
216. Towards Open-Ended Discovery for Low-Resource NLP
- Authors: Bonaventure F. P. Dossou , Henri Aïdasso
- URL: https://arxiv.org/abs/2510.01220
- Abstract:
Natural Language Processing (NLP) for low-resource languages remains fundamentally constrained by the lack of textual corpora, standardized orthographies, and scalable annotation pipelines. While recent advances in large language models have improved cross-lingual transfer, they remain inaccessible to underrepresented communities due to their reliance on massive, pre-collected data and centralized infrastructure. In this position paper, we argue for a paradigm shift toward open-ended, interactive language discovery, where AI systems learn new languages dynamically through dialogue rather than static datasets. We contend that the future of language technology, particularly for low-resource and under-documented languages, must move beyond static data collection pipelines toward interactive, uncertainty-driven discovery, where learning emerges dynamically from human-machine collaboration instead of being limited to pre-existing datasets. We propose a framework grounded in joint human-machine uncertainty, combining epistemic uncertainty from the model with hesitation cues and confidence signals from human speakers to guide interaction, query selection, and memory retention. This paper is a call to action: we advocate a rethinking of how AI engages with human knowledge in under-documented languages, moving from extractive data collection toward participatory, co-adaptive learning processes that respect and empower communities while discovering and preserving the world’s linguistic diversity. This vision aligns with principles of human-centered AI, emphasizing interactive, cooperative model building between AI systems and speakers.
217. Uncovering Implicit Bias in Large Language Models with Concept Learning Dataset
- Authors: Leroy Z. Wang
- URL: https://arxiv.org/abs/2510.01219
- Abstract:
We introduce a dataset of concept learning tasks that helps uncover implicit biases in large language models. Using in-context concept learning experiments, we found that language models may have a bias toward upward monotonicity in quantifiers; such bias is less apparent when the model is tested by direct prompting without concept learning components. This demonstrates that in-context concept learning can be an effective way to discover hidden biases in language models.
218. Control the Temperature: Selective Sampling for Diverse and High-Quality LLM Outputs
- Authors: Sergey Troshin , Wafaa Mohammed , Yan Meng , Christof Monz , Antske Fokkens , Vlad Niculae
- URL: https://arxiv.org/abs/2510.01218
- Abstract:
Diversity is an essential metric for evaluating the creativity of outputs generated by language models. Temperature-based sampling is a common strategy to increase diversity. However, for tasks that require high precision, e.g., mathematical reasoning, uncontrolled high temperature sampling, e.g., min-$p$ or top-$p$, degrades reasoning quality. We demonstrate that the loss of accuracy is caused by sampling incorrect continuations in sensitive decoding positions. To address this, in this paper, we propose \textbf{selective sampling}, a method that dynamically switches between greedy and high-temperature sampling based on a sampling risk metric. This risk metric estimates the likelihood of output errors when applying high-temperature sampling on the current token position. To predict sampling risk, we train a lightweight classifier on a small subset of verifiable problems. The trained classifier can be integrated with the base language model with minimal latency overhead. Experiments on mathematical reasoning tasks demonstrate that selective sampling enhances the quality-diversity trade-off, even in high-temperature settings.
219. Mamba Outpaces Reformer in Stock Prediction with Sentiments from Top Ten LLMs
- Authors: Lokesh Antony Kadiyala , Amir Mirzaeinia
- URL: https://arxiv.org/abs/2510.01203
- Abstract:
The stock market is extremely difficult to predict in the short term due to high market volatility, changes caused by news, and the non-linear nature of the financial time series. This research proposes a novel framework for improving minute-level prediction accuracy using semantic sentiment scores from top ten different large language models (LLMs) combined with minute interval intraday stock price data. We systematically constructed a time-aligned dataset of AAPL news articles and 1-minute Apple Inc. (AAPL) stock prices for the dates of April 4 to May 2, 2025. The sentiment analysis was achieved using the DeepSeek-V3, GPT variants, LLaMA, Claude, Gemini, Qwen, and Mistral models through their APIs. Each article obtained sentiment scores from all ten LLMs, which were scaled to a [0, 1] range and combined with prices and technical indicators like RSI, ROC, and Bollinger Band Width. Two state-of-the-art such as Reformer and Mamba were trained separately on the dataset using the sentiment scores produced by each LLM as input. Hyper parameters were optimized by means of Optuna and were evaluated through a 3-day evaluation period. Reformer had mean squared error (MSE) or the evaluation metrics, and it should be noted that Mamba performed not only faster but also better than Reformer for every LLM across the 10 LLMs tested. Mamba performed best with LLaMA 3.3–70B, with the lowest error of 0.137. While Reformer could capture broader trends within the data, the model appeared to over smooth sudden changes by the LLMs. This study highlights the potential of integrating LLM-based semantic analysis paired with efficient temporal modeling to enhance real-time financial forecasting.
220. LegiScout: A Visual Tool for Understanding Complex Legislation
- Authors: Aadarsh Rajiv , Klaus Mueller
- URL: https://arxiv.org/abs/2510.01195
- Abstract:
Modern legislative frameworks, such as the Affordable Care Act (ACA), often involve complex webs of agencies, mandates, and interdependencies. Government issued charts attempt to depict these structures but are typically static, dense, and difficult to interpret - even for experts. We introduce LegiScout, an interactive visualization system that transforms static policy diagrams into dynamic, force-directed graphs, enhancing comprehension while preserving essential relationships. By integrating data extraction, natural language processing, and computer vision techniques, LegiScout supports deeper exploration of not only the ACA but also a wide range of legislative and regulatory frameworks. Our approach enables stakeholders - policymakers, analysts, and the public - to navigate and understand the complexity inherent in modern law.
221. An Anthropologist LLM to Elicit Users’ Moral Preferences through Role-Play
- Authors: Gianluca De Ninno , Paola Inverardi , Francesca Belotti
- URL: https://arxiv.org/abs/2510.01189
- Abstract:
This study investigates a novel approach to eliciting users’ moral decision-making by combining immersive roleplaying games with LLM analysis capabilities. Building on the distinction introduced by Floridi between hard ethics inspiring and shaping laws-and soft ethics-moral preferences guiding individual behavior within the free space of decisions compliant to laws-we focus on capturing the latter through contextrich, narrative-driven interactions. Grounded in anthropological methods, the role-playing game exposes participants to ethically charged scenarios in the domain of digital privacy. Data collected during the sessions were interpreted by a customized LLM (“GPT Anthropologist”). Evaluation through a cross-validation process shows that both the richness of the data and the interpretive framing significantly enhance the model’s ability to predict user behavior. Results show that LLMs can be effectively employed to automate and enhance the understanding of user moral preferences and decision-making process in the early stages of software development.
222. Quantum-Assisted Correlation Clustering
- Authors: Antonio Macaluso , Supreeth Mysore Venkatesh , Diego Arenas , Matthias Klusch , Andreas Dengel
- URL: https://arxiv.org/abs/2509.03561
- Abstract:
This work introduces a hybrid quantum-classical method to correlation clustering, a graph-based unsupervised learning task that seeks to partition the nodes in a graph based on pairwise agreement and disagreement. In particular, we adapt GCS-Q, a quantum-assisted solver originally designed for coalition structure generation, to maximize intra-cluster agreement in signed graphs through recursive divisive partitioning. The proposed method encodes each bipartitioning step as a quadratic unconstrained binary optimization problem, solved via quantum annealing. This integration of quantum optimization within a hierarchical clustering framework enables handling of graphs with arbitrary correlation structures, including negative edges, without relying on metric assumptions or a predefined number of clusters. Empirical evaluations on synthetic signed graphs and real-world hyperspectral imaging data demonstrate that, when adapted for correlation clustering, GCS-Q outperforms classical algorithms in robustness and clustering quality on real-world data and in scenarios with cluster size imbalance. Our results highlight the promise of hybrid quantum-classical optimization for advancing scalable and structurally-aware clustering techniques in graph-based unsupervised learning.