전체 AI 논문 - 2025-09-24

1. Cross-Cultural Transfer of Commonsense Reasoning in LLMs: Evidence from the Arab World

Authors: Saeed Almheiri , Rania Hossam , Mena Attia , Chenxi Wang , Preslav Nakov , Timothy Baldwin , Fajri Koto
URL: https://arxiv.org/abs/2509.19265
Abstract:

Large language models (LLMs) often reflect Western-centric biases, limiting their effectiveness in diverse cultural contexts. Although some work has explored cultural alignment, the potential for cross-cultural transfer, using alignment in one culture to improve performance in others, remains underexplored. This paper investigates cross-cultural transfer of commonsense reasoning in the Arab world, where linguistic and historical similarities coexist with local cultural differences. Using a culturally grounded commonsense reasoning dataset covering 13 Arab countries, we evaluate lightweight alignment methods such as in-context learning and demonstration-based reinforcement (DITTO), alongside baselines like supervised fine-tuning and direct preference optimization. Our results show that merely 12 culture-specific examples from one country can improve performance in others by 10\% on average, within multilingual models. In addition, we demonstrate that out-of-culture demonstrations from Indonesia and US contexts can match or surpass in-culture alignment for MCQ reasoning, highlighting cultural commonsense transferability beyond the Arab world. These findings demonstrate that efficient cross-cultural alignment is possible and offer a promising approach to adapt LLMs to low-resource cultural settings.

2. AgentInit: Initializing LLM-based Multi-Agent Systems via Diversity and Expertise Orchestration for Effective and Efficient Collaboration

Authors: Chunhao Tian , Yutong Wang , Xuebo Liu , Zhexuan Wang , Liang Ding , Miao Zhang , Min Zhang
URL: https://arxiv.org/abs/2509.19236
Abstract:

Proper initialization is crucial for any system, particularly in multi-agent systems (MAS), where it plays a pivotal role in determining both the system’s efficiency and effectiveness. However, existing MAS initialization methods do not fully account for the collaborative needs of the generated agents in subsequent stages. Inspired by the principles of effective team composition, we propose AgentInit, which aims to optimize the structure of agent teams. Specifically, in addition to multi-round interactions and reflections between agents during agent generation, AgentInit incorporates a Natural Language to Format mechanism to ensure consistency and standardization. Balanced team selection strategies using Pareto principles are subsequently applied to jointly consider agent team diversity and task relevance to promote effective and efficient collaboration and enhance overall system performance. Experiments show that AgentInit consistently outperforms state-of-the-art initialization methods and pre-defined strategies across various frameworks and tasks, achieving an overall performance improvement of up to 1.2 and 1.6, respectively, while also significantly reducing token consumption. Further analysis confirms its strong transferability to similar tasks and verifies the effectiveness of its key components, demonstrating its capability and adaptability as a reliable MAS initialization method. Source code and models are available at this https URL .

3. Code Driven Planning with Domain-Adaptive Critic

Authors: Zikang Tian , Shaohui Peng , Du Huang , Jiaming Guo , Ruizhi Chen , Rui Zhang , Xishan Zhang , Yuxuan Guo , Zidong Du , Qi Guo , Ling Li , Yewen Pu , Xing Hu , Yunji Chen
URL: https://arxiv.org/abs/2509.19077
Abstract:

Large Language Models (LLMs) have been widely adopted as task planners for AI agents in sequential decision-making problems, leveraging their extensive world knowledge. However, the gap between their general knowledge and environment-specific requirements often leads to inaccurate plans. To address this, existing approaches rely on frequent LLM queries to iteratively refine plans based on immediate environmental feedback, which incurs substantial query costs. However, this refinement is typically guided by short-term environmental feedback, limiting LLMs from developing plans aligned with long-term rewards. We propose Code Driven Planning with Domain-Adaptive Critic (CoPiC). Instead of relying on frequent queries, CoPiC employs LLMs to generate a diverse set of high-level planning programs, which iteratively produce and refine candidate plans. A trained domain-adaptive critic then evaluates these candidates and selects the one most aligned with long-term rewards for execution. Using high-level planning programs as planner and domain-adaptive critic as estimator, CoPiC improves planning while significantly reducing query costs. Results in ALFWorld, NetHack, and StarCraft II Unit Building show that CoPiC outperforms advanced LLM-based baselines, AdaPlanner and Reflexion, achieving an average (1) 23.33% improvement in success rate and (2) 91.27% reduction in query costs.

4. Towards Causal Representation Learning with Observable Sources as Auxiliaries

Authors: Kwonho Kim , Heejeong Nam , Inwoo Hwang , Sanghack Lee
URL: https://arxiv.org/abs/2509.19058
Abstract:

Causal representation learning seeks to recover latent factors that generate observational data through a mixing function. Needing assumptions on latent structures or relationships to achieve identifiability in general, prior works often build upon conditional independence given known auxiliary variables. However, prior frameworks limit the scope of auxiliary variables to be external to the mixing function. Yet, in some cases, system-driving latent factors can be easily observed or extracted from data, possibly facilitating identification. In this paper, we introduce a framework of observable sources being auxiliaries, serving as effective conditioning variables. Our main results show that one can identify entire latent variables up to subspace-wise transformations and permutations using volume-preserving encoders. Moreover, when multiple known auxiliary variables are available, we offer a variable-selection scheme to choose those that maximize recoverability of the latent factors given knowledge of the latent causal graph. Finally, we demonstrate the effectiveness of our framework through experiments on synthetic graph and image data, thereby extending the boundaries of current approaches.

5. Landmarks, Monuments, and Beacons: Understanding Generative Calls to Action

Authors: Victoire Hervé , Henrik Warpefelt , Christoph Salge
URL: https://arxiv.org/abs/2509.19030
Abstract:

Algorithmic evaluation of procedurally generated content struggles to find metrics that align with human experience, particularly for composite artefacts. Automatic decomposition as a possible solution requires concepts that meet a range of properties. To this end, drawing on Games Studies and Game AI research, we introduce the nested concepts of \textit{Landmarks}, \textit{Monuments}, and \textit{Beacons}. These concepts are based on the artefact’s perceivability, evocativeness, and Call to Action, all from a player-centric perspective. These terms are generic to games and usable across genres. We argue that these entities can be found and evaluated with techniques currently used in both research and industry, opening a path towards a fully automated decomposition of PCG, and evaluation of the salient sub-components. Although the work presented here emphasises mixed-initiative PCG and compositional PCG, we believe it applies beyond those domains. With this approach, we intend to create a connection between humanities and technical game research and allow for better computational PCG evaluation

6. Remaining Time Prediction in Outbound Warehouse Processes: A Case Study (Short Paper)

Authors: Erik Penther , Michael Grohs , Jana-Rebecca Rehse
URL: https://arxiv.org/abs/2509.18986
Abstract:

Predictive process monitoring is a sub-domain of process mining which aims to forecast the future of ongoing process executions. One common prediction target is the remaining time, meaning the time that will elapse until a process execution is completed. In this paper, we compare four different remaining time prediction approaches in a real-life outbound warehouse process of a logistics company in the aviation business. For this process, the company provided us with a novel and original event log with 169,523 traces, which we can make publicly available. Unsurprisingly, we find that deep learning models achieve the highest accuracy, but shallow methods like conventional boosting techniques achieve competitive accuracy and require significantly fewer computational resources.

7. From latent factors to language: a user study on LLM-generated explanations for an inherently interpretable matrix-based recommender system

Authors: Maxime Manderlier , Fabian Lecron , Olivier Vu Thanh , Nicolas Gillis
URL: https://arxiv.org/abs/2509.18980
Abstract:

We investigate whether large language models (LLMs) can generate effective, user-facing explanations from a mathematically interpretable recommendation model. The model is based on constrained matrix factorization, where user types are explicitly represented and predicted item scores share the same scale as observed ratings, making the model’s internal representations and predicted scores directly interpretable. This structure is translated into natural language explanations using carefully designed LLM prompts. Many works in explainable AI rely on automatic evaluation metrics, which often fail to capture users’ actual needs and perceptions. In contrast, we adopt a user-centered approach: we conduct a study with 326 participants who assessed the quality of the explanations across five key dimensions-transparency, effectiveness, persuasion, trust, and satisfaction-as well as the recommendations this http URL evaluate how different explanation strategies are perceived, we generate multiple explanation types from the same underlying model, varying the input information provided to the LLM. Our analysis reveals that all explanation types are generally well received, with moderate statistical differences between strategies. User comments further underscore how participants react to each type of explanation, offering complementary insights beyond the quantitative results.

8. LLM-based Agents Suffer from Hallucinations: A Survey of Taxonomy, Methods, and Directions

Authors: Xixun Lin , Yucheng Ning , Jingwen Zhang , Yan Dong , Yilong Liu , Yongxuan Wu , Xiaohua Qi , Nan Sun , Yanmin Shang , Pengfei Cao , Lixin Zou , Xu Chen , Chuan Zhou , Jia Wu , Shirui Pan , Bin Wang , Yanan Cao , Kai Chen , Songlin Hu , Li Guo
URL: https://arxiv.org/abs/2509.18970
Abstract:

Driven by the rapid advancements of Large Language Models (LLMs), LLM-based agents have emerged as powerful intelligent systems capable of human-like cognition, reasoning, and interaction. These agents are increasingly being deployed across diverse real-world applications, including student education, scientific research, and financial analysis. However, despite their remarkable potential, LLM-based agents remain vulnerable to hallucination issues, which can result in erroneous task execution and undermine the reliability of the overall system design. Addressing this critical challenge requires a deep understanding and a systematic consolidation of recent advances on LLM-based agents. To this end, we present the first comprehensive survey of hallucinations in LLM-based agents. By carefully analyzing the complete workflow of agents, we propose a new taxonomy that identifies different types of agent hallucinations occurring at different stages. Furthermore, we conduct an in-depth examination of eighteen triggering causes underlying the emergence of agent hallucinations. Through a detailed review of a large number of existing studies, we summarize approaches for hallucination mitigation and detection, and highlight promising directions for future research. We hope this survey will inspire further efforts toward addressing hallucinations in LLM-based agents, ultimately contributing to the development of more robust and reliable agent systems.

9. Data Efficient Adaptation in Large Language Models via Continuous Low-Rank Fine-Tuning

Authors: Xiao Han , Zimo Zhao , Wanyu Wang , Maolin Wang , Zitao Liu , Yi Chang , Xiangyu Zhao
URL: https://arxiv.org/abs/2509.18942
Abstract:

Recent advancements in Large Language Models (LLMs) have emphasized the critical role of fine-tuning (FT) techniques in adapting LLMs to specific tasks, especially when retraining from scratch is computationally infeasible. Fine-tuning enables LLMs to leverage task- or domain-specific data, producing models that more effectively meet the requirements of targeted applications. However, con- ventional FT approaches often suffer from catastrophic forgetting and suboptimal data efficiency, limiting their real-world applicability. To address these challenges, this paper proposes DEAL, a novel framework that integrates Low-Rank Adapta- tion (LoRA) with a continuous fine-tuning strategy. By incorporating knowledge retention and adaptive parameter update modules, the framework mitigates the lim- itations of existing FT methods while maintaining efficiency in privacy-preserving settings. Experiments on 15 diverse datasets show that DEAL consistently outper- forms baseline methods, yielding substantial gains in task accuracy and resource efficiency. These findings demonstrate the potential of our approach to advance continual adaptation in LLMs by enhancing task performance while improving resource efficiency.

10. How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective

Authors: Songsong Yu , Yuxin Chen , Hao Ju , Lianjie Jia , Fuxi Zhang , Shaofei Huang , Yuhan Wu , Rundi Cui , Binghao Ran , Zaibin Zhang , Zhedong Zheng , Zhipeng Zhang , Yifan Wang , Lin Song , Lijun Wang , Yanwei Li , Ying Shan , Huchuan Lu
URL: https://arxiv.org/abs/2509.18905
Abstract:

Visual Spatial Reasoning (VSR) is a core human cognitive ability and a critical requirement for advancing embodied intelligence and autonomous systems. Despite recent progress in Vision-Language Models (VLMs), achieving human-level VSR remains highly challenging due to the complexity of representing and reasoning over three-dimensional space. In this paper, we present a systematic investigation of VSR in VLMs, encompassing a review of existing methodologies across input modalities, model architectures, training strategies, and reasoning mechanisms. Furthermore, we categorize spatial intelligence into three levels of capability, ie, basic perception, spatial understanding, spatial planning, and curate SIBench, a spatial intelligence benchmark encompassing nearly 20 open-source datasets across 23 task settings. Experiments with state-of-the-art VLMs reveal a pronounced gap between perception and reasoning, as models show competence in basic perceptual tasks but consistently underperform in understanding and planning tasks, particularly in numerical estimation, multi-view reasoning, temporal dynamics, and spatial imagination. These findings underscore the substantial challenges that remain in achieving spatial intelligence, while providing both a systematic roadmap and a comprehensive benchmark to drive future research in the field. The related resources of this study are accessible at this https URL .

11. LongCat-Flash-Thinking Technical Report

Authors: Meituan LongCat Team , Anchun Gui , Bei Li , Bingyang Tao , Bole Zhou , Borun Chen , Chao Zhang , Chao Zhang , Chengcheng Han , Chenhui Yang , Chi Zhang , Chong Peng , Chuyu Zhang , Cong Chen , Fengcun Li , Gang Xu , Guoyuan Lin , Hao Jiang , Hao Liang , Haomin Fu , Haoxiang Ma , Hong Liu , Hongyan Hao , Hongyin Tang , Hongyu Zang , Hongzhi Ni , Hui Su , Jiahao Liu , Jiahuan Li , Jialin Liu , Jianfei Zhang , Jianhao Xu , Jianing Wang , Jiaqi Sun , Jiaqi Zhang , Jiarong Shi , Jiawei Yang , Jingang Wang , Jinrui Ding , Jun Kuang , Jun Xu , Ke He , Kefeng Zhang , Keheng Wang , Keqing He , Li Wei , Liang Shi , Lin Qiu , Lingbin Kong , Lingchuan Liu , Linsen Guo , Longfei An , Mai Xia , Meng Zhou , Mengshen Zhu , Peng Pei , Pengcheng Jia , Qi Gu , Qi Guo , Qiong Huang , Quan Chen , Quanchi Weng , Rongxiang Weng , Ruichen Shao , Rumei Li , Shanglin Lei , Shuai Du , Shuaikang Liu , Shuang Zhou , Shuhao Hu , Siyu Xu , Songshan Gong , Tao Liang , Tianhao Hu , Wei He , Wei Shi , Wei Wang , Wei Wu , Wei Zhuo , Weifeng Tang , Wenjie Shi , Wenlong Zhu , Xi Su , Xiangcheng Liu , Xiangyu Xi , Xiangzhou Huang , Xiao Liu , Xiaochen Jiang , Xiaowei Shi , Xiaowen Shi , Xiaoyu Li , Xin Chen , Xinyue Zhao , Xuan Huang , Xuemiao Zhang , Xuezhi Cao , Xunliang Cai , Yajie Zhang , Yang Chen , Yang Liu
URL: https://arxiv.org/abs/2509.18883
Abstract:

We present LongCat-Flash-Thinking, an efficient 560-billion-parameter open-source Mixture-of-Experts (MoE) reasoning model. Its advanced capabilities are cultivated through a meticulously crafted training process, beginning with long Chain-of-Thought (CoT) data cold-start and culminating in large-scale Reinforcement Learning (RL). We first employ a well-designed cold-start training strategy, which significantly enhances the reasoning potential and equips the model with specialized skills in both formal and agentic reasoning. Then, a core innovation is our domain-parallel training scheme, which decouples optimization across distinct domains (e.g., STEM, Code, Agentic) and subsequently fuses the resulting expert models into a single, nearly Pareto-optimal model. This entire process is powered by our Dynamic ORchestration for Asynchronous rollout (DORA) system, a large-scale RL framework that delivers a greater than threefold training speedup over synchronous methods on tens of thousands of accelerators. As a result, LongCat-Flash-Thinking achieves state-of-the-art performance among open-source models on a suite of complex reasoning tasks. The model exhibits exceptional efficiency in agentic reasoning, reducing average token consumption by 64.5% (from 19, 653 to 6, 965) on AIME-25, without degrading task accuracy. We release LongCat-Flash-Thinking to promote further advances in reasoning systems and agentic AI research.

12. Memory in Large Language Models: Mechanisms, Evaluation and Evolution

Authors: Dianxing Zhang , Wendong Li , Kani Song , Jiaye Lu , Gang Li , Liuchun Yang , Sheng Li
URL: https://arxiv.org/abs/2509.18868
Abstract:

Under a unified operational definition, we define LLM memory as a persistent state written during pretraining, finetuning, or inference that can later be addressed and that stably influences outputs. We propose a four-part taxonomy (parametric, contextual, external, procedural/episodic) and a memory quadruple (location, persistence, write/access path, controllability). We link mechanism, evaluation, and governance via the chain write -> read -> inhibit/update. To avoid distorted comparisons across heterogeneous setups, we adopt a three-setting protocol (parametric only, offline retrieval, online retrieval) that decouples capability from information availability on the same data and timeline. On this basis we build a layered evaluation: parametric (closed-book recall, edit differential, memorization/privacy), contextual (position curves and the mid-sequence drop), external (answer correctness vs snippet attribution/faithfulness), and procedural/episodic (cross-session consistency and timeline replay, E MARS+). The framework integrates temporal governance and leakage auditing (freshness hits, outdated answers, refusal slices) and uncertainty reporting via inter-rater agreement plus paired tests with multiple-comparison correction. For updating and forgetting, we present DMM Gov: coordinating DAPT/TAPT, PEFT, model editing (ROME, MEND, MEMIT, SERAC), and RAG to form an auditable loop covering admission thresholds, rollout, monitoring, rollback, and change audits, with specs for timeliness, conflict handling, and long-horizon consistency. Finally, we give four testable propositions: minimum identifiability; a minimal evaluation card; causally constrained editing with verifiable forgetting; and when retrieval with small-window replay outperforms ultra-long-context reading. This yields a reproducible, comparable, and governable coordinate system for research and deployment.

13. Conf-Profile: A Confidence-Driven Reasoning Paradigm for Label-Free User Profiling

Authors: Yingxin Li , Jianbo Zhao , Xueyu Ren , Jie Tang , Wangjie You , Xu Chen , Kan Zhou , Chao Feng , Jiao Ran , Yuan Meng , Zhi Wang
URL: https://arxiv.org/abs/2509.18864
Abstract:

User profiling, as a core technique for user understanding, aims to infer structural attributes from user information. Large Language Models (LLMs) provide a promising avenue for user profiling, yet the progress is hindered by the lack of comprehensive benchmarks. To bridge this gap, we propose ProfileBench, an industrial benchmark derived from a real-world video platform, encompassing heterogeneous user data and a well-structured profiling taxonomy. However, the profiling task remains challenging due to the difficulty of collecting large-scale ground-truth labels, and the heterogeneous and noisy user information can compromise the reliability of LLMs. To approach label-free and reliable user profiling, we propose a Confidence-driven Profile reasoning framework Conf-Profile, featuring a two-stage paradigm. We first synthesize high-quality labels by leveraging advanced LLMs with confidence hints, followed by confidence-weighted voting for accuracy improvement and confidence calibration for a balanced distribution. The multiple profile results, rationales, and confidence scores are aggregated and distilled into a lightweight LLM. We further enhance the reasoning ability via confidence-guided unsupervised reinforcement learning, which exploits confidence for difficulty filtering, quasi-ground truth voting, and reward weighting. Experimental results demonstrate that Conf-Profile delivers substantial performance through the two-stage training, improving F1 by 13.97 on Qwen3-8B.

14. MAPO: Mixed Advantage Policy Optimization

Authors: Wenke Huang , Quan Zhang , Yiyang Fang , Jian Liang , Xuankun Rong , Huanjin Yao , Guancheng Wan , Ke Liang , Wenwen He , Mingjun Li , Leszek Rutkowski , Mang Ye , Bo Du , Dacheng Tao
URL: https://arxiv.org/abs/2509.18849
Abstract:

Recent advances in reinforcement learning for foundation models, such as Group Relative Policy Optimization (GRPO), have significantly improved the performance of foundation models on reasoning tasks. Notably, the advantage function serves as a central mechanism in GRPO for ranking the trajectory importance. However, existing explorations encounter both advantage reversion and advantage mirror problems, which hinder the reasonable advantage allocation across different query samples. In this work, we propose an easy but effective GRPO strategy, Mixed Advantage Policy Optimization (MAPO). We reveal that the trajectory appears with different certainty and propose the advantage percent deviation for samples with high-certainty trajectories. Furthermore, we dynamically reweight the advantage function for samples with varying trajectory certainty, thereby adaptively configuring the advantage function to account for sample-specific characteristics. Comparison with related state-of-the-art methods, along with ablation studies on different advantage variants, validates the effectiveness of our approach.

15. Model selection meets clinical semantics: Optimizing ICD-10-CM prediction via LLM-as-Judge evaluation, redundancy-aware sampling, and section-aware fine-tuning

Authors: Hong-Jie Dai , Zheng-Hao Li , An-Tai Lu , Bo-Tsz Shain , Ming-Ta Li , Tatheer Hussain Mir , Kuang-Te Wang , Min-I Su , Pei-Kang Liu , Ming-Ju Tsai
URL: https://arxiv.org/abs/2509.18846
Abstract:

Accurate International Classification of Diseases (ICD) coding is critical for clinical documentation, billing, and healthcare analytics, yet it remains a labour-intensive and error-prone task. Although large language models (LLMs) show promise in automating ICD coding, their challenges in base model selection, input contextualization, and training data redundancy limit their effectiveness. We propose a modular framework for ICD-10 Clinical Modification (ICD-10-CM) code prediction that addresses these challenges through principled model selection, redundancy-aware data sampling, and structured input design. The framework integrates an LLM-as-judge evaluation protocol with Plackett-Luce aggregation to assess and rank open-source LLMs based on their intrinsic comprehension of ICD-10-CM code definitions. We introduced embedding-based similarity measures, a redundancy-aware sampling strategy to remove semantically duplicated discharge summaries. We leverage structured discharge summaries from Taiwanese hospitals to evaluate contextual effects and examine section-wise content inclusion under universal and section-specific modelling paradigms. Experiments across two institutional datasets demonstrate that the selected base model after fine-tuning consistently outperforms baseline LLMs in internal and external evaluations. Incorporating more clinical sections consistently improves prediction performance. This study uses open-source LLMs to establish a practical and principled approach to ICD-10-CM code prediction. The proposed framework provides a scalable, institution-ready solution for real-world deployment of automated medical coding systems by combining informed model selection, efficient data refinement, and context-aware prompting.

16. Bounded PCTL Model Checking of Large Language Model Outputs

Authors: Dennis Gross , Helge Spieker , Arnaud Gotlieb
URL: https://arxiv.org/abs/2509.18836
Abstract:

In this paper, we introduce LLMCHECKER, a model-checking-based verification method to verify the probabilistic computation tree logic (PCTL) properties of an LLM text generation process. We empirically show that only a limited number of tokens are typically chosen during text generation, which are not always the same. This insight drives the creation of $\alpha$-$k$-bounded text generation, narrowing the focus to the $\alpha$ maximal cumulative probability on the top-$k$ tokens at every step of the text generation process. Our verification method considers an initial string and the subsequent top-$k$ tokens while accommodating diverse text quantification methods, such as evaluating text quality and biases. The threshold $\alpha$ further reduces the selected tokens, only choosing those that exceed or meet it in cumulative probability. LLMCHECKER then allows us to formally verify the PCTL properties of $\alpha$-$k$-bounded LLMs. We demonstrate the applicability of our method in several LLMs, including Llama, Gemma, Mistral, Genstruct, and BERT. To our knowledge, this is the first time PCTL-based model checking has been used to check the consistency of the LLM text generation process.

17. The AGNTCY Agent Directory Service: Architecture and Implementation

Authors: Luca Muscariello , Vijoy Pandey , Ramiz Polic
URL: https://arxiv.org/abs/2509.18787
Abstract:

The Agent Directory Service (ADS) is a distributed directory for the discovery of AI agent capabilities, metadata, and provenance. It leverages content-addressed storage, hierarchical taxonomies, and cryptographic signing to enable efficient, verifiable, and multi-dimensional discovery across heterogeneous Multi-Agent Systems (MAS). Built on the Open Agentic Schema Framework (OASF), ADS decouples capability indexing from content location through a two-level mapping realized over a Kademlia-based Distributed Hash Table (DHT). It reuses mature OCI / ORAS infrastructure for artifact distribution, integrates Sigstore for provenance, and supports schema-driven extensibility for emerging agent modalities (LLM prompt agents, MCP servers, A2A-enabled components). This paper formalizes the architectural model, describes storage and discovery layers, explains security and performance properties, and positions ADS within the broader landscape of emerging agent registry and interoperability initiatives.

18. Experience Scaling: Post-Deployment Evolution For Large Language Models

Authors: Xingkun Yin , Kaibin Huang , Dong In Kim , Hongyang Du
URL: https://arxiv.org/abs/2509.18771
Abstract:

Scaling model size, training data, and compute power have driven advances in large language models (LLMs), but these approaches are reaching saturation as human-generated text is exhausted and further gains diminish. We propose experience scaling, a framework for continuous post-deployment evolution for LLMs through autonomous interaction with the environment and collaborative sharing of accumulated experience. The framework captures raw interactions, distills them into compact, reusable knowledge, and periodically refines stored content to preserve relevance and efficiency. We validate the framework in simulated real-world scenarios involving generalization to previously unseen but related tasks, repetitive queries, and over-saturated knowledge stores. Across all settings, experience scaling improves accuracy, sustains performance over time, and maintains gains when applied to novel situations. These results demonstrate that structured post-deployment learning can extend LLM capabilities beyond the limits of static human-generated data, offering a scalable path for continued intelligence progress.

19. Autonomous Data Agents: A New Opportunity for Smart Data

Authors: Yanjie Fu , Dongjie Wang , Wangyang Ying , Xiangliang Zhang , Huan Liu , Jian Pei
URL: https://arxiv.org/abs/2509.18710
Abstract:

As data continues to grow in scale and complexity, preparing, transforming, and analyzing it remains labor-intensive, repetitive, and difficult to scale. Since data contains knowledge and AI learns knowledge from it, the alignment between AI and data is essential. However, data is often not structured in ways that are optimal for AI utilization. Moreover, an important question arises: how much knowledge can we pack into data through intensive data operations? Autonomous data agents (DataAgents), which integrate LLM reasoning with task decomposition, action reasoning and grounding, and tool calling, can autonomously interpret data task descriptions, decompose tasks into subtasks, reason over actions, ground actions into python code or tool calling, and execute operations. Unlike traditional data management and engineering tools, DataAgents dynamically plan workflows, call powerful tools, and adapt to diverse data tasks at scale. This report argues that DataAgents represent a paradigm shift toward autonomous data-to-knowledge systems. DataAgents are capable of handling collection, integration, preprocessing, selection, transformation, reweighing, augmentation, reprogramming, repairs, and retrieval. Through these capabilities, DataAgents transform complex and unstructured data into coherent and actionable knowledge. We first examine why the convergence of agentic AI and data-to-knowledge systems has emerged as a critical trend. We then define the concept of DataAgents and discuss their architectural design, training strategies, as well as the new skills and capabilities they enable. Finally, we call for concerted efforts to advance action workflow optimization, establish open datasets and benchmark ecosystems, safeguard privacy, balance efficiency with scalability, and develop trustworthy DataAgent guardrails to prevent malicious actions.

20. Advances in Large Language Models for Medicine

Authors: Zhiyu Kan , Wensheng Gan , Zhenlian Qi , Philip S. Yu
URL: https://arxiv.org/abs/2509.18690
Abstract:

Artificial intelligence (AI) technology has advanced rapidly in recent years, with large language models (LLMs) emerging as a significant breakthrough. LLMs are increasingly making an impact across various industries, with the medical field standing out as the most prominent application area. This paper systematically reviews the up-to-date research progress of LLMs in the medical field, providing an in-depth analysis of training techniques for large medical models, their adaptation in healthcare settings, related applications, as well as their strengths and limitations. Furthermore, it innovatively categorizes medical LLMs into three distinct types based on their training methodologies and classifies their evaluation approaches into two categories. Finally, the study proposes solutions to existing challenges and outlines future research directions based on identified issues in the field of medical LLMs. By systematically reviewing previous and advanced research findings, we aim to highlight the necessity of developing medical LLMs, provide a deeper understanding of their current state of development, and offer clear guidance for subsequent research.

21. Implementation of airborne ML models with semantics preservation

Authors: Nicolas Valot , Louis Fabre , Benjamin Lesage , Ammar Mechouche , Claire Pagetti
URL: https://arxiv.org/abs/2509.18681
Abstract:

Machine Learning (ML) may offer new capabilities in airborne systems. However, as any piece of airborne systems, ML-based systems will be required to guarantee their safe operation. Thus, their development will have to be demonstrated to be compliant with the adequate guidance. So far, the European Union Aviation Safety Agency (EASA) has published a concept paper and an EUROCAE/SAE group is preparing ED-324. Both approaches delineate high-level objectives to confirm the ML model achieves its intended function and maintains training performance in the target environment. The paper aims to clarify the difference between an ML model and its corresponding unambiguous description, referred to as the Machine Learning Model Description (MLMD). It then refines the essential notion of semantics preservation to ensure the accurate replication of the model. We apply our contributions to several industrial use cases to build and compare several target models.

22. TERAG: Token-Efficient Graph-Based Retrieval-Augmented Generation

Authors: Qiao Xiao , Hong Ting Tsang , Jiaxin Bai
URL: https://arxiv.org/abs/2509.18667
Abstract:

Graph-based Retrieval-augmented generation (RAG) has become a widely studied approach for improving the reasoning, accuracy, and factuality of Large Language Models. However, many existing graph-based RAG systems overlook the high cost associated with LLM token usage during graph construction, hindering large-scale adoption. To address this, we propose TERAG, a simple yet effective framework designed to build informative graphs at a significantly lower cost. Inspired by HippoRAG, we incorporate Personalized PageRank (PPR) during the retrieval phase, and we achieve at least 80% of the accuracy of widely used graph-based RAG methods while consuming only 3%-11% of the output tokens.

23. Adaptive Learning in Spatial Agent-Based Models for Climate Risk Assessment: A Geospatial Framework with Evolutionary Economic Agents

Authors: Yara Mohajerani
URL: https://arxiv.org/abs/2509.18633
Abstract:

Climate risk assessment requires modelling complex interactions between spatially heterogeneous hazards and adaptive economic systems. We present a novel geospatial agent-based model that integrates climate hazard data with evolutionary learning for economic agents. Our framework combines Mesa-based spatial modelling with CLIMADA climate impact assessment, introducing adaptive learning behaviours that allow firms to evolve strategies for budget allocation, pricing, wages, and risk adaptation through fitness-based selection and mutation. We demonstrate the framework using riverine flood projections under RCP8.5 until 2100, showing that evolutionary adaptation enables firms to converge with baseline (no hazard) production levels after decades of disruption due to climate stress. Our results reveal systemic risks where even agents that are not directly exposed to floods face impacts through supply chain disruptions, with the end-of-century average price of goods 5.6% higher under RCP8.5 compared to the baseline. This open-source framework provides financial institutions and companies with tools to quantify both direct and cascading climate risks while evaluating cost-effective adaptation strategies.

24. Solving Math Word Problems Using Estimation Verification and Equation Generation

Authors: Mitchell Piehl , Dillon Wilson , Ananya Kalita , Jugal Kalita
URL: https://arxiv.org/abs/2509.18565
Abstract:

Large Language Models (LLMs) excel at various tasks, including problem-solving and question-answering. However, LLMs often find Math Word Problems (MWPs) challenging because solving them requires a range of reasoning and mathematical abilities with which LLMs seem to struggle. Recent efforts have helped LLMs solve more complex MWPs with improved prompts. This study proposes a novel method that initially prompts an LLM to create equations from a decomposition of the question, followed by using an external symbolic equation solver to produce an answer. To ensure the accuracy of the obtained answer, inspired by an established recommendation of math teachers, the LLM is instructed to solve the MWP a second time, but this time with the objective of estimating the correct answer instead of solving it exactly. The estimation is then compared to the generated answer to verify. If verification fails, an iterative rectification process is employed to ensure the correct answer is eventually found. This approach achieves new state-of-the-art results on datasets used by prior published research on numeric and algebraic MWPs, improving the previous best results by nearly two percent on average. In addition, the approach obtains satisfactory results on trigonometric MWPs, a task not previously attempted to the authors’ best knowledge. This study also introduces two new datasets, SVAMPClean and Trig300, to further advance the testing of LLMs’ reasoning abilities.

25. LLMZ+: Contextual Prompt Whitelist Principles for Agentic LLMs

Authors: Tom Pawelek , Raj Patel , Charlotte Crowell , Noorbakhsh Amiri , Sudip Mittal , Shahram Rahimi , Andy Perkins
URL: https://arxiv.org/abs/2509.18557
Abstract:

Compared to traditional models, agentic AI represents a highly valuable target for potential attackers as they possess privileged access to data sources and API tools, which are traditionally not incorporated into classical agents. Unlike a typical software application residing in a Demilitarized Zone (DMZ), agentic LLMs consciously rely on nondeterministic behavior of the AI (only defining a final goal, leaving the path selection to LLM). This characteristic introduces substantial security risk to both operational security and information security. Most common existing defense mechanism rely on detection of malicious intent and preventing it from reaching the LLM agent, thus protecting against jailbreak attacks such as prompt injection. In this paper, we present an alternative approach, LLMZ+, which moves beyond traditional detection-based approaches by implementing prompt whitelisting. Through this method, only contextually appropriate and safe messages are permitted to interact with the agentic LLM. By leveraging the specificity of context, LLMZ+ guarantees that all exchanges between external users and the LLM conform to predefined use cases and operational boundaries. Our approach streamlines the security framework, enhances its long-term resilience, and reduces the resources required for sustaining LLM information security. Our empirical evaluation demonstrates that LLMZ+ provides strong resilience against the most common jailbreak prompts. At the same time, legitimate business communications are not disrupted, and authorized traffic flows seamlessly between users and the agentic LLM. We measure the effectiveness of approach using false positive and false negative rates, both of which can be reduced to 0 in our experimental setting.

26. FERA: Foil Fencing Referee Assistant Using Pose-Based Multi-Label Move Recognition and Rule Reasoning

Authors: Ziwen Chen , Zhong Wang
URL: https://arxiv.org/abs/2509.18527
Abstract:

The sport of fencing, like many other sports, faces challenges in refereeing: subjective calls, human errors, bias, and limited availability in practice environments. We present FERA (Fencing Referee Assistant), a prototype AI referee for foil fencing which integrates pose-based multi-label action recognition and rule-based reasoning. FERA extracts 2D joint positions from video, normalizes them, computes a 101-dimensional kinematic feature set, and applies a Transformer for multi-label move and blade classification. To determine priority and scoring, FERA applies a distilled language model with encoded right-of-way rules, producing both a decision and an explanation for each exchange. With limited hand-labeled data, a 5-fold cross-validation achieves an average macro-F1 score of 0.549, outperforming multiple baselines, including a Temporal Convolutional Network (TCN), BiLSTM, and a vanilla Transformer. While not ready for deployment, these results demonstrate a promising path towards automated referee assistance in foil fencing and new opportunities for AI applications, such as coaching in the field of fencing.

27. Memory-QA: Answering Recall Questions Based on Multimodal Memories

Authors: Hongda Jiang , Xinyuan Zhang , Siddhant Garg , Rishab Arora , Shiun-Zu Kuo , Jiayang Xu , Christopher Brossman , Yue Liu , Aaron Colak , Ahmed Aly , Anuj Kumar , Xin Luna Dong
URL: https://arxiv.org/abs/2509.18436
Abstract:

We introduce Memory-QA, a novel real-world task that involves answering recall questions about visual content from previously stored multimodal memories. This task poses unique challenges, including the creation of task-oriented memories, the effective utilization of temporal and location information within memories, and the ability to draw upon multiple memories to answer a recall question. To address these challenges, we propose a comprehensive pipeline, Pensieve, integrating memory-specific augmentation, time- and location-aware multi-signal retrieval, and multi-memory QA fine-tuning. We created a multimodal benchmark to illustrate various real challenges in this task, and show the superior performance of Pensieve over state-of-the-art solutions (up to 14% on QA accuracy).

28. Instruction-Following Evaluation in Function Calling for Large Language Models

Authors: Nikolai Skripko
URL: https://arxiv.org/abs/2509.18420
Abstract:

Function calling is a core capability of large language models, essential for AI agents. Existing benchmarks such as the Berkeley Function Calling Leaderboard (BFCL), tau^2-Bench ( arXiv:2506.07982 ), and ACEBench ( arXiv:2501.12851 ) evaluate argument correctness but do not test adherence to format instructions embedded in parameter descriptions, such as enclosing values in double quotes or using ISO date formats. We introduce IFEval-FC, a benchmark inspired by IFEval ( arXiv:2311.07911 ) that assesses precise instruction following in function calling. IFEval-FC encodes verifiable formats directly within JSON schema descriptions, for example specifying that a value must not contain punctuation. It includes 750 test cases, each consisting of a function with an embedded format for one of its input parameters and a corresponding user query. Evaluation is fully algorithmic, ensuring objectivity, reproducibility, and scalability. Our results show that even state-of-the-art proprietary models, including GPT-5 and Claude 4.1 Opus, frequently fail to follow basic formatting rules, highlighting a practical limitation for real-world agent systems. The complete codebase and data are publicly available at this https URL .

29. ATLAS: Benchmarking and Adapting LLMs for Global Trade via Harmonized Tariff Code Classification

Authors: Pritish Yuvraj , Siva Devarakonda
URL: https://arxiv.org/abs/2509.18400
Abstract:

Accurate classification of products under the Harmonized Tariff Schedule (HTS) is a critical bottleneck in global trade, yet it has received little attention from the machine learning community. Misclassification can halt shipments entirely, with major postal operators suspending deliveries to the U.S. due to incomplete customs documentation. We introduce the first benchmark for HTS code classification, derived from the U.S. Customs Rulings Online Search System (CROSS). Evaluating leading LLMs, we find that our fine-tuned Atlas model (LLaMA-3.3-70B) achieves 40 percent fully correct 10-digit classifications and 57.5 percent correct 6-digit classifications, improvements of 15 points over GPT-5-Thinking and 27.5 points over Gemini-2.5-Pro-Thinking. Beyond accuracy, Atlas is roughly five times cheaper than GPT-5-Thinking and eight times cheaper than Gemini-2.5-Pro-Thinking, and can be self-hosted to guarantee data privacy in high-stakes trade and compliance workflows. While Atlas sets a strong baseline, the benchmark remains highly challenging, with only 40 percent 10-digit accuracy. By releasing both dataset and model, we aim to position HTS classification as a new community benchmark task and invite future work in retrieval, reasoning, and alignment.

30. Gödel Test: Can Large Language Models Solve Easy Conjectures?

Authors: Moran Feldman , Amin Karbasi
URL: https://arxiv.org/abs/2509.18383
Abstract:

Recent announcements from frontier AI model labs have highlighted strong results on high-school and undergraduate math competitions. Yet it remains unclear whether large language models can solve new, simple conjectures in more advanced areas of mathematics. We propose the Gödel Test: evaluating whether a model can produce correct proofs for very simple, previously unsolved conjectures. To this end, we study the performance of GPT-5 on five conjectures in combinatorial optimization. For each problem, we provided one or two source papers from which the conjecture arose, withheld our own conjecture, and then assessed the model’s reasoning in detail. On the three easier problems, GPT-5 produced nearly correct solutions; for Problem 2 it even derived a different approximation guarantee that, upon checking, refuted our conjecture while providing a valid solution. The model failed on Problem 4, which required combining results from two papers. On Problem 5, a harder case without a validated conjecture, GPT-5 proposed the same algorithm we had in mind but failed in the analysis, suggesting the proof is more challenging than expected. Although our sample is small, the results point to meaningful progress on routine reasoning, occasional flashes of originality, and clear limitations when cross-paper synthesis is required. GPT-5 may represent an early step toward frontier models eventually passing the Gödel Test.

31. Evaluating the Safety and Skill Reasoning of Large Reasoning Models Under Compute Constraints

Authors: Adarsha Balaji , Le Chen , Rajeev Thakur , Franck Cappello , Sandeep Madireddy
URL: https://arxiv.org/abs/2509.18382
Abstract:

Test-time compute scaling has demonstrated the ability to improve the performance of reasoning language models by generating longer chain-of-thought (CoT) sequences. However, this increase in performance comes with a significant increase in computational cost. In this work, we investigate two compute constraint strategies: (1) reasoning length constraint and (2) model quantization, as methods to reduce the compute demand of reasoning models and study their impact on their safety performance. Specifically, we explore two approaches to apply compute constraints to reasoning models: (1) fine-tuning reasoning models using a length controlled policy optimization (LCPO) based reinforcement learning method to satisfy a user-defined CoT reasoning length, and (2) applying quantization to maximize the generation of CoT sequences within a user-defined compute constraint. Furthermore, we study the trade-off between the computational efficiency and the safety of the model.

32. The Illusion of Readiness: Stress Testing Large Frontier Models on Multimodal Medical Benchmarks

Authors: Yu Gu , Jingjing Fu , Xiaodong Liu , Jeya Maria Jose Valanarasu , Noel Codella , Reuben Tan , Qianchu Liu , Ying Jin , Sheng Zhang , Jinyu Wang , Rui Wang , Lei Song , Guanghui Qin , Naoto Usuyama , Cliff Wong , Cheng Hao , Hohin Lee , Praneeth Sanapathi , Sarah Hilado , Bian Jiang , Javier Alvarez-Valle , Mu Wei , Jianfeng Gao , Eric Horvitz , Matt Lungren , Hoifung Poon , Paul Vozila
URL: https://arxiv.org/abs/2509.18234
Abstract:

Large frontier models like GPT-5 now achieve top scores on medical benchmarks. But our stress tests tell a different story. Leading systems often guess correctly even when key inputs like images are removed, flip answers under trivial prompt changes, and fabricate convincing yet flawed reasoning. These aren’t glitches; they expose how today’s benchmarks reward test-taking tricks over medical understanding. We evaluate six flagship models across six widely used benchmarks and find that high leaderboard scores hide brittleness and shortcut learning. Through clinician-guided rubric evaluation, we show that benchmarks vary widely in what they truly measure yet are treated interchangeably, masking failure modes. We caution that medical benchmark scores do not directly reflect real-world readiness. If we want AI to earn trust in healthcare, we must demand more than leaderboard wins and must hold systems accountable for robustness, sound reasoning, and alignment with real medical demands.

33. Towards General Computer Control with Hierarchical Agents and Multi-Level Action Spaces

Authors: Zihan Dong , Xinyu Fan , Zixiang Tang , Yunqing Li
URL: https://arxiv.org/abs/2509.18230
Abstract:

Controlling desktop applications via software remains a fundamental yet under-served problem. Existing multi-modal large language models (MLLMs) ingest screenshots and task instructions to generate keystrokes and mouse events, but they suffer from prohibitive inference latency, poor sample efficiency on long-horizon sparse-reward tasks, and infeasible on-device deployment. We introduce a lightweight hierarchical reinforcement learning framework, ComputerAgent, that formulates OS control as a two-level option process (manager and subpolicy), employs a triple-modal state encoder (screenshot, task ID, numeric state) to handle visual and contextual diversity, integrates meta-actions with an early-stop mechanism to reduce wasted interactions, and uses a compact vision backbone plus small policy networks for on-device inference (15M parameters). On a suite of 135 real-world desktop tasks, ComputerAgent attains 92.1% success on simple tasks (<8 steps) and 58.8% on hard tasks (>=8 steps), matching or exceeding 200B-parameter MLLM baselines on simple scenarios while reducing model size by over four orders of magnitude and halving inference time. These results demonstrate that hierarchical RL offers a practical, scalable alternative to monolithic MLLM-based automation for computer control.

34. An N-Plus-1 GPT Agency for Critical Solution of Mechanical Engineering Analysis Problems

Authors: Anthony Patera , Rohan Abeyaratne
URL: https://arxiv.org/abs/2509.18229
Abstract:

Generative AI, and specifically GPT, can produce a remarkable solution to a mechanical engineering analysis problem - but also, on occasion, a flawed solution. For example, an elementary mechanics problem is solved flawlessly in one GPT instance and incorrectly in a subsequent GPT instance, with a success probability of only 85%. This unreliability renders “out-of-the-box” GPT unsuitable for deployment in education or engineering practice. We introduce an “N-Plus-1” GPT Agency for Initial (Low-Cost) Analysis of mechanical engineering Problem Statements. Agency first launches N instantiations of Agent Solve to yield N independent Proposed Problem Solution Realizations; Agency then invokes Agent Compare to summarize and compare the N Proposed Problem Solution Realizations and to provide a Recommended Problem Solution. We argue from Condorcet’s Jury Theorem that, for a Problem Statement characterized by per-Solve success probability greater than 1/2 (and N sufficiently large), the Predominant (Agent Compare) Proposed Problem Solution will, with high probability, correspond to a Correct Proposed Problem Solution. Furthermore, Agent Compare can also incorporate aspects of Secondary (Agent Compare) Proposed Problem Solutions, in particular when the latter represent alternative Problem Statement interpretations - different Mathematical Models - or alternative Mathematical Solution Procedures. Comparisons to Grok Heavy, a commercial multi-agent model, show similarities in design and performance, but also important differences in emphasis: our Agency focuses on transparency and pedagogical value.

35. From “What to Eat?” to Perfect Recipe: ChefMind’s Chain-of-Exploration for Ambiguous User Intent in Recipe Recommendation

Authors: Yu Fu , Linyue Cai , Ruoyu Wu , Yong Zhao
URL: https://arxiv.org/abs/2509.18226
Abstract:

Personalized recipe recommendation faces challenges in handling fuzzy user intent, ensuring semantic accuracy, and providing sufficient detail coverage. We propose ChefMind, a hybrid architecture combining Chain of Exploration (CoE), Knowledge Graph (KG), Retrieval-Augmented Generation (RAG), and a Large Language Model (LLM). CoE refines ambiguous queries into structured conditions, KG offers semantic reasoning and interpretability, RAG supplements contextual culinary details, and LLM integrates outputs into coherent recommendations. We evaluate ChefMind on the Xiachufang dataset and manually annotated queries, comparing it with LLM-only, KG-only, and RAG-only baselines. Results show that ChefMind achieves superior performance in accuracy, relevance, completeness, and clarity, with an average score of 8.7 versus 6.4-6.7 for ablation models. Moreover, it reduces unprocessed queries to 1.6%, demonstrating robustness in handling fuzzy demands.

36. Multimodal Health Risk Prediction System for Chronic Diseases via Vision-Language Fusion and Large Language Models

Authors: Dingxin Lu , Shurui Wu , Xinyi Huang
URL: https://arxiv.org/abs/2509.18221
Abstract:

With the rising global burden of chronic diseases and the multimodal and heterogeneous clinical data (medical imaging, free-text recordings, wearable sensor streams, etc.), there is an urgent need for a unified multimodal AI framework that can proactively predict individual health risks. We propose VL-RiskFormer, a hierarchical stacked visual-language multimodal Transformer with a large language model (LLM) inference head embedded in its top layer. The system builds on the dual-stream architecture of existing visual-linguistic models (e.g., PaLM-E, LLaVA) with four key innovations: (i) pre-training with cross-modal comparison and fine-grained alignment of radiological images, fundus maps, and wearable device photos with corresponding clinical narratives using momentum update encoders and debiased InfoNCE losses; (ii) a time fusion block that integrates irregular visit sequences into the causal Transformer decoder through adaptive time interval position coding; (iii) a disease ontology map adapter that injects ICD-10 codes into visual and textual channels in layers and infers comorbid patterns with the help of a graph attention mechanism. On the MIMIC-IV longitudinal cohort, VL-RiskFormer achieved an average AUROC of 0.90 with an expected calibration error of 2.7 percent.

37. Similarity Field Theory: A Mathematical Framework for Intelligence

Authors: Kei-Sing Ng
URL: https://arxiv.org/abs/2509.18218
Abstract:

We posit that persisting and transforming similarity relations form the structural basis of any comprehensible dynamic system. This paper introduces Similarity Field Theory, a mathematical framework that formalizes the principles governing similarity values among entities and their evolution. We define: (1) a similarity field $S: U \times U \to [0,1]$ over a universe of entities $U$, satisfying reflexivity $S(E,E)=1$ and treated as a directed relational field (asymmetry and non-transitivity are allowed); (2) the evolution of a system through a sequence $Z_p = (X_p, S^{(p)})$ indexed by $p=0,1,2,\ldots$; (3) concepts $K$ as entities that induce fibers $F_{\alpha}(K) = { E \in U \mid S(E,K) \ge \alpha }$, i.e., superlevel sets of the unary map $S_K(E) := S(E,K)$; and (4) a generative operator $G$ that produces new entities. Within this framework, we formalize a generative definition of intelligence: an operator $G$ is intelligent with respect to a concept $K$ if, given a system containing entities belonging to the fiber of $K$, it generates new entities that also belong to that fiber. Similarity Field Theory thus offers a foundational language for characterizing, comparing, and constructing intelligent systems. We prove two theorems: (i) asymmetry blocks mutual inclusion; and (ii) stability requires either an anchor coordinate or eventual confinement within a level set of $f$. These results ensure that the evolution of similarity fields is both constrained and interpretable, culminating in an exploration of how the framework allows us to interpret large language models and use them as experimental probes into societal cognition.

38. nDNA – the Semantic Helix of Artificial Cognition

Authors: Amitava Das
URL: https://arxiv.org/abs/2509.18216
Abstract:

As AI foundation models grow in capability, a deeper question emerges: What shapes their internal cognitive identity – beyond fluency and output? Benchmarks measure behavior, but the soul of a model resides in its latent geometry. In this work, we propose Neural DNA (nDNA) as a semantic-genotypic representation that captures this latent identity through the intrinsic geometry of belief. At its core, nDNA is synthesized from three principled and indispensable dimensions of latent geometry: spectral curvature, which reveals the curvature of conceptual flow across layers; thermodynamic length, which quantifies the semantic effort required to traverse representational transitions through layers; and belief vector field, which delineates the semantic torsion fields that guide a model’s belief directional orientations. Like biological DNA, it encodes ancestry, mutation, and semantic inheritance, found in finetuning and alignment scars, cultural imprints, and architectural drift. In naming it, we open a new field: Neural Genomics, where models are not just tools, but digital semantic organisms with traceable inner cognition. Modeling statement. We read AI foundation models as semantic fluid–dynamics: meaning is transported through layers like fluid in a shaped conduit; nDNA is the physics-grade readout of that flow – a geometry-first measure of how meaning is bent, paid for, and pushed – yielding a stable, coordinate-free neural DNA fingerprint tied to on-input behavior; with this fingerprint we cross into biology: tracing lineages across pretraining, fine-tuning, alignment, pruning, distillation, and merges; measuring inheritance between checkpoints; detecting drift as traits shift under new data or objectives; and, ultimately, studying the evolution of artificial cognition to compare models, diagnose risks, and govern change over time.

39. Change in Quantitative Bipolar Argumentation: Sufficient, Necessary, and Counterfactual Explanations

Authors: Timotheus Kampik , Kristijonas Čyras , José Ruiz Alarcón
URL: https://arxiv.org/abs/2509.18215
Abstract:

This paper presents a formal approach to explaining change of inference in Quantitative Bipolar Argumentation Frameworks (QBAFs). When drawing conclusions from a QBAF and updating the QBAF to then again draw conclusions (and so on), our approach traces changes – which we call strength inconsistencies – in the partial order over argument strengths that a semantics establishes on some arguments of interest, called topic arguments. We trace the causes of strength inconsistencies to specific arguments, which then serve as explanations. We identify sufficient, necessary, and counterfactual explanations for strength inconsistencies and show that strength inconsistency explanations exist if and only if an update leads to strength inconsistency. We define a heuristic-based approach to facilitate the search for strength inconsistency explanations, for which we also provide an implementation.

Authors: Rui Liu , Zikang Wang , Peng Gao , Yu Shen , Pratap Tokekar , Ming Lin
URL: https://arxiv.org/abs/2509.18198
Abstract:

Autonomous systems have advanced significantly, but challenges persist in accident-prone environments where robust decision-making is crucial. A single vehicle’s limited sensor range and obstructed views increase the likelihood of accidents. Multi-vehicle connected systems and multi-modal approaches, leveraging RGB images and LiDAR point clouds, have emerged as promising solutions. However, existing methods often assume the availability of all data modalities and connected vehicles during both training and testing, which is impractical due to potential sensor failures or missing connected vehicles. To address these challenges, we introduce a novel framework MMCD (Multi-Modal Collaborative Decision-making) for connected autonomy. Our framework fuses multi-modal observations from ego and collaborative vehicles to enhance decision-making under challenging conditions. To ensure robust performance when certain data modalities are unavailable during testing, we propose an approach based on cross-modal knowledge distillation with a teacher-student model structure. The teacher model is trained with multiple data modalities, while the student model is designed to operate effectively with reduced modalities. In experiments on $\textit{connected autonomous driving with ground vehicles}$ and $\textit{aerial-ground vehicles collaboration}$, our method improves driving safety by up to ${\it 20.7}\%$, surpassing the best-existing baseline in detecting potential accidents and making safe driving decisions. More information can be found on our website this https URL .

41. An Outcome-Based Educational Recommender System

Authors: Nursultan Askarbekuly , Timur Fayzrakhmanov , Sladjan Babarogić , Ivan Luković
URL: https://arxiv.org/abs/2509.18186
Abstract:

Most educational recommender systems are tuned and judged on click- or rating-based relevance, leaving their true pedagogical impact unclear. We introduce OBER-an Outcome-Based Educational Recommender that embeds learning outcomes and assessment items directly into the data schema, so any algorithm can be evaluated on the mastery it fosters. OBER uses a minimalist entity-relation model, a log-driven mastery formula, and a plug-in architecture. Integrated into an e-learning system in non-formal domain, it was evaluated trough a two-week randomized split test with over 5 700 learners across three methods: fixed expert trajectory, collaborative filtering (CF), and knowledge-based (KB) filtering. CF maximized retention, but the fixed path achieved the highest mastery. Because OBER derives business, relevance, and learning metrics from the same logs, it lets practitioners weigh relevance and engagement against outcome mastery with no extra testing overhead. The framework is method-agnostic and readily extensible to future adaptive or context-aware recommenders.

42. Synthesizing Attitudes, Predicting Actions (SAPA): Behavioral Theory-Guided LLMs for Ridesourcing Mode Choice Modeling

Authors: Mustafa Sameen , Xiaojian Zhang , Xilei Zhao
URL: https://arxiv.org/abs/2509.18181
Abstract:

Accurate modeling of ridesourcing mode choices is essential for designing and implementing effective traffic management policies for reducing congestion, improving mobility, and allocating resources more efficiently. Existing models for predicting ridesourcing mode choices often suffer from limited predictive accuracy due to their inability to capture key psychological factors, and are further challenged by severe class imbalance, as ridesourcing trips comprise only a small fraction of individuals’ daily travel. To address these limitations, this paper introduces the Synthesizing Attitudes, Predicting Actions (SAPA) framework, a hierarchical approach that uses Large Language Models (LLMs) to synthesize theory-grounded latent attitudes to predict ridesourcing choices. SAPA first uses an LLM to generate qualitative traveler personas from raw travel survey data and then trains a propensity-score model on demographic and behavioral features, enriched by those personas, to produce an individual-level score. Next, the LLM assigns quantitative scores to theory-driven latent variables (e.g., time and cost sensitivity), and a final classifier integrates the propensity score, latent-variable scores (with their interaction terms), and observable trip attributes to predict ridesourcing mode choice. Experiments on a large-scale, multi-year travel survey show that SAPA significantly outperforms state-of-the-art baselines, improving ridesourcing choice predictions by up to 75.9% in terms of PR-AUC on a held-out test set. This study provides a powerful tool for accurately predicting ridesourcing mode choices, and provides a methodology that is readily transferable to various applications.

43. Large Language Models and Operations Research: A Structured Survey

Authors: Yang Wang , Kai Li
URL: https://arxiv.org/abs/2509.18180
Abstract:

Operations research (OR) provides fundamental methodologies for complex system decision-making, with established applications in transportation, supply chain management, and production scheduling. Traditional approaches, which depend on expert-based modeling and manual parameter adjustment, often face challenges in handling large-scale, dynamic, and multi-constraint problems. Recently, large language models (LLMs) have shown potential to address these limitations through semantic understanding, structured generation, and reasoning control. LLMs can translate natural language descriptions into mathematical models or executable code, generate heuristics, evolve algorithms, and directly tackle optimization tasks. This paper surveys recent progress on the integration of LLMs into OR, organizing methods into three main directions: automatic modeling, auxiliary optimization, and direct solving. It further reviews evaluation benchmarks and domain-specific applications, and summarizes key open issues such as unstable semantic-to-structure mapping, fragmented research progress, limited generalization, and insufficient evaluation systems. Finally, the survey outlines possible research avenues for advancing the role of LLMs in OR.

44. Foam-Agent: An End-to-End Composable Multi-Agent Framework for Automating CFD Simulation in OpenFOAM

Authors: Ling Yue , Nithin Somasekharan , Tingwen Zhang , Yadi Cao , Shaowu Pan
URL: https://arxiv.org/abs/2509.18178
Abstract:

Computational Fluid Dynamics (CFD) is an essential simulation tool in engineering, yet its steep learning curve and complex manual setup create significant barriers. To address these challenges, we introduce Foam-Agent, a multi-agent framework that automates the entire end-to-end OpenFOAM workflow from a single natural language prompt. Our key innovations address critical gaps in existing systems: 1. An Comprehensive End-to-End Simulation Automation: Foam-Agent is the first system to manage the full simulation pipeline, including advanced pre-processing with a versatile Meshing Agent capable of handling external mesh files and generating new geometries via Gmsh, automatic generation of HPC submission scripts, and post-simulation visualization via ParaView. 2. Composable Service Architecture: Going beyond a monolithic agent, the framework uses Model Context Protocol (MCP) to expose its core functions as discrete, callable tools. This allows for flexible integration and use by other agentic systems, such as Claude-code, for more exploratory workflows. 3. High-Fidelity Configuration Generation: We achieve superior accuracy through a Hierarchical Multi-Index RAG for precise context retrieval and a dependency-aware generation process that ensures configuration consistency. Evaluated on a benchmark of 110 simulation tasks, Foam-Agent achieves an 88.2% success rate with Claude 3.5 Sonnet, significantly outperforming existing frameworks (55.5% for MetaOpenFOAM). Foam-Agent dramatically lowers the expertise barrier for CFD, demonstrating how specialized multi-agent systems can democratize complex scientific computing. The code is public at this https URL .

45. HSGM: Hierarchical Segment-Graph Memory for Scalable Long-Text Semantics

Authors: Dong Liu , Yanxuan Yu
URL: https://arxiv.org/abs/2509.18168
Abstract:

Semantic parsing of long documents remains challenging due to quadratic growth in pairwise composition and memory requirements. We introduce \textbf{Hierarchical Segment-Graph Memory (HSGM)}, a novel framework that decomposes an input of length $N$ into $M$ meaningful segments, constructs \emph{Local Semantic Graphs} on each segment, and extracts compact \emph{summary nodes} to form a \emph{Global Graph Memory}. HSGM supports \emph{incremental updates} – only newly arrived segments incur local graph construction and summary-node integration – while \emph{Hierarchical Query Processing} locates relevant segments via top-$K$ retrieval over summary nodes and then performs fine-grained reasoning within their local graphs. Theoretically, HSGM reduces worst-case complexity from $O(N^2)$ to $O!\left(N\,k + (N/k)^2\right)$, with segment size $k \ll N$, and we derive Frobenius-norm bounds on the approximation error introduced by node summarization and sparsification thresholds. Empirically, on three benchmarks – long-document AMR parsing, segment-level semantic role labeling (OntoNotes), and legal event extraction – HSGM achieves \emph{2–4$\times$ inference speedup}, \emph{$>60\%$ reduction} in peak memory, and \emph{$\ge 95\%$} of baseline accuracy. Our approach unlocks scalable, accurate semantic modeling for ultra-long texts, enabling real-time and resource-constrained NLP applications.

46. Position Paper: Integrating Explainability and Uncertainty Estimation in Medical AI

Authors: Xiuyi Fan
URL: https://arxiv.org/abs/2509.18132
Abstract:

Uncertainty is a fundamental challenge in medical practice, but current medical AI systems fail to explicitly quantify or communicate uncertainty in a way that aligns with clinical reasoning. Existing XAI works focus on interpreting model predictions but do not capture the confidence or reliability of these predictions. Conversely, uncertainty estimation (UE) techniques provide confidence measures but lack intuitive explanations. The disconnect between these two areas limits AI adoption in medicine. To address this gap, we propose Explainable Uncertainty Estimation (XUE) that integrates explainability with uncertainty quantification to enhance trust and usability in medical AI. We systematically map medical uncertainty to AI uncertainty concepts and identify key challenges in implementing XUE. We outline technical directions for advancing XUE, including multimodal uncertainty quantification, model-agnostic visualization techniques, and uncertainty-aware decision support systems. Lastly, we propose guiding principles to ensure effective XUE realisation. Our analysis highlights the need for AI systems that not only generate reliable predictions but also articulate confidence levels in a clinically meaningful way. This work contributes to the development of trustworthy medical AI by bridging explainability and uncertainty, paving the way for AI systems that are aligned with real-world clinical complexities.

47. SPADE: A Large Language Model Framework for Soil Moisture Pattern Recognition and Anomaly Detection in Precision Agriculture

Authors: Yeonju Lee , Rui Qi Chen , Joseph Oboamah , Po Nien Su , Wei-zhen Liang , Yeyin Shi , Lu Gan , Yongsheng Chen , Xin Qiao , Jing Li
URL: https://arxiv.org/abs/2509.18123
Abstract:

Accurate interpretation of soil moisture patterns is critical for irrigation scheduling and crop management, yet existing approaches for soil moisture time-series analysis either rely on threshold-based rules or data-hungry machine learning or deep learning models that are limited in adaptability and interpretability. In this study, we introduce SPADE (Soil moisture Pattern and Anomaly DEtection), an integrated framework that leverages large language models (LLMs) to jointly detect irrigation patterns and anomalies in soil moisture time-series data. SPADE utilizes ChatGPT-4.1 for its advanced reasoning and instruction-following capabilities, enabling zero-shot analysis without requiring task-specific annotation or fine-tuning. By converting time-series data into a textual representation and designing domain-informed prompt templates, SPADE identifies irrigation events, estimates net irrigation gains, detects, classifies anomalies, and produces structured, interpretable reports. Experiments were conducted on real-world soil moisture sensor data from commercial and experimental farms cultivating multiple crops across the United States. Results demonstrate that SPADE outperforms the existing method in anomaly detection, achieving higher recall and F1 scores and accurately classifying anomaly types. Furthermore, SPADE achieved high precision and recall in detecting irrigation events, indicating its strong capability to capture irrigation patterns accurately. SPADE’s reports provide interpretability and usability of soil moisture analytics. This study highlights the potential of LLMs as scalable, adaptable tools for precision agriculture, which is capable of integrating qualitative knowledge and data-driven reasoning to produce actionable insights for accurate soil moisture monitoring and improved irrigation scheduling from soil moisture time-series data.

48. A Cost-Benefit Analysis of On-Premise Large Language Model Deployment: Breaking Even with Commercial LLM Services

Authors: Guanzhong Pan , Haibo Wang
URL: https://arxiv.org/abs/2509.18101
Abstract:

Large language models (LLMs) are becoming increasingly widespread. Organizations that want to use AI for productivity now face an important decision. They can subscribe to commercial LLM services or deploy models on their own infrastructure. Cloud services from providers such as OpenAI, Anthropic, and Google are attractive because they provide easy access to state-of-the-art models and are easy to scale. However, concerns about data privacy, the difficulty of switching service providers, and long-term operating costs have driven interest in local deployment of open-source models. This paper presents a cost-benefit analysis framework to help organizations determine when on-premise LLM deployment becomes economically viable compared to commercial subscription services. We consider the hardware requirements, operational expenses, and performance benchmarks of the latest open-source models, including Qwen, Llama, Mistral, and etc. Then we compare the total cost of deploying these models locally with the major cloud providers subscription fee. Our findings provide an estimated breakeven point based on usage levels and performance needs. These results give organizations a practical framework for planning their LLM strategies.

49. Audio-Based Pedestrian Detection in the Presence of Vehicular Noise

Authors: Yonghyun Kim , Chaeyeon Han , Akash Sarode , Noah Posner , Subhrajit Guhathakurta , Alexander Lerch
URL: https://arxiv.org/abs/2509.19295
Abstract:

Audio-based pedestrian detection is a challenging task and has, thus far, only been explored in noise-limited environments. We present a new dataset, results, and a detailed analysis of the state-of-the-art in audio-based pedestrian detection in the presence of vehicular noise. In our study, we conduct three analyses: (i) cross-dataset evaluation between noisy and noise-limited environments, (ii) an assessment of the impact of noisy data on model performance, highlighting the influence of acoustic context, and (iii) an evaluation of the model’s predictive robustness on out-of-domain sounds. The new dataset is a comprehensive 1321-hour roadside dataset. It incorporates traffic-rich soundscapes. Each recording includes 16kHz audio synchronized with frame-level pedestrian annotations and 1fps video thumbnails.

50. SOE: Sample-Efficient Robot Policy Self-Improvement via On-Manifold Exploration

Authors: Yang Jin , Jun Lv , Han Xue , Wendi Chen , Chuan Wen , Cewu Lu
URL: https://arxiv.org/abs/2509.19292
Abstract:

Intelligent agents progress by continually refining their capabilities through actively exploring environments. Yet robot policies often lack sufficient exploration capability due to action mode collapse. Existing methods that encourage exploration typically rely on random perturbations, which are unsafe and induce unstable, erratic behaviors, thereby limiting their effectiveness. We propose Self-Improvement via On-Manifold Exploration (SOE), a framework that enhances policy exploration and improvement in robotic manipulation. SOE learns a compact latent representation of task-relevant factors and constrains exploration to the manifold of valid actions, ensuring safety, diversity, and effectiveness. It can be seamlessly integrated with arbitrary policy models as a plug-in module, augmenting exploration without degrading the base policy performance. Moreover, the structured latent space enables human-guided exploration, further improving efficiency and controllability. Extensive experiments in both simulation and real-world tasks demonstrate that SOE consistently outperforms prior methods, achieving higher task success rates, smoother and safer exploration, and superior sample efficiency. These results establish on-manifold exploration as a principled approach to sample-efficient policy self-improvement. Project website: this https URL

51. MOIS-SAM2: Exemplar-based Segment Anything Model 2 for multilesion interactive segmentation of neurobromas in whole-body MRI

Authors: Georgii Kolokolnikov , Marie-Lena Schmalhofer , Sophie Götz , Lennart Well , Said Farschtschi , Victor-Felix Mautner , Inka Ristow , Rene Werner
URL: https://arxiv.org/abs/2509.19277
Abstract:

Background and Objectives: Neurofibromatosis type 1 is a genetic disorder characterized by the development of numerous neurofibromas (NFs) throughout the body. Whole-body MRI (WB-MRI) is the clinical standard for detection and longitudinal surveillance of NF tumor growth. Existing interactive segmentation methods fail to combine high lesion-wise precision with scalability to hundreds of lesions. This study proposes a novel interactive segmentation model tailored to this challenge. Methods: We introduce MOIS-SAM2, a multi-object interactive segmentation model that extends the state-of-the-art, transformer-based, promptable Segment Anything Model 2 (SAM2) with exemplar-based semantic propagation. MOIS-SAM2 was trained and evaluated on 119 WB-MRI scans from 84 NF1 patients acquired using T2-weighted fat-suppressed sequences. The dataset was split at the patient level into a training set and four test sets (one in-domain and three reflecting different domain shift scenarios, e.g., MRI field strength variation, low tumor burden, differences in clinical site and scanner vendor). Results: On the in-domain test set, MOIS-SAM2 achieved a scan-wise DSC of 0.60 against expert manual annotations, outperforming baseline 3D nnU-Net (DSC: 0.54) and SAM2 (DSC: 0.35). Performance of the proposed model was maintained under MRI field strength shift (DSC: 0.53) and scanner vendor variation (DSC: 0.50), and improved in low tumor burden cases (DSC: 0.61). Lesion detection F1 scores ranged from 0.62 to 0.78 across test sets. Preliminary inter-reader variability analysis showed model-to-expert agreement (DSC: 0.62-0.68), comparable to inter-expert agreement (DSC: 0.57-0.69). Conclusions: The proposed MOIS-SAM2 enables efficient and scalable interactive segmentation of NFs in WB-MRI with minimal user input and strong generalization, supporting integration into clinical workflows.

52. WolBanking77: Wolof Banking Speech Intent Classification Dataset

Authors: Abdou Karim Kandji , Frédéric Precioso , Cheikh Ba , Samba Ndiaye , Augustin Ndione
URL: https://arxiv.org/abs/2509.19271
Abstract:

Intent classification models have made a lot of progress in recent years. However, previous studies primarily focus on high-resource languages datasets, which results in a gap for low-resource languages and for regions with a high rate of illiterate people where languages are more spoken than read or written. This is the case in Senegal, for example, where Wolof is spoken by around 90\% of the population, with an illiteracy rate of 42\% for the country. Wolof is actually spoken by more than 10 million people in West African region. To tackle such limitations, we release a Wolof Intent Classification Dataset (WolBanking77), for academic research in intent classification. WolBanking77 currently contains 9,791 text sentences in the banking domain and more than 4 hours of spoken sentences. Experiments on various baselines are conducted in this work, including text and voice state-of-the-art models. The results are very promising on this current dataset. This paper also provides detailed analyses of the contents of the data. We report baseline f1-score and word error rate metrics respectively on NLP and ASR models trained on WolBanking77 dataset and also comparisons between models. We plan to share and conduct dataset maintenance, updates and to release open-source code.

53. SloPalSpeech: A 2,8000-Hour Slovak Speech Corpus from Parliamentary Data

Authors: Erik Božík , Marek Šuppa
URL: https://arxiv.org/abs/2509.19270
Abstract:

Automatic Speech Recognition (ASR) for low-resource languages like Slovak is hindered by the scarcity of training data. To address this, we introduce SloPalSpeech, a new, large-scale Slovak ASR dataset containing 2,806 hours of speech from parliamentary proceedings. We developed a robust processing pipeline to align and segment long-form recordings into clean, 30-second audio-transcript pairs suitable for model training. We use this dataset to fine-tune several OpenAI Whisper models (small, medium, large-v3, and large-v3-turbo), achieving significant Word Error Rate (WER) reductions on standard Slovak benchmarks like Common Voice and FLEURS. For instance, the fine-tuned Whisper-small model’s WER dropped by up to 70\%, approaching the baseline performance of the much larger Whisper-large-v3 model. To foster future research in low-resource speech recognition, we publicly release the complete SloPalSpeech dataset, the fully segmented transcripts (60 million words), and all our fine-tuned models.

54. Adversarially-Refined VQ-GAN with Dense Motion Tokenization for Spatio-Temporal Heatmaps

Authors: Gabriel Maldonado , Narges Rashvand , Armin Danesh Pazho , Ghazal Alinezhad Noghre , Vinit Katariya , Hamed Tabkhi
URL: https://arxiv.org/abs/2509.19252
Abstract:

Continuous human motion understanding remains a core challenge in computer vision due to its high dimensionality and inherent redundancy. Efficient compression and representation are crucial for analyzing complex motion dynamics. In this work, we introduce an adversarially-refined VQ-GAN framework with dense motion tokenization for compressing spatio-temporal heatmaps while preserving the fine-grained traces of human motion. Our approach combines dense motion tokenization with adversarial refinement, which eliminates reconstruction artifacts like motion smearing and temporal misalignment observed in non-adversarial baselines. Our experiments on the CMU Panoptic dataset provide conclusive evidence of our method’s superiority, outperforming the dVAE baseline by 9.31% SSIM and reducing temporal instability by 37.1%. Furthermore, our dense tokenization strategy enables a novel analysis of motion complexity, revealing that 2D motion can be optimally represented with a compact 128-token vocabulary, while 3D motion’s complexity demands a much larger 1024-token codebook for faithful reconstruction. These results establish practical deployment feasibility across diverse motion analysis applications. The code base for this work is available at this https URL .

55. Reinforcement Learning on Pre-Training Data

Authors: Siheng Li , Kejiao Li , Zenan Xu , Guanhua Huang , Evander Yang , Kun Li , Haoyuan Wu , Jiajia Wu , Zihao Zheng , Chenchen Zhang , Kun Shi , Kyrierl Deng , Qi Yi , Ruibin Xiong , Tingqiang Xu , Yuhao Jiang , Jianfeng Yan , Yuyuan Zeng , Guanghui Xu , Jinbao Xue , Zhijiang Xu , Zheng Fang , Shuai Li , Qibin Liu , Xiaoxue Li , Zhuoyu Li , Yangyu Tao , Fei Gao , Cheng Jiang , Bo Chao Wang , Kai Liu , Jianchen Zhu , Wai Lam , Wayyt Wang , Bo Zhou , Di Wang
URL: https://arxiv.org/abs/2509.19249
Abstract:

The growing disparity between the exponential scaling of computational resources and the finite growth of high-quality text data now constrains conventional scaling approaches for large language models (LLMs). To address this challenge, we introduce Reinforcement Learning on Pre-Training data (RLPT), a new training-time scaling paradigm for optimizing LLMs. In contrast to prior approaches that scale training primarily through supervised learning, RLPT enables the policy to autonomously explore meaningful trajectories to learn from pre-training data and improve its capability through reinforcement learning (RL). While existing RL strategies such as reinforcement learning from human feedback (RLHF) and reinforcement learning with verifiable rewards (RLVR) rely on human annotation for reward construction, RLPT eliminates this dependency by deriving reward signals directly from pre-training data. Specifically, it adopts a next-segment reasoning objective, rewarding the policy for accurately predicting subsequent text segments conditioned on the preceding context. This formulation allows RL to be scaled on pre-training data, encouraging the exploration of richer trajectories across broader contexts and thereby fostering more generalizable reasoning skills. Extensive experiments on both general-domain and mathematical reasoning benchmarks across multiple models validate the effectiveness of RLPT. For example, when applied to Qwen3-4B-Base, RLPT yields absolute improvements of $3.0$, $5.1$, $8.1$, $6.0$, $6.6$, and $5.3$ on MMLU, MMLU-Pro, GPQA-Diamond, KOR-Bench, AIME24, and AIME25, respectively. The results further demonstrate favorable scaling behavior, suggesting strong potential for continued gains with more compute. In addition, RLPT provides a solid foundation, extending the reasoning boundaries of LLMs and enhancing RLVR performance.

56. Finding My Voice: Generative Reconstruction of Disordered Speech for Automated Clinical Evaluation

Authors: Karen Rosero , Eunjung Yeo , David R. Mortensen , Cortney Van’t Slot , Rami R. Hallac , Carlos Busso
URL: https://arxiv.org/abs/2509.19231
Abstract:

We present ChiReSSD, a speech reconstruction framework that preserves children speaker’s identity while suppressing mispronunciations. Unlike prior approaches trained on healthy adult speech, ChiReSSD adapts to the voices of children with speech sound disorders (SSD), with particular emphasis on pitch and prosody. We evaluate our method on the STAR dataset and report substantial improvements in lexical accuracy and speaker identity preservation. Furthermore, we automatically predict the phonetic content in the original and reconstructed pairs, where the proportion of corrected consonants is comparable to the percentage of correct consonants (PCC), a clinical speech assessment metric. Our experiments show Pearson correlation of 0.63 between automatic and human expert annotations, highlighting the potential to reduce the manual transcription burden. In addition, experiments on the TORGO dataset demonstrate effective generalization for reconstructing adult dysarthric speech. Our results indicate that disentangled, style-based TTS reconstruction can provide identity-preserving speech across diverse clinical populations.

57. MsFIN: Multi-scale Feature Interaction Network for Traffic Accident Anticipation

Authors: Tongshuai Wu , Chao Lu , Ze Song , Yunlong Lin , Sizhe Fan , Xuemei Chen
URL: https://arxiv.org/abs/2509.19227
Abstract:

With the widespread deployment of dashcams and advancements in computer vision, developing accident prediction models from the dashcam perspective has become critical for proactive safety interventions. However, two key challenges persist: modeling feature-level interactions among traffic participants (often occluded in dashcam views) and capturing complex, asynchronous multi-temporal behavioral cues preceding accidents. To deal with these two challenges, a Multi-scale Feature Interaction Network (MsFIN) is proposed for early-stage accident anticipation from dashcam videos. MsFIN has three layers for multi-scale feature aggregation, temporal feature processing and multi-scale feature post fusion, respectively. For multi-scale feature aggregation, a Multi-scale Module is designed to extract scene representations at short-term, mid-term and long-term temporal scales. Meanwhile, the Transformer architecture is leveraged to facilitate comprehensive feature interactions. Temporal feature processing captures the sequential evolution of scene and object features under causal constraints. In the multi-scale feature post fusion stage, the network fuses scene and object features across multiple temporal scales to generate a comprehensive risk representation. Experiments on DAD and DADA datasets show that MsFIN significantly outperforms state-of-the-art models with single-scale feature extraction in both prediction correctness and earliness. Ablation studies validate the effectiveness of each module in MsFIN, highlighting how the network achieves superior performance through multi-scale feature fusion and contextual interaction modeling.

58. Systematic Comparative Analysis of Large Pretrained Language Models on Contextualized Medication Event Extraction

Authors: Tariq Abdul-Quddoos , Xishuang Dong , Lijun Qian
URL: https://arxiv.org/abs/2509.19224
Abstract:

Attention-based models have become the leading approach in modeling medical language for Natural Language Processing (NLP) in clinical notes. These models outperform traditional techniques by effectively capturing contextual rep- resentations of language. In this research a comparative analysis is done amongst pre- trained attention based models namely Bert Base, BioBert, two variations of Bio+Clinical Bert, RoBerta, and Clinical Long- former on task related to Electronic Health Record (EHR) information extraction. The tasks from Track 1 of Harvard Medical School’s 2022 National Clinical NLP Challenges (n2c2) are considered for this comparison, with the Contextualized Medication Event Dataset (CMED) given for these task. CMED is a dataset of unstructured EHRs and annotated notes that contain task relevant information about the EHRs. The goal of the challenge is to develop effective solutions for extracting contextual information related to patient medication events from EHRs using data driven methods. Each pre-trained model is fine-tuned and applied on CMED to perform medication extraction, medical event detection, and multi-dimensional medication event context classification. Pro- cessing methods are also detailed for breaking down EHRs for compatibility with the applied models. Performance analysis has been carried out using a script based on constructing medical terms from the evaluation portion of CMED with metrics including recall, precision, and F1-Score. The results demonstrate that models pre-trained on clinical data are more effective in detecting medication and medication events, but Bert Base, pre- trained on general domain data showed to be the most effective for classifying the context of events related to medications.

59. FedFusion: Federated Learning with Diversity- and Cluster-Aware Encoders for Robust Adaptation under Label Scarcity

Authors: Ferdinand Kahenga , Antoine Bagula , Patrick Sello , Sajal K. Das
URL: https://arxiv.org/abs/2509.19220
Abstract:

Federated learning in practice must contend with heterogeneous feature spaces, severe non-IID data, and scarce labels across clients. We present FedFusion, a federated transfer-learning framework that unifies domain adaptation and frugal labelling with diversity-/cluster-aware encoders (DivEn, DivEn-mix, DivEn-c). Labelled teacher clients guide learner clients via confidence-filtered pseudo-labels and domain-adaptive transfer, while clients maintain personalised encoders tailored to local data. To preserve global coherence under heterogeneity, FedFusion employs similarity-weighted classifier coupling (with optional cluster-wise averaging), mitigating dominance by data-rich sites and improving minority-client performance. The frugal-labelling pipeline combines self-/semi-supervised pretext training with selective fine-tuning, reducing annotation demands without sharing raw data. Across tabular and imaging benchmarks under IID, non-IID, and label-scarce regimes, FedFusion consistently outperforms state-of-the-art baselines in accuracy, robustness, and fairness while maintaining comparable communication and computation budgets. These results show that harmonising personalisation, domain adaptation, and label efficiency is an effective recipe for robust federated learning under real-world constraints.

60. HyKid: An Open MRI Dataset with Expert-Annotated Multi-Structure and Choroid Plexus in Pediatric Hydrocephalus

Authors: Yunzhi Xu , Yushuang Ding , Hu Sun , Hongxi Zhang , Li Zhao
URL: https://arxiv.org/abs/2509.19218
Abstract:

Evaluation of hydrocephalus in children is challenging, and the related research is limited by a lack of publicly available, expert-annotated datasets, particularly those with segmentation of the choroid plexus. To address this, we present HyKid, an open-source dataset from 48 pediatric patients with hydrocephalus. 3D MRIs were provided with 1mm isotropic resolution, which was reconstructed from routine low-resolution images using a slice-to-volume algorithm. Manually corrected segmentations of brain tissues, including white matter, grey matter, lateral ventricle, external CSF, and the choroid plexus, were provided by an experienced neurologist. Additionally, structured data was extracted from clinical radiology reports using a Retrieval-Augmented Generation framework. The strong correlation between choroid plexus volume and total CSF volume provided a potential biomarker for hydrocephalus evaluation, achieving excellent performance in a predictive model (AUC = 0.87). The proposed HyKid dataset provided a high-quality benchmark for neuroimaging algorithms development, and it revealed the choroid plexus-related features in hydrocephalus assessments. Our datasets are publicly available at this https URL .

61. Steering Multimodal Large Language Models Decoding for Context-Aware Safety

Authors: Zheyuan Liu , Zhangchen Xu , Guangyao Dou , Xiangchi Yuan , Zhaoxuan Tan , Radha Poovendran , Meng Jiang
URL: https://arxiv.org/abs/2509.19212
Abstract:

Multimodal Large Language Models (MLLMs) are increasingly deployed in real-world applications, yet their ability to make context-aware safety decisions remains limited. Existing methods often fail to balance oversensitivity (unjustified refusals of benign queries) and undersensitivity (missed detection of visually grounded risks), leaving a persistent gap in safety alignment. To address this issue, we introduce Safety-aware Contrastive Decoding (SafeCoDe), a lightweight and model-agnostic decoding framework that dynamically adjusts token generation based on multimodal context. SafeCoDe operates in two stages: (1) a contrastive decoding mechanism that highlights tokens sensitive to visual context by contrasting real and Gaussian-noised images, and (2) a global-aware token modulation strategy that integrates scene-level reasoning with token-level adjustment to adapt refusals according to the predicted safety verdict. Extensive experiments across diverse MLLM architectures and safety benchmarks, covering undersensitivity, oversensitivity, and general safety evaluations, show that SafeCoDe consistently improves context-sensitive refusal behaviors while preserving model helpfulness.

62. YAC: Bridging Natural Language and Interactive Visual Exploration with Generative AI for Biomedical Data Discovery

Authors: Devin Lange , Shanghua Gao , Pengwei Sui , Austen Money , Priya Misner , Marinka Zitnik , Nils Gehlenborg
URL: https://arxiv.org/abs/2509.19182
Abstract:

Incorporating natural language input has the potential to improve the capabilities of biomedical data discovery interfaces. However, user interface elements and visualizations are still powerful tools for interacting with data, even in the new world of generative AI. In our prototype system, YAC, Yet Another Chatbot, we bridge the gap between natural language and interactive visualizations by generating structured declarative output with a multi-agent system and interpreting that output to render linked interactive visualizations and apply data filters. Furthermore, we include widgets, which allow users to adjust the values of that structured output through user interface elements. We reflect on the capabilities and design of this system with an analysis of its technical dimensions and illustrate the capabilities through four usage scenarios.

63. Soft Tokens, Hard Truths

Authors: Natasha Butt , Ariel Kwiatkowski , Ismail Labiad , Julia Kempe , Yann Ollivier
URL: https://arxiv.org/abs/2509.19170
Abstract:

The use of continuous instead of discrete tokens during the Chain-of-Thought (CoT) phase of reasoning LLMs has garnered attention recently, based on the intuition that a continuous mixture of discrete tokens could simulate a superposition of several reasoning paths simultaneously. Theoretical results have formally proven that continuous tokens have much greater expressivity and can solve specific problems more efficiently. However, practical use of continuous tokens has been limited by strong training difficulties: previous works either just use continuous tokens at inference time on a pre-trained discrete-token model, or must distill the continuous CoT from ground-truth discrete CoTs and face computational costs that limit the CoT to very few tokens. This is the first work introducing a scalable method to learn continuous CoTs via reinforcement learning (RL), without distilling from reference discrete CoTs. We use “soft” tokens: mixtures of tokens together with noise on the input embedding to provide RL exploration. Computational overhead is minimal, enabling us to learn continuous CoTs with hundreds of tokens. On math reasoning benchmarks with Llama and Qwen models up to 8B, training with continuous CoTs match discrete-token CoTs for pass@1 and surpass them for pass@32, showing greater CoT diversity. In systematic comparisons, the best-performing scenario is to train with continuous CoT tokens then use discrete tokens for inference, meaning the “soft” models can be deployed in a standard way. Finally, we show continuous CoT RL training better preserves the predictions of the base model on out-of-domain tasks, thus providing a softer touch to the base model.

64. RoSe: Robust Self-supervised Stereo Matching under Adverse Weather Conditions

Authors: Yun Wang , Junjie Hu , Junhui Hou , Chenghao Zhang , Renwei Yang , Dapeng Oliver Wu
URL: https://arxiv.org/abs/2509.19165
Abstract:

Recent self-supervised stereo matching methods have made significant progress, but their performance significantly degrades under adverse weather conditions such as night, rain, and fog. We identify two primary weaknesses contributing to this performance degradation. First, adverse weather introduces noise and reduces visibility, making CNN-based feature extractors struggle with degraded regions like reflective and textureless areas. Second, these degraded regions can disrupt accurate pixel correspondences, leading to ineffective supervision based on the photometric consistency assumption. To address these challenges, we propose injecting robust priors derived from the visual foundation model into the CNN-based feature extractor to improve feature representation under adverse weather conditions. We then introduce scene correspondence priors to construct robust supervisory signals rather than relying solely on the photometric consistency assumption. Specifically, we create synthetic stereo datasets with realistic weather degradations. These datasets feature clear and adverse image pairs that maintain the same semantic context and disparity, preserving the scene correspondence property. With this knowledge, we propose a robust self-supervised training paradigm, consisting of two key steps: robust self-supervised scene correspondence learning and adverse weather distillation. Both steps aim to align underlying scene results from clean and adverse image pairs, thus improving model disparity estimation under adverse weather effects. Extensive experiments demonstrate the effectiveness and versatility of our proposed solution, which outperforms existing state-of-the-art self-supervised methods. Codes are available at \textcolor{blue}{ this https URL }.

65. Generative Propaganda

Authors: Madeleine I. G. Daepp , Alejandro Cuevas , Robert Osazuwa Ness , Vickie Yu-Ping Wang , Bharat Kumar Nayak , Dibyendu Mishra , Ti-Chung Cheng , Shaily Desai , Joyojeet Pal
URL: https://arxiv.org/abs/2509.19147
Abstract:

Generative propaganda is the use of generative artificial intelligence (AI) to shape public opinion. To characterize its use in real-world settings, we conducted interviews with defenders (e.g., factcheckers, journalists, officials) in Taiwan and creators (e.g., influencers, political consultants, advertisers) as well as defenders in India, centering two places characterized by high levels of online propaganda. The term “deepfakes”, we find, exerts outsized discursive power in shaping defenders’ expectations of misuse and, in turn, the interventions that are prioritized. To better characterize the space of generative propaganda, we develop a taxonomy that distinguishes between obvious versus hidden and promotional versus derogatory use. Deception was neither the main driver nor the main impact vector of AI’s use; instead, Indian creators sought to persuade rather than to deceive, often making AI’s use obvious in order to reduce legal and reputational risks, while Taiwan’s defenders saw deception as a subset of broader efforts to distort the prevalence of strategic narratives online. AI was useful and used, however, in producing efficiency gains in communicating across languages and modes, and in evading human and algorithmic detection. Security researchers should reconsider threat models to clearly differentiate deepfakes from promotional and obvious uses, to complement and bolster the social factors that constrain misuse by internal actors, and to counter efficiency gains globally.

66. Anecdoctoring: Automated Red-Teaming Across Language and Place

Authors: Alejandro Cuevas , Saloni Dash , Bharat Kumar Nayak , Dan Vann , Madeleine I. G. Daepp
URL: https://arxiv.org/abs/2509.19143
Abstract:

Disinformation is among the top risks of generative artificial intelligence (AI) misuse. Global adoption of generative AI necessitates red-teaming evaluations (i.e., systematic adversarial probing) that are robust across diverse languages and cultures, but red-teaming datasets are commonly US- and English-centric. To address this gap, we propose “anecdoctoring”, a novel red-teaming approach that automatically generates adversarial prompts across languages and cultures. We collect misinformation claims from fact-checking websites in three languages (English, Spanish, and Hindi) and two geographies (US and India). We then cluster individual claims into broader narratives and characterize the resulting clusters with knowledge graphs, with which we augment an attacker LLM. Our method produces higher attack success rates and offers interpretability benefits relative to few-shot prompting. Results underscore the need for disinformation mitigations that scale globally and are grounded in real-world adversarial misuse.

67. On the Soundness and Consistency of LLM Agents for Executing Test Cases Written in Natural Language

Authors: Sébastien Salva , Redha Taguelmimt
URL: https://arxiv.org/abs/2509.19136
Abstract:

The use of natural language (NL) test cases for validating graphical user interface (GUI) applications is emerging as a promising direction to manually written executable test scripts, which are costly to develop and difficult to maintain. Recent advances in large language models (LLMs) have opened the possibility of the direct execution of NL test cases by LLM agents. This paper investigates this direction, focusing on the impact on NL test case unsoundness and on test case execution consistency. NL test cases are inherently unsound, as they may yield false failures due to ambiguous instructions or unpredictable agent behaviour. Furthermore, repeated executions of the same NL test case may lead to inconsistent outcomes, undermining test reliability. To address these challenges, we propose an algorithm for executing NL test cases with guardrail mechanisms and specialised agents that dynamically verify the correct execution of each test step. We introduce measures to evaluate the capabilities of LLMs in test execution and one measure to quantify execution consistency. We propose a definition of weak unsoundness to characterise contexts in which NL test case execution remains acceptable, with respect to the industrial quality levels Six Sigma. Our experimental evaluation with eight publicly available LLMs, ranging from 3B to 70B parameters, demonstrates both the potential and current limitations of current LLM agents for GUI testing. Our experiments show that Meta Llama 3.1 70B demonstrates acceptable capabilities in NL test case execution with high execution consistency (above the level 3-sigma). We provide prototype tools, test suites, and results.

68. GSTM-HMU: Generative Spatio-Temporal Modeling for Human Mobility Understanding

Authors: Wenying Luo , Zhiyuan Lin , Wenhao Xu , Minghao Liu , Zhi Li
URL: https://arxiv.org/abs/2509.19135
Abstract:

Human mobility traces, often recorded as sequences of check-ins, provide a unique window into both short-term visiting patterns and persistent lifestyle regularities. In this work we introduce GSTM-HMU, a generative spatio-temporal framework designed to advance mobility analysis by explicitly modeling the semantic and temporal complexity of human movement. The framework consists of four key innovations. First, a Spatio-Temporal Concept Encoder (STCE) integrates geographic location, POI category semantics, and periodic temporal rhythms into unified vector representations. Second, a Cognitive Trajectory Memory (CTM) adaptively filters historical visits, emphasizing recent and behaviorally salient events in order to capture user intent more effectively. Third, a Lifestyle Concept Bank (LCB) contributes structured human preference cues, such as activity types and lifestyle patterns, to enhance interpretability and personalization. Finally, task-oriented generative heads transform the learned representations into predictions for multiple downstream tasks. We conduct extensive experiments on four widely used real-world datasets, including Gowalla, WeePlace, Brightkite, and FourSquare, and evaluate performance on three benchmark tasks: next-location prediction, trajectory-user identification, and time estimation. The results demonstrate consistent and substantial improvements over strong baselines, confirming the effectiveness of GSTM-HMU in extracting semantic regularities from complex mobility data. Beyond raw performance gains, our findings also suggest that generative modeling provides a promising foundation for building more robust, interpretable, and generalizable systems for human mobility intelligence.

69. Analysis on distribution and clustering of weight

Authors: Chunming Ye , Wenquan Tian , Yalan Gao , Songzhou Li
URL: https://arxiv.org/abs/2509.19122
Abstract:

The study on architecture and parameter characteristics remains the hot topic in the research of large language models. In this paper we concern with the characteristics of weight which are used to analyze the correlations and differences between models. Two kinds of vectors-standard deviation vector and clustering vector-are proposed to describe features of models. In the first case, the weights are assumed to follow normal distribution. The standard deviation values of projection matrices are normalized to form Standard-Deviation Vector, representing the distribution characteristics of models. In the second case, the singular values from each weight projection matrix are extracted and grouped by K-Means algorithm. The grouped data with the same type matrix are combined as Clustering Vector to represent the correlation characteristics of models’ weights. The study reveals that these two vectors can effectively distinguish between different models and clearly show the similarities among models of the same family. Moreover, after conducting LoRA fine-tuning with different datasets and models, it is found that the distribution of weights represented by standard deviation vector is directly influenced by the dataset, but the correlations between different weights represented by clustering vector remain unaffected and maintain a high consistency with the pre-trained model.

70. FedFiTS: Fitness-Selected, Slotted Client Scheduling for Trustworthy Federated Learning in Healthcare AI

Authors: Ferdinand Kahenga , Antoine Bagula , Sajal K. Das , Patrick Sello
URL: https://arxiv.org/abs/2509.19120
Abstract:

Federated Learning (FL) has emerged as a powerful paradigm for privacy-preserving model training, yet deployments in sensitive domains such as healthcare face persistent challenges from non-IID data, client unreliability, and adversarial manipulation. This paper introduces FedFiTS, a trust and fairness-aware selective FL framework that advances the FedFaSt line by combining fitness-based client election with slotted aggregation. FedFiTS implements a three-phase participation strategy-free-for-all training, natural selection, and slotted team participation-augmented with dynamic client scoring, adaptive thresholding, and cohort-based scheduling to balance convergence efficiency with robustness. A theoretical convergence analysis establishes bounds for both convex and non-convex objectives under standard assumptions, while a communication-complexity analysis shows reductions relative to FedAvg and other baselines. Experiments on diverse datasets-medical imaging (X-ray pneumonia), vision benchmarks (MNIST, FMNIST), and tabular agricultural data (Crop Recommendation)-demonstrate that FedFiTS consistently outperforms FedAvg, FedRand, and FedPow in accuracy, time-to-target, and resilience to poisoning attacks. By integrating trust-aware aggregation with fairness-oriented client selection, FedFiTS advances scalable and secure FL, making it well suited for real-world healthcare and cross-domain deployments.

71. Towards Practical Multi-label Causal Discovery in High-Dimensional Event Sequences via One-Shot Graph Aggregation

Authors: Hugo Math , Rainer Lienhart
URL: https://arxiv.org/abs/2509.19112
Abstract:

Understanding causality in event sequences where outcome labels such as diseases or system failures arise from preceding events like symptoms or error codes is critical. Yet remains an unsolved challenge across domains like healthcare or vehicle diagnostics. We introduce CARGO, a scalable multi-label causal discovery method for sparse, high-dimensional event sequences comprising of thousands of unique event types. Using two pretrained causal Transformers as domain-specific foundation models for event sequences. CARGO infers in parallel, per sequence one-shot causal graphs and aggregates them using an adaptive frequency fusion to reconstruct the global Markov boundaries of labels. This two-stage approach enables efficient probabilistic reasoning at scale while bypassing the intractable cost of full-dataset conditional independence testing. Our results on a challenging real-world automotive fault prediction dataset with over 29,100 unique event types and 474 imbalanced labels demonstrate CARGO’s ability to perform structured reasoning.

72. FUNCanon: Learning Pose-Aware Action Primitives via Functional Object Canonicalization for Generalizable Robotic Manipulation

Authors: Hongli Xu , Lei Zhang , Xiaoyue Hu , Boyang Zhong , Kaixin Bai , Zoltán-Csaba Márton , Zhenshan Bing , Zhaopeng Chen , Alois Christian Knoll , Jianwei Zhang
URL: https://arxiv.org/abs/2509.19102
Abstract:

General-purpose robotic skills from end-to-end demonstrations often leads to task-specific policies that fail to generalize beyond the training distribution. Therefore, we introduce FunCanon, a framework that converts long-horizon manipulation tasks into sequences of action chunks, each defined by an actor, verb, and object. These chunks focus policy learning on the actions themselves, rather than isolated tasks, enabling compositionality and reuse. To make policies pose-aware and category-general, we perform functional object canonicalization for functional alignment and automatic manipulation trajectory transfer, mapping objects into shared functional frames using affordance cues from large vision language models. An object centric and action centric diffusion policy FuncDiffuser trained on this aligned data naturally respects object affordances and poses, simplifying learning and improving generalization ability. Experiments on simulated and real-world benchmarks demonstrate category-level generalization, cross-task behavior reuse, and robust sim2real deployment, showing that functional canonicalization provides a strong inductive bias for scalable imitation learning in complex manipulation domains. Details of the demo and supplemental material are available on our project website this https URL .

73. Algorithms for Adversarially Robust Deep Learning

Authors: Alexander Robey
URL: https://arxiv.org/abs/2509.19100
Abstract:

Given the widespread use of deep learning models in safety-critical applications, ensuring that the decisions of such models are robust against adversarial exploitation is of fundamental importance. In this thesis, we discuss recent progress toward designing algorithms that exhibit desirable robustness properties. First, we discuss the problem of adversarial examples in computer vision, for which we introduce new technical results, training paradigms, and certification algorithms. Next, we consider the problem of domain generalization, wherein the task is to train neural networks to generalize from a family of training distributions to unseen test distributions. We present new algorithms that achieve state-of-the-art generalization in medical imaging, molecular identification, and image classification. Finally, we study the setting of jailbreaking large language models (LLMs), wherein an adversarial user attempts to design prompts that elicit objectionable content from an LLM. We propose new attacks and defenses, which represent the frontier of progress toward designing robust language-based agents.

74. Pathways of Thoughts: Multi-Directional Thinking for Long-form Personalized Question Answering

Authors: Alireza Salemi , Cheng Li , Mingyang Zhang , Qiaozhu Mei , Zhuowan Li , Spurthi Amba Hombaiah , Weize Kong , Tao Chen , Hamed Zamani , Michael Bendersky
URL: https://arxiv.org/abs/2509.19094
Abstract:

Personalization is essential for adapting question answering (QA) systems to user-specific information needs, thereby improving both accuracy and user satisfaction. However, personalized QA remains relatively underexplored due to challenges such as inferring preferences from long, noisy, and implicit contexts, and generating responses that are simultaneously correct, contextually appropriate, and aligned with user expectations and background knowledge. To address these challenges, we propose Pathways of Thoughts (PoT), an inference-stage method that applies to any large language model (LLM) without requiring task-specific fine-tuning. The approach models the reasoning of an LLM as an iterative decision process, where the model dynamically selects among cognitive operations such as reasoning, revision, personalization, and clarification. This enables exploration of multiple reasoning trajectories, producing diverse candidate responses that capture different perspectives. PoT then aggregates and reweights these candidates according to inferred user preferences, yielding a final personalized response that benefits from the complementary strengths of diverse reasoning paths. Experiments on the LaMP-QA benchmark for personalized QA show that PoT consistently outperforms competitive baselines, achieving up to a 13.1% relative improvement. Human evaluation corroborates these results, with annotators preferring outputs from PoT in 66% of cases and reporting ties in only 15% of cases.

75. Training Flow Matching Models with Reliable Labels via Self-Purification

Authors: Hyeongju Kim , Yechan Yu , June Young Yi , Juheon Lee
URL: https://arxiv.org/abs/2509.19091
Abstract:

Training datasets are inherently imperfect, often containing mislabeled samples due to human annotation errors, limitations of tagging models, and other sources of noise. Such label contamination can significantly degrade the performance of a trained model. In this work, we introduce Self-Purifying Flow Matching (SPFM), a principled approach to filtering unreliable data within the flow-matching framework. SPFM identifies suspicious data using the model itself during the training process, bypassing the need for pretrained models or additional modules. Our experiments demonstrate that models trained with SPFM generate samples that accurately adhere to the specified conditioning, even when trained on noisy labels. Furthermore, we validate the robustness of SPFM on the TITW dataset, which consists of in-the-wild speech data, achieving performance that surpasses existing baselines.

76. Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning

Authors: Guoxin Wang , Jun Zhao , Xinyi Liu , Yanbo Liu , Xuyang Cao , Chao Li , Zhuoyun Liu , Qintian Sun , Fangru Zhou , Haoqiang Xing , Zhenhong Yang
URL: https://arxiv.org/abs/2509.19090
Abstract:

Medical imaging provides critical evidence for clinical diagnosis, treatment planning, and surgical decisions, yet most existing imaging models are narrowly focused and require multiple specialized networks, limiting their generalization. Although large-scale language and multimodal models exhibit strong reasoning and multi-task capabilities, real-world clinical applications demand precise visual grounding, multimodal integration, and chain-of-thought reasoning. We introduce Citrus-V, a multimodal medical foundation model that combines image analysis with textual reasoning. The model integrates detection, segmentation, and multimodal chain-of-thought reasoning, enabling pixel-level lesion localization, structured report generation, and physician-like diagnostic inference in a single framework. We propose a novel multimodal training approach and release a curated open-source data suite covering reasoning, detection, segmentation, and document understanding tasks. Evaluations demonstrate that Citrus-V outperforms existing open-source medical models and expert-level imaging systems across multiple benchmarks, delivering a unified pipeline from visual grounding to clinical reasoning and supporting precise lesion quantification, automated reporting, and reliable second opinions.

77. A Mega-Study of Digital Twins Reveals Strengths, Weaknesses and Opportunities for Further Improvement

Authors: Tiany Peng , George Gui , Daniel J. Merlau , Grace Jiarui Fan , Malek Ben Sliman , Melanie Brucks , Eric J. Johnson , Vicki Morwitz , Abdullah Althenayyan , Silvia Bellezza , Dante Donati , Hortense Fong , Elizabeth Friedman , Ariana Guevara , Mohamed Hussein , Kinshuk Jerath , Bruce Kogut , Kristen Lane , Hannah Li , Patryk Perkowski , Oded Netzer , Olivier Toubia
URL: https://arxiv.org/abs/2509.19088
Abstract:

Do “digital twins” capture individual responses in surveys and experiments? We run 19 pre-registered studies on a national U.S. panel and their LLM-powered digital twins (constructed based on previously-collected extensive individual-level data) and compare twin and human answers across 164 outcomes. The correlation between twin and human answers is modest (approximately 0.2 on average) and twin responses are less variable than human responses. While constructing digital twins based on rich individual-level data improves our ability to capture heterogeneity across participants and predict relative differences between them, it does not substantially improve our ability to predict the exact answers given by specific participants or enhance predictions of population means. Twin performance varies by domain and is higher among more educated, higher-income, and ideologically moderate participants. These results suggest current digital twins can capture some degree of relative differences but are unreliable for individual-level predictions and sample mean and variance estimation, underscoring the need for careful validation before use. Our data and code are publicly available for researchers and practitioners interested in optimizing digital twin pipelines.

78. Graph Neural Networks with Similarity-Navigated Probabilistic Feature Copying

Authors: Asela Hevapathige
URL: https://arxiv.org/abs/2509.19084
Abstract:

Graph Neural Networks (GNNs) have demonstrated remarkable success across various graph-based tasks. However, they face some fundamental limitations: feature oversmoothing can cause node representations to become indistinguishable in deeper networks, they struggle to effectively manage heterogeneous relationships where connected nodes differ significantly, and they process entire feature vectors as indivisible units, which limits flexibility. We seek to address these limitations. We propose AxelGNN, a novel GNN architecture inspired by Axelrod’s cultural dissemination model that addresses these limitations through a unified framework. AxelGNN incorporates similarity-gated probabilistic interactions that adaptively promote convergence or divergence based on node similarity, implements trait-level copying mechanisms for fine-grained feature aggregation at the segment level, and maintains global polarization to preserve node distinctiveness across multiple representation clusters. The model’s bistable convergence dynamics naturally handle both homophilic and heterophilic graphs within a single architecture. Extensive experiments on node classification and influence estimation benchmarks demonstrate that AxelGNN consistently outperforms or matches state-of-the-art GNN methods across diverse graph structures with varying homophily-heterophily characteristics.

Authors: Zhennan Jiang , Kai Liu , Yuxin Qin , Shuai Tian , Yupeng Zheng , Mingcai Zhou , Chao Yu , Haoran Li , Dongbin Zhao
URL: https://arxiv.org/abs/2509.19080
Abstract:

Robotic manipulation policies are commonly initialized through imitation learning, but their performance is limited by the scarcity and narrow coverage of expert data. Reinforcement learning can refine polices to alleviate this limitation, yet real-robot training is costly and unsafe, while training in simulators suffers from the sim-to-real gap. Recent advances in generative models have demonstrated remarkable capabilities in real-world simulation, with diffusion models in particular excelling at generation. This raises the question of how diffusion model-based world models can be combined to enhance pre-trained policies in robotic manipulation. In this work, we propose World4RL, a framework that employs diffusion-based world models as high-fidelity simulators to refine pre-trained policies entirely in imagined environments for robotic manipulation. Unlike prior works that primarily employ world models for planning, our framework enables direct end-to-end policy optimization. World4RL is designed around two principles: pre-training a diffusion world model that captures diverse dynamics on multi-task datasets and refining policies entirely within a frozen world model to avoid online real-world interactions. We further design a two-hot action encoding scheme tailored for robotic manipulation and adopt diffusion backbones to improve modeling fidelity. Extensive simulation and real-world experiments demonstrate that World4RL provides high-fidelity environment modeling and enables consistent policy refinement, yielding significantly higher success rates compared to imitation learning and other baselines. More visualization results are available at this https URL .

80. Beyond Backpropagation: Exploring Innovative Algorithms for Energy-Efficient Deep Neural Network Training

Authors: Przemysław Spyra
URL: https://arxiv.org/abs/2509.19063
Abstract:

The rising computational and energy demands of deep neural networks (DNNs), driven largely by backpropagation (BP), challenge sustainable AI development. This paper rigorously investigates three BP-free training methods: the Forward-Forward (FF), Cascaded-Forward (CaFo), and Mono-Forward (MF) algorithms, tracing their progression from foundational concepts to a demonstrably superior solution. A robust comparative framework was established: each algorithm was implemented on its native architecture (MLPs for FF and MF, a CNN for CaFo) and benchmarked against an equivalent BP-trained model. Hyperparameters were optimized with Optuna, and consistent early stopping criteria were applied based on validation performance, ensuring all models were optimally tuned before comparison. Results show that MF not only competes with but consistently surpasses BP in classification accuracy on its native MLPs. Its superior generalization stems from converging to a more favorable minimum in the validation loss landscape, challenging the assumption that global optimization is required for state-of-the-art results. Measured at the hardware level using the NVIDIA Management Library (NVML) API, MF reduces energy consumption by up to 41% and shortens training time by up to 34%, translating to a measurably smaller carbon footprint as estimated by CodeCarbon. Beyond this primary result, we present a hardware-level analysis that explains the efficiency gains: exposing FF’s architectural inefficiencies, validating MF’s computationally lean design, and challenging the assumption that all BP-free methods are inherently more memory-efficient. By documenting the evolution from FF’s conceptual groundwork to MF’s synthesis of accuracy and sustainability, this work offers a clear, data-driven roadmap for future energy-efficient deep learning.

81. Reduced-Order Model-Guided Reinforcement Learning for Demonstration-Free Humanoid Locomotion

Authors: Shuai Liu , Meng Cheng Lau
URL: https://arxiv.org/abs/2509.19023
Abstract:

We introduce Reduced-Order Model-Guided Reinforcement Learning (ROM-GRL), a two-stage reinforcement learning framework for humanoid walking that requires no motion capture data or elaborate reward shaping. In the first stage, a compact 4-DOF (four-degree-of-freedom) reduced-order model (ROM) is trained via Proximal Policy Optimization. This generates energy-efficient gait templates. In the second stage, those dynamically consistent trajectories guide a full-body policy trained with Soft Actor–Critic augmented by an adversarial discriminator, ensuring the student’s five-dimensional gait feature distribution matches the ROM’s demonstrations. Experiments at 1 meter-per-second and 4 meter-per-second show that ROM-GRL produces stable, symmetric gaits with substantially lower tracking error than a pure-reward baseline. By distilling lightweight ROM guidance into high-dimensional policies, ROM-GRL bridges the gap between reward-only and imitation-based locomotion methods, enabling versatile, naturalistic humanoid behaviors without any human demonstrations.

82. Fully Learnable Neural Reward Machines

Authors: Hazem Dewidar , Elena Umili
URL: https://arxiv.org/abs/2509.19017
Abstract:

Non-Markovian Reinforcement Learning (RL) tasks present significant challenges, as agents must reason over entire trajectories of state-action pairs to make optimal decisions. A common strategy to address this is through symbolic formalisms, such as Linear Temporal Logic (LTL) or automata, which provide a structured way to express temporally extended objectives. However, these approaches often rely on restrictive assumptions – such as the availability of a predefined Symbol Grounding (SG) function mapping raw observations to high-level symbolic representations, or prior knowledge of the temporal task. In this work, we propose a fully learnable version of Neural Reward Machines (NRM), which can learn both the SG function and the automaton end-to-end, removing any reliance on prior knowledge. Our approach is therefore as easily applicable as classic deep RL (DRL) approaches, while being far more explainable, because of the finite and compact nature of automata. Furthermore, we show that by integrating Fully Learnable Reward Machines (FLNRM) with DRL, our method outperforms previous approaches based on Recurrent Neural Networks (RNNs).

83. Pure Vision Language Action (VLA) Models: A Comprehensive Survey

Authors: Dapeng Zhang , Jin Sun , Chenghui Hu , Xiaoyan Wu , Zhenlong Yuan , Rui Zhou , Fei Shen , Qingguo Zhou
URL: https://arxiv.org/abs/2509.19012
Abstract:

The emergence of Vision Language Action (VLA) models marks a paradigm shift from traditional policy-based control to generalized robotics, reframing Vision Language Models (VLMs) from passive sequence generators into active agents for manipulation and decision-making in complex, dynamic environments. This survey delves into advanced VLA methods, aiming to provide a clear taxonomy and a systematic, comprehensive review of existing research. It presents a comprehensive analysis of VLA applications across different scenarios and classifies VLA approaches into several paradigms: autoregression-based, diffusion-based, reinforcement-based, hybrid, and specialized methods; while examining their motivations, core strategies, and implementations in detail. In addition, foundational datasets, benchmarks, and simulation platforms are introduced. Building on the current VLA landscape, the review further proposes perspectives on key challenges and future directions to advance research in VLA models and generalizable robotics. By synthesizing insights from over three hundred recent studies, this survey maps the contours of this rapidly evolving field and highlights the opportunities and challenges that will shape the development of scalable, general-purpose VLA methods.

84. VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction

Authors: Hao Wang , Eiki Murata , Lingfang Zhang , Ayako Sato , So Fukuda , Ziqi Yin , Wentao Hu , Keisuke Nakao , Yusuke Nakamura , Sebastian Zwirner , Yi-Chia Chen , Hiroyuki Otomo , Hiroki Ouchi , Daisuke Kawahara
URL: https://arxiv.org/abs/2509.19002
Abstract:

Recent advances in multimodal large language models (MLLMs) have significantly enhanced video understanding capabilities, opening new possibilities for practical applications. Yet current video benchmarks focus largely on indoor scenes or short-range outdoor activities, leaving the challenges associated with long-distance travel largely unexplored. Mastering extended geospatial-temporal trajectories is critical for next-generation MLLMs, underpinning real-world tasks such as embodied-AI planning and navigation. To bridge this gap, we present VIR-Bench, a novel benchmark consisting of 200 travel videos that frames itinerary reconstruction as a challenging task designed to evaluate and push forward MLLMs’ geospatial-temporal intelligence. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, struggle to achieve high scores, underscoring the difficulty of handling videos that span extended spatial and temporal scales. Moreover, we conduct an in-depth case study in which we develop a prototype travel-planning agent that leverages the insights gained from VIR-Bench. The agent’s markedly improved itinerary recommendations verify that our evaluation protocol not only benchmarks models effectively but also translates into concrete performance gains in user-facing applications.

85. Eva-VLA: Evaluating Vision-Language-Action Models’ Robustness Under Real-World Physical Variations

Authors: Hanqing Liu , Jiahuan Long , Junqi Wu , Jiacheng Hou , Huili Tang , Tingsong Jiang , Weien Zhou , Wen Yao
URL: https://arxiv.org/abs/2509.18953
Abstract:

Vision-Language-Action (VLA) models have emerged as promising solutions for robotic manipulation, yet their robustness to real-world physical variations remains critically underexplored. To bridge this gap, we propose Eva-VLA, the first unified framework that systematically evaluates the robustness of VLA models by transforming discrete physical variations into continuous optimization problems. However, comprehensively assessing VLA robustness presents two key challenges: (1) how to systematically characterize diverse physical variations encountered in real-world deployments while maintaining evaluation reproducibility, and (2) how to discover worst-case scenarios without prohibitive real-world data collection costs efficiently. To address the first challenge, we decompose real-world variations into three critical domains: object 3D transformations that affect spatial reasoning, illumination variations that challenge visual perception, and adversarial patches that disrupt scene understanding. For the second challenge, we introduce a continuous black-box optimization framework that transforms discrete physical variations into parameter optimization, enabling systematic exploration of worst-case scenarios. Extensive experiments on state-of-the-art OpenVLA models across multiple benchmarks reveal alarming vulnerabilities: all variation types trigger failure rates exceeding 60%, with object transformations causing up to 97.8% failure in long-horizon tasks. Our findings expose critical gaps between controlled laboratory success and unpredictable deployment readiness, while the Eva-VLA framework provides a practical pathway for hardening VLA-based robotic manipulation models against real-world deployment challenges.

86. Towards Privacy-Aware Bayesian Networks: A Credal Approach

Authors: Niccolò Rocchi , Fabio Stella , Cassio de Campos
URL: https://arxiv.org/abs/2509.18949
Abstract:

Bayesian networks (BN) are probabilistic graphical models that enable efficient knowledge representation and inference. These have proven effective across diverse domains, including healthcare, bioinformatics and economics. The structure and parameters of a BN can be obtained by domain experts or directly learned from available data. However, as privacy concerns escalate, it becomes increasingly critical for publicly released models to safeguard sensitive information in training data. Typically, released models do not prioritize privacy by design. In particular, tracing attacks from adversaries can combine the released BN with auxiliary data to determine whether specific individuals belong to the data from which the BN was learned. State-of-the-art protection tecniques involve introducing noise into the learned parameters. While this offers robust protection against tracing attacks, it significantly impacts the model’s utility, in terms of both the significance and accuracy of the resulting inferences. Hence, high privacy may be attained at the cost of releasing a possibly ineffective model. This paper introduces credal networks (CN) as a novel solution for balancing the model’s privacy and utility. After adapting the notion of tracing attacks, we demonstrate that a CN enables the masking of the learned BN, thereby reducing the probability of successful attacks. As CNs are obfuscated but not noisy versions of BNs, they can achieve meaningful inferences while safeguarding privacy. Moreover, we identify key learning information that must be concealed to prevent attackers from recovering the underlying BN. Finally, we conduct a set of numerical experiments to analyze how privacy gains can be modulated by tuning the CN hyperparameters. Our results confirm that CNs provide a principled, practical, and effective approach towards the development of privacy-aware probabilistic graphical models.

87. No Labels Needed: Zero-Shot Image Classification with Collaborative Self-Learning

Authors: Matheus Vinícius Todescato , Joel Luís Carbonera
URL: https://arxiv.org/abs/2509.18938
Abstract:

While deep learning, including Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), has significantly advanced classification performance, its typical reliance on extensive annotated datasets presents a major obstacle in many practical scenarios where such data is scarce. Vision-language models (VLMs) and transfer learning with pre-trained visual models appear as promising techniques to deal with this problem. This paper proposes a novel zero-shot image classification framework that combines a VLM and a pre-trained visual model within a self-learning cycle. Requiring only the set of class names and no labeled training data, our method utilizes a confidence-based pseudo-labeling strategy to train a lightweight classifier directly on the test data, enabling dynamic adaptation. The VLM identifies high-confidence samples, and the pre-trained visual model enhances their visual representations. These enhanced features then iteratively train the classifier, allowing the system to capture complementary semantic and visual cues without supervision. Notably, our approach avoids VLM fine-tuning and the use of large language models, relying on the visual-only model to reduce the dependence on semantic representation. Experimental evaluations on ten diverse datasets demonstrate that our approach outperforms the baseline zero-shot method.

88. Accurate and Efficient Prediction of Wi-Fi Link Quality Based on Machine Learning

Authors: Gabriele Formis , Gianluca Cena , Lukasz Wisniewski , Stefano Scanzio
URL: https://arxiv.org/abs/2509.18933
Abstract:

Wireless communications are characterized by their unpredictability, posing challenges for maintaining consistent communication quality. This paper presents a comprehensive analysis of various prediction models, with a focus on achieving accurate and efficient Wi-Fi link quality forecasts using machine learning techniques. Specifically, the paper evaluates the performance of data-driven models based on the linear combination of exponential moving averages, which are designed for low-complexity implementations and are then suitable for hardware platforms with limited processing resources. Accuracy of the proposed approaches was assessed using experimental data from a real-world Wi-Fi testbed, considering both channel-dependent and channel-independent training data. Remarkably, channel-independent models, which allow for generalized training by equipment manufacturers, demonstrated competitive performance. Overall, this study provides insights into the practical deployment of machine learning-based prediction models for enhancing Wi-Fi dependability in industrial environments.

89. Tackling GNARLy Problems: Graph Neural Algorithmic Reasoning Reimagined through Reinforcement Learning

Authors: Alex Schutz , Victor-Alexandru Darvariu , Efimia Panagiotaki , Bruno Lacerda , Nick Hawes
URL: https://arxiv.org/abs/2509.18930
Abstract:

Neural Algorithmic Reasoning (NAR) is a paradigm that trains neural networks to execute classic algorithms by supervised learning. Despite its successes, important limitations remain: inability to construct valid solutions without post-processing and to reason about multiple correct ones, poor performance on combinatorial NP-hard problems, and inapplicability to problems for which strong algorithms are not yet known. To address these limitations, we reframe the problem of learning algorithm trajectories as a Markov Decision Process, which imposes structure on the solution construction procedure and unlocks the powerful tools of imitation and reinforcement learning (RL). We propose the GNARL framework, encompassing the methodology to translate problem formulations from NAR to RL and a learning architecture suitable for a wide range of graph-based problems. We achieve very high graph accuracy results on several CLRS-30 problems, performance matching or exceeding much narrower NAR approaches for NP-hard problems and, remarkably, applicability even when lacking an expert algorithm.

90. LiDAR Point Cloud Image-based Generation Using Denoising Diffusion Probabilistic Models

Authors: Amirhesam Aghanouri , Cristina Olaverri-Monreal
URL: https://arxiv.org/abs/2509.18917
Abstract:

Autonomous vehicles (AVs) are expected to revolutionize transportation by improving efficiency and safety. Their success relies on 3D vision systems that effectively sense the environment and detect traffic agents. Among sensors AVs use to create a comprehensive view of surroundings, LiDAR provides high-resolution depth data enabling accurate object detection, safe navigation, and collision avoidance. However, collecting real-world LiDAR data is time-consuming and often affected by noise and sparsity due to adverse weather or sensor limitations. This work applies a denoising diffusion probabilistic model (DDPM), enhanced with novel noise scheduling and time-step embedding techniques to generate high-quality synthetic data for augmentation, thereby improving performance across a range of computer vision tasks, particularly in AV perception. These modifications impact the denoising process and the model’s temporal awareness, allowing it to produce more realistic point clouds based on the projection. The proposed method was extensively evaluated under various configurations using the IAMCV and KITTI-360 datasets, with four performance metrics compared against state-of-the-art (SOTA) methods. The results demonstrate the model’s superior performance over most existing baselines and its effectiveness in mitigating the effects of noisy and sparse LiDAR data, producing diverse point clouds with rich spatial relationships and structural detail.

91. The AI Literacy Heptagon: A Structured Approach to AI Literacy in Higher Education

Authors: Veronika Hackl , Alexandra Mueller , Maximilian Sailer
URL: https://arxiv.org/abs/2509.18900
Abstract:

The integrative literature review addresses the conceptualization and implementation of AI Literacy (AIL) in Higher Education (HE) by examining recent research literature. Through an analysis of publications (2021-2024), we explore (1) how AIL is defined and conceptualized in current research, particularly in HE, and how it can be delineated from related concepts such as Data Literacy, Media Literacy, and Computational Literacy; (2) how various definitions can be synthesized into a comprehensive working definition, and (3) how scientific insights can be effectively translated into educational practice. Our analysis identifies seven central dimensions of AIL: technical, applicational, critical thinking, ethical, social, integrational, and legal. These are synthesized in the AI Literacy Heptagon, deepening conceptual understanding and supporting the structured development of AIL in HE. The study aims to bridge the gap between theoretical AIL conceptualizations and the practical implementation in academic curricula.

92. Diversity Boosts AI-Generated Text Detection

Authors: Advik Raj Basani , Pin-Yu Chen
URL: https://arxiv.org/abs/2509.18880
Abstract:

Detecting AI-generated text is an increasing necessity to combat misuse of LLMs in education, business compliance, journalism, and social media, where synthetic fluency can mask misinformation or deception. While prior detectors often rely on token-level likelihoods or opaque black-box classifiers, these approaches struggle against high-quality generations and offer little interpretability. In this work, we propose DivEye, a novel detection framework that captures how unpredictability fluctuates across a text using surprisal-based features. Motivated by the observation that human-authored text exhibits richer variability in lexical and structural unpredictability than LLM outputs, DivEye captures this signal through a set of interpretable statistical features. Our method outperforms existing zero-shot detectors by up to 33.2% and achieves competitive performance with fine-tuned baselines across multiple benchmarks. DivEye is robust to paraphrasing and adversarial attacks, generalizes well across domains and models, and improves the performance of existing detectors by up to 18.7% when used as an auxiliary signal. Beyond detection, DivEye provides interpretable insights into why a text is flagged, pointing to rhythmic unpredictability as a powerful and underexplored signal for LLM detection.

93. When Ads Become Profiles: Large-Scale Audit of Algorithmic Biases and LLM Profiling Risks

Authors: Baiyu Chen , Benjamin Tag , Hao Xue , Daniel Angus , Flora Salim
URL: https://arxiv.org/abs/2509.18874
Abstract:

Automated ad targeting on social media is opaque, creating risks of exploitation and invisibility to external scrutiny. Users may be steered toward harmful content while independent auditing of these processes remains blocked. Large Language Models (LLMs) raise a new concern: the potential to reverse-engineer sensitive user attributes from exposure alone. We introduce a multi-stage auditing framework to investigate these risks. First, a large-scale audit of over 435,000 ad impressions delivered to 891 Australian Facebook users reveals algorithmic biases, including disproportionate Gambling and Politics ads shown to socioeconomically vulnerable and politically aligned groups. Second, a multimodal LLM can reconstruct users’ demographic profiles from ad streams, outperforming census-based baselines and matching or exceeding human performance. Our results provide the first empirical evidence that ad streams constitute rich digital footprints for public AI inference, highlighting urgent privacy risks and the need for content-level auditing and governance.

94. NGRPO: Negative-enhanced Group Relative Policy Optimization

Authors: Gongrui Nan , Siye Chen , Jing Huang , Mengyu Lu , Dexun Wang , Chunmei Xie , Weiqi Xiong , Xianzhou Zeng , Qixuan Zhou , Yadong Li , Xingzhong Xu
URL: https://arxiv.org/abs/2509.18851
Abstract:

RLVR has enhanced the reasoning capabilities of Large Language Models (LLMs) across various tasks. However, GRPO, a representative RLVR algorithm, suffers from a critical limitation: when all responses within a group are either entirely correct or entirely incorrect, the model fails to learn from these homogeneous responses. This is particularly problematic for homogeneously incorrect groups, where GRPO’s advantage function yields a value of zero, leading to null gradients and the loss of valuable learning signals. To overcome this issue, we propose NGRPO (Negative-enhanced Group Relative Policy Optimization), an algorithm designed to convert homogeneous errors into robust learning signals. First, NGRPO introduces Advantage Calibration. This mechanism hypothesizes the existence of a virtual maximum-reward sample during advantage calculation, thereby altering the mean and variance of rewards within a group and ensuring that the advantages for homogeneously incorrect samples are no longer zero. Second, NGRPO employs Asymmetric Clipping, which relaxes the update magnitude for positive samples while imposing stricter constraints on that of negative samples. This serves to stabilize the exploration pressure introduced by the advantage calibration. Our experiments on Qwen2.5-Math-7B demonstrate that NGRPO significantly outperforms baselines such as PPO, GRPO, DAPO, and PSR-NSR on mathematical benchmarks including MATH500, AMC23, and AIME2025. These results validate NGRPO’s ability to learn from homogeneous errors, leading to stable and substantial improvements in mathematical reasoning. Our code is available at this https URL .

95. Failure Makes the Agent Stronger: Enhancing Accuracy through Structured Reflection for Reliable Tool Interactions

Authors: Junhao Su , Yuanliang Wan , Junwei Yang , Hengyu Shi , Tianyang Han , Junfeng Luo , Yurui Qiu
URL: https://arxiv.org/abs/2509.18847
Abstract:

Tool-augmented large language models (LLMs) are usually trained with supervised imitation or coarse-grained reinforcement learning that optimizes single tool calls. Current self-reflection practices rely on heuristic prompts or one-way reasoning: the model is urged to ‘think more’ instead of learning error diagnosis and repair. This is fragile in multi-turn interactions; after a failure the model often repeats the same mistake. We propose structured reflection, which turns the path from error to repair into an explicit, controllable, and trainable action. The agent produces a short yet precise reflection: it diagnoses the failure using evidence from the previous step and then proposes a correct, executable follow-up call. For training we combine DAPO and GSPO objectives with a reward scheme tailored to tool use, optimizing the stepwise strategy Reflect, then Call, then Final. To evaluate, we introduce Tool-Reflection-Bench, a lightweight benchmark that programmatically checks structural validity, executability, parameter correctness, and result consistency. Tasks are built as mini trajectories of erroneous call, reflection, and corrected call, with disjoint train and test splits. Experiments on BFCL v3 and Tool-Reflection-Bench show large gains in multi-turn tool-call success and error recovery, and a reduction of redundant calls. These results indicate that making reflection explicit and optimizing it directly improves the reliability of tool interaction and offers a reproducible path for agents to learn from failure.

96. Text Slider: Efficient and Plug-and-Play Continuous Concept Control for Image/Video Synthesis via LoRA Adapters

Authors: Pin-Yen Chiu , I-Sheng Fang , Jun-Cheng Chen
URL: https://arxiv.org/abs/2509.18831
Abstract:

Recent advances in diffusion models have significantly improved image and video synthesis. In addition, several concept control methods have been proposed to enable fine-grained, continuous, and flexible control over free-form text prompts. However, these methods not only require intensive training time and GPU memory usage to learn the sliders or embeddings but also need to be retrained for different diffusion backbones, limiting their scalability and adaptability. To address these limitations, we introduce Text Slider, a lightweight, efficient and plug-and-play framework that identifies low-rank directions within a pre-trained text encoder, enabling continuous control of visual concepts while significantly reducing training time, GPU memory consumption, and the number of trainable parameters. Furthermore, Text Slider supports multi-concept composition and continuous control, enabling fine-grained and flexible manipulation in both image and video synthesis. We show that Text Slider enables smooth and continuous modulation of specific attributes while preserving the original spatial layout and structure of the input. Text Slider achieves significantly better efficiency: 5$\times$ faster training than Concept Slider and 47$\times$ faster than Attribute Control, while reducing GPU memory usage by nearly 2$\times$ and 4$\times$, respectively.

97. A Kernel Space-based Multidimensional Sparse Model for Dynamic PET Image Denoising

Authors: Kuang Xiaodong , Li Bingxuan , Li Yuan , Rao Fan , Ma Gege , Xie Qingguo , Mok Greta S P , Liu Huafeng , Zhu Wentao
URL: https://arxiv.org/abs/2509.18801
Abstract:

Achieving high image quality for temporal frames in dynamic positron emission tomography (PET) is challenging due to the limited statistic especially for the short frames. Recent studies have shown that deep learning (DL) is useful in a wide range of medical image denoising tasks. In this paper, we propose a model-based neural network for dynamic PET image denoising. The inter-frame spatial correlation and intra-frame structural consistency in dynamic PET are used to establish the kernel space-based multidimensional sparse (KMDS) model. We then substitute the inherent forms of the parameter estimation with neural networks to enable adaptive parameters optimization, forming the end-to-end neural KMDS-Net. Extensive experimental results from simulated and real data demonstrate that the neural KMDS-Net exhibits strong denoising performance for dynamic PET, outperforming previous baseline methods. The proposed method may be used to effectively achieve high temporal and spatial resolution for dynamic PET. Our source code is available at this https URL .

98. Detection of security smells in IaC scripts through semantics-aware code and language processing

Authors: Aicha War , Adnan A. Rawass , Abdoul K. Kabore , Jordan Samhi , Jacques Klein , Tegawende F. Bissyande
URL: https://arxiv.org/abs/2509.18790
Abstract:

Infrastructure as Code (IaC) automates the provisioning and management of IT infrastructure through scripts and tools, streamlining software deployment. Prior studies have shown that IaC scripts often contain recurring security misconfigurations, and several detection and mitigation approaches have been proposed. Most of these rely on static analysis, using statistical code representations or Machine Learning (ML) classifiers to distinguish insecure configurations from safe code. In this work, we introduce a novel approach that enhances static analysis with semantic understanding by jointly leveraging natural language and code representations. Our method builds on two complementary ML models: CodeBERT, to capture semantics across code and text, and LongFormer, to represent long IaC scripts without losing contextual information. We evaluate our approach on misconfiguration datasets from two widely used IaC tools, Ansible and Puppet. To validate its effectiveness, we conduct two ablation studies (removing code text from the natural language input and truncating scripts to reduce context) and compare against four large language models (LLMs) and prior work. Results show that semantic enrichment substantially improves detection, raising precision and recall from 0.46 and 0.79 to 0.92 and 0.88 on Ansible, and from 0.55 and 0.97 to 0.87 and 0.75 on Puppet, respectively.

99. VGGT-DP: Generalizable Robot Control via Vision Foundation Models

Authors: Shijia Ge , Yinxin Zhang , Shuzhao Xie , Weixiang Zhang , Mingcai Zhou , Zhi Wang
URL: https://arxiv.org/abs/2509.18778
Abstract:

Visual imitation learning frameworks allow robots to learn manipulation skills from expert demonstrations. While existing approaches mainly focus on policy design, they often neglect the structure and capacity of visual encoders, limiting spatial understanding and generalization. Inspired by biological vision systems, which rely on both visual and proprioceptive cues for robust control, we propose VGGT-DP, a visuomotor policy framework that integrates geometric priors from a pretrained 3D perception model with proprioceptive feedback. We adopt the Visual Geometry Grounded Transformer (VGGT) as the visual encoder and introduce a proprioception-guided visual learning strategy to align perception with internal robot states, improving spatial grounding and closed-loop control. To reduce inference latency, we design a frame-wise token reuse mechanism that compacts multi-view tokens into an efficient spatial representation. We further apply random token pruning to enhance policy robustness and reduce overfitting. Experiments on challenging MetaWorld tasks show that VGGT-DP significantly outperforms strong baselines such as DP and DP3, particularly in precision-critical and long-horizon scenarios.

100. AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field

Authors: Chen Liang , Zhaoqi Huang , Haofen Wang , Fu Chai , Chunying Yu , Huanhuan Wei , Zhengjie Liu , Yanpeng Li , Hongjun Wang , Ruifeng Luo , Xianzhong Zhao
URL: https://arxiv.org/abs/2509.18776
Abstract:

Large language models (LLMs), as a novel information technology, are seeing increasing adoption in the Architecture, Engineering, and Construction (AEC) field. They have shown their potential to streamline processes throughout the building lifecycle. However, the robustness and reliability of LLMs in such a specialized and safety-critical domain remain to be evaluated. To address this challenge, this paper establishes AECBench, a comprehensive benchmark designed to quantify the strengths and limitations of current LLMs in the AEC domain. The benchmark defines 23 representative tasks within a five-level cognition-oriented evaluation framework encompassing Knowledge Memorization, Understanding, Reasoning, Calculation, and Application. These tasks were derived from authentic AEC practice, with scope ranging from codes retrieval to specialized documents generation. Subsequently, a 4,800-question dataset encompassing diverse formats, including open-ended questions, was crafted primarily by engineers and validated through a two-round expert review. Furthermore, an LLM-as-a-Judge approach was introduced to provide a scalable and consistent methodology for evaluating complex, long-form responses leveraging expert-derived rubrics. Through the evaluation of nine LLMs, a clear performance decline across five cognitive levels was revealed. Despite demonstrating proficiency in foundational tasks at the Knowledge Memorization and Understanding levels, the models showed significant performance deficits, particularly in interpreting knowledge from tables in building codes, executing complex reasoning and calculation, and generating domain-specific documents. Consequently, this study lays the groundwork for future research and development aimed at the robust and reliable integration of LLMs into safety-critical engineering practices.

101. Financial Risk Relation Identification through Dual-view Adaptation

Authors: Wei-Ning Chiu , Yu-Hsiang Wang , Andy Hsiao , Yu-Shiang Huang , Chuan-Ju Wang
URL: https://arxiv.org/abs/2509.18775
Abstract:

A multitude of interconnected risk events – ranging from regulatory changes to geopolitical tensions – can trigger ripple effects across firms. Identifying inter-firm risk relations is thus crucial for applications like portfolio management and investment strategy. Traditionally, such assessments rely on expert judgment and manual analysis, which are, however, subjective, labor-intensive, and difficult to scale. To address this, we propose a systematic method for extracting inter-firm risk relations using Form 10-K filings – authoritative, standardized financial documents – as our data source. Leveraging recent advances in natural language processing, our approach captures implicit and abstract risk connections through unsupervised fine-tuning based on chronological and lexical patterns in the filings. This enables the development of a domain-specific financial encoder with a deeper contextual understanding and introduces a quantitative risk relation score for transparency, interpretable analysis. Extensive experiments demonstrate that our method outperforms strong baselines across multiple evaluation settings.

102. DiSSECT: Structuring Transfer-Ready Medical Image Representations through Discrete Self-Supervision

Authors: Azad Singh , Deepak Mishra
URL: https://arxiv.org/abs/2509.18765
Abstract:

Self-supervised learning (SSL) has emerged as a powerful paradigm for medical image representation learning, particularly in settings with limited labeled data. However, existing SSL methods often rely on complex architectures, anatomy-specific priors, or heavily tuned augmentations, which limit their scalability and generalizability. More critically, these models are prone to shortcut learning, especially in modalities like chest X-rays, where anatomical similarity is high and pathology is subtle. In this work, we introduce DiSSECT – Discrete Self-Supervision for Efficient Clinical Transferable Representations, a framework that integrates multi-scale vector quantization into the SSL pipeline to impose a discrete representational bottleneck. This constrains the model to learn repeatable, structure-aware features while suppressing view-specific or low-utility patterns, improving representation transfer across tasks and domains. DiSSECT achieves strong performance on both classification and segmentation tasks, requiring minimal or no fine-tuning, and shows particularly high label efficiency in low-label regimes. We validate DiSSECT across multiple public medical imaging datasets, demonstrating its robustness and generalizability compared to existing state-of-the-art approaches.

103. When Long Helps Short: How Context Length in Supervised Fine-tuning Affects Behavior of Large Language Models

Authors: Yingming Zheng , Hanqi Li , Kai Yu , Lu Chen
URL: https://arxiv.org/abs/2509.18762
Abstract:

Large language models (LLMs) have achieved impressive performance across natural language processing (NLP) tasks. As real-world applications increasingly demand longer context windows, continued pretraining and supervised fine-tuning (SFT) on long-context data has become a common approach. While the effects of data length in continued pretraining have been extensively studied, their implications for SFT remain unclear. In this work, we systematically investigate how SFT data length influences LLM behavior on short-context tasks. Counterintuitively, we find that long-context SFT improves short-context performance, contrary to the commonly observed degradation from long-context pretraining. To uncover the underlying mechanisms of this phenomenon, we first decouple and analyze two key components, Multi-Head Attention (MHA) and Feed-Forward Network (FFN), and show that both independently benefit from long-context SFT. We further study their interaction and reveal a knowledge preference bias: long-context SFT promotes contextual knowledge, while short-context SFT favors parametric knowledge, making exclusive reliance on long-context SFT suboptimal. Finally, we demonstrate that hybrid training mitigates this bias, offering explainable guidance for fine-tuning LLMs.

104. Security smells in infrastructure as code: a taxonomy update beyond the seven sins

Authors: Aicha War , Serge L.B. Nikiema , Jordan Samhi , Jacques Klein , Tegawende F. Bissyande
URL: https://arxiv.org/abs/2509.18761
Abstract:

Infrastructure as Code (IaC) has become essential for modern software management, yet security flaws in IaC scripts can have severe consequences, as exemplified by the recurring exploits of Cloud Web Services. Prior work has recognized the need to build a precise taxonomy of security smells in IaC scripts as a first step towards developing approaches to improve IaC security. This first effort led to the unveiling of seven sins, limited by the focus on a single IaC tool as well as by the extensive, and potentially biased, manual effort that was required. We propose, in our work, to revisit this taxonomy: first, we extend the study of IaC security smells to a more diverse dataset with scripts associated with seven popular IaC tools, including Terraform, Ansible, Chef, Puppet, Pulumi, Saltstack, and Vagrant; second, we bring in some automation for the analysis by relying on an LLM. While we leverage LLMs for initial pattern processing, all taxonomic decisions underwent systematic human validation and reconciliation with established security standards. Our study yields a comprehensive taxonomy of 62 security smell categories, significantly expanding beyond the previously known seven. We demonstrate actionability by implementing new security checking rules within linters for seven popular IaC tools, often achieving 1.00 precision score. Our evolution study of security smells in GitHub projects reveals that these issues persist for extended periods, likely due to inadequate detection and mitigation tools. This work provides IaC practitioners with insights for addressing common security smells and systematically adopting DevSecOps practices to build safer infrastructure code.

105. Complexity of Activity Patterns in a Bio-Inspired Hopfield-Type Network in Different Topologies

Authors: Marco Cafiso , Paolo Paradisi
URL: https://arxiv.org/abs/2509.18758
Abstract:

Neural network models capable of storing memory have been extensively studied in computer science and computational neuroscience. The Hopfield network is a prototypical example of a model designed for associative, or content-addressable, memory and has been analyzed in many forms. Further, ideas and methods from complex network theory have been incorporated into artificial neural networks and learning, emphasizing their structural properties. Nevertheless, the temporal dynamics also play a vital role in biological neural networks, whose temporal structure is a crucial feature to examine. Biological neural networks display complex intermittency and, thus, can be studied through the lens of the temporal complexity (TC) theory. The TC approach look at the metastability of self-organized states, characterized by a power-law decay in the inter-event time distribution and in the total activity distribution or a scaling behavior in the corresponding event-driven diffusion processes. In this study, we present a temporal complexity (TC) analysis of a biologically-inspired Hopfield-type neural network model. We conducted a comparative assessment between scale-free and random network topologies, with particular emphasis on their global activation patterns. Our parametric analysis revealed comparable dynamical behaviors across both neural network architectures. Furthermore, our investigation into temporal complexity characteristics uncovered that seemingly distinct dynamical patterns exhibit similar temporal complexity behaviors. In particular, similar power-law decay in the activity distribution and similar complexity levels are observed in both topologies, but with a much reduced noise in the scale-free topology. Notably, most of the complex dynamical profiles were consistently observed in scale-free network configurations, thus confirming the crucial role of hubs in neural network dynamics.

106. MV-UMI: A Scalable Multi-View Interface for Cross-Embodiment Learning

Authors: Omar Rayyan , John Abanes , Mahmoud Hafez , Anthony Tzes , Fares Abu-Dakka
URL: https://arxiv.org/abs/2509.18757
Abstract:

Recent advances in imitation learning have shown great promise for developing robust robot manipulation policies from demonstrations. However, this promise is contingent on the availability of diverse, high-quality datasets, which are not only challenging and costly to collect but are often constrained to a specific robot embodiment. Portable handheld grippers have recently emerged as intuitive and scalable alternatives to traditional robotic teleoperation methods for data collection. However, their reliance solely on first-person view wrist-mounted cameras often creates limitations in capturing sufficient scene contexts. In this paper, we present MV-UMI (Multi-View Universal Manipulation Interface), a framework that integrates a third-person perspective with the egocentric camera to overcome this limitation. This integration mitigates domain shifts between human demonstration and robot deployment, preserving the cross-embodiment advantages of handheld data-collection devices. Our experimental results, including an ablation study, demonstrate that our MV-UMI framework improves performance in sub-tasks requiring broad scene understanding by approximately 47% across 3 tasks, confirming the effectiveness of our approach in expanding the range of feasible manipulation tasks that can be learned using handheld gripper systems, without compromising the cross-embodiment advantages inherent to such systems.

107. COLT: Enhancing Video Large Language Models with Continual Tool Usage

Authors: Yuyang Liu , Xinyuan Shi , Bang Yang , Peilin Zhou , Jiahua Dong , Long Chen , Ian Reid , Xiaondan Liang
URL: https://arxiv.org/abs/2509.18754
Abstract:

The success of Large Language Models (LLMs) has significantly propelled the research of video understanding. To harvest the benefits of well-trained expert models (i.e., tools), video LLMs prioritize the exploration of tool usage capabilities. Existing methods either prompt closed-source LLMs or employ the instruction tuning paradigm for tool-use fine-tuning. These methods, however, assume an established repository of fixed tools and struggle to generalize to real-world environments where tool data is perpetually evolving and streaming in. To this end, we propose to enhance open-source video LLMs with COntinuaL Tool usage (termed COLT), which automatically acquires tool-use ability in a successive tool stream without suffering ‘catastrophic forgetting’ of the past learned tools. Specifically, our COLT incorporates a learnable tool codebook as a tool-specific memory system. Then relevant tools are dynamically selected based on the similarity between user instruction and tool features within the codebook. To unleash the tool usage potential of video LLMs, we collect a video-centric tool-use instruction tuning dataset VideoToolBench. Extensive experiments on both previous video LLM benchmarks and the tool-use-specific VideoToolBench dataset demonstrate the state-of-the-art performance of our proposed COLT.

108. A Generalized Bisimulation Metric of State Similarity between Markov Decision Processes: From Theoretical Propositions to Applications

Authors: Zhenyu Tao , Wei Xu , Xiaohu You
URL: https://arxiv.org/abs/2509.18714
Abstract:

The bisimulation metric (BSM) is a powerful tool for computing state similarities within a Markov decision process (MDP), revealing that states closer in BSM have more similar optimal value functions. While BSM has been successfully utilized in reinforcement learning (RL) for tasks like state representation learning and policy exploration, its application to multiple-MDP scenarios, such as policy transfer, remains challenging. Prior work has attempted to generalize BSM to pairs of MDPs, but a lack of rigorous analysis of its mathematical properties has limited further theoretical progress. In this work, we formally establish a generalized bisimulation metric (GBSM) between pairs of MDPs, which is rigorously proven with the three fundamental properties: GBSM symmetry, inter-MDP triangle inequality, and the distance bound on identical state spaces. Leveraging these properties, we theoretically analyse policy transfer, state aggregation, and sampling-based estimation in MDPs, obtaining explicit bounds that are strictly tighter than those derived from the standard BSM. Additionally, GBSM provides a closed-form sample complexity for estimation, improving upon existing asymptotic results based on BSM. Numerical results validate our theoretical findings and demonstrate the effectiveness of GBSM in multi-MDP scenarios.

109. MemOrb: A Plug-and-Play Verbal-Reinforcement Memory Layer for E-Commerce Customer Service

Authors: Yizhe Huang , Yang Liu , Ruiyu Zhao , Xiaolong Zhong , Xingming Yue , Ling Jiang
URL: https://arxiv.org/abs/2509.18713
Abstract:

Large Language Model-based agents(LLM-based agents) are increasingly deployed in customer service, yet they often forget across sessions, repeat errors, and lack mechanisms for continual self-improvement. This makes them unreliable in dynamic settings where stability and consistency are critical. To better evaluate these properties, we emphasize two indicators: task success rate as a measure of overall effectiveness, and consistency metrics such as Pass$^k$ to capture reliability across multiple trials. To address the limitations of existing approaches, we propose MemOrb, a lightweight and plug-and-play verbal reinforcement memory layer that distills multi-turn interactions into compact strategy reflections. These reflections are stored in a shared memory bank and retrieved to guide decision-making, without requiring any fine-tuning. Experiments show that MemOrb significantly improves both success rate and stability, achieving up to a 63 percentage-point gain in multi-turn success rate and delivering more consistent performance across repeated trials. Our results demonstrate that structured reflection is a powerful mechanism for enhancing long-term reliability of frozen LLM agents in customer service scenarios.

110. RSVG-ZeroOV: Exploring a Training-Free Framework for Zero-Shot Open-Vocabulary Visual Grounding in Remote Sensing Images

Authors: Ke Li , Di Wang , Ting Wang , Fuyu Dong , Yiming Zhang , Luyao Zhang , Xiangyu Wang , Shaofeng Li , Quan Wang
URL: https://arxiv.org/abs/2509.18711
Abstract:

Remote sensing visual grounding (RSVG) aims to localize objects in remote sensing images based on free-form natural language expressions. Existing approaches are typically constrained to closed-set vocabularies, limiting their applicability in open-world scenarios. While recent attempts to leverage generic foundation models for open-vocabulary RSVG, they overly rely on expensive high-quality datasets and time-consuming fine-tuning. To address these limitations, we propose \textbf{RSVG-ZeroOV}, a training-free framework that aims to explore the potential of frozen generic foundation models for zero-shot open-vocabulary RSVG. Specifically, RSVG-ZeroOV comprises three key stages: (i) Overview: We utilize a vision-language model (VLM) to obtain cross-attention\footnote[1]{In this paper, although decoder-only VLMs use self-attention over all tokens, we refer to the image-text interaction part as cross-attention to distinguish it from pure visual self-attention.}maps that capture semantic correlations between text queries and visual regions. (ii) Focus: By leveraging the fine-grained modeling priors of a diffusion model (DM), we fill in gaps in structural and shape information of objects, which are often overlooked by VLM. (iii) Evolve: A simple yet effective attention evolution module is introduced to suppress irrelevant activations, yielding purified segmentation masks over the referred objects. Without cumbersome task-specific training, RSVG-ZeroOV offers an efficient and scalable solution. Extensive experiments demonstrate that the proposed framework consistently outperforms existing weakly-supervised and zero-shot methods.

111. An overview of neural architectures for self-supervised audio representation learning from masked spectrograms

Authors: Sarthak Yadav , Sergios Theodoridis , Zheng-Hua Tan
URL: https://arxiv.org/abs/2509.18691
Abstract:

In recent years, self-supervised learning has amassed significant interest for training deep neural representations without labeled data. One such self-supervised learning approach is masked spectrogram modeling, where the objective is to learn semantically rich contextual representations by predicting removed or hidden portions of the input audio spectrogram. With the Transformer neural architecture at its core, masked spectrogram modeling has emerged as the prominent approach for learning general purpose audio representations, a.k.a. audio foundation models. Meanwhile, addressing the issues of the Transformer architecture, in particular the underlying Scaled Dot-product Attention operation, which scales quadratically with input sequence length, has led to renewed interest in recurrent sequence modeling approaches. Among them, Selective structured state space models (such as Mamba) and extended Long Short-Term Memory (xLSTM) are the two most promising approaches which have experienced widespread adoption. While the body of work on these two topics continues to grow, there is currently a lack of an adequate overview encompassing the intersection of these topics. In this paper, we present a comprehensive overview of the aforementioned research domains, covering masked spectrogram modeling and the previously mentioned neural sequence modeling architectures, Mamba and xLSTM. Further, we compare Transformers, Mamba and xLSTM based masked spectrogram models in a unified, reproducible framework on ten diverse downstream audio classification tasks, which will help interested readers to make informed decisions regarding suitability of the evaluated approaches to adjacent applications.

112. LEAF-Mamba: Local Emphatic and Adaptive Fusion State Space Model for RGB-D Salient Object Detection

Authors: Lanhu Wu , Zilin Gao , Hao Fei , Mong-Li Lee , Wynne Hsu
URL: https://arxiv.org/abs/2509.18683
Abstract:

RGB-D salient object detection (SOD) aims to identify the most conspicuous objects in a scene with the incorporation of depth cues. Existing methods mainly rely on CNNs, limited by the local receptive fields, or Vision Transformers that suffer from the cost of quadratic complexity, posing a challenge in balancing performance and computational efficiency. Recently, state space models (SSM), Mamba, have shown great potential for modeling long-range dependency with linear complexity. However, directly applying SSM to RGB-D SOD may lead to deficient local semantics as well as the inadequate cross-modality fusion. To address these issues, we propose a Local Emphatic and Adaptive Fusion state space model (LEAF-Mamba) that contains two novel components: 1) a local emphatic state space module (LE-SSM) to capture multi-scale local dependencies for both modalities. 2) an SSM-based adaptive fusion module (AFM) for complementary cross-modality interaction and reliable cross-modality integration. Extensive experiments demonstrate that the LEAF-Mamba consistently outperforms 16 state-of-the-art RGB-D SOD methods in both efficacy and efficiency. Moreover, our method can achieve excellent performance on the RGB-T SOD task, proving a powerful generalization ability.

113. NaviSense: A Multimodal Assistive Mobile application for Object Retrieval by Persons with Visual Impairment

Authors: Ajay Narayanan Sridhar (1), Fuli Qiao (1), Nelson Daniel Troncoso Aldas (2), Yanpei Shi (3), Mehrdad Mahdavi (1), Laurent Itti (3), Vijaykrishnan Narayanan (1) ((1) The Pennsylvania State University, (2) Independent Researcher, (3) University of Southern California)
URL: https://arxiv.org/abs/2509.18672
Abstract:

People with visual impairments often face significant challenges in locating and retrieving objects in their surroundings. Existing assistive technologies present a trade-off: systems that offer precise guidance typically require pre-scanning or support only fixed object categories, while those with open-world object recognition lack spatial feedback for reaching the object. To address this gap, we introduce ‘NaviSense’, a mobile assistive system that combines conversational AI, vision-language models, augmented reality (AR), and LiDAR to support open-world object detection with real-time audio-haptic guidance. Users specify objects via natural language and receive continuous spatial feedback to navigate toward the target without needing prior setup. Designed with insights from a formative study and evaluated with 12 blind and low-vision participants, NaviSense significantly reduced object retrieval time and was preferred over existing tools, demonstrating the value of integrating open-world perception with precise, accessible guidance.

114. SPiDR: A Simple Approach for Zero-Shot Safety in Sim-to-Real Transfer

Authors: Yarden As , Chengrui Qu , Benjamin Unger , Dongho Kang , Max van der Hart , Laixi Shi , Stelian Coros , Adam Wierman , Andreas Krause
URL: https://arxiv.org/abs/2509.18648
Abstract:

Safety remains a major concern for deploying reinforcement learning (RL) in real-world applications. Simulators provide safe, scalable training environments, but the inevitable sim-to-real gap introduces additional safety concerns, as policies must satisfy constraints in real-world conditions that differ from simulation. To address this challenge, robust safe RL techniques offer principled methods, but are often incompatible with standard scalable training pipelines. In contrast, domain randomization, a simple and popular sim-to-real technique, stands out as a promising alternative, although it often results in unsafe behaviors in practice. We present SPiDR, short for Sim-to-real via Pessimistic Domain Randomization – a scalable algorithm with provable guarantees for safe sim-to-real transfer. SPiDR uses domain randomization to incorporate the uncertainty about the sim-to-real gap into the safety constraints, making it versatile and highly compatible with existing training pipelines. Through extensive experiments on sim-to-sim benchmarks and two distinct real-world robotic platforms, we demonstrate that SPiDR effectively ensures safety despite the sim-to-real gap while maintaining strong performance.

115. Do You Need Proprioceptive States in Visuomotor Policies?

Authors: Juntu Zhao , Wenbo Lu , Di Zhang , Yufeng Liu , Yushen Liang , Tianluo Zhang , Yifeng Cao , Junyuan Xie , Yingdong Hu , Shengjie Wang , Junliang Guo , Dequan Wang , Yang Gao
URL: https://arxiv.org/abs/2509.18644
Abstract:

Imitation-learning-based visuomotor policies have been widely used in robot manipulation, where both visual observations and proprioceptive states are typically adopted together for precise control. However, in this study, we find that this common practice makes the policy overly reliant on the proprioceptive state input, which causes overfitting to the training trajectories and results in poor spatial generalization. On the contrary, we propose the State-free Policy, removing the proprioceptive state input and predicting actions only conditioned on visual observations. The State-free Policy is built in the relative end-effector action space, and should ensure the full task-relevant visual observations, here provided by dual wide-angle wrist cameras. Empirical results demonstrate that the State-free policy achieves significantly stronger spatial generalization than the state-based policy: in real-world tasks such as pick-and-place, challenging shirt-folding, and complex whole-body manipulation, spanning multiple robot embodiments, the average success rate improves from 0\% to 85\% in height generalization and from 6\% to 64\% in horizontal generalization. Furthermore, they also show advantages in data efficiency and cross-embodiment adaptation, enhancing their practicality for real-world deployment.

116. Learning neuroimaging models from health system-scale data

Authors: Yiwei Lyu , Samir Harake , Asadur Chowdury , Soumyanil Banerjee , Rachel Gologorsky , Shixuan Liu , Anna-Katharina Meissner , Akshay Rao , Chenhui Zhao , Akhil Kondepudi , Cheng Jiang , Xinhai Hou , Rushikesh S. Joshi , Volker Neuschmelting , Ashok Srinivasan , Dawn Kleindorfer , Brian Athey , Vikas Gulani , Aditya Pandey , Honglak Lee , Todd Hollon
URL: https://arxiv.org/abs/2509.18638
Abstract:

Neuroimaging is a ubiquitous tool for evaluating patients with neurological diseases. The global demand for magnetic resonance imaging (MRI) studies has risen steadily, placing significant strain on health systems, prolonging turnaround times, and intensifying physician burnout \cite{Chen2017-bt, Rula2024-qp-1}. These challenges disproportionately impact patients in low-resource and rural settings. Here, we utilized a large academic health system as a data engine to develop Prima, the first vision language model (VLM) serving as an AI foundation for neuroimaging that supports real-world, clinical MRI studies as input. Trained on over 220,000 MRI studies, Prima uses a hierarchical vision architecture that provides general and transferable MRI features. Prima was tested in a 1-year health system-wide study that included 30K MRI studies. Across 52 radiologic diagnoses from the major neurologic disorders, including neoplastic, inflammatory, infectious, and developmental lesions, Prima achieved a mean diagnostic area under the ROC curve of 92.0, outperforming other state-of-the-art general and medical AI models. Prima offers explainable differential diagnoses, worklist priority for radiologists, and clinical referral recommendations across diverse patient demographics and MRI systems. Prima demonstrates algorithmic fairness across sensitive groups and can help mitigate health system biases, such as prolonged turnaround times for low-resource populations. These findings highlight the transformative potential of health system-scale VLMs and Prima’s role in advancing AI-driven healthcare.

117. Generalizable Domain Adaptation for Sim-and-Real Policy Co-Training

Authors: Shuo Cheng , Liqian Ma , Zhenyang Chen , Ajay Mandlekar , Caelan Garrett , Danfei Xu
URL: https://arxiv.org/abs/2509.18631
Abstract:

Behavior cloning has shown promise for robot manipulation, but real-world demonstrations are costly to acquire at scale. While simulated data offers a scalable alternative, particularly with advances in automated demonstration generation, transferring policies to the real world is hampered by various simulation and real domain gaps. In this work, we propose a unified sim-and-real co-training framework for learning generalizable manipulation policies that primarily leverages simulation and only requires a few real-world demonstrations. Central to our approach is learning a domain-invariant, task-relevant feature space. Our key insight is that aligning the joint distributions of observations and their corresponding actions across domains provides a richer signal than aligning observations (marginals) alone. We achieve this by embedding an Optimal Transport (OT)-inspired loss within the co-training framework, and extend this to an Unbalanced OT framework to handle the imbalance between abundant simulation data and limited real-world examples. We validate our method on challenging manipulation tasks, showing it can leverage abundant simulation data to achieve up to a 30% improvement in the real-world success rate and even generalize to scenarios seen only in simulation.

118. HyperAdapt: Simple High-Rank Adaptation

Authors: Abel Gurung , Joseph Campbell
URL: https://arxiv.org/abs/2509.18629
Abstract:

Foundation models excel across diverse tasks, but adapting them to specialized applications often requires fine-tuning, an approach that is memory and compute-intensive. Parameter-efficient fine-tuning (PEFT) methods mitigate this by updating only a small subset of weights. In this paper, we introduce HyperAdapt, a parameter-efficient fine-tuning method that significantly reduces the number of trainable parameters compared to state-of-the-art methods like LoRA. Specifically, HyperAdapt adapts a pre-trained weight matrix by applying row- and column-wise scaling through diagonal matrices, thereby inducing a high-rank update while requiring only $n+m$ trainable parameters for an $n \times m$ matrix. Theoretically, we establish an upper bound on the rank of HyperAdapt’s updates, and empirically, we confirm that it consistently induces high-rank transformations across model layers. Experiments on GLUE, arithmetic reasoning, and commonsense reasoning benchmarks with models up to 14B parameters demonstrate that HyperAdapt matches or nearly matches the performance of full fine-tuning and state-of-the-art PEFT methods while using orders of magnitude fewer trainable parameters.

119. BRAID: Input-Driven Nonlinear Dynamical Modeling of Neural-Behavioral Data

Authors: Parsa Vahidi , Omid G. Sani , Maryam M. Shanechi
URL: https://arxiv.org/abs/2509.18627
Abstract:

Neural populations exhibit complex recurrent structures that drive behavior, while continuously receiving and integrating external inputs from sensory stimuli, upstream regions, and neurostimulation. However, neural populations are often modeled as autonomous dynamical systems, with little consideration given to the influence of external inputs that shape the population activity and behavioral outcomes. Here, we introduce BRAID, a deep learning framework that models nonlinear neural dynamics underlying behavior while explicitly incorporating any measured external inputs. Our method disentangles intrinsic recurrent neural population dynamics from the effects of inputs by including a forecasting objective within input-driven recurrent neural networks. BRAID further prioritizes the learning of intrinsic dynamics that are related to a behavior of interest by using a multi-stage optimization scheme. We validate BRAID with nonlinear simulations, showing that it can accurately learn the intrinsic dynamics shared between neural and behavioral modalities. We then apply BRAID to motor cortical activity recorded during a motor task and demonstrate that our method more accurately fits the neural-behavioral data by incorporating measured sensory stimuli into the model and improves the forecasting of neural-behavioral data compared with various baseline methods, whether input-driven or not.

120. The Case for Negative Data: From Crash Reports to Counterfactuals for Reasonable Driving

Authors: Jay Patrikar , Apoorva Sharma , Sushant Veer , Boyi Li , Sebastian Scherer , Marco Pavone
URL: https://arxiv.org/abs/2509.18626
Abstract:

Learning-based autonomous driving systems are trained mostly on incident-free data, offering little guidance near safety-performance boundaries. Real crash reports contain precisely the contrastive evidence needed, but they are hard to use: narratives are unstructured, third-person, and poorly grounded to sensor views. We address these challenges by normalizing crash narratives to ego-centric language and converting both logs and crashes into a unified scene-action representation suitable for retrieval. At decision time, our system adjudicates proposed actions by retrieving relevant precedents from this unified index; an agentic counterfactual extension proposes plausible alternatives, retrieves for each, and reasons across outcomes before deciding. On a nuScenes benchmark, precedent retrieval substantially improves calibration, with recall on contextually preferred actions rising from 24% to 53%. The counterfactual variant preserves these gains while sharpening decisions near risk.

121. Flow marching for a generative PDE foundation model

Authors: Zituo Chen , Sili Deng
URL: https://arxiv.org/abs/2509.18611
Abstract:

Pretraining on large-scale collections of PDE-governed spatiotemporal trajectories has recently shown promise for building generalizable models of dynamical systems. Yet most existing PDE foundation models rely on deterministic Transformer architectures, which lack generative flexibility for many science and engineering applications. We propose Flow Marching, an algorithm that bridges neural operator learning with flow matching motivated by an analysis of error accumulation in physical dynamical systems, and we build a generative PDE foundation model on top of it. By jointly sampling the noise level and the physical time step between adjacent states, the model learns a unified velocity field that transports a noisy current state toward its clean successor, reducing long-term rollout drift while enabling uncertainty-aware ensemble generations. Alongside this core algorithm, we introduce a Physics-Pretrained Variational Autoencoder (P2VAE) to embed physical states into a compact latent space, and an efficient Flow Marching Transformer (FMT) that combines a diffusion-forcing scheme with latent temporal pyramids, achieving up to 15x greater computational efficiency than full-length video diffusion models and thereby enabling large-scale pretraining at substantially reduced cost. We curate a corpus of ~2.5M trajectories across 12 distinct PDE families and train suites of P2VAEs and FMTs at multiple scales. On downstream evaluation, we benchmark on unseen Kolmogorov turbulence with few-shot adaptation, demonstrate long-term rollout stability over deterministic counterparts, and present uncertainty-stratified ensemble results, highlighting the importance of generative PDE foundation models for real-world applications.

Authors: Ana Luiza Mineiro , Francisco Affonso , Marcelo Becker
URL: https://arxiv.org/abs/2509.18608
Abstract:

Reliable navigation in under-canopy agricultural environments remains a challenge due to GNSS unreliability, cluttered rows, and variable lighting. To address these limitations, we present an end-to-end learning-based navigation system that maps raw 3D LiDAR data directly to control commands using a deep reinforcement learning policy trained entirely in simulation. Our method includes a voxel-based downsampling strategy that reduces LiDAR input size by 95.83%, enabling efficient policy learning without relying on labeled datasets or manually designed control interfaces. The policy was validated in simulation, achieving a 100% success rate in straight-row plantations and showing a gradual decline in performance as row curvature increased, tested across varying sinusoidal frequencies and amplitudes.

123. FlexSED: Towards Open-Vocabulary Sound Event Detection

Authors: Jiarui Hai , Helin Wang , Weizhe Guo , Mounya Elhilali
URL: https://arxiv.org/abs/2509.18606
Abstract:

Despite recent progress in large-scale sound event detection (SED) systems capable of handling hundreds of sound classes, existing multi-class classification frameworks remain fundamentally limited. They cannot process free-text sound queries, which enable more flexible and user-friendly interaction, and they lack zero-shot capabilities and offer poor few-shot adaptability. Although text-query-based separation methods have been explored, they primarily focus on source separation and are ill-suited for SED tasks that require precise temporal localization and efficient detection across large and diverse sound vocabularies. In this paper, we propose FlexSED, an open-vocabulary sound event detection system. FlexSED builds on a pretrained audio SSL model and the CLAP text encoder, introducing an encoder-decoder composition and an adaptive fusion strategy to enable effective continuous training from pretrained weights. To ensure robust supervision, it also employs large language models (LLMs) to assist in event query selection during training, addressing challenges related to missing labels. As a result, FlexSED achieves superior performance compared to vanilla SED models on AudioSet-Strong, while demonstrating strong zero-shot and few-shot capabilities. We release the code and pretrained models to support future research and applications based on FlexSED.

124. SynSonic: Augmenting Sound Event Detection through Text-to-Audio Diffusion ControlNet and Effective Sample Filtering

Authors: Jiarui Hai , Mounya Elhilali
URL: https://arxiv.org/abs/2509.18603
Abstract:

Data synthesis and augmentation are essential for Sound Event Detection (SED) due to the scarcity of temporally labeled data. While augmentation methods like SpecAugment and Mix-up can enhance model performance, they remain constrained by the diversity of existing samples. Recent generative models offer new opportunities, yet their direct application to SED is challenging due to the lack of precise temporal annotations and the risk of introducing noise through unreliable filtering. To address these challenges and enable generative-based augmentation for SED, we propose SynSonic, a data augmentation method tailored for this task. SynSonic leverages text-to-audio diffusion models guided by an energy-envelope ControlNet to generate temporally coherent sound events. A joint score filtering strategy with dual classifiers ensures sample quality, and we explore its practical integration into training pipelines. Experimental results show that SynSonic improves Polyphonic Sound Detection Scores (PSDS1 and PSDS2), enhancing both temporal localization and sound class discrimination.

125. OraPO: Oracle-educated Reinforcement Learning for Data-efficient and Factual Radiology Report Generation

Authors: Zhuoxiao Chen , Hongyang Yu , Ying Xu , Yadan Luo , Long Duong , Yuan-Fang Li
URL: https://arxiv.org/abs/2509.18600
Abstract:

Radiology report generation (RRG) aims to automatically produce clinically faithful reports from chest X-ray images. Prevailing work typically follows a scale-driven paradigm, by multi-stage training over large paired corpora and oversized backbones, making pipelines highly data- and compute-intensive. In this paper, we propose Oracle-educated GRPO {OraPO) with a FactScore-based reward (FactS) to tackle the RRG task under constrained budgets. OraPO enables single-stage, RL-only training by converting failed GRPO explorations on rare or difficult studies into direct preference supervision via a lightweight oracle step. FactS grounds learning in diagnostic evidence by extracting atomic clinical facts and checking entailment against ground-truth labels, yielding dense, interpretable sentence-level rewards. Together, OraPO and FactS create a compact and powerful framework that significantly improves learning efficiency on clinically challenging cases, setting the new SOTA performance on the CheXpert Plus dataset (0.341 in F1) with 2–3 orders of magnitude less training data using a small base VLM on modest hardware.

Authors: Neel P. Bhatt , Yunhao Yang , Rohan Siva , Pranay Samineni , Daniel Milan , Zhangyang Wang , Ufuk Topcu
URL: https://arxiv.org/abs/2509.18592
Abstract:

Rapid adaptation in unseen environments is essential for scalable real-world autonomy, yet existing approaches rely on exhaustive exploration or rigid navigation policies that fail to generalize. We present VLN-Zero, a two-phase vision-language navigation framework that leverages vision-language models to efficiently construct symbolic scene graphs and enable zero-shot neurosymbolic navigation. In the exploration phase, structured prompts guide VLM-based search toward informative and diverse trajectories, yielding compact scene graph representations. In the deployment phase, a neurosymbolic planner reasons over the scene graph and environmental observations to generate executable plans, while a cache-enabled execution module accelerates adaptation by reusing previously computed task-location trajectories. By combining rapid exploration, symbolic reasoning, and cache-enabled execution, the proposed framework overcomes the computational inefficiency and poor generalization of prior vision-language navigation methods, enabling robust and scalable decision-making in unseen environments. VLN-Zero achieves 2x higher success rate compared to state-of-the-art zero-shot models, outperforms most fine-tuned baselines, and reaches goal locations in half the time with 55% fewer VLM calls on average compared to state-of-the-art models across diverse environments. Codebase, datasets, and videos for VLN-Zero are available at: this https URL .

127. TsqLoRA: Towards Sensitivity and Quality Low-Rank Adaptation for Efficient Fine-Tuning

Authors: Yu Chen , Yifei Han , Long Zhang , Yue Du , Bin Li
URL: https://arxiv.org/abs/2509.18585
Abstract:

Fine-tuning large pre-trained models for downstream tasks has become a fundamental approach in natural language processing. Fully fine-tuning all model parameters is computationally expensive and memory-intensive, especially in resource-constrained environments. Existing parameter-efficient fine-tuning methods reduce the number of trainable parameters but typically overlook the varying sensitivity of different model layers and the importance of training data. In this work, we propose TsqLoRA, a novel method that integrates data-quality-driven selection with sensitivity-aware low-rank adaptation, consisted of two main components: a quality-aware sampling mechanism for selecting the most informative training data, and a dynamic rank allocation module that adjusts the rank of each layer based on its sensitivity to parameter updates. The experimental results demonstrate that TsqLoRA improves fine-tuning efficiency while maintaining or even improving performance on a variety of NLP tasks. Our code will be available at this https URL .

128. LCMF: Lightweight Cross-Modality Mambaformer for Embodied Robotics VQA

Authors: Zeyi Kang (1), Liang He (2), Yanxin Zhang (3), Zuheng Ming (4), Kaixing Zhao (5) ((1) Northwestern Polytechnical University, (2) University Sorbonne Paris Nord)
URL: https://arxiv.org/abs/2509.18576
Abstract:

Multimodal semantic learning plays a critical role in embodied intelligence, especially when robots perceive their surroundings, understand human instructions, and make intelligent decisions. However, the field faces technical challenges such as effective fusion of heterogeneous data and computational efficiency in resource-constrained environments. To address these challenges, this study proposes the lightweight LCMF cascaded attention framework, introducing a multi-level cross-modal parameter sharing mechanism into the Mamba module. By integrating the advantages of Cross-Attention and Selective parameter-sharing State Space Models (SSMs), the framework achieves efficient fusion of heterogeneous modalities and semantic complementary alignment. Experimental results show that LCMF surpasses existing multimodal baselines with an accuracy of 74.29% in VQA tasks and achieves competitive mid-tier performance within the distribution cluster of Large Language Model Agents (LLM Agents) in EQA video tasks. Its lightweight design achieves a 4.35-fold reduction in FLOPs relative to the average of comparable baselines while using only 166.51M parameters (image-text) and 219M parameters (video-text), providing an efficient solution for Human-Robot Interaction (HRI) applications in resource-constrained scenarios with strong multimodal decision generalization capabilities.

129. The Ranking Blind Spot: Decision Hijacking in LLM-based Text Ranking

Authors: Yaoyao Qian , Yifan Zeng , Yuchao Jiang , Chelsi Jain , Huazheng Wang
URL: https://arxiv.org/abs/2509.18575
Abstract:

Large Language Models (LLMs) have demonstrated strong performance in information retrieval tasks like passage ranking. Our research examines how instruction-following capabilities in LLMs interact with multi-document comparison tasks, identifying what we term the “Ranking Blind Spot”, a characteristic of LLM decision processes during comparative evaluation. We analyze how this ranking blind spot affects LLM evaluation systems through two approaches: Decision Objective Hijacking, which alters the evaluation goal in pairwise ranking systems, and Decision Criteria Hijacking, which modifies relevance standards across ranking schemes. These approaches demonstrate how content providers could potentially influence LLM-based ranking systems to affect document positioning. These attacks aim to force the LLM ranker to prefer a specific passage and rank it at the top. Malicious content providers can exploit this weakness, which helps them gain additional exposure by attacking the ranker. In our experiment, We empirically show that the proposed attacks are effective in various LLMs and can be generalized to multiple ranking schemes. We apply these attack to realistic examples to show their effectiveness. We also found stronger LLMs are more vulnerable to these attacks. Our code is available at: this https URL

130. Interaction Topological Transformer for Multiscale Learning in Porous Materials

Authors: Dong Chen , Jian Liu , Chun-Long Chen , Guo-Wei Wei
URL: https://arxiv.org/abs/2509.18573
Abstract:

Porous materials exhibit vast structural diversity and support critical applications in gas storage, separations, and catalysis. However, predictive modeling remains challenging due to the multiscale nature of structure-property relationships, where performance is governed by both local chemical environments and global pore-network topology. These complexities, combined with sparse and unevenly distributed labeled data, hinder generalization across material families. We propose the Interaction Topological Transformer (ITT), a unified data-efficient framework that leverages novel interaction topology to capture materials information across multiple scales and multiple levels, including structural, elemental, atomic, and pairwise-elemental organization. ITT extracts scale-aware features that reflect both compositional and relational structure within complex porous frameworks, and integrates them through a built-in Transformer architecture that supports joint reasoning across scales. Trained using a two-stage strategy, i.e., self-supervised pretraining on 0.6 million unlabeled structures followed by supervised fine-tuning, ITT achieves state-of-the-art, accurate, and transferable predictions for adsorption, transport, and stability properties. This framework provides a principled and scalable path for learning-guided discovery in structurally and chemically diverse porous materials.

131. Explore the Reinforcement Learning for the LLM based ASR and TTS system

Authors: Changfeng Gao , Yabin Li , Keyu An , Zhifu Gao , Zhihao Du , Han Zhao , Xiangang Li
URL: https://arxiv.org/abs/2509.18569
Abstract:

In recent years, large language models (LLMs) have played an important role in automatic speech recognition (ASR) and text-to-speech (TTS) systems. While reinforcement learning (RL) has significantly enhanced LLM performance in text-based tasks, its application to ASR and TTS remains underexplored due to the complexity of training audio-based models. In this study, we propose a lightweight RL framework tailored for audio-based LLMs that can process audio inputs and generate audio outputs. Based on this framework, we evaluate the effectiveness of reinforcement learning on both ASR and TTS tasks. For the ASR task, we experiment with different rule-based reward functions within the Group Relative Policy Optimization (GRPO) framework and investigate the impact of RL data construction. For the TTS task, we compare GRPO with Differentiable Reward Optimization (DiffRO) and further combine the two approaches to achieve improved performance. Our experiments demonstrate that RL can significantly enhance the performance of both ASR and TTS systems, even with limited training data and a small number of optimization steps.

132. CPCLDETECTOR: Knowledge Enhancement and Alignment Selection for Chinese Patronizing and Condescending Language Detection

Authors: Jiaxun Yang , Yifei Han , Long Zhang , Liu Yujie , Bin Li , Bo Gao , Yangfan He , Kejia Zhan
URL: https://arxiv.org/abs/2509.18562
Abstract:

Chinese Patronizing and Condescending Language (CPCL) is an implicitly discriminatory toxic speech targeting vulnerable groups on Chinese video platforms. The existing dataset lacks user comments, which are a direct reflection of video content. This undermines the model’s understanding of video content and results in the failure to detect some CPLC videos. To make up for this loss, this research reconstructs a new dataset PCLMMPLUS that includes 103k comment entries and expands the dataset size. We also propose the CPCLDetector model with alignment selection and knowledge-enhanced comment content modules. Extensive experiments show the proposed CPCLDetector outperforms the SOTA on PCLMM and achieves higher performance on PCLMMPLUS . CPLC videos are detected more accurately, supporting content governance and protecting vulnerable groups. Code and dataset are available at this https URL .

133. SoundCompass: Navigating Target Sound Extraction With Effective Directional Clue Integration In Complex Acoustic Scenes

Authors: Dayun Choi , Jung-Woo Choi
URL: https://arxiv.org/abs/2509.18561
Abstract:

Recent advances in target sound extraction (TSE) utilize directional clues derived from direction of arrival (DoA), which represent an inherent spatial property of sound available in any acoustic scene. However, previous DoA-based methods rely on hand-crafted features or discrete encodings, which lose fine-grained spatial information and limit adaptability. We propose SoundCompass, an effective directional clue integration framework centered on a Spectral Pairwise INteraction (SPIN) module that captures cross-channel spatial correlations in the complex spectrogram domain to preserve full spatial information in multichannel signals. The input feature expressed in terms of spatial correlations is fused with a DoA clue represented as spherical harmonics (SH) encoding. The fusion is carried out across overlapping frequency subbands, inheriting the benefits reported in the previous band-split architectures. We also incorporate the iterative refinement strategy, chain-of-inference (CoI), in the TSE framework, which recursively fuses DoA with sound event activation estimated from the previous inference stage. Experiments demonstrate that SoundCompass, combining SPIN, SH embedding, and CoI, robustly extracts target sources across diverse signal classes and spatial configurations.

134. Global Minimizers of Sigmoid Contrastive Loss

Authors: Kiril Bangachev , Guy Bresler , Iliyas Noman , Yury Polyanskiy
URL: https://arxiv.org/abs/2509.18552
Abstract:

The meta-task of obtaining and aligning representations through contrastive pretraining is steadily gaining importance since its introduction in CLIP and ALIGN. In this paper we theoretically explain the advantages of synchronizing with trainable inverse temperature and bias under the sigmoid loss, as implemented in the recent SigLIP and SigLIP2 models of Google DeepMind. Temperature and bias can drive the loss function to zero for a rich class of configurations that we call $(\mathsf{m}, \mathsf{b}{\mathsf{rel}})$-Constellations. $(\mathsf{m}, \mathsf{b}{\mathsf{rel}})$-Constellations are a novel combinatorial object related to spherical codes and are parametrized by a margin $\mathsf{m}$ and relative bias $\mathsf{b}_{\mathsf{rel}}$. We use our characterization of constellations to theoretically justify the success of SigLIP on retrieval, to explain the modality gap present in SigLIP, and to identify the necessary dimension for producing high-quality representations. Finally, we propose a reparameterization of the sigmoid loss with explicit relative bias, which improves training dynamics in experiments with synthetic data.

135. Symphony-MoE: Harmonizing Disparate Pre-trained Models into a Coherent Mixture-of-Experts

Authors: Qi Wang , Hanyang Peng , Yue Yu
URL: https://arxiv.org/abs/2509.18542
Abstract:

Mixture-of-Experts (MoE) models enable scalable performance by activating large parameter sets sparsely, minimizing computational overhead. To circumvent the prohibitive cost of training MoEs from scratch, recent work employs upcycling, reusing a single pre-trained dense model by replicating its feed-forward network (FFN) layers into experts. However, this limits expert diversity, as all experts originate from a single pre-trained dense model. This paper addresses this limitation by constructing powerful MoE models using experts sourced from multiple identically-architected but disparate pre-trained models (e.g., Llama2-Chat and Code Llama). A key challenge lies in the fact that these source models occupy disparate, dissonant regions of the parameter space, making direct upcycling prone to severe performance degradation. To overcome this, we propose Symphony-MoE, a novel two-stage framework designed to harmonize these models into a single, coherent expert mixture. First, we establish this harmony in a training-free manner: we construct a shared backbone via a layer-aware fusion strategy and, crucially, alleviate parameter misalignment among experts using activation-based functional alignment. Subsequently, a single lightweight stage of router training coordinates the entire architecture. Experiments demonstrate that our method successfully integrates experts from heterogeneous sources, achieving an MoE model that significantly surpasses baselines in multi-domain tasks and out-of-distribution generalization.

136. CCQA: Generating Question from Solution Can Improve Inference-Time Reasoning in SLMs

Authors: Jin Young Kim , Ji Won Yoon
URL: https://arxiv.org/abs/2509.18536
Abstract:

Recently, inference-time reasoning strategies have further improved the accuracy of large language models (LLMs), but their effectiveness on smaller models remains unclear. Based on the observation that conventional approaches often fail to improve performance in this context, we propose \textbf{C}ycle-\textbf{C}onsistency in \textbf{Q}uestion \textbf{A}nswering (CCQA), a novel reasoning method that can be effectively applied to SLMs. Inspired by cycle consistency, CCQA generates a question from each reasoning path and answer, evaluates each by its similarity to the original question, and then selects the candidate solution with the highest similarity score as the final response. Since conventional SLMs struggle to generate accurate questions from their own reasoning paths and answers, we employ a lightweight Flan-T5 model specialized for question generation to support this process efficiently. From the experimental results, it is verified that CCQA consistently outperforms existing state-of-the-art (SOTA) methods across eight models on mathematical and commonsense reasoning benchmarks. Furthermore, our method establishes a new practical baseline for efficient reasoning in SLMs. Source code can be found at this https URL .

137. No Verifiable Reward for Prosody: Toward Preference-Guided Prosody Learning in TTS

Authors: Seungyoun Shin , Dongha Ahn , Jiwoo Kim , Sungwook Jeon
URL: https://arxiv.org/abs/2509.18531
Abstract:

Recent work reports gains in neural text-to-speech (TTS) with Group Relative Policy Optimization (GRPO). However, in the absence of a verifiable reward for \textit{prosody}, GRPO trained on transcription-oriented signals (CER/NLL) lowers error rates yet collapses prosody into monotone, unnatural speech; adding speaker-similarity further destabilizes training and degrades CER. We address this with an \textit{iterative Direct Preference Optimization (DPO)} scheme that uses only a few hundred human-labeled preference pairs per round to directly optimize prosodic naturalness while regularizing to the current model. On \textbf{KoCC-TTS}, a curated dataset of authentic Korean call center interactions capturing task-oriented dialogues, our method attains the highest human preference (ELO) with competitive CER, outperforming GRPO and strong commercial baselines. These results suggest that when prosody cannot be rewarded automatically, \textit{human preference optimization} offers a practical and data-efficient path to natural and robust TTS. The demo page is available at \href{ this https URL }

138. Automatic coherence-driven inference on arguments

Authors: Steve Huntsman
URL: https://arxiv.org/abs/2509.18523
Abstract:

Inconsistencies are ubiquitous in law, administration, and jurisprudence. Though a cure is too much to hope for, we propose a technological remedy. Large language models (LLMs) can accurately extract propositions from arguments and compile them into natural data structures that enable coherence-driven inference (CDI) via combinatorial optimization. This neurosymbolic architecture naturally separates concerns and enables meaningful judgments about the coherence of arguments that can inform legislative and policy analysis and legal reasoning.

139. APRIL: Active Partial Rollouts in Reinforcement Learning to tame long-tail generation

Authors: Yuzhen Zhou , Jiajun Li , Yusheng Su , Gowtham Ramesh , Zilin Zhu , Xiang Long , Chenyang Zhao , Jin Pan , Xiaodong Yu , Ze Wang , Kangrui Du , Jialian Wu , Ximeng Sun , Jiang Liu , Qiaolin Yu , Hao Chen , Zicheng Liu , Emad Barsoum
URL: https://arxiv.org/abs/2509.18521
Abstract:

Reinforcement learning (RL) has become a cornerstone in advancing large-scale pre-trained language models (LLMs). Successive generations, including GPT-o series, DeepSeek-R1, Kimi-K1.5, Grok 4, and GLM-4.5, have relied on large-scale RL training to enhance reasoning and coding capabilities. To meet the community’s growing RL needs, numerous RL frameworks have been proposed. Most of these frameworks primarily rely on inference engines for rollout generation and training engines for policy updates. However, RL training remains computationally expensive, with rollout generation accounting for more than 90% of total runtime. In addition, its efficiency is often constrained by the long-tail distribution of rollout response lengths, where a few lengthy responses stall entire batches, leaving GPUs idle and underutilized. As model and rollout sizes continue to grow, this bottleneck increasingly limits scalability. To address this challenge, we propose Active Partial Rollouts in Reinforcement Learning (APRIL), which mitigates long-tail inefficiency. In the rollout phase, APRIL over-provisions rollout requests, terminates once the target number of responses is reached, and recycles incomplete responses for continuation in future steps. This strategy ensures that no rollouts are discarded while substantially reducing GPU idle time. Experiments show that APRIL improves rollout throughput by at most 44% across commonly used RL algorithms (GRPO, DAPO, GSPO), accelerates convergence, and achieves at most 8% higher final accuracy across tasks. Moreover, APRIL is both framework and hardware agnostic, already integrated into the slime RL framework, and deployable on NVIDIA and AMD GPUs alike. Taken together, this work unifies system-level and algorithmic considerations in proposing APRIL, with the aim of advancing RL training efficiency and inspiring further optimizations in RL systems.

140. Coherence-driven inference for cybersecurity

Authors: Steve Huntsman
URL: https://arxiv.org/abs/2509.18520
Abstract:

Large language models (LLMs) can compile weighted graphs on natural language data to enable automatic coherence-driven inference (CDI) relevant to red and blue team operations in cybersecurity. This represents an early application of automatic CDI that holds near- to medium-term promise for decision-making in cybersecurity and eventually also for autonomous blue team operations.

141. A Rhythm-Aware Phrase Insertion for Classical Arabic Poetry Composition

Authors: Mohamad Elzohbi , Richard Zhao
URL: https://arxiv.org/abs/2509.18514
Abstract:

This paper presents a methodology for inserting phrases in Arabic poems to conform to a specific rhythm using ByT5, a byte-level multilingual transformer-based model. Our work discusses a rule-based grapheme-to-beat transformation tailored for extracting the rhythm from fully diacritized Arabic script. Our approach employs a conditional denoising objective to fine-tune ByT5, where the model reconstructs masked words to match a target rhythm. We adopt a curriculum learning strategy, pre-training on a general Arabic dataset before fine-tuning on poetic dataset, and explore cross-lingual transfer from English to Arabic. Experimental results demonstrate that our models achieve high rhythmic alignment while maintaining semantic coherence. The proposed model has the potential to be used in co-creative applications in the process of composing classical Arabic poems.

142. Dynamical Modeling of Behaviorally Relevant Spatiotemporal Patterns in Neural Imaging Data

Authors: Mohammad Hosseini , Maryam M. Shanechi
URL: https://arxiv.org/abs/2509.18507
Abstract:

High-dimensional imaging of neural activity, such as widefield calcium and functional ultrasound imaging, provide a rich source of information for understanding the relationship between brain activity and behavior. Accurately modeling neural dynamics in these modalities is crucial for understanding this relationship but is hindered by the high-dimensionality, complex spatiotemporal dependencies, and prevalent behaviorally irrelevant dynamics in these modalities. Existing dynamical models often employ preprocessing steps to obtain low-dimensional representations from neural image modalities. However, this process can discard behaviorally relevant information and miss spatiotemporal structure. We propose SBIND, a novel data-driven deep learning framework to model spatiotemporal dependencies in neural images and disentangle their behaviorally relevant dynamics from other neural dynamics. We validate SBIND on widefield imaging datasets, and show its extension to functional ultrasound imaging, a recent modality whose dynamical modeling has largely remained unexplored. We find that our model effectively identifies both local and long-range spatial dependencies across the brain while also dissociating behaviorally relevant neural dynamics. Doing so, SBIND outperforms existing models in neural-behavioral prediction. Overall, SBIND provides a versatile tool for investigating the neural mechanisms underlying behavior using imaging modalities.

143. Hyperbolic Coarse-to-Fine Few-Shot Class-Incremental Learning

Authors: Jiaxin Dai , Xiang Xiang
URL: https://arxiv.org/abs/2509.18504
Abstract:

In the field of machine learning, hyperbolic space demonstrates superior representation capabilities for hierarchical data compared to conventional Euclidean space. This work focuses on the Coarse-To-Fine Few-Shot Class-Incremental Learning (C2FSCIL) task. Our study follows the Knowe approach, which contrastively learns coarse class labels and subsequently normalizes and freezes the classifier weights of learned fine classes in the embedding space. To better interpret the “coarse-to-fine” paradigm, we propose embedding the feature extractor into hyperbolic space. Specifically, we employ the Poincaré ball model of hyperbolic space, enabling the feature extractor to transform input images into feature vectors within the Poincaré ball instead of Euclidean space. We further introduce hyperbolic contrastive loss and hyperbolic fully-connected layers to facilitate model optimization and classification in hyperbolic space. Additionally, to enhance performance under few-shot conditions, we implement maximum entropy distribution in hyperbolic space to estimate the probability distribution of fine-class feature vectors. This allows generation of augmented features from the distribution to mitigate overfitting during training with limited samples. Experiments on C2FSCIL benchmarks show that our method effectively improves both coarse and fine class accuracies.

144. LAWCAT: Efficient Distillation from Quadratic to Linear Attention with Convolution across Tokens for Long Context Modeling

Authors: Zeyu Liu , Souvik Kundu , Lianghao Jiang , Anni Li , Srikanth Ronanki , Sravan Bodapati , Gourav Datta , Peter A. Beerel
URL: https://arxiv.org/abs/2509.18467
Abstract:

Although transformer architectures have achieved state-of-the-art performance across diverse domains, their quadratic computational complexity with respect to sequence length remains a significant bottleneck, particularly for latency-sensitive long-context applications. While recent linear-complexity alternatives are increasingly powerful, effectively training them from scratch is still resource-intensive. To overcome these limitations, we propose LAWCAT (Linear Attention with Convolution Across Time), a novel linearization framework designed to efficiently transfer the capabilities of pre-trained transformers into a performant linear attention architecture. LAWCAT integrates causal Conv1D layers to enhance local dependency modeling and employs normalized gated linear attention to improve generalization across varying context lengths. Our comprehensive evaluations demonstrate that, distilling Mistral-7B with only 1K-length sequences yields over 90\% passkey retrieval accuracy up to 22K tokens, significantly extending its effective context window. Similarly, Llama3.2-1B LAWCAT variant achieves competitive performance on S-NIAH 1\&2\&3 tasks (1K-8K context length) and BABILong benchmark (QA2\&QA3, 0K-16K context length), requiring less than 0.1\% pre-training tokens compared with pre-training models. Furthermore, LAWCAT exhibits faster prefill speeds than FlashAttention-2 for sequences exceeding 8K tokens. LAWCAT thus provides an efficient pathway to high-performance, long-context linear models suitable for edge deployment, reducing reliance on extensive long-sequence training data and computational resources.

145. Zero-Shot Visual Deepfake Detection: Can AI Predict and Prevent Fake Content Before It’s Created?

Authors: Ayan Sar , Sampurna Roy , Tanupriya Choudhury , Ajith Abraham
URL: https://arxiv.org/abs/2509.18461
Abstract:

Generative adversarial networks (GANs) and diffusion models have dramatically advanced deepfake technology, and its threats to digital security, media integrity, and public trust have increased rapidly. This research explored zero-shot deepfake detection, an emerging method even when the models have never seen a particular deepfake variation. In this work, we studied self-supervised learning, transformer-based zero-shot classifier, generative model fingerprinting, and meta-learning techniques that better adapt to the ever-evolving deepfake threat. In addition, we suggested AI-driven prevention strategies that mitigated the underlying generation pipeline of the deepfakes before they occurred. They consisted of adversarial perturbations for creating deepfake generators, digital watermarking for content authenticity verification, real-time AI monitoring for content creation pipelines, and blockchain-based content verification frameworks. Despite these advancements, zero-shot detection and prevention faced critical challenges such as adversarial attacks, scalability constraints, ethical dilemmas, and the absence of standardized evaluation benchmarks. These limitations were addressed by discussing future research directions on explainable AI for deepfake detection, multimodal fusion based on image, audio, and text analysis, quantum AI for enhanced security, and federated learning for privacy-preserving deepfake detection. This further highlighted the need for an integrated defense framework for digital authenticity that utilized zero-shot learning in combination with preventive deepfake mechanisms. Finally, we highlighted the important role of interdisciplinary collaboration between AI researchers, cybersecurity experts, and policymakers to create resilient defenses against the rising tide of deepfake attacks.

146. CogniLoad: A Synthetic Natural Language Reasoning Benchmark With Tunable Length, Intrinsic Difficulty, and Distractor Density

Authors: Daniel Kaiser , Arnoldo Frigessi , Ali Ramezani-Kebrya , Benjamin Ricaud
URL: https://arxiv.org/abs/2509.18458
Abstract:

Current benchmarks for long-context reasoning in Large Language Models (LLMs) often blur critical factors like intrinsic task complexity, distractor interference, and task length. To enable more precise failure analysis, we introduce CogniLoad, a novel synthetic benchmark grounded in Cognitive Load Theory (CLT). CogniLoad generates natural-language logic puzzles with independently tunable parameters that reflect CLT’s core dimensions: intrinsic difficulty ($d$) controls intrinsic load; distractor-to-signal ratio ($\rho$) regulates extraneous load; and task length ($N$) serves as an operational proxy for conditions demanding germane load. Evaluating 22 SotA reasoning LLMs, CogniLoad reveals distinct performance sensitivities, identifying task length as a dominant constraint and uncovering varied tolerances to intrinsic complexity and U-shaped responses to distractor ratios. By offering systematic, factorial control over these cognitive load dimensions, CogniLoad provides a reproducible, scalable, and diagnostically rich tool for dissecting LLM reasoning limitations and guiding future model development.

147. PrioriTouch: Adapting to User Contact Preferences for Whole-Arm Physical Human-Robot Interaction

Authors: Rishabh Madan , Jiawei Lin , Mahika Goel , Angchen Xie , Xiaoyu Liang , Marcus Lee , Justin Guo , Pranav N. Thakkar , Rohan Banerjee , Jose Barreiros , Kate Tsui , Tom Silver , Tapomayukh Bhattacharjee
URL: https://arxiv.org/abs/2509.18447
Abstract:

Physical human-robot interaction (pHRI) requires robots to adapt to individual contact preferences, such as where and how much force is applied. Identifying preferences is difficult for a single contact; with whole-arm interaction involving multiple simultaneous contacts between the robot and human, the challenge is greater because different body parts can impose incompatible force requirements. In caregiving tasks, where contact is frequent and varied, such conflicts are unavoidable. With multiple preferences across multiple contacts, no single solution can satisfy all objectives–trade-offs are inherent, making prioritization essential. We present PrioriTouch, a framework for ranking and executing control objectives across multiple contacts. PrioriTouch can prioritize from a general collection of controllers, making it applicable not only to caregiving scenarios such as bed bathing and dressing but also to broader multi-contact settings. Our method combines a novel learning-to-rank approach with hierarchical operational space control, leveraging simulation-in-the-loop rollouts for data-efficient and safe exploration. We conduct a user study on physical assistance preferences, derive personalized comfort thresholds, and incorporate them into PrioriTouch. We evaluate PrioriTouch through extensive simulation and real-world experiments, demonstrating its ability to adapt to user contact preferences, maintain task performance, and enhance safety and comfort. Website: this https URL .

148. Developing an AI framework to automatically detect shared decision-making in patient-doctor conversations

Authors: Oscar J. Ponce-Ponte , David Toro-Tobon , Luis F. Figueroa , Michael Gionfriddo , Megan Branda , Victor M. Montori , Saturnino Luz , Juan P. Brito
URL: https://arxiv.org/abs/2509.18439
Abstract:

Shared decision-making (SDM) is necessary to achieve patient-centred care. Currently no methodology exists to automatically measure SDM at scale. This study aimed to develop an automated approach to measure SDM by using language modelling and the conversational alignment (CA) score. A total of 157 video-recorded patient-doctor conversations from a randomized multi-centre trial evaluating SDM decision aids for anticoagulation in atrial fibrillations were transcribed and segmented into 42,559 sentences. Context-response pairs and negative sampling were employed to train deep learning (DL) models and fine-tuned BERT models via the next sentence prediction (NSP) task. Each top-performing model was used to calculate four types of CA scores. A random-effects analysis by clinician, adjusting for age, sex, race, and trial arm, assessed the association between CA scores and SDM outcomes: the Decisional Conflict Scale (DCS) and the Observing Patient Involvement in Decision-Making 12 (OPTION12) scores. p-values were corrected for multiple comparisons with the Benjamini-Hochberg method. Among 157 patients (34% female, mean age 70 SD 10.8), clinicians on average spoke more words than patients (1911 vs 773). The DL model without the stylebook strategy achieved a recall@1 of 0.227, while the fine-tuned BERTbase (110M) achieved the highest recall@1 with 0.640. The AbsMax (18.36 SE7.74 p=0.025) and Max CA (21.02 SE7.63 p=0.012) scores generated with the DL without stylebook were associated with OPTION12. The Max CA score generated with the fine-tuned BERTbase (110M) was associated with the DCS score (-27.61 SE12.63 p=0.037). BERT model sizes did not have an impact the association between CA scores and SDM. This study introduces an automated, scalable methodology to measure SDM in patient-doctor conversations through explainable CA scores, with potential to evaluate SDM strategies at scale.

149. Scattering Transformer: A Training-Free Transformer Architecture for Heart Murmur Detection

Authors: Rami Zewail
URL: https://arxiv.org/abs/2509.18424
Abstract:

In an attempt to address the need for skilled clinicians in heart sound interpretation, recent research efforts on automating cardiac auscultation have explored deep learning approaches. The majority of these approaches have been based on supervised learning that is always challenged in occasions where training data is limited. More recently, there has been a growing interest in potentials of pre-trained self-supervised audio foundation models for biomedical end tasks. Despite exhibiting promising results, these foundational models are typically computationally intensive. Within the context of automatic cardiac auscultation, this study explores a lightweight alternative to these general-purpose audio foundation models by introducing the Scattering Transformer, a novel, training-free transformer architecture for heart murmur detection. The proposed method leverages standard wavelet scattering networks by introducing contextual dependencies in a transformer-like architecture without any backpropagation. We evaluate our approach on the public CirCor DigiScope dataset, directly comparing it against leading general-purpose foundational models. The Scattering Transformer achieves a Weighted Accuracy(WAR) of 0.786 and an Unweighted Average Recall(UAR) of 0.697, demonstrating performance highly competitive with contemporary state of the art methods. This study establishes the Scattering Transformer as a viable and promising alternative in resource-constrained setups.

150. Context Lineage Assurance for Non-Human Identities in Critical Multi-Agent Systems

Authors: Sumana Malkapuram , Sameera Gangavarapu , Kailashnath Reddy Kavalakuntla , Ananya Gangavarapu
URL: https://arxiv.org/abs/2509.18415
Abstract:

The proliferation of autonomous software agents necessitates rigorous frameworks for establishing secure and verifiable agent-to-agent (A2A) interactions, particularly when such agents are instantiated as non-human identities(NHIs). We extend the A2A paradigm [1 , 2] by introducing a cryptographically grounded mechanism for lineage verification, wherein the provenance and evolution of NHIs are anchored in append-only Merkle tree structures modeled after Certificate Transparency (CT) logs. Unlike traditional A2A models that primarily secure point-to-point interactions, our approach enables both agents and external verifiers to cryptographically validate multi-hop provenance, thereby ensuring the integrity of the entire call chain. A federated proof server acts as an auditor across one or more Merkle logs, aggregating inclusion proofs and consistency checks into compact, signed attestations that external parties can verify without access to the full execution trace. In parallel, we augment the A2A agent card to incorporate explicit identity verification primitives, enabling both peer agents and human approvers to authenticate the legitimacy of NHI representations in a standardized manner. Together, these contributions establish a cohesive model that integrates identity attestation, lineage verification, and independent proof auditing, thereby advancing the security posture of inter-agent ecosystems and providing a foundation for robust governance of NHIs in regulated environments such as FedRAMP.

Authors: Navya Tiwari , Joseph Vazhaeparampil , Victoria Preston
URL: https://arxiv.org/abs/2509.18407
Abstract:

Uncontrolled intersections account for a significant fraction of roadway crashes due to ambiguous right-of-way rules, occlusions, and unpredictable driver behavior. While autonomous vehicle research has explored uncertainty-aware decision making, few systems exist to retrofit human-operated vehicles with assistive navigation support. We present a driver-assist framework for right-of-way reasoning at uncontrolled intersections, formulated as a Partially Observable Markov Decision Process (POMDP). Using a custom simulation testbed with stochastic traffic agents, pedestrians, occlusions, and adversarial scenarios, we evaluate four decision-making approaches: a deterministic finite state machine (FSM), and three probabilistic planners: QMDP, POMCP, and DESPOT. Results show that probabilistic planners outperform the rule-based baseline, achieving up to 97.5 percent collision-free navigation under partial observability, with POMCP prioritizing safety and DESPOT balancing efficiency and runtime feasibility. Our findings highlight the importance of uncertainty-aware planning for driver assistance and motivate future integration of sensor fusion and environment perception modules for real-time deployment in realistic traffic environments.

152. Check Field Detection Agent (CFD-Agent) using Multimodal Large Language and Vision Language Models

Authors: Sourav Halder , Jinjun Tong , Xinyu Wu
URL: https://arxiv.org/abs/2509.18405
Abstract:

Checks remain a foundational instrument in the financial ecosystem, facilitating substantial transaction volumes across institutions. However, their continued use also renders them a persistent target for fraud, underscoring the importance of robust check fraud detection mechanisms. At the core of such systems lies the accurate identification and localization of critical fields, such as the signature, magnetic ink character recognition (MICR) line, courtesy amount, legal amount, payee, and payer, which are essential for subsequent verification against reference checks belonging to the same customer. This field-level detection is traditionally dependent on object detection models trained on large, diverse, and meticulously labeled datasets, a resource that is scarce due to proprietary and privacy concerns. In this paper, we introduce a novel, training-free framework for automated check field detection, leveraging the power of a vision language model (VLM) in conjunction with a multimodal large language model (MLLM). Our approach enables zero-shot detection of check components, significantly lowering the barrier to deployment in real-world financial settings. Quantitative evaluation of our model on a hand-curated dataset of 110 checks spanning multiple formats and layouts demonstrates strong performance and generalization capability. Furthermore, this framework can serve as a bootstrap mechanism for generating high-quality labeled datasets, enabling the development of specialized real-time object detection models tailored to institutional needs.

153. An Artificial Intelligence Value at Risk Approach: Metrics and Models

Authors: Luis Enriquez Alvarez
URL: https://arxiv.org/abs/2509.18394
Abstract:

Artificial intelligence risks are multidimensional in nature, as the same risk scenarios may have legal, operational, and financial risk dimensions. With the emergence of new AI regulations, the state of the art of artificial intelligence risk management seems to be highly immature due to upcoming AI regulations. Despite the appearance of several methodologies and generic criteria, it is rare to find guidelines with real implementation value, considering that the most important issue is customizing artificial intelligence risk metrics and risk models for specific AI risk scenarios. Furthermore, the financial departments, legal departments and Government Risk Compliance teams seem to remain unaware of many technical aspects of AI systems, in which data scientists and AI engineers emerge as the most appropriate implementers. It is crucial to decompose the problem of artificial intelligence risk in several dimensions: data protection, fairness, accuracy, robustness, and information security. Consequently, the main task is developing adequate metrics and risk models that manage to reduce uncertainty for decision-making in order to take informed decisions concerning the risk management of AI systems. The purpose of this paper is to orientate AI stakeholders about the depths of AI risk management. Although it is not extremely technical, it requires a basic knowledge of risk management, quantifying uncertainty, the FAIR model, machine learning, large language models and AI context engineering. The examples presented pretend to be very basic and understandable, providing simple ideas that can be developed regarding specific AI customized environments. There are many issues to solve in AI risk management, and this paper will present a holistic overview of the inter-dependencies of AI risks, and how to model them together, within risk scenarios.

154. Graph Enhanced Trajectory Anomaly Detection

Authors: Jonathan Kabala Mbuya , Dieter Pfoser , Antonios Anastasopoulos
URL: https://arxiv.org/abs/2509.18386
Abstract:

Trajectory anomaly detection is essential for identifying unusual and unexpected movement patterns in applications ranging from intelligent transportation systems to urban safety and fraud prevention. Existing methods only consider limited aspects of the trajectory nature and its movement space by treating trajectories as sequences of sampled locations, with sampling determined by positioning technology, e.g., GPS, or by high-level abstractions such as staypoints. Trajectories are analyzed in Euclidean space, neglecting the constraints and connectivity information of the underlying movement network, e.g., road or transit networks. The proposed Graph Enhanced Trajectory Anomaly Detection (GETAD) framework tightly integrates road network topology, segment semantics, and historical travel patterns to model trajectory data. GETAD uses a Graph Attention Network to learn road-aware embeddings that capture both physical attributes and transition behavior, and augments these with graph-based positional encodings that reflect the spatial layout of the road network. A Transformer-based decoder models sequential movement, while a multiobjective loss function combining autoregressive prediction and supervised link prediction ensures realistic and structurally coherent representations. To improve the robustness of anomaly detection, we introduce Confidence Weighted Negative Log Likelihood (CW NLL), an anomaly scoring function that emphasizes high-confidence deviations. Experiments on real-world and synthetic datasets demonstrate that GETAD achieves consistent improvements over existing methods, particularly in detecting subtle anomalies in road-constrained environments. These results highlight the benefits of incorporating graph structure and contextual semantics into trajectory modeling, enabling more precise and context-aware anomaly detection.

155. Align Where the Words Look: Cross-Attention-Guided Patch Alignment with Contrastive and Transport Regularization for Bengali Captioning

Authors: Riad Ahmed Anonto , Sardar Md. Saffat Zabin , M. Saifur Rahman
URL: https://arxiv.org/abs/2509.18369
Abstract:

Grounding vision–language models in low-resource languages remains challenging, as they often produce fluent text about the wrong objects. This stems from scarce paired data, translation pivots that break alignment, and English-centric pretraining that ignores target-language semantics. We address this with a compute-aware Bengali captioning pipeline trained on LaBSE-verified EN–BN pairs and 110k bilingual-prompted synthetic images. A frozen MaxViT yields stable visual patches, a Bengali-native mBART-50 decodes, and a lightweight bridge links the modalities. Our core novelty is a tri-loss objective: Patch-Alignment Loss (PAL) aligns real and synthetic patch descriptors using decoder cross-attention, InfoNCE enforces global real–synthetic separation, and Sinkhorn-based OT ensures balanced fine-grained patch correspondence. This PAL+InfoNCE+OT synergy improves grounding, reduces spurious matches, and drives strong gains on Flickr30k-1k (BLEU-4 12.29, METEOR 27.98, BERTScore-F1 71.20) and MSCOCO-1k (BLEU-4 12.00, METEOR 28.14, BERTScore-F1 75.40), outperforming strong CE baselines and narrowing the real–synthetic centroid gap by 41%.

156. Multi-Worker Selection based Distributed Swarm Learning for Edge IoT with Non-i.i.d. Data

Authors: Zhuoyu Yao , Yue Wang , Songyang Zhang , Yingshu Li , Zhipeng Cai , Zhi Tian
URL: https://arxiv.org/abs/2509.18367
Abstract:

Recent advances in distributed swarm learning (DSL) offer a promising paradigm for edge Internet of Things. Such advancements enhance data privacy, communication efficiency, energy saving, and model scalability. However, the presence of non-independent and identically distributed (non-i.i.d.) data pose a significant challenge for multi-access edge computing, degrading learning performance and diverging training behavior of vanilla DSL. Further, there still lacks theoretical guidance on how data heterogeneity affects model training accuracy, which requires thorough investigation. To fill the gap, this paper first study the data heterogeneity by measuring the impact of non-i.i.d. datasets under the DSL framework. This then motivates a new multi-worker selection design for DSL, termed M-DSL algorithm, which works effectively with distributed heterogeneous data. A new non-i.i.d. degree metric is introduced and defined in this work to formulate the statistical difference among local datasets, which builds a connection between the measure of data heterogeneity and the evaluation of DSL performance. In this way, our M-DSL guides effective selection of multiple works who make prominent contributions for global model updates. We also provide theoretical analysis on the convergence behavior of our M-DSL, followed by extensive experiments on different heterogeneous datasets and non-i.i.d. data settings. Numerical results verify performance improvement and network intelligence enhancement provided by our M-DSL beyond the benchmarks.

157. FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction

Authors: Yuxuan Cai , Xiaozhuan Liang , Xinghua Wang , Jin Ma , Haijin Liang , Jinwen Luo , Xinyu Zuo , Lisheng Duan , Yuyang Yin , Xi Chen
URL: https://arxiv.org/abs/2509.18362
Abstract:

As large language models (LLMs) become increasingly powerful, the sequential nature of autoregressive generation creates a fundamental throughput bottleneck that limits the practical deployment. While Multi-Token Prediction (MTP) has demonstrated remarkable benefits for model training efficiency and performance, its inherent potential for inference acceleration remains largely unexplored. This paper introduces FastMTP, a simple yet effective method that improves multi-step draft quality by aligning MTP training with its inference pattern, significantly enhancing speculative decoding performance. Our approach fine-tunes a single MTP head with position-shared weights on self-distilled data, enabling it to capture dependencies among consecutive future tokens and maintain high acceptance rates across multiple recursive draft steps. By integrating language-aware dynamic vocabulary compression into the MTP head, we further reduce computational overhead in the drafting process. Experimental results across seven diverse benchmarks demonstrate that FastMTP achieves an average of 2.03x speedup compared to standard next token prediction with lossless output quality, outperforming vanilla MTP by 82%. FastMTP requires only lightweight training and seamlessly integrates with existing inference frameworks, offering a practical and rapidly deployable solution for accelerating LLM inference.

158. Reading Between the Lines: Scalable User Feedback via Implicit Sentiment in Developer Prompts

Authors: Daye Nam , Malgorzata Salawa , Satish Chandra
URL: https://arxiv.org/abs/2509.18361
Abstract:

Evaluating developer satisfaction with conversational AI assistants at scale is critical but challenging. User studies provide rich insights, but are unscalable, while large-scale quantitative signals from logs or in-product ratings are often too shallow or sparse to be reliable. To address this gap, we propose and evaluate a new approach: using sentiment analysis of developer prompts to identify implicit signals of user satisfaction. With an analysis of industrial usage logs of 372 professional developers, we show that this approach can identify a signal in ~8% of all interactions, a rate more than 13 times higher than explicit user feedback, with reasonable accuracy even with an off-the-shelf sentiment analysis approach. This new practical approach to complement existing feedback channels would open up new directions for building a more comprehensive understanding of the developer experience at scale.

159. Chiplet-Based RISC-V SoC with Modular AI Acceleration

Authors: P. Ramkumar , S. S. Bharadwaj
URL: https://arxiv.org/abs/2509.18355
Abstract:

Achieving high performance, energy efficiency, and cost-effectiveness while maintaining architectural flexibility is a critical challenge in the development and deployment of edge AI devices. Monolithic SoC designs struggle with this complex balance mainly due to low manufacturing yields (below 16%) at advanced 360 mm^2 process nodes. This paper presents a novel chiplet-based RISC-V SoC architecture that addresses these limitations through modular AI acceleration and intelligent system level optimization. Our proposed design integrates 4 different key innovations in a 30mm x 30mm silicon interposer: adaptive cross-chiplet Dynamic Voltage and Frequency Scaling (DVFS); AI-aware Universal Chiplet Interconnect Express (UCIe) protocol extensions featuring streaming flow control units and compression-aware transfers; distributed cryptographic security across heterogeneous chiplets; and intelligent sensor-driven load migration. The proposed architecture integrates a 7nm RISC-V CPU chiplet with dual 5nm AI accelerators (15 TOPS INT8 each), 16GB HBM3 memory stacks, and dedicated power management controllers. Experimental results across industry standard benchmarks like MobileNetV2, ResNet-50 and real-time video processing demonstrate significant performance improvements. The AI-optimized configuration achieves ~14.7% latency reduction, 17.3% throughput improvement, and 16.2% power reduction compared to previous basic chiplet implementations. These improvements collectively translate to a 40.1% efficiency gain corresponding to ~3.5 mJ per MobileNetV2 inference (860 mW/244 images/s), while maintaining sub-5ms real-time capability across all experimented workloads. These performance upgrades demonstrate that modular chiplet designs can achieve near-monolithic computational density while enabling cost efficiency, scalability and upgradeability, crucial for next-generation edge AI device applications.

160. A Single Image Is All You Need: Zero-Shot Anomaly Localization Without Training Data

Authors: Mehrdad Moradi , Shengzhe Chen , Hao Yan , Kamran Paynabar
URL: https://arxiv.org/abs/2509.18354
Abstract:

Anomaly detection in images is typically addressed by learning from collections of training data or relying on reference samples. In many real-world scenarios, however, such training data may be unavailable, and only the test image itself is provided. We address this zero-shot setting by proposing a single-image anomaly localization method that leverages the inductive bias of convolutional neural networks, inspired by Deep Image Prior (DIP). Our method is named Single Shot Decomposition Network (SSDnet). Our key assumption is that natural images often exhibit unified textures and patterns, and that anomalies manifest as localized deviations from these repetitive or stochastic patterns. To learn the deep image prior, we design a patch-based training framework where the input image is fed directly into the network for self-reconstruction, rather than mapping random noise to the image as done in DIP. To avoid the model simply learning an identity mapping, we apply masking, patch shuffling, and small Gaussian noise. In addition, we use a perceptual loss based on inner-product similarity to capture structure beyond pixel fidelity. Our approach needs no external training data, labels, or references, and remains robust in the presence of noise or missing pixels. SSDnet achieves 0.99 AUROC and 0.60 AUPRC on MVTec-AD and 0.98 AUROC and 0.67 AUPRC on the fabric dataset, outperforming state-of-the-art methods. The implementation code will be released at this https URL

161. Brittleness and Promise: Knowledge Graph Based Reward Modeling for Diagnostic Reasoning

Authors: Saksham Khatwani , He Cheng , Majid Afshar , Dmitriy Dligach , Yanjun Gao
URL: https://arxiv.org/abs/2509.18316
Abstract:

Large language models (LLMs) show promise for diagnostic reasoning but often lack reliable, knowledge grounded inference. Knowledge graphs (KGs), such as the Unified Medical Language System (UMLS), offer structured biomedical knowledge that can support trustworthy reasoning. Prior approaches typically integrate KGs via retrieval augmented generation or fine tuning, inserting KG content into prompts rather than enabling structured reasoning. We explore an alternative paradigm: treating the LLM as a reward model of KG reasoning paths, where the model learns to judge whether a candidate path leads to correct diagnosis for a given patient input. This approach is inspired by recent work that leverages reward training to enhance model reasoning abilities, and grounded in computational theory, which suggests that verifying a solution is often easier than generating one from scratch. It also parallels physicians’ diagnostic assessment, where they judge which sequences of findings and intermediate conditions most plausibly support a diagnosis. We first systematically evaluate five task formulation for knowledge path judging and eight training paradigm. Second, we test whether the path judging abilities generalize to downstream diagnostic tasks, including diagnosis summarization and medical question answering. Experiments with three open source instruct-tuned LLMs reveal both promise and brittleness: while specific reward optimization and distillation lead to strong path-judging performance, the transferability to downstream tasks remain weak. Our finding provides the first systematic assessment of “reward model style” reasoning over clinical KGs, offering insights into how structured, reward-based supervision influences diagnostic reasoning in GenAI systems for healthcare.

162. Evaluating Large Language Models for Detecting Antisemitism

Authors: Jay Patel , Hrudayangam Mehta , Jeremy Blackburn
URL: https://arxiv.org/abs/2509.18293
Abstract:

Detecting hateful content is a challenging and important problem. Automated tools, like machine-learning models, can help, but they require continuous training to adapt to the ever-changing landscape of social media. In this work, we evaluate eight open-source LLMs’ capability to detect antisemitic content, specifically leveraging in-context definition as a policy guideline. We explore various prompting techniques and design a new CoT-like prompt, Guided-CoT. Guided-CoT handles the in-context policy well, increasing performance across all evaluated models, regardless of decoding configuration, model sizes, or reasoning capability. Notably, Llama 3.1 70B outperforms fine-tuned GPT-3.5. Additionally, we examine LLM errors and introduce metrics to quantify semantic divergence in model-generated rationales, revealing notable differences and paradoxical behaviors among LLMs. Our experiments highlight the differences observed across LLMs’ utility, explainability, and reliability.

163. PEEK: Guiding and Minimal Image Representations for Zero-Shot Generalization of Robot Manipulation Policies

Authors: Jesse Zhang , Marius Memmel , Kevin Kim , Dieter Fox , Jesse Thomason , Fabio Ramos , Erdem Bıyık , Abhishek Gupta , Anqi Li
URL: https://arxiv.org/abs/2509.18282
Abstract:

Robotic manipulation policies often fail to generalize because they must simultaneously learn where to attend, what actions to take, and how to execute them. We argue that high-level reasoning about where and what can be offloaded to vision-language models (VLMs), leaving policies to specialize in how to act. We present PEEK (Policy-agnostic Extraction of Essential Keypoints), which fine-tunes VLMs to predict a unified point-based intermediate representation: 1. end-effector paths specifying what actions to take, and 2. task-relevant masks indicating where to focus. These annotations are directly overlaid onto robot observations, making the representation policy-agnostic and transferable across architectures. To enable scalable training, we introduce an automatic annotation pipeline, generating labeled data across 20+ robot datasets spanning 9 embodiments. In real-world evaluations, PEEK consistently boosts zero-shot generalization, including a 41.4x real-world improvement for a 3D policy trained only in simulation, and 2-3.5x gains for both large VLAs and small manipulation policies. By letting VLMs absorb semantic and visual complexity, PEEK equips manipulation policies with the minimal cues they need–where, what, and how. Website at this https URL .

164. Perceptions of AI Across Sectors: A Comparative Review of Public Attitudes

Authors: Filip Bialy , Mark Elliot , Robert Meckin
URL: https://arxiv.org/abs/2509.18233
Abstract:

This paper offers a domain-mediated comparative review of 251 studies on public attitudes toward AI, published between 2011 and 2025. Drawing on a systematic literature review, we analyse how different factors including perceived benefits and concerns (or risks) shape public acceptance of - or resistance to - artificial intelligence across domains and use-cases, including healthcare, education, security, public administration, generative AI, and autonomous vehicles. The analysis highlights recurring patterns in individual, contextual, and technical factors influencing perception, while also tracing variations in institutional trust, perceived fairness, and ethical concerns. We show that the public perception in AI is shaped not only by technical design or performance but also by sector-specific considerations as well as imaginaries, cultural narratives, and historical legacies. This comparative approach offers a foundation for developing more tailored and context-sensitive strategies for responsible AI governance.

165. Enhanced Interpretable Knowledge Tracing for Students Performance Prediction with Human understandable Feature Space

Authors: Sein Minn , Roger Nkambou
URL: https://arxiv.org/abs/2509.18231
Abstract:

Knowledge Tracing (KT) plays a central role in assessing students skill mastery and predicting their future performance. While deep learning based KT models achieve superior predictive accuracy compared to traditional methods, their complexity and opacity hinder their ability to provide psychologically meaningful explanations. This disconnect between model parameters and cognitive theory poses challenges for understanding and enhancing the learning process, limiting their trustworthiness in educational applications. To address these challenges, we enhance interpretable KT models by exploring human-understandable features derived from students interaction data. By incorporating additional features, particularly those reflecting students learning abilities, our enhanced approach improves predictive accuracy while maintaining alignment with cognitive theory. Our contributions aim to balance predictive power with interpretability, advancing the utility of adaptive learning systems.

166. Automatic Classification of Magnetic Chirality of Solar Filaments from H-Alpha Observations

Authors: Alexis Chalmers , Azim Ahmadzadeh
URL: https://arxiv.org/abs/2509.18214
Abstract:

In this study, we classify the magnetic chirality of solar filaments from H-Alpha observations using state-of-the-art image classification models. We establish the first reproducible baseline for solar filament chirality classification on the MAGFiLO dataset. The MAGFiLO dataset contains over 10,000 manually-annotated filaments from GONG H-Alpha observations, making it the largest dataset for filament detection and classification to date. Prior studies relied on much smaller datasets, which limited their generalizability and comparability. We fine-tuned several pre-trained, image classification architectures, including ResNet, WideResNet, ResNeXt, and ConvNeXt, and also applied data augmentation and per-class loss weights to optimize the models. Our best model, ConvNeXtBase, achieves a per-class accuracy of 0.69 for left chirality filaments and $0.73$ for right chirality filaments.

167. Variational Task Vector Composition

Authors: Boyuan Zhang , Yingjun Du , Xiantong Zhen , Ling Shao
URL: https://arxiv.org/abs/2509.18208
Abstract:

Task vectors capture how a model changes during fine-tuning by recording the difference between pre-trained and task-specific weights. The composition of task vectors, a key operator in task arithmetic, enables models to integrate knowledge from multiple tasks without incurring additional inference costs. In this paper, we propose variational task vector composition, where composition coefficients are taken as latent variables and estimated in a Bayesian inference framework. Unlike previous methods that operate at the task level, our framework focuses on sample-specific composition. Motivated by the observation of structural redundancy in task vectors, we introduce a Spike-and-Slab prior that promotes sparsity and preserves only the most informative components. To further address the high variance and sampling inefficiency in sparse, high-dimensional spaces, we develop a gated sampling mechanism that constructs a controllable posterior by filtering the composition coefficients based on both uncertainty and importance. This yields a more stable and interpretable variational framework by deterministically selecting reliable task components, reducing sampling variance while improving transparency and generalization. Experimental results demonstrate that our method consistently outperforms existing approaches across all datasets by selectively leveraging the most reliable and informative components in task vectors. These findings highlight the practical value of our approach, establishing a new standard for efficient and effective task vector composition.

Authors: Yu Ti Huang
URL: https://arxiv.org/abs/2509.18200
Abstract:

Conversational agents must translate egocentric utterances (e.g., “on my right”) into allocentric orientations (N/E/S/W). This challenge is particularly critical in indoor or complex facilities where GPS signals are weak and detailed maps are unavailable. While chain-of-thought (CoT) prompting has advanced reasoning in language and vision tasks, its application to multimodal spatial orientation remains underexplored. We introduce Conversational Orientation Reasoning (COR), a new benchmark designed for Traditional Chinese conversational navigation projected from real-world environments, addressing egocentric-to-allocentric reasoning in non-English and ASR-transcribed scenarios. We propose a multimodal chain-of-thought (MCoT) framework, which integrates ASR-transcribed speech with landmark coordinates through a structured three-step reasoning process: (1) extracting spatial relations, (2) mapping coordinates to absolute directions, and (3) inferring user orientation. A curriculum learning strategy progressively builds these capabilities on Taiwan-LLM-13B-v2.0-Chat, a mid-sized model representative of resource-constrained settings. Experiments show that MCoT achieves 100% orientation accuracy on clean transcripts and 98.1% with ASR transcripts, substantially outperforming unimodal and non-structured baselines. Moreover, MCoT demonstrates robustness under noisy conversational conditions, including ASR recognition errors and multilingual code-switching. The model also maintains high accuracy in cross-domain evaluation and resilience to linguistic variation, domain shift, and referential ambiguity. These findings highlight the potential of structured MCoT spatial reasoning as a path toward interpretable and resource-efficient embodied navigation.

169. MNV-17: A High-Quality Performative Mandarin Dataset for Nonverbal Vocalization Recognition in Speech

Authors: Jialong Mai , Jinxin Ji , Xiaofen Xing , Chen Yang , Weidong Chen , Jingyuan Xing , Xiangmin Xu
URL: https://arxiv.org/abs/2509.18196
Abstract:

Mainstream Automatic Speech Recognition (ASR) systems excel at transcribing lexical content, but largely fail to recognize nonverbal vocalizations (NVs) embedded in speech, such as sighs, laughs, and coughs. This capability is important for a comprehensive understanding of human communication, as NVs convey crucial emotional and intentional cues. Progress in NV-aware ASR has been hindered by the lack of high-quality, well-annotated datasets. To address this gap, we introduce MNV-17, a 7.55-hour performative Mandarin speech dataset. Unlike most existing corpora that rely on model-based detection, MNV-17’s performative nature ensures high-fidelity, clearly articulated NV instances. To the best of our knowledge, MNV-17 provides the most extensive set of nonverbal vocalization categories, comprising 17 distinct and well-balanced classes of common NVs. We benchmarked MNV-17 on four mainstream ASR architectures, evaluating their joint performance on semantic transcription and NV classification. The dataset and the pretrained model checkpoints will be made publicly available to facilitate future research in expressive ASR.

170. TinyEcoWeedNet: Edge Efficient Real-Time Aerial Agricultural Weed Detection

Authors: Omar H. Khater , Abdul Jabbar Siddiqui , Aiman El-Maleh , M. Shamim Hossain
URL: https://arxiv.org/abs/2509.18193
Abstract:

Deploying deep learning models in agriculture is difficult because edge devices have limited resources, but this work presents a compressed version of EcoWeedNet using structured channel pruning, quantization-aware training (QAT), and acceleration with NVIDIA’s TensorRT on the Jetson Orin Nano. Despite the challenges of pruning complex architectures with residual shortcuts, attention mechanisms, concatenations, and CSP blocks, the model size was reduced by up to 68.5% and computations by 3.2 GFLOPs, while inference speed reached 184 FPS at FP16, 28.7% faster than the baseline. On the CottonWeedDet12 dataset, the pruned EcoWeedNet with a 39.5% pruning ratio outperformed YOLO11n and YOLO12n (with only 20% pruning), achieving 83.7% precision, 77.5% recall, and 85.9% mAP50, proving it to be both efficient and effective for precision agriculture.

171. HazeFlow: Revisit Haze Physical Model as ODE and Non-Homogeneous Haze Generation for Real-World Dehazing

Authors: Junseong Shin , Seungwoo Chung , Yunjeong Yang , Tae Hyun Kim
URL: https://arxiv.org/abs/2509.18190
Abstract:

Dehazing involves removing haze or fog from images to restore clarity and improve visibility by estimating atmospheric scattering effects. While deep learning methods show promise, the lack of paired real-world training data and the resulting domain gap hinder generalization to real-world scenarios. In this context, physics-grounded learning becomes crucial; however, traditional methods based on the Atmospheric Scattering Model (ASM) often fall short in handling real-world complexities and diverse haze patterns. To solve this problem, we propose HazeFlow, a novel ODE-based framework that reformulates ASM as an ordinary differential equation (ODE). Inspired by Rectified Flow (RF), HazeFlow learns an optimal ODE trajectory to map hazy images to clean ones, enhancing real-world dehazing performance with only a single inference step. Additionally, we introduce a non-homogeneous haze generation method using Markov Chain Brownian Motion (MCBM) to address the scarcity of paired real-world data. By simulating realistic haze patterns through MCBM, we enhance the adaptability of HazeFlow to diverse real-world scenarios. Through extensive experiments, we demonstrate that HazeFlow achieves state-of-the-art performance across various real-world dehazing benchmark datasets.

172. Qianfan-VL: Domain-Enhanced Universal Vision-Language Models

Authors: Daxiang Dong , Mingming Zheng , Dong Xu , Bairong Zhuang , Wenyu Zhang , Chunhua Luo , Haoran Wang , Zijian Zhao , Jie Li , Yuxuan Li , Hanjun Zhong , Mengyue Liu , Jieting Chen , Shupeng Li , Lun Tian , Yaping Feng , Xin Li , Donggang Jiang , Yong Chen , Yehua Xu , Duohao Qin , Chen Feng , Dan Wang , Henghua Zhang , Jingjing Ha , Jinhui He , Yanfeng Zhai , Chengxin Zheng , Jiayi Mao , Jiacheng Chen , Ruchang Yao , Ziye Yuan , Jianmin Wu , Guangjun Xie , Dou Shen
URL: https://arxiv.org/abs/2509.18189
Abstract:

We present Qianfan-VL, a series of multimodal large language models ranging from 3B to 70B parameters, achieving state-of-the-art performance through innovative domain enhancement techniques. Our approach employs multi-stage progressive training and high-precision data synthesis pipelines, which prove to be critical technologies for enhancing domain-specific capabilities while maintaining strong general performance. Qianfan-VL achieves comparable results to leading open-source models on general benchmarks, with state-of-the-art performance on benchmarks such as CCBench, SEEDBench IMG, ScienceQA, and MMStar. The domain enhancement strategy delivers significant advantages in OCR and document understanding, validated on both public benchmarks (OCRBench 873, DocVQA 94.75%) and in-house evaluations. Notably, Qianfan-VL-8B and 70B variants incorporate long chain-of-thought capabilities, demonstrating superior performance on mathematical reasoning (MathVista 78.6%) and logical inference tasks. All models are trained entirely on Baidu’s Kunlun P800 chips, validating the capability of large-scale AI infrastructure to train SOTA-level multimodal models with over 90% scaling efficiency on 5000 chips for a single task. This work establishes an effective methodology for developing domain-enhanced multimodal models suitable for diverse enterprise deployment scenarios.

173. V-SenseDrive: A Privacy-Preserving Road Video and In-Vehicle Sensor Fusion Framework for Road Safety & Driver Behaviour Modelling

Authors: Muhammad Naveed , Nazia Perwaiz , Sidra Sultana , Mohaira Ahmad , Muhammad Moazam Fraz
URL: https://arxiv.org/abs/2509.18187
Abstract:

Road traffic accidents remain a major public health challenge, particularly in countries with heterogeneous road conditions, mixed traffic flow, and variable driving discipline, such as Pakistan. Reliable detection of unsafe driving behaviours is a prerequisite for improving road safety, enabling advanced driver assistance systems (ADAS), and supporting data driven decisions in insurance and fleet management. Most of existing datasets originate from the developed countries with limited representation of the behavioural diversity observed in emerging economies and the driver’s face recording voilates the privacy preservation. We present V-SenseDrive, the first privacy-preserving multimodal driver behaviour dataset collected entirely within the Pakistani driving environment. V-SenseDrive combines smartphone based inertial and GPS sensor data with synchronized road facing video to record three target driving behaviours (normal, aggressive, and risky) on multiple types of roads, including urban arterials, secondary roads, and motorways. Data was gathered using a custom Android application designed to capture high frequency accelerometer, gyroscope, and GPS streams alongside continuous video, with all sources precisely time aligned to enable multimodal analysis. The focus of this work is on the data acquisition process, covering participant selection, driving scenarios, environmental considerations, and sensor video synchronization techniques. The dataset is structured into raw, processed, and semantic layers, ensuring adaptability for future research in driver behaviour classification, traffic safety analysis, and ADAS development. By representing real world driving in Pakistan, V-SenseDrive fills a critical gap in the global landscape of driver behaviour datasets and lays the groundwork for context aware intelligent transportation solutions.

174. Visionerves: Automatic and Reproducible Hybrid AI for Peripheral Nervous System Recognition Applied to Endometriosis Cases

Authors: Giammarco La Barbera , Enzo Bonnot , Thomas Isla , Juan Pablo de la Plata , Joy-Rose Dunoyer de Segonzac , Jennifer Attali , Cécile Lozach , Alexandre Bellucci , Louis Marcellin , Laure Fournier , Sabine Sarnacki , Pietro Gori , Isabelle Bloch
URL: https://arxiv.org/abs/2509.18185
Abstract:

Endometriosis often leads to chronic pelvic pain and possible nerve involvement, yet imaging the peripheral nerves remains a challenge. We introduce Visionerves, a novel hybrid AI framework for peripheral nervous system recognition from multi-gradient DWI and morphological MRI data. Unlike conventional tractography, Visionerves encodes anatomical knowledge through fuzzy spatial relationships, removing the need for selection of manual ROIs. The pipeline comprises two phases: (A) automatic segmentation of anatomical structures using a deep learning model, and (B) tractography and nerve recognition by symbolic spatial reasoning. Applied to the lumbosacral plexus in 10 women with (confirmed or suspected) endometriosis, Visionerves demonstrated substantial improvements over standard tractography, with Dice score improvements of up to 25% and spatial errors reduced to less than 5 mm. This automatic and reproducible approach enables detailed nerve analysis and paves the way for non-invasive diagnosis of endometriosis-related neuropathy, as well as other conditions with nerve involvement.

175. VLA-LPAF: Lightweight Perspective-Adaptive Fusion for Vision-Language-Action to Enable More Unconstrained Robotic Manipulation

Authors: Jinyue Bian , Zhaoxing Zhang , Zhengyu Liang , Shiwei Zheng , Shengtao Zhang , Rong Shen , Chen Yang , Anzhou Hou
URL: https://arxiv.org/abs/2509.18183
Abstract:

The Visual-Language-Action (VLA) models can follow text instructions according to visual observations of the surrounding environment. This ability to map multimodal inputs to actions is derived from the training of the VLA model on extensive standard demonstrations. These visual observations captured by third-personal global and in-wrist local cameras are inevitably varied in number and perspective across different environments, resulting in significant differences in the visual features. This perspective heterogeneity constrains the generality of VLA models. In light of this, we first propose the lightweight module VLA-LPAF to foster the perspective adaptivity of VLA models using only 2D data. VLA-LPAF is finetuned using images from a single view and fuses other multiview observations in the latent space, which effectively and efficiently bridge the gap caused by perspective inconsistency. We instantiate our VLA-LPAF framework with the VLA model RoboFlamingo to construct RoboFlamingo-LPAF. Experiments show that RoboFlamingo-LPAF averagely achieves around 8% task success rate improvement on CALVIN, 15% on LIBERO, and 30% on a customized simulation benchmark. We also demonstrate the developed viewadaptive characteristics of the proposed RoboFlamingo-LPAF through real-world tasks.

176. The Describe-Then-Generate Bottleneck: How VLM Descriptions Alter Image Generation Outcomes

Authors: Sai Varun Kodathala , Rakesh Vunnam
URL: https://arxiv.org/abs/2509.18179
Abstract:

With the increasing integration of multimodal AI systems in creative workflows, understanding information loss in vision-language-vision pipelines has become important for evaluating system limitations. However, the degradation that occurs when visual content passes through textual intermediation remains poorly quantified. In this work, we provide empirical analysis of the describe-then-generate bottleneck, where natural language serves as an intermediate representation for visual information. We generated 150 image pairs through the describe-then-generate pipeline and applied existing metrics (LPIPS, SSIM, and color distance) to measure information preservation across perceptual, structural, and chromatic dimensions. Our evaluation reveals that 99.3% of samples exhibit substantial perceptual degradation and 91.5% demonstrate significant structural information loss, providing empirical evidence that the describe-then-generate bottleneck represents a measurable and consistent limitation in contemporary multimodal systems.

177. A Framework for Generating Artificial Datasets to Validate Absolute and Relative Position Concepts

Authors: George Corrêa de Araújo , Helena de Almeida Maia , Helio Pedrini
URL: https://arxiv.org/abs/2509.18177
Abstract:

In this paper, we present the Scrapbook framework, a novel methodology designed to generate extensive datasets for probing the learned concepts of artificial intelligence (AI) models. The framework focuses on fundamental concepts such as object recognition, absolute and relative positions, and attribute identification. By generating datasets with a large number of questions about individual concepts and a wide linguistic variation, the Scrapbook framework aims to validate the model’s understanding of these basic elements before tackling more complex tasks. Our experimental findings reveal that, while contemporary models demonstrate proficiency in recognizing and enumerating objects, they encounter challenges in comprehending positional information and addressing inquiries with additional constraints. Specifically, the MobileVLM-V2 model showed significant answer disagreements and plausible wrong answers, while other models exhibited a bias toward affirmative answers and struggled with questions involving geometric shapes and positional information, indicating areas for improvement in understanding and consistency. The proposed framework offers a valuable instrument for generating diverse and comprehensive datasets, which can be utilized to systematically assess and enhance the performance of AI models.

178. Developing Training Procedures for Piecewise-linear Spline Activation Functions in Neural Networks

Authors: William H Patty
URL: https://arxiv.org/abs/2509.18161
Abstract:

Activation functions in neural networks are typically selected from a set of empirically validated, commonly used static functions such as ReLU, tanh, or sigmoid. However, by optimizing the shapes of a network’s activation functions, we can train models that are more parameter-efficient and accurate by assigning more optimal activations to the neurons. In this paper, I present and compare 9 training methodologies to explore dual-optimization dynamics in neural networks with parameterized linear B-spline activation functions. The experiments realize up to 94% lower end model error rates in FNNs and 51% lower rates in CNNs compared to traditional ReLU-based models. These gains come at the cost of additional development and training complexity as well as end model latency.

179. Event Causality Identification with Synthetic Control

Authors: Haoyu Wang , Fengze Liu , Jiayao Zhang , Dan Roth , Kyle Richardson
URL: https://arxiv.org/abs/2509.18156
Abstract:

Event causality identification (ECI), a process that extracts causal relations between events from text, is crucial for distinguishing causation from correlation. Traditional approaches to ECI have primarily utilized linguistic patterns and multi-hop relational inference, risking false causality identification due to informal usage of causality and specious graphical inference. In this paper, we adopt the Rubin Causal Model to identify event causality: given two temporally ordered events, we see the first event as the treatment and the second one as the observed outcome. Determining their causality involves manipulating the treatment and estimating the resultant change in the likelihood of the outcome. Given that it is only possible to implement manipulation conceptually in the text domain, as a work-around, we try to find a twin for the protagonist from existing corpora. This twin should have identical life experiences with the protagonist before the treatment but undergoes an intervention of treatment. However, the practical difficulty of locating such a match limits its feasibility. Addressing this issue, we use the synthetic control method to generate such a twin’ from relevant historical data, leveraging text embedding synthesis and inversion techniques. This approach allows us to identify causal relations more robustly than previous methods, including GPT-4, which is demonstrated on a causality benchmark, COPES-hard.

180. WLFM: A Well-Logs Foundation Model for Multi-Task and Cross-Well Geological Interpretation

Authors: Zhenyu Qi , Qing Yu , Jichen Wang , Yun-Bo Zhao , Zerui Li , Wenjun Lv
URL: https://arxiv.org/abs/2509.18152
Abstract:

Well-log interpretation is fundamental for subsurface characterization but remains challenged by heterogeneous tool responses, noisy signals, and limited labels. We propose WLFM, a foundation model pretrained on multi-curve logs from 1200 wells, comprising three stages: tokenization of log patches into geological tokens, self-supervised pretraining with masked-token modeling and stratigraphy-aware contrastive learning, and multi-task adaptation with few-shot fine-tuning. WLFM consistently outperforms state-of-the-art baselines, achieving 0.0041 MSE in porosity estimation and 74.13\% accuracy in lithology classification, while WLFM-Finetune further improves to 0.0038 MSE and 78.10\% accuracy. Beyond predictive accuracy, WLFM exhibits emergent layer-awareness, learns a reusable geological vocabulary, and reconstructs masked curves with reasonable fidelity, though systematic offsets are observed in shallow and ultra-deep intervals. Although boundary detection is not explicitly evaluated here, clustering analyses suggest strong potential for future extension. These results establish WLFM as a scalable, interpretable, and transferable backbone for geological AI, with implications for multi-modal integration of logs, seismic, and textual data.

181. HyperNAS: Enhancing Architecture Representation for NAS Predictor via Hypernetwork

Authors: Jindi Lv , Yuhao Zhou , Yuxin Tian , Qing Ye , Wentao Feng , Jiancheng Lv
URL: https://arxiv.org/abs/2509.18151
Abstract:

Time-intensive performance evaluations significantly impede progress in Neural Architecture Search (NAS). To address this, neural predictors leverage surrogate models trained on proxy datasets, allowing for direct performance predictions for new architectures. However, these predictors often exhibit poor generalization due to their limited ability to capture intricate relationships among various architectures. In this paper, we propose HyperNAS, a novel neural predictor paradigm for enhancing architecture representation learning. HyperNAS consists of two primary components: a global encoding scheme and a shared hypernetwork. The global encoding scheme is devised to capture the comprehensive macro-structure information, while the shared hypernetwork serves as an auxiliary task to enhance the investigation of inter-architecture patterns. To ensure training stability, we further develop a dynamic adaptive multi-task loss to facilitate personalized exploration on the Pareto front. Extensive experiments across five representative search spaces, including ViTs, demonstrate the advantages of HyperNAS, particularly in few-shot scenarios. For instance, HyperNAS strikes new state-of-the-art results, with 97.60\% top-1 accuracy on CIFAR-10 and 82.4\% top-1 accuracy on ImageNet, using at least 5.0$\times$ fewer samples.

182. Sparse Training Scheme for Multimodal LLM

Authors: Kean Shi , Liang Chen , Haozhe Zhao , Baobao Chang
URL: https://arxiv.org/abs/2509.18150
Abstract:

Multimodal Large Language Models (MLLMs) have demonstrated outstanding performance across a variety of domains. However, training MLLMs is often inefficient due to the significantly longer input sequences introduced by multimodal data and the low utilization of inter-layer computations. To address this challenge, we shift the focus to the training process itself and propose a novel training-efficient framework based on sparse representations, termed the Sparse Training Scheme (STS). This scheme consists of two key components: the Visual Token Compressor, which reduces the information load by compressing visual tokens, and the Layer Dynamic Skipper, which mitigates the computational overhead by dynamically skipping unnecessary layers in the language model during both forward and backward passes. Our approach is broadly applicable to diverse MLLM architectures and has been extensively evaluated on multiple benchmarks, demonstrating its effectiveness and efficiency.

183. Augmenting Limited and Biased RCTs through Pseudo-Sample Matching-Based Observational Data Fusion Method

Authors: Kairong Han , Weidong Huang , Taiyang Zhou , Peng Zhen , Kun Kuang
URL: https://arxiv.org/abs/2509.18148
Abstract:

In the online ride-hailing pricing context, companies often conduct randomized controlled trials (RCTs) and utilize uplift models to assess the effect of discounts on customer orders, which substantially influences competitive market outcomes. However, due to the high cost of RCTs, the proportion of trial data relative to observational data is small, which only accounts for 0.65\% of total traffic in our context, resulting in significant bias when generalizing to the broader user base. Additionally, the complexity of industrial processes reduces the quality of RCT data, which is often subject to heterogeneity from potential interference and selection bias, making it difficult to correct. Moreover, existing data fusion methods are challenging to implement effectively in complex industrial settings due to the high dimensionality of features and the strict assumptions that are hard to verify with real-world data. To address these issues, we propose an empirical data fusion method called pseudo-sample matching. By generating pseudo-samples from biased, low-quality RCT data and matching them with the most similar samples from large-scale observational data, the method expands the RCT dataset while mitigating its heterogeneity. We validated the method through simulation experiments, conducted offline and online tests using real-world data. In a week-long online experiment, we achieved a 0.41\% improvement in profit, which is a considerable gain when scaled to industrial scenarios with hundreds of millions in revenue. In addition, we discuss the harm to model training, offline evaluation, and online economic benefits when the RCT data quality is not high, and emphasize the importance of improving RCT data quality in industrial scenarios. Further details of the simulation experiments can be found in the GitHub repository this https URL .

184. ConceptFlow: Hierarchical and Fine-grained Concept-Based Explanation for Convolutional Neural Networks

Authors: Xinyu Mu , Hui Dou , Furao Shen , Jian Zhao
URL: https://arxiv.org/abs/2509.18147
Abstract:

Concept-based interpretability for Convolutional Neural Networks (CNNs) aims to align internal model representations with high-level semantic concepts, but existing approaches largely overlook the semantic roles of individual filters and the dynamic propagation of concepts across layers. To address these limitations, we propose ConceptFlow, a concept-based interpretability framework that simulates the internal “thinking path” of a model by tracing how concepts emerge and evolve across layers. ConceptFlow comprises two key components: (i) concept attentions, which associate each filter with relevant high-level concepts to enable localized semantic interpretation, and (ii) conceptual pathways, derived from a concept transition matrix that quantifies how concepts propagate and transform between filters. Together, these components offer a unified and structured view of internal model reasoning. Experimental results demonstrate that ConceptFlow yields semantically meaningful insights into model reasoning, validating the effectiveness of concept attentions and conceptual pathways in explaining decision behavior. By modeling hierarchical conceptual pathways, ConceptFlow provides deeper insight into the internal logic of CNNs and supports the generation of more faithful and human-aligned explanations.

185. Early Prediction of Multi-Label Care Escalation Triggers in the Intensive Care Unit Using Electronic Health Records

Authors: Syed Ahmad Chan Bukhari , Amritpal Singh , Shifath Hossain , Iram Wajahat
URL: https://arxiv.org/abs/2509.18145
Abstract:

Intensive Care Unit (ICU) patients often present with complex, overlapping signs of physiological deterioration that require timely escalation of care. Traditional early warning systems, such as SOFA or MEWS, are limited by their focus on single outcomes and fail to capture the multi-dimensional nature of clinical decline. This study proposes a multi-label classification framework to predict Care Escalation Triggers (CETs), including respiratory failure, hemodynamic instability, renal compromise, and neurological deterioration, using the first 24 hours of ICU data. Using the MIMIC-IV database, CETs are defined through rule-based criteria applied to data from hours 24 to 72 (for example, oxygen saturation below 90, mean arterial pressure below 65 mmHg, creatinine increase greater than 0.3 mg/dL, or a drop in Glasgow Coma Scale score greater than 2). Features are extracted from the first 24 hours and include vital sign aggregates, laboratory values, and static demographics. We train and evaluate multiple classification models on a cohort of 85,242 ICU stays (80 percent training: 68,193; 20 percent testing: 17,049). Evaluation metrics include per-label precision, recall, F1-score, and Hamming loss. XGBoost, the best performing model, achieves F1-scores of 0.66 for respiratory, 0.72 for hemodynamic, 0.76 for renal, and 0.62 for neurologic deterioration, outperforming baseline models. Feature analysis shows that clinically relevant parameters such as respiratory rate, blood pressure, and creatinine are the most influential predictors, consistent with the clinical definitions of the CETs. The proposed framework demonstrates practical potential for early, interpretable clinical alerts without requiring complex time-series modeling or natural language processing.

186. AdaSTI: Conditional Diffusion Models with Adaptive Dependency Modeling for Spatio-Temporal Imputation

Authors: Yubo Yang , Yichen Zhu , Bo Jiang
URL: https://arxiv.org/abs/2509.18144
Abstract:

Spatio-temporal data abounds in domain like traffic and environmental monitoring. However, it often suffers from missing values due to sensor malfunctions, transmission failures, etc. Recent years have seen continued efforts to improve spatio-temporal data imputation performance. Recently diffusion models have outperformed other approaches in various tasks, including spatio-temporal imputation, showing competitive performance. Extracting and utilizing spatio-temporal dependencies as conditional information is vital in diffusion-based methods. However, previous methods introduce error accumulation in this process and ignore the variability of the dependencies in the noisy data at different diffusion steps. In this paper, we propose AdaSTI (Adaptive Dependency Model in Diffusion-based Spatio-Temporal Imputation), a novel spatio-temporal imputation approach based on conditional diffusion model. Inside AdaSTI, we propose a BiS4PI network based on a bi-directional S4 model for pre-imputation with the imputed result used to extract conditional information by our designed Spatio-Temporal Conditionalizer (STC)network. We also propose a Noise-Aware Spatio-Temporal (NAST) network with a gated attention mechanism to capture the variant dependencies across diffusion steps. Extensive experiments on three real-world datasets show that AdaSTI outperforms existing methods in all the settings, with up to 46.4% reduction in imputation error.

187. Weight Mapping Properties of a Dual Tree Single Clock Adiabatic Capacitive Neuron

Authors: Mike Smart , Sachin Maheshwari , Himadri Singh Raghav , Alexander Serb
URL: https://arxiv.org/abs/2509.18143
Abstract:

Dual Tree Single Clock (DTSC) Adiabatic Capacitive Neuron (ACN) circuits offer the potential for highly energy-efficient Artificial Neural Network (ANN) computation in full custom analog IC designs. The efficient mapping of Artificial Neuron (AN) abstract weights, extracted from the software-trained ANNs, onto physical ACN capacitance values has, however, yet to be fully researched. In this paper, we explore the unexpected hidden complexities, challenges and properties of the mapping, as well as, the ramifications for IC designers in terms accuracy, design and implementation. We propose an optimal, AN to ACN methodology, that promotes smaller chip sizes and improved overall classification accuracy, necessary for successful practical deployment. Using TensorFlow and Larq software frameworks, we train three different ANN networks and map their weights into the energy-efficient DTSC ACN capacitance value domain to demonstrate 100% functional equivalency. Finally, we delve into the impact of weight quantization on ACN performance using novel metrics related to practical IC considerations, such as IC floor space and comparator decision-making efficacy.

188. KM-GPT: An Automated Pipeline for Reconstructing Individual Patient Data from Kaplan-Meier Plots

Authors: Yao Zhao , Haoyue Sun , Yantian Ding , Yanxun Xu
URL: https://arxiv.org/abs/2509.18141
Abstract:

Reconstructing individual patient data (IPD) from Kaplan-Meier (KM) plots provides valuable insights for evidence synthesis in clinical research. However, existing approaches often rely on manual digitization, which is error-prone and lacks scalability. To address these limitations, we develop KM-GPT, the first fully automated, AI-powered pipeline for reconstructing IPD directly from KM plots with high accuracy, robustness, and reproducibility. KM-GPT integrates advanced image preprocessing, multi-modal reasoning powered by GPT-5, and iterative reconstruction algorithms to generate high-quality IPD without manual input or intervention. Its hybrid reasoning architecture automates the conversion of unstructured information into structured data flows and validates data extraction from complex KM plots. To improve accessibility, KM-GPT is equipped with a user-friendly web interface and an integrated AI assistant, enabling researchers to reconstruct IPD without requiring programming expertise. KM-GPT was rigorously evaluated on synthetic and real-world datasets, consistently demonstrating superior accuracy. To illustrate its utility, we applied KM-GPT to a meta-analysis of gastric cancer immunotherapy trials, reconstructing IPD to facilitate evidence synthesis and biomarker-based subgroup analyses. By automating traditionally manual processes and providing a scalable, web-based solution, KM-GPT transforms clinical research by leveraging reconstructed IPD to enable more informed downstream analyses, supporting evidence-based decision-making.

189. A Machine Learning Framework for Pathway-Driven Therapeutic Target Discovery in Metabolic Disorders

Authors: Iram Wajahat , Amritpal Singh , Fazel Keshtkar , Syed Ahmad Chan Bukhari
URL: https://arxiv.org/abs/2509.18140
Abstract:

Metabolic disorders, particularly type 2 diabetes mellitus (T2DM), represent a significant global health burden, disproportionately impacting genetically predisposed populations such as the Pima Indians (a Native American tribe from south central Arizona). This study introduces a novel machine learning (ML) framework that integrates predictive modeling with gene-agnostic pathway mapping to identify high-risk individuals and uncover potential therapeutic targets. Using the Pima Indian dataset, logistic regression and t-tests were applied to identify key predictors of T2DM, yielding an overall model accuracy of 78.43%. To bridge predictive analytics with biological relevance, we developed a pathway mapping strategy that links identified predictors to critical signaling networks, including insulin signaling, AMPK, and PPAR pathways. This approach provides mechanistic insights without requiring direct molecular data. Building upon these connections, we propose therapeutic strategies such as dual GLP-1/GIP receptor agonists, AMPK activators, SIRT1 modulators, and phytochemical, further validated through pathway enrichment analyses. Overall, this framework advances precision medicine by offering interpretable and scalable solutions for early detection and targeted intervention in metabolic disorders. The key contributions of this work are: (1) development of an ML framework combining logistic regression and principal component analysis (PCA) for T2DM risk prediction; (2) introduction of a gene-agnostic pathway mapping approach to generate mechanistic insights; and (3) identification of novel therapeutic strategies tailored for high-risk populations.

190. LoRALib: A Standardized Benchmark for Evaluating LoRA-MoE Methods

Authors: Shaoheng Wang , Yao Lu , Yuqi Li , Yaxin Gao , Jiaqi Nie , Shanqing Yu , Yingli Tian , Qi Xuan
URL: https://arxiv.org/abs/2509.18137
Abstract:

As a parameter efficient fine-tuning (PEFT) method, low-rank adaptation (LoRA) can save significant costs in storage and computing, but its strong adaptability to a single task is often accompanied by insufficient cross-task generalization capabilities. To improve this, existing work combines LoRA with mixture-of-experts (MoE) to enhance the model’s adaptability through expert modules and routing mechanisms. However, existing LoRA-MoE methods lack unified standards in models, datasets, hyperparameters, and evaluation methods, making it difficult to conduct fair comparisons between different methods. To this end, we proposed a unified benchmark named LoRALib. Specifically, we standardized datasets from $40$ downstream tasks into a unified format, fine-tuned them using the same hyperparameters and obtained $680$ LoRA modules across $17$ model architectures. Based on this LoRA library, we conduct large-scale experiments on $3$ representative LoRA-MoE methods and different LoRA selection mechanisms using the open-sourced testing tool OpenCompass. Extensive experiments show that LoRAMoE performs best, and that prioritizing LoRAs relevant to the target task can further improve the performance of MoE. We hope these findings will inspire future work. Our datasets and LoRA library are available at this https URL and this https URL .

191. From Parameters to Performance: A Data-Driven Study on LLM Structure and Development

Authors: Suqing Wang , Zuchao Li , Luohe Shi , Bo Du , Hai Zhao , Yun Li , Qianren Wang
URL: https://arxiv.org/abs/2509.18136
Abstract:

Large language models (LLMs) have achieved remarkable success across various domains, driving significant technological advancements and innovations. Despite the rapid growth in model scale and capability, systematic, data-driven research on how structural configurations affect performance remains scarce. To address this gap, we present a large-scale dataset encompassing diverse open-source LLM structures and their performance across multiple benchmarks. Leveraging this dataset, we conduct a systematic, data mining-driven analysis to validate and quantify the relationship between structural configurations and performance. Our study begins with a review of the historical development of LLMs and an exploration of potential future trends. We then analyze how various structural choices impact performance across benchmarks and further corroborate our findings using mechanistic interpretability techniques. By providing data-driven insights into LLM optimization, our work aims to guide the targeted development and application of future models. We will release our dataset at this https URL

192. SDGF: Fusing Static and Multi-Scale Dynamic Correlations for Multivariate Time Series Forecasting

Authors: Shaoxun Wang , Xingjun Zhang , Qianyang Li , Jiawei Cao , Zhendong Tan
URL: https://arxiv.org/abs/2509.18135
Abstract:

Inter-series correlations are crucial for accurate multivariate time series forecasting, yet these relationships often exhibit complex dynamics across different temporal scales. Existing methods are limited in modeling these multi-scale dependencies and struggle to capture their intricate and evolving nature. To address this challenge, this paper proposes a novel Static-Dynamic Graph Fusion network (SDGF), whose core lies in capturing multi-scale inter-series correlations through a dual-path graph structure learning approach. Specifically, the model utilizes a static graph based on prior knowledge to anchor long-term, stable dependencies, while concurrently employing Multi-level Wavelet Decomposition to extract multi-scale features for constructing an adaptively learned dynamic graph to capture associations at different scales. We design an attention-gated module to fuse these two complementary sources of information intelligently, and a multi-kernel dilated convolutional network is then used to deepen the understanding of temporal patterns. Comprehensive experiments on multiple widely used real-world benchmark datasets demonstrate the effectiveness of our proposed model.

193. Self-Evolving LLMs via Continual Instruction Tuning

Authors: Le Huang , Jiazheng Kang , Cheng Hou , Zhe Zhao , Zhenxiang Yan , Chuan Shi , Ting Bai
URL: https://arxiv.org/abs/2509.18133
Abstract:

In real-world industrial settings, large language models (LLMs) must learn continually to keep pace with diverse and evolving tasks, requiring self-evolution to refine knowledge under dynamic data distributions. However, existing continual learning (CL) approaches, such as replay and parameter isolation, often suffer from catastrophic forgetting: training on new tasks degrades performance on earlier ones by overfitting to the new distribution and weakening this http URL propose MoE-CL, a parameter-efficient adversarial mixture-of-experts framework for industrial-scale, self-evolving continual instruction tuning of LLMs. MoE-CL uses a dual-expert design: (1) a dedicated LoRA expert per task to preserve task-specific knowledge via parameter independence, mitigating forgetting; and (2) a shared LoRA expert to enable cross-task transfer. To prevent transferring task-irrelevant noise through the shared pathway, we integrate a task-aware discriminator within a GAN. The discriminator encourages the shared expert to pass only task-aligned information during sequential training. Through adversarial learning, the shared expert acquires generalized representations that mimic the discriminator, while dedicated experts retain task-specific details, balancing knowledge retention and cross-task generalization and thereby supporting this http URL experiments on the public MTL5 benchmark and an industrial Tencent3 benchmark validate the effectiveness of MoE-CL for continual instruction tuning. In real-world A/B testing for content compliance review on the Tencent Video platform, MoE-CL reduced manual review costs by 15.3%. These results demonstrate that MoE-CL is practical for large-scale industrial deployment where continual adaptation and stable transfer are critical.

194. Two ways to knowledge?

Authors: Jean-Michel Tucny , Abhisek Ganguly , Santosh Ansumali , Sauro Succi
URL: https://arxiv.org/abs/2509.18131
Abstract:

It is shown that the weight matrices of transformer-based machine learning applications to the solution of two representative physical applications show a random-like character which bears no directly recognizable link to the physical and mathematical structure of the physical problem under study. This suggests that machine learning and the scientific method may represent two distinct and potentially complementary paths to knowledge, even though a strict notion of explainability in terms of direct correspondence between network parameters and physical structures may remain out of reach. It is also observed that drawing a parallel between transformer operation and (generalized) path-integration techniques may account for the random-like nature of the weights, but still does not resolve the tension with explainability. We conclude with some general comments on the hazards of gleaning knowledge without the benefit of Insight.

195. Research on Metro Transportation Flow Prediction Based on the STL-GRU Combined Model

Authors: Zijie Zhou , Huichen Ma
URL: https://arxiv.org/abs/2509.18130
Abstract:

In the metro intelligent transportation system, accurate transfer passenger flow prediction is a key link in optimizing operation plans and improving transportation efficiency. To further improve the theory of metro internal transfer passenger flow prediction and provide more reliable support for intelligent operation decisions, this paper innovatively proposes a metro transfer passenger flow prediction model that integrates the Seasonal and Trend decomposition using Loess (STL) method and Gated Recurrent Unit (GRU).In practical application, the model first relies on the deep learning library Keras to complete the construction and training of the GRU model, laying the foundation for subsequent prediction; then preprocesses the original metro card swiping data, uses the graph-based depth-first search algorithm to identify passengers’ travel paths, and further constructs the transfer passenger flow time series; subsequently adopts the STL time series decomposition algorithm to decompose the constructed transfer passenger flow time series into trend component, periodic component and residual component, and uses the 3{\sigma} principle to eliminate and fill the outliers in the residual component, and finally completes the transfer passenger flow this http URL the transfer passenger flow data of a certain metro station as the research sample, the validity of the model is verified. The results show that compared with Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), and the combined model of STL time series decomposition method and Long Short-Term Memory (STL-LSTM), the STL-GRU combined prediction model significantly improves the prediction accuracy of transfer passenger flow on weekdays (excluding Fridays), Fridays and rest days, with the mean absolute percentage error (MAPE) of the prediction results reduced by at least 2.3, 1.36 and 6.42 percentage points respectively.

196. Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework

Authors: Jiaqi Weng , Han Zheng , Hanyu Zhang , Qinqin He , Jialing Tao , Hui Xue , Zhixuan Chu , Xiting Wang
URL: https://arxiv.org/abs/2509.18127
Abstract:

Increasing deployment of large language models (LLMs) in real-world applications raises significant safety concerns. Most existing safety research focuses on evaluating LLM outputs or specific safety tasks, limiting their ability to ad- dress broader, undefined risks. Sparse Autoencoders (SAEs) facilitate interpretability research to clarify model behavior by explaining single-meaning atomic features decomposed from entangled signals. jHowever, prior applications on SAEs do not interpret features with fine-grained safety-related con- cepts, thus inadequately addressing safety-critical behaviors, such as generating toxic responses and violating safety regu- lations. For rigorous safety analysis, we must extract a rich and diverse set of safety-relevant features that effectively capture these high-risk behaviors, yet face two challenges: identifying SAEs with the greatest potential for generating safety concept-specific neurons, and the prohibitively high cost of detailed feature explanation. In this paper, we pro- pose Safe-SAIL, a framework for interpreting SAE features within LLMs to advance mechanistic understanding in safety domains. Our approach systematically identifies SAE with best concept-specific interpretability, explains safety-related neurons, and introduces efficient strategies to scale up the in- terpretation process. We will release a comprehensive toolkit including SAE checkpoints and human-readable neuron ex- planations, which supports empirical analysis of safety risks to promote research on LLM safety.

197. Anomaly Detection in Electric Vehicle Charging Stations Using Federated Learning

Authors: Bishal K C , Amr Hilal , Pawan Thapa
URL: https://arxiv.org/abs/2509.18126
Abstract:

Federated Learning (FL) is a decentralized training framework widely used in IoT ecosystems that preserves privacy by keeping raw data local, making it ideal for IoT-enabled cyber-physical systems with sensing and communication like Smart Grids (SGs), Connected and Automated Vehicles (CAV), and Electric Vehicle Charging Stations (EVCS). With the rapid expansion of electric vehicle infrastructure, securing these IoT-based charging stations against cyber threats has become critical. Centralized Intrusion Detection Systems (IDS) raise privacy concerns due to sensitive network and user data, making FL a promising alternative. However, current FL-based IDS evaluations overlook practical challenges such as system heterogeneity and non-IID data. To address these challenges, we conducted experiments to evaluate the performance of federated learning for anomaly detection in EV charging stations under system and data heterogeneity. We used FedAvg and FedAvgM, widely studied optimization approaches, to analyze their effectiveness in anomaly detection. Under IID settings, FedAvg achieves superior performance to centralized models using the same neural network. However, performance degrades with non-IID data and system heterogeneity. FedAvgM consistently outperforms FedAvg in heterogeneous settings, showing better convergence and higher anomaly detection accuracy. Our results demonstrate that FL can handle heterogeneity in IoT-based EVCS without significant performance loss, with FedAvgM as a promising solution for robust, privacy-preserving EVCS security.

198. NurseSchedRL: Attention-Guided Reinforcement Learning for Nurse-Patient Assignment

Authors: Harsha Koduri
URL: https://arxiv.org/abs/2509.18125
Abstract:

Healthcare systems face increasing pressure to allocate limited nursing resources efficiently while accounting for skill heterogeneity, patient acuity, staff fatigue, and continuity of care. Traditional optimization and heuristic scheduling methods struggle to capture these dynamic, multi-constraint environments. I propose NurseSchedRL, a reinforcement learning framework for nurse-patient assignment that integrates structured state encoding, constrained action masking, and attention-based representations of skills, fatigue, and geographical context. NurseSchedRL uses Proximal Policy Optimization (PPO) with feasibility masks to ensure assignments respect real-world constraints, while dynamically adapting to patient arrivals and varying nurse availability. In simulation with realistic nurse and patient data, NurseSchedRL achieves improved scheduling efficiency, better alignment of skills to patient needs, and reduced fatigue compared to baseline heuristic and unconstrained RL approaches. These results highlight the potential of reinforcement learning for decision support in complex, high-stakes healthcare workforce management.

199. A Coopetitive-Compatible Data Generation Framework for Cross-silo Federated Learning

Authors: Thanh Linh Nguyen , Quoc-Viet Pham
URL: https://arxiv.org/abs/2509.18120
Abstract:

Cross-silo federated learning (CFL) enables organizations (e.g., hospitals or banks) to collaboratively train artificial intelligence (AI) models while preserving data privacy by keeping data local. While prior work has primarily addressed statistical heterogeneity across organizations, a critical challenge arises from economic competition, where organizations may act as market rivals, making them hesitant to participate in joint training due to potential utility loss (i.e., reduced net benefit). Furthermore, the combined effects of statistical heterogeneity and inter-organizational competition on organizational behavior and system-wide social welfare remain underexplored. In this paper, we propose CoCoGen, a coopetitive-compatible data generation framework, leveraging generative AI (GenAI) and potential game theory to model, analyze, and optimize collaborative learning under heterogeneous and competitive settings. Specifically, CoCoGen characterizes competition and statistical heterogeneity through learning performance and utility-based formulations and models each training round as a weighted potential game. We then derive GenAI-based data generation strategies that maximize social welfare. Experimental results on the Fashion-MNIST dataset reveal how varying heterogeneity and competition levels affect organizational behavior and demonstrate that CoCoGen consistently outperforms baseline methods.

200. MobileRL: Online Agentic Reinforcement Learning for Mobile GUI Agents

Authors: Yifan Xu , Xiao Liu , Xinghan Liu , Jiaqi Fu , Hanchen Zhang , Bohao Jing , Shudan Zhang , Yuting Wang , Wenyi Zhao , Yuxiao Dong
URL: https://arxiv.org/abs/2509.18119
Abstract:

Building general-purpose graphical user interface (GUI) agents has become increasingly promising with the progress in vision language models. However, developing effective mobile GUI agents with reinforcement learning (RL) remains challenging due to the heavy-tailed distribution of task difficulty and the inefficiency of large-scale environment sampling. We present an online agentic reinforcement learning framework MOBILERL to enhance GUI agents in mobile environments. Its core component is the Difficulty-Adaptive GRPO (ADAGRPO) algorithm. In ADAGRPO, we design difficulty-adaptive positive replay and failure curriculum filtering to adapt the model to different task difficulties. We introduce the shortest path reward adjustment strategy to reshape rewards concerning the task length in multi-turn agentic tasks. Those strategies jointly stabilize RL training, improve sample efficiency, and generate strong performance across diverse mobile apps and tasks. We apply MOBILERL to two open models (Qwen2.5-VL-7B-Instruct and GLM-4.1V-9B-Base). The resultant MOBILERL-9B model achieves state-of-the-art results in terms of success rates on both AndroidWorld (75.8%) and AndroidLab (46.8%). The MOBILERL framework is adopted in the AutoGLM products, and also open-sourced at this https URL .

201. Amortized Latent Steering: Low-Cost Alternative to Test-Time Optimization

Authors: Nathan Egbuna , Saatvik Gaur , Sunishchal Dev , Ashwinee Panda , Maheep Chaudhary
URL: https://arxiv.org/abs/2509.18116
Abstract:

Test-time optimization remains impractical at scale due to prohibitive inference costs\textemdash techniques like iterative refinement and multi-step verification can require $10$–$100\times$ more compute per query than standard decoding. Latent space test-time optimization methods like LatentSeek offer a more direct approach by steering hidden representations, but still demand expensive per-query optimization loops with multiple backward passes. We propose Amortized Latent Steering (ALS), which collapses this iterative optimization into a single offline-computed vector applied at constant cost during inference. ALS computes the mean difference between hidden states from successful versus unsuccessful generations, then uses this direction to calibrate the model’s hidden representations: when decoding drifts away from the success manifold, ALS nudges activations back toward it. Across GSM8K and MATH-$500$ benchmarks, ALS achieves $2$–$5\times$ speedup over iterative methods while matching or surpassing greedy Chain-of-Thought (CoT) and Self-Consistency baselines, yielding up to 101\% improvement in efficiency–accuracy trade-off. These results show that much of latent optimization’s benefit can be captured offline, making sophisticated reasoning techniques viable for production deployment. Code is available at~\href{ this https URL }{ this https URL }

202. Prompt Optimization Meets Subspace Representation Learning for Few-shot Out-of-Distribution Detection

Authors: Faizul Rakib Sayem , Shahana Ibrahim
URL: https://arxiv.org/abs/2509.18111
Abstract:

The reliability of artificial intelligence (AI) systems in open-world settings depends heavily on their ability to flag out-of-distribution (OOD) inputs unseen during training. Recent advances in large-scale vision-language models (VLMs) have enabled promising few-shot OOD detection frameworks using only a handful of in-distribution (ID) samples. However, existing prompt learning-based OOD methods rely solely on softmax probabilities, overlooking the rich discriminative potential of the feature embeddings learned by VLMs trained on millions of samples. To address this limitation, we propose a novel context optimization (CoOp)-based framework that integrates subspace representation learning with prompt tuning. Our approach improves ID-OOD separability by projecting the ID features into a subspace spanned by prompt vectors, while projecting ID-irrelevant features into an orthogonal null space. To train such OOD detection framework, we design an easy-to-handle end-to-end learning criterion that ensures strong OOD detection performance as well as high ID classification accuracy. Experiments on real-world datasets showcase the effectiveness of our approach.

203. Solve it with EASE

Authors: Adam Viktorin , Tomas Kadavy , Jozef Kovac , Michal Pluhacek , Roman Senkerik
URL: https://arxiv.org/abs/2509.18108
Abstract:

This paper presents EASE (Effortless Algorithmic Solution Evolution), an open-source and fully modular framework for iterative algorithmic solution generation leveraging large language models (LLMs). EASE integrates generation, testing, analysis, and evaluation into a reproducible feedback loop, giving users full control over error handling, analysis, and quality assessment. Its architecture supports the orchestration of multiple LLMs in complementary roles-such as generator, analyst, and evaluator. By abstracting the complexity of prompt design and model management, EASE provides a transparent and extensible platform for researchers and practitioners to co-design algorithms and other generative solutions across diverse domains.

204. BULL-ODE: Bullwhip Learning with Neural ODEs and Universal Differential Equations under Stochastic Demand

Authors: Nachiket N. Naik , Prathamesh Dinesh Joshi , Raj Abhijit Dandekar , Rajat Dandekar , Sreedath Panat
URL: https://arxiv.org/abs/2509.18105
Abstract:

We study learning of continuous-time inventory dynamics under stochastic demand and quantify when structure helps or hurts forecasting of the bullwhip effect. BULL-ODE compares a fully learned Neural ODE (NODE) that models the entire right-hand side against a physics-informed Universal Differential Equation (UDE) that preserves conservation and order-up-to structure while learning a small residual policy term. Classical supply chain models explain the bullwhip through control/forecasting choices and information sharing, while recent physics-informed and neural differential equation methods blend domain constraints with learned components. It is unclear whether structural bias helps or hinders forecasting under different demand regimes. We address this by using a single-echelon testbed with three demand regimes - AR(1) (autocorrelated), i.i.d. Gaussian, and heavy-tailed lognormal. Training is done on varying fractions of each trajectory, followed by evaluation of multi-step forecasts for inventory I, order rate O, and demand D. Across the structured regimes, UDE consistently generalizes better: with 90% of the training horizon, inventory RMSE drops from 4.92 (NODE) to 0.26 (UDE) under AR(1) and from 5.96 to 0.95 under Gaussian demand. Under heavy-tailed lognormal shocks, the flexibility of NODE is better. These trends persist as train18 ing data shrinks, with NODE exhibiting phase drift in extrapolation while UDE remains stable but underreacts to rare spikes. Our results provide concrete guidance: enforce structure when noise is light-tailed or temporally correlated; relax structure when extreme events dominate. Beyond inventory control, the results offer guidance for hybrid modeling in scientific and engineering systems: enforce known structure when conservation laws and modest noise dominate, and relax structure to capture extremes in settings where rare events drive dynamics.

205. Data Valuation and Selection in a Federated Model Marketplace

Authors: Wenqian Li , Youjia Yang , Ruoxi Jia , Yan Pang
URL: https://arxiv.org/abs/2509.18104
Abstract:

In the era of Artificial Intelligence (AI), marketplaces have become essential platforms for facilitating the exchange of data products to foster data sharing. Model transactions provide economic solutions in data marketplaces that enhance data reusability and ensure the traceability of data ownership. To establish trustworthy data marketplaces, Federated Learning (FL) has emerged as a promising paradigm to enable collaborative learning across siloed datasets while safeguarding data privacy. However, effective data valuation and selection from heterogeneous sources in the FL setup remain key challenges. This paper introduces a comprehensive framework centered on a Wasserstein-based estimator tailored for FL. The estimator not only predicts model performance across unseen data combinations but also reveals the compatibility between data heterogeneity and FL aggregation algorithms. To ensure privacy, we propose a distributed method to approximate Wasserstein distance without requiring access to raw data. Furthermore, we demonstrate that model performance can be reliably extrapolated under the neural scaling law, enabling effective data selection without full-scale training. Extensive experiments across diverse scenarios, such as label skew, mislabeled, and unlabeled sources, show that our approach consistently identifies high-performing data combinations, paving the way for more reliable FL-based model marketplaces.

전체 AI 논문 - 2025-09-24

1. Cross-Cultural Transfer of Commonsense Reasoning in LLMs: Evidence from the Arab World

2. AgentInit: Initializing LLM-based Multi-Agent Systems via Diversity and Expertise Orchestration for Effective and Efficient Collaboration

3. Code Driven Planning with Domain-Adaptive Critic

4. Towards Causal Representation Learning with Observable Sources as Auxiliaries

5. Landmarks, Monuments, and Beacons: Understanding Generative Calls to Action

6. Remaining Time Prediction in Outbound Warehouse Processes: A Case Study (Short Paper)

7. From latent factors to language: a user study on LLM-generated explanations for an inherently interpretable matrix-based recommender system

8. LLM-based Agents Suffer from Hallucinations: A Survey of Taxonomy, Methods, and Directions

9. Data Efficient Adaptation in Large Language Models via Continuous Low-Rank Fine-Tuning

10. How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective

11. LongCat-Flash-Thinking Technical Report

12. Memory in Large Language Models: Mechanisms, Evaluation and Evolution

13. Conf-Profile: A Confidence-Driven Reasoning Paradigm for Label-Free User Profiling

14. MAPO: Mixed Advantage Policy Optimization

15. Model selection meets clinical semantics: Optimizing ICD-10-CM prediction via LLM-as-Judge evaluation, redundancy-aware sampling, and section-aware fine-tuning

16. Bounded PCTL Model Checking of Large Language Model Outputs

17. The AGNTCY Agent Directory Service: Architecture and Implementation

18. Experience Scaling: Post-Deployment Evolution For Large Language Models

19. Autonomous Data Agents: A New Opportunity for Smart Data

20. Advances in Large Language Models for Medicine

21. Implementation of airborne ML models with semantics preservation

22. TERAG: Token-Efficient Graph-Based Retrieval-Augmented Generation

23. Adaptive Learning in Spatial Agent-Based Models for Climate Risk Assessment: A Geospatial Framework with Evolutionary Economic Agents

24. Solving Math Word Problems Using Estimation Verification and Equation Generation

25. LLMZ+: Contextual Prompt Whitelist Principles for Agentic LLMs

26. FERA: Foil Fencing Referee Assistant Using Pose-Based Multi-Label Move Recognition and Rule Reasoning

27. Memory-QA: Answering Recall Questions Based on Multimodal Memories

28. Instruction-Following Evaluation in Function Calling for Large Language Models

29. ATLAS: Benchmarking and Adapting LLMs for Global Trade via Harmonized Tariff Code Classification

30. Gödel Test: Can Large Language Models Solve Easy Conjectures?

31. Evaluating the Safety and Skill Reasoning of Large Reasoning Models Under Compute Constraints

32. The Illusion of Readiness: Stress Testing Large Frontier Models on Multimodal Medical Benchmarks

33. Towards General Computer Control with Hierarchical Agents and Multi-Level Action Spaces

34. An N-Plus-1 GPT Agency for Critical Solution of Mechanical Engineering Analysis Problems

35. From “What to Eat?” to Perfect Recipe: ChefMind’s Chain-of-Exploration for Ambiguous User Intent in Recipe Recommendation

36. Multimodal Health Risk Prediction System for Chronic Diseases via Vision-Language Fusion and Large Language Models

37. Similarity Field Theory: A Mathematical Framework for Intelligence

38. nDNA – the Semantic Helix of Artificial Cognition

39. Change in Quantitative Bipolar Argumentation: Sufficient, Necessary, and Counterfactual Explanations

40. MMCD: Multi-Modal Collaborative Decision-Making for Connected Autonomy with Knowledge Distillation

41. An Outcome-Based Educational Recommender System

42. Synthesizing Attitudes, Predicting Actions (SAPA): Behavioral Theory-Guided LLMs for Ridesourcing Mode Choice Modeling

43. Large Language Models and Operations Research: A Structured Survey

44. Foam-Agent: An End-to-End Composable Multi-Agent Framework for Automating CFD Simulation in OpenFOAM

45. HSGM: Hierarchical Segment-Graph Memory for Scalable Long-Text Semantics

46. Position Paper: Integrating Explainability and Uncertainty Estimation in Medical AI

47. SPADE: A Large Language Model Framework for Soil Moisture Pattern Recognition and Anomaly Detection in Precision Agriculture

48. A Cost-Benefit Analysis of On-Premise Large Language Model Deployment: Breaking Even with Commercial LLM Services

49. Audio-Based Pedestrian Detection in the Presence of Vehicular Noise

50. SOE: Sample-Efficient Robot Policy Self-Improvement via On-Manifold Exploration

51. MOIS-SAM2: Exemplar-based Segment Anything Model 2 for multilesion interactive segmentation of neurobromas in whole-body MRI

52. WolBanking77: Wolof Banking Speech Intent Classification Dataset

53. SloPalSpeech: A 2,8000-Hour Slovak Speech Corpus from Parliamentary Data

54. Adversarially-Refined VQ-GAN with Dense Motion Tokenization for Spatio-Temporal Heatmaps

55. Reinforcement Learning on Pre-Training Data

56. Finding My Voice: Generative Reconstruction of Disordered Speech for Automated Clinical Evaluation

57. MsFIN: Multi-scale Feature Interaction Network for Traffic Accident Anticipation

58. Systematic Comparative Analysis of Large Pretrained Language Models on Contextualized Medication Event Extraction

59. FedFusion: Federated Learning with Diversity- and Cluster-Aware Encoders for Robust Adaptation under Label Scarcity

60. HyKid: An Open MRI Dataset with Expert-Annotated Multi-Structure and Choroid Plexus in Pediatric Hydrocephalus

61. Steering Multimodal Large Language Models Decoding for Context-Aware Safety

62. YAC: Bridging Natural Language and Interactive Visual Exploration with Generative AI for Biomedical Data Discovery

63. Soft Tokens, Hard Truths

64. RoSe: Robust Self-supervised Stereo Matching under Adverse Weather Conditions

65. Generative Propaganda

66. Anecdoctoring: Automated Red-Teaming Across Language and Place

67. On the Soundness and Consistency of LLM Agents for Executing Test Cases Written in Natural Language

68. GSTM-HMU: Generative Spatio-Temporal Modeling for Human Mobility Understanding

69. Analysis on distribution and clustering of weight

70. FedFiTS: Fitness-Selected, Slotted Client Scheduling for Trustworthy Federated Learning in Healthcare AI

71. Towards Practical Multi-label Causal Discovery in High-Dimensional Event Sequences via One-Shot Graph Aggregation

72. FUNCanon: Learning Pose-Aware Action Primitives via Functional Object Canonicalization for Generalizable Robotic Manipulation

73. Algorithms for Adversarially Robust Deep Learning

74. Pathways of Thoughts: Multi-Directional Thinking for Long-form Personalized Question Answering

75. Training Flow Matching Models with Reliable Labels via Self-Purification

76. Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning

77. A Mega-Study of Digital Twins Reveals Strengths, Weaknesses and Opportunities for Further Improvement

78. Graph Neural Networks with Similarity-Navigated Probabilistic Feature Copying

79. World4RL: Diffusion World Models for Policy Refinement with Reinforcement Learning for Robotic Manipulation