전체 AI 논문 - 2025-11-06
1. Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything
- Authors: Huawei Lin , Yunzhi Shi , Tong Geng , Weijie Zhao , Wei Wang , Ravender Pal Singh
- URL: https://arxiv.org/abs/2511.02834
- Abstract:
Multimodal large language models (MLLMs) have shown strong capabilities but remain limited to fixed modality pairs and require costly fine-tuning with large aligned datasets. Building fully omni-capable models that can integrate text, images, audio, and video remains impractical and lacks robust reasoning support. In this paper, we propose an Agent-Omni framework that coordinates existing foundation models through a master-agent system, enabling flexible multimodal reasoning without retraining. The master agent interprets user intent, delegates subtasks to modality-specific agents, and integrates their outputs into coherent responses. Extensive experiments across text, image, audio, video, and omni benchmarks show that Agent-Omni consistently achieves state-of-the-art performance, particularly on tasks requiring complex cross-modal reasoning. Its agent-based design enables seamless integration of specialized foundation models, ensuring adaptability to diverse inputs while maintaining transparency and interpretability. In addition, the framework is modular and easily extensible, allowing future improvements as stronger models become available.
2. Neurosymbolic Deep Learning Semantics
- Authors: Artur d’Avila Garcez , Simon Odense
- URL: https://arxiv.org/abs/2511.02825
- Abstract:
Artificial Intelligence (AI) is a powerful new language of science as evidenced by recent Nobel Prizes in chemistry and physics that recognized contributions to AI applied to those areas. Yet, this new language lacks semantics, which makes AI’s scientific discoveries unsatisfactory at best. With the purpose of uncovering new facts but also improving our understanding of the world, AI-based science requires formalization through a framework capable of translating insight into comprehensible scientific knowledge. In this paper, we argue that logic offers an adequate framework. In particular, we use logic in a neurosymbolic framework to offer a much needed semantics for deep learning, the neural network-based technology of current AI. Deep learning and neurosymbolic AI lack a general set of conditions to ensure that desirable properties are satisfied. Instead, there is a plethora of encoding and knowledge extraction approaches designed for particular cases. To rectify this, we introduced a framework for semantic encoding, making explicit the mapping between neural networks and logic, and characterizing the common ingredients of the various existing approaches. In this paper, we describe succinctly and exemplify how logical semantics and neural networks are linked through this framework, we review some of the most prominent approaches and techniques developed for neural encoding and knowledge extraction, provide a formal definition of our framework, and discuss some of the difficulties of identifying a semantic encoding in practice in light of analogous problems in the philosophy of mind.
3. Kosmos: An AI Scientist for Autonomous Discovery
- Authors: Ludovico Mitchener , Angela Yiu , Benjamin Chang , Mathieu Bourdenx , Tyler Nadolski , Arvis Sulovari , Eric C. Landsness , Daniel L. Barabasi , Siddharth Narayanan , Nicky Evans , Shriya Reddy , Martha Foiani , Aizad Kamal , Leah P. Shriver , Fang Cao , Asmamaw T. Wassie , Jon M. Laurent , Edwin Melville-Green , Mayk Caldas , Albert Bou , Kaleigh F. Roberts , Sladjana Zagorac , Timothy C. Orr , Miranda E. Orr , Kevin J. Zwezdaryk , Ali E. Ghareeb , Laurie McCoy , Bruna Gomes , Euan A. Ashley , Karen E. Duff , Tonio Buonassisi , Tom Rainforth , Randall J. Bateman , Michael Skarlinski , Samuel G. Rodriques , Michaela M. Hinks , Andrew D. White
- URL: https://arxiv.org/abs/2511.02824
- Abstract:
Data-driven scientific discovery requires iterative cycles of literature search, hypothesis generation, and data analysis. Substantial progress has been made towards AI agents that can automate scientific research, but all such agents remain limited in the number of actions they can take before losing coherence, thus limiting the depth of their findings. Here we present Kosmos, an AI scientist that automates data-driven discovery. Given an open-ended objective and a dataset, Kosmos runs for up to 12 hours performing cycles of parallel data analysis, literature search, and hypothesis generation before synthesizing discoveries into scientific reports. Unlike prior systems, Kosmos uses a structured world model to share information between a data analysis agent and a literature search agent. The world model enables Kosmos to coherently pursue the specified objective over 200 agent rollouts, collectively executing an average of 42,000 lines of code and reading 1,500 papers per run. Kosmos cites all statements in its reports with code or primary literature, ensuring its reasoning is traceable. Independent scientists found 79.4% of statements in Kosmos reports to be accurate, and collaborators reported that a single 20-cycle Kosmos run performed the equivalent of 6 months of their own research time on average. Furthermore, collaborators reported that the number of valuable scientific findings generated scales linearly with Kosmos cycles (tested up to 20 cycles). We highlight seven discoveries made by Kosmos that span metabolomics, materials science, neuroscience, and statistical genetics. Three discoveries independently reproduce findings from preprinted or unpublished manuscripts that were not accessed by Kosmos at runtime, while four make novel contributions to the scientific literature.
4. Optimizing AI Agent Attacks With Synthetic Data
- Authors: Chloe Loughridge , Paul Colognese , Avery Griffin , Tyler Tracy , Jon Kutasov , Joe Benton
- URL: https://arxiv.org/abs/2511.02823
- Abstract:
As AI deployments become more complex and high-stakes, it becomes increasingly important to be able to estimate their risk. AI control is one framework for doing so. However, good control evaluations require eliciting strong attack policies. This can be challenging in complex agentic environments where compute constraints leave us data-poor. In this work, we show how to optimize attack policies in SHADE-Arena, a dataset of diverse realistic control environments. We do this by decomposing attack capability into five constituent skills – suspicion modeling, attack selection, plan synthesis, execution and subtlety – and optimizing each component individually. To get around the constraint of limited data, we develop a probabilistic model of attack dynamics, optimize our attack hyperparameters using this simulation, and then show that the results transfer to SHADE-Arena. This results in a substantial improvement in attack strength, reducing safety score from a baseline of 0.87 to 0.41 using our scaffold.
5. Orion-MSP: Multi-Scale Sparse Attention for Tabular In-Context Learning
- Authors: Mohamed Bouadi , Pratinav Seth , Aditya Tanna , Vinay Kumar Sankarapu
- URL: https://arxiv.org/abs/2511.02818
- Abstract:
Tabular data remain the predominant format for real-world applications. Yet, developing effective neural models for tabular data remains challenging due to heterogeneous feature types and complex interactions occurring at multiple scales. Recent advances in tabular in-context learning (ICL), such as TabPFN and TabICL, have achieved state-of-the-art performance comparable to gradient-boosted trees (GBTs) without task-specific fine-tuning. However, current architectures exhibit key limitations: (1) single-scale feature processing that overlooks hierarchical dependencies, (2) dense attention with quadratic scaling in table width, and (3) strictly sequential component processing that prevents iterative representation refinement and cross-component communication. To address these challenges, we introduce Orion-MSP, a tabular ICL architecture featuring three key innovations: (1) multi-scale processing to capture hierarchical feature interactions; (2) block-sparse attention combining windowed, global, and random patterns for scalable efficiency and long-range connectivity; and (3) a Perceiver-style memory enabling safe bidirectional information flow across components. Across diverse benchmarks, Orion-MSP matches or surpasses state-of-the-art performance while scaling effectively to high-dimensional tables, establishing a new standard for efficient tabular in-context learning. The model is publicly available at this https URL .
6. When One Modality Sabotages the Others: A Diagnostic Lens on Multimodal Reasoning
- Authors: Chenyu Zhang , Minsol Kim , Shohreh Ghorbani , Jingyao Wu , Rosalind Picard , Patricia Maes , Paul Pu Liang
- URL: https://arxiv.org/abs/2511.02794
- Abstract:
Despite rapid growth in multimodal large language models (MLLMs), their reasoning traces remain opaque: it is often unclear which modality drives a prediction, how conflicts are resolved, or when one stream dominates. In this paper, we introduce modality sabotage, a diagnostic failure mode in which a high-confidence unimodal error overrides other evidence and misleads the fused result. To analyze such dynamics, we propose a lightweight, model-agnostic evaluation layer that treats each modality as an agent, producing candidate labels and a brief self-assessment used for auditing. A simple fusion mechanism aggregates these outputs, exposing contributors (modalities supporting correct outcomes) and saboteurs (modalities that mislead). Applying our diagnostic layer in a case study on multimodal emotion recognition benchmarks with foundation models revealed systematic reliability profiles, providing insight into whether failures may arise from dataset artifacts or model limitations. More broadly, our framework offers a diagnostic scaffold for multimodal reasoning, supporting principled auditing of fusion dynamics and informing possible interventions.
7. LLM-Supported Formal Knowledge Representation for Enhancing Control Engineering Content with an Interactive Semantic Layer
- Authors: Julius Fiedler (1), Carsten Knoll (2), Klaus Röbenack (1) ((1) Institute of Control Theory at TU Dresden, (2) Chair of Fundamentals of Electrical Engineering at TU Dresden)
- URL: https://arxiv.org/abs/2511.02759
- Abstract:
The rapid growth of research output in control engineering calls for new approaches to structure and formalize domain knowledge. This paper briefly describes an LLM-supported method for semi-automated generation of formal knowledge representations that combine human readability with machine interpretability and increased expressiveness. Based on the Imperative Representation of Knowledge (PyIRK) framework, we demonstrate how language models can assist in transforming natural-language descriptions and mathematical definitions (available as LaTeX source code) into a formalized knowledge graph. As a first application we present the generation of an ``interactive semantic layer’’ to enhance the source documents in order to facilitate knowledge transfer. From our perspective this contributes to the vision of easily accessible, collaborative, and verifiable knowledge bases for the control engineering domain.
8. Using Span Queries to Optimize for Cache and Attention Locality
- Authors: Paul Castro , Nick Mitchell , Nathan Ordonez , Thomas Parnell , Mudhakar Srivatsa , Antoni Viros i Martin
- URL: https://arxiv.org/abs/2511.02749
- Abstract:
Clients are evolving beyond chat completion, and now include a variety of innovative inference-time scaling and deep reasoning techniques. At the same time, inference servers remain heavily optimized for chat completion. Prior work has shown that large improvements to KV cache hit rate are possible if inference servers evolve towards these non-chat use cases. However, they offer solutions that are also optimized for a single use case, RAG. In this paper, we introduce the span query to generalize the interface to the inference server. We demonstrate that chat, RAG, inference-time scaling, and agentic workloads can all be expressed as span queries. We show how the critical distinction that had been assumed by prior work lies in whether the order of the inputs matter – do they commute? In chat, they do not. In RAG, they often do. This paper introduces span queries, which are expression trees of inference calls, linked together with commutativity constraints. We describe span query syntax and semantics. We show how they can be automatically optimized to improve KV cache locality. We show how a small change to vLLM (affecting only 492 lines) can enable high-performance execution of span queries. Using this stack, we demonstrate that span queries can achieve 10-20x reductions in TTFT for two distinct non-chat use cases. Finally, we show that span queries can also be optimized to improve attention locality, so as to avoid the so-called lost-in-the-middle problem. We demonstrate that an attention-optimized span query on a 2b parameter model vastly outperforms the accuracy of a stock inference server using an 8b model.
9. CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents
- Authors: Jiayu Liu , Cheng Qian , Zhaochen Su , Qing Zong , Shijue Huang , Bingxiang He , Yi R. Fung
- URL: https://arxiv.org/abs/2511.02734
- Abstract:
Current evaluations of Large Language Model (LLM) agents primarily emphasize task completion, often overlooking resource efficiency and adaptability. This neglects a crucial capability: agents’ ability to devise and adjust cost-optimal plans in response to changing environments. To bridge this gap, we introduce CostBench, a scalable, cost-centric benchmark designed to evaluate agents’ economic reasoning and replanning abilities. Situated in the travel-planning domain, CostBench comprises tasks solvable via multiple sequences of atomic and composite tools with diverse, customizable costs. It also supports four types of dynamic blocking events, such as tool failures and cost changes, to simulate real-world unpredictability and necessitate agents to adapt in real time. Evaluating leading open-sourced and proprietary models on CostBench reveals a substantial gap in cost-aware planning: agents frequently fail to identify cost-optimal solutions in static settings, with even GPT-5 achieving less than 75% exact match rate on the hardest tasks, and performance further dropping by around 40% under dynamic conditions. By diagnosing these weaknesses, CostBench lays the groundwork for developing future agents that are both economically rational and robust.
10. The Collaboration Gap
- Authors: Tim R. Davidson , Adam Fourney , Saleema Amershi , Robert West , Eric Horvitz , Ece Kamar
- URL: https://arxiv.org/abs/2511.02687
- Abstract:
The trajectory of AI development suggests that we will increasingly rely on agent-based systems composed of independently developed agents with different information, privileges, and tools. The success of these systems will critically depend on effective collaboration among these heterogeneous agents, even under partial observability. Despite intense interest, few empirical studies have evaluated such agent-agent collaboration at scale. We propose a collaborative maze-solving benchmark that (i) isolates collaborative capabilities, (ii) modulates problem complexity, (iii) enables scalable automated grading, and (iv) imposes no output-format constraints, preserving ecological plausibility. Using this framework, we evaluate 32 leading open- and closed-source models in solo, homogeneous, and heterogeneous pairings. Our results reveal a “collaboration gap”: models that perform well solo often degrade substantially when required to collaborate. Collaboration can break down dramatically; for instance, small distilled models that solve mazes well alone may fail almost completely in certain pairings. We find that starting with the stronger agent often improves outcomes, motivating a “relay inference” approach where the stronger agent leads before handing off to the weaker one, closing much of the gap. Our findings argue for (1) collaboration-aware evaluation, (2) training strategies developed to enhance collaborative capabilities, and (3) interaction design that reliably elicits agents’ latent skills, guidance that applies to AI-AI and human-AI collaboration.
11. DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning
- Authors: Lachlan McPheat , Navdeep Kaur , Robert Blackwell , Alessandra Russo , Anthony G. Cohn , Pranava Madhyastha
- URL: https://arxiv.org/abs/2511.02627
- Abstract:
We introduce DecompSR, decomposed spatial reasoning, a large benchmark dataset (over 5m datapoints) and generation framework designed to analyse compositional spatial reasoning ability. The generation of DecompSR allows users to independently vary several aspects of compositionality, namely: productivity (reasoning depth), substitutivity (entity and linguistic variability), overgeneralisation (input order, distractors) and systematicity (novel linguistic elements). DecompSR is built procedurally in a manner which makes it is correct by construction, which is independently verified using a symbolic solver to guarantee the correctness of the dataset. DecompSR is comprehensively benchmarked across a host of Large Language Models (LLMs) where we show that LLMs struggle with productive and systematic generalisation in spatial reasoning tasks whereas they are more robust to linguistic variation. DecompSR provides a provably correct and rigorous benchmarking dataset with a novel ability to independently vary the degrees of several key aspects of compositionality, allowing for robust and fine-grained probing of the compositional reasoning abilities of LLMs.
12. A Multi-Agent Psychological Simulation System for Human Behavior Modeling
- Authors: Xiangen Hu , Jiarui Tong , Sheng Xu
- URL: https://arxiv.org/abs/2511.02606
- Abstract:
Training and education in human-centered fields require authentic practice, yet realistic simulations of human behavior have remained limited. We present a multi-agent psychological simulation system that models internal cognitive-affective processes to generate believable human behaviors. In contrast to black-box neural models, this system is grounded in established psychological theories (e.g., self-efficacy, mindset, social constructivism) and explicitly simulates an ``inner parliament’’ of agents corresponding to key psychological factors. These agents deliberate and interact to determine the system’s output behavior, enabling unprecedented transparency and alignment with human psychology. We describe the system’s architecture and theoretical foundations, illustrate its use in teacher training and research, and discuss how it embodies principles of social learning, cognitive apprenticeship, deliberate practice, and meta-cognition.
13. Adaptive GR(1) Specification Repair for Liveness-Preserving Shielding in Reinforcement Learning
- Authors: Tiberiu-Andrei Georgescu , Alexander W. Goodall , Dalal Alrajeh , Francesco Belardinelli , Sebastian Uchitel
- URL: https://arxiv.org/abs/2511.02605
- Abstract:
Shielding is widely used to enforce safety in reinforcement learning (RL), ensuring that an agent’s actions remain compliant with formal specifications. Classical shielding approaches, however, are often static, in the sense that they assume fixed logical specifications and hand-crafted abstractions. While these static shields provide safety under nominal assumptions, they fail to adapt when environment assumptions are violated. In this paper, we develop the first adaptive shielding framework - to the best of our knowledge - based on Generalized Reactivity of rank 1 (GR(1)) specifications, a tractable and expressive fragment of Linear Temporal Logic (LTL) that captures both safety and liveness properties. Our method detects environment assumption violations at runtime and employs Inductive Logic Programming (ILP) to automatically repair GR(1) specifications online, in a systematic and interpretable way. This ensures that the shield evolves gracefully, ensuring liveness is achievable and weakening goals only when necessary. We consider two case studies: Minepump and Atari Seaquest; showing that (i) static symbolic controllers are often severely suboptimal when optimizing for auxiliary rewards, and (ii) RL agents equipped with our adaptive shield maintain near-optimal reward and perfect logical compliance compared with static shields.
14. The ORCA Benchmark: Evaluating Real-World Calculation Accuracy in Large Language Models
- Authors: Claudia Herambourg , Dawid Siuda , Anna Szczepanek , Julia Kopczyńska , Joao R. L. Santos , Wojciech Sas , Joanna Śmietańska-Nowak
- URL: https://arxiv.org/abs/2511.02589
- Abstract:
We present ORCA (Omni Research on Calculation in AI) Benchmark – a novel benchmark that evaluates large language models (LLMs) on multi-domain, real-life quantitative reasoning using verified outputs from Omni’s calculator engine. In 500 natural-language tasks across domains such as finance, physics, health, and statistics, the five state-of-the-art systems (ChatGPT-5, Gemini~2.5~Flash, Claude~Sonnet~4.5, Grok~4, and DeepSeek~V3.2) achieved only $45\text{–}63\,\%$ accuracy, with errors mainly related to rounding ($35\,\%$) and calculation mistakes ($33\,\%$). Results in specific domains indicate strengths in mathematics and engineering, but weaknesses in physics and natural sciences. Correlation analysis ($r \approx 0.40\text{–}0.65$) shows that the models often fail together but differ in the types of errors they make, highlighting their partial complementarity rather than redundancy. Unlike standard math datasets, ORCA evaluates step-by-step reasoning, numerical precision, and domain generalization across real problems from finance, physics, health, and statistics.
15. Knowledge Graph-enhanced Large Language Model for Incremental Game PlayTesting
- Authors: Enhong Mu , Jinyu Cai , Yijun Lu , Mingyue Zhang , Kenji Tei , Jialong Li
- URL: https://arxiv.org/abs/2511.02534
- Abstract:
The rapid iteration and frequent updates of modern video games pose significant challenges to the efficiency and specificity of testing. Although automated playtesting methods based on Large Language Models (LLMs) have shown promise, they often lack structured knowledge accumulation mechanisms, making it difficult to conduct precise and efficient testing tailored for incremental game updates. To address this challenge, this paper proposes a KLPEG framework. The framework constructs and maintains a Knowledge Graph (KG) to systematically model game elements, task dependencies, and causal relationships, enabling knowledge accumulation and reuse across versions. Building on this foundation, the framework utilizes LLMs to parse natural language update logs, identify the scope of impact through multi-hop reasoning on the KG, enabling the generation of update-tailored test cases. Experiments in two representative game environments, Overcooked and Minecraft, demonstrate that KLPEG can more accurately locate functionalities affected by updates and complete tests in fewer steps, significantly improving both playtesting effectiveness and efficiency.
16. Agentic AI for Mobile Network RAN Management and Optimization
- Authors: Jorge Pellejero , Luis A. Hernández Gómez , Luis Mendo Tomás , Zoraida Frias Barroso
- URL: https://arxiv.org/abs/2511.02532
- Abstract:
Agentic AI represents a new paradigm for automating complex systems by using Large AI Models (LAMs) to provide human-level cognitive abilities with multimodal perception, planning, memory, and reasoning capabilities. This will lead to a new generation of AI systems that autonomously decompose goals, retain context over time, learn continuously, operate across tools and environments, and adapt dynamically. The complexity of 5G and upcoming 6G networks renders manual optimization ineffective, pointing to Agentic AI as a method for automating decisions in dynamic RAN environments. However, despite its rapid advances, there is no established framework outlining the foundational components and operational principles of Agentic AI systems nor a universally accepted definition. This paper contributes to ongoing research on Agentic AI in 5G and 6G networks by outlining its core concepts and then proposing a practical use case that applies Agentic principles to RAN optimization. We first introduce Agentic AI, tracing its evolution from classical agents and discussing the progress from workflows and simple AI agents to Agentic AI. Core design patterns-reflection, planning, tool use, and multi-agent collaboration-are then described to illustrate how intelligent behaviors are orchestrated. These theorical concepts are grounded in the context of mobile networks, with a focus on RAN management and optimization. A practical 5G RAN case study shows how time-series analytics and LAM-driven agents collaborate for KPI-based autonomous decision-making.
17. Auditable-choice reframing unlocks RL-based verification for open-ended tasks
- Authors: Mengyu Zhang , Xubo Liu , Siyu Ding , Weichong Yin , Yu Sun , Hua Wu , Wenya Guo , Ying Zhang
- URL: https://arxiv.org/abs/2511.02463
- Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated great potential in enhancing the reasoning capabilities of large language models (LLMs), achieving remarkable progress in domains such as mathematics and programming where standard answers are available. However, for open-ended tasks lacking ground-truth solutions (e.g., creative writing and instruction following), existing studies typically regard them as non-reasoning scenarios, thereby overlooking the latent value of reasoning capabilities. This raises a key question: Can strengthening reasoning improve performance in open-ended tasks? To address this, we explore the transfer of the RLVR paradigm to the open domain. Yet, since RLVR fundamentally relies on verifiers that presuppose the existence of standard answers, it cannot be directly applied to open-ended tasks. To overcome this challenge, we introduce Verifiable Multiple-Choice Reformulation (VMR), a novel training strategy that restructures open-ended data into verifiable multiple-choice formats, enabling effective training even in the absence of explicit ground truth. Experimental results on multiple benchmarks validate the effectiveness of our method in improving LLM performance on open-ended tasks. Notably, across eight open-ended benchmarks, our VMR-based training delivers an average gain of 5.99 points over the baseline. Code will be released upon acceptance to facilitate reproducibility.
18. ReAcTree: Hierarchical LLM Agent Trees with Control Flow for Long-Horizon Task Planning
- Authors: Jae-Woo Choi , Hyungmin Kim , Hyobin Ong , Minsu Jang , Dohyung Kim , Jaehong Kim , Youngwoo Yoon
- URL: https://arxiv.org/abs/2511.02424
- Abstract:
Recent advancements in large language models (LLMs) have enabled significant progress in decision-making and task planning for embodied autonomous agents. However, most existing methods still struggle with complex, long-horizon tasks because they rely on a monolithic trajectory that entangles all past decisions and observations, attempting to solve the entire task in a single unified process. To address this limitation, we propose ReAcTree, a hierarchical task-planning method that decomposes a complex goal into more manageable subgoals within a dynamically constructed agent tree. Each subgoal is handled by an LLM agent node capable of reasoning, acting, and further expanding the tree, while control flow nodes coordinate the execution strategies of agent nodes. In addition, we integrate two complementary memory systems: each agent node retrieves goal-specific, subgoal-level examples from episodic memory and shares environment-specific observations through working memory. Experiments on the WAH-NL and ALFRED datasets demonstrate that ReAcTree consistently outperforms strong task-planning baselines such as ReAct across diverse LLMs. Notably, on WAH-NL, ReAcTree achieves a 61% goal success rate with Qwen 2.5 72B, nearly doubling ReAct’s 31%.
19. A New Perspective on Precision and Recall for Generative Models
- Authors: Benjamin Sykes (UNICAEN, ENSICAEN, GREYC), Loïc Simon (UNICAEN, ENSICAEN, GREYC), Julien Rabin (UNICAEN, ENSICAEN, GREYC), Jalal Fadili (UNICAEN, ENSICAEN, GREYC)
- URL: https://arxiv.org/abs/2511.02414
- Abstract:
With the recent success of generative models in image and text, the question of their evaluation has recently gained a lot of attention. While most methods from the state of the art rely on scalar metrics, the introduction of Precision and Recall (PR) for generative model has opened up a new avenue of research. The associated PR curve allows for a richer analysis, but their estimation poses several challenges. In this paper, we present a new framework for estimating entire PR curves based on a binary classification standpoint. We conduct a thorough statistical analysis of the proposed estimates. As a byproduct, we obtain a minimax upper bound on the PR estimation risk. We also show that our framework extends several landmark PR metrics of the literature which by design are restrained to the extreme values of the curve. Finally, we study the different behaviors of the curves obtained experimentally in various settings.
20. Fuzzy Soft Set Theory based Expert System for the Risk Assessment in Breast Cancer Patients
- Authors: Muhammad Sheharyar Liaqat
- URL: https://arxiv.org/abs/2511.02392
- Abstract:
Breast cancer remains one of the leading causes of mortality among women worldwide, with early diagnosis being critical for effective treatment and improved survival rates. However, timely detection continues to be a challenge due to the complex nature of the disease and variability in patient risk factors. This study presents a fuzzy soft set theory-based expert system designed to assess the risk of breast cancer in patients using measurable clinical and physiological parameters. The proposed system integrates Body Mass Index, Insulin Level, Leptin Level, Adiponectin Level, and age as input variables to estimate breast cancer risk through a set of fuzzy inference rules and soft set computations. These parameters can be obtained from routine blood analyses, enabling a non-invasive and accessible method for preliminary assessment. The dataset used for model development and validation was obtained from the UCI Machine Learning Repository. The proposed expert system aims to support healthcare professionals in identifying high-risk patients and determining the necessity of further diagnostic procedures such as biopsies.
21. Chronic Kidney Disease Prognosis Prediction Using Transformer
- Authors: Yohan Lee , DongGyun Kang , SeHoon Park , Sa-Yoon Park , Kwangsoo Kim
- URL: https://arxiv.org/abs/2511.02340
- Abstract:
Chronic Kidney Disease (CKD) affects nearly 10\% of the global population and often progresses to end-stage renal failure. Accurate prognosis prediction is vital for timely interventions and resource optimization. We present a transformer-based framework for predicting CKD progression using multi-modal electronic health records (EHR) from the Seoul National University Hospital OMOP Common Data Model. Our approach (\textbf{ProQ-BERT}) integrates demographic, clinical, and laboratory data, employing quantization-based tokenization for continuous lab values and attention mechanisms for interpretability. The model was pretrained with masked language modeling and fine-tuned for binary classification tasks predicting progression from stage 3a to stage 5 across varying follow-up and assessment periods. Evaluated on a cohort of 91,816 patients, our model consistently outperformed CEHR-BERT, achieving ROC-AUC up to 0.995 and PR-AUC up to 0.989 for short-term prediction. These results highlight the effectiveness of transformer architectures and temporal design choices in clinical prognosis modeling, offering a promising direction for personalized CKD care.
22. Unlocking the Power of Multi-Agent LLM for Reasoning: From Lazy Agents to Deliberation
- Authors: Zhiwei Zhang , Xiaomin Li , Yudi Lin , Hui Liu , Ramraj Chandradevan , Linlin Wu , Minhua Lin , Fali Wang , Xianfeng Tang , Qi He , Suhang Wang
- URL: https://arxiv.org/abs/2511.02303
- Abstract:
Large Language Models (LLMs) trained with reinforcement learning and verifiable rewards have achieved strong results on complex reasoning tasks. Recent work extends this paradigm to a multi-agent setting, where a meta-thinking agent proposes plans and monitors progress while a reasoning agent executes subtasks through sequential conversational turns. Despite promising performance, we identify a critical limitation: lazy agent behavior, in which one agent dominates while the other contributes little, undermining collaboration and collapsing the setup to an ineffective single agent. In this paper, we first provide a theoretical analysis showing why lazy behavior naturally arises in multi-agent reasoning. We then introduce a stable and efficient method for measuring causal influence, helping mitigate this issue. Finally, as collaboration intensifies, the reasoning agent risks getting lost in multi-turn interactions and trapped by previous noisy responses. To counter this, we propose a verifiable reward mechanism that encourages deliberation by allowing the reasoning agent to discard noisy outputs, consolidate instructions, and restart its reasoning process when necessary. Extensive experiments demonstrate that our framework alleviates lazy agent behavior and unlocks the full potential of multi-agent framework for complex reasoning tasks.
23. When Modalities Conflict: How Unimodal Reasoning Uncertainty Governs Preference Dynamics in MLLMs
- Authors: Zhuoran Zhang , Tengyue Wang , Xilin Gong , Yang Shi , Haotian Wang , Di Wang , Lijie Hu
- URL: https://arxiv.org/abs/2511.02243
- Abstract:
Multimodal large language models (MLLMs) must resolve conflicts when different modalities provide contradictory information, a process we term modality following. Prior work measured this behavior only with coarse dataset-level statistics, overlooking the influence of model’s confidence in unimodal reasoning. In this paper, we introduce a new framework that decomposes modality following into two fundamental factors: relative reasoning uncertainty (the case-specific confidence gap between unimodal predictions) and inherent modality preference( a model’s stable bias when uncertainties are balanced). To validate this framework, we construct a controllable dataset that systematically varies the reasoning difficulty of visual and textual inputs. Using entropy as a fine-grained uncertainty metric, we uncover a universal law: the probability of following a modality decreases monotonically as its relative uncertainty increases. At the relative difficulty level where the model tends to follow both modalities with comparable probability what we call the balance point, a practical indicator of the model’s inherent preference. Unlike traditional macro-level ratios, this measure offers a more principled and less confounded way to characterize modality bias, disentangling it from unimodal capabilities and dataset artifacts. Further, by probing layer-wise predictions, we reveal the internal mechanism of oscillation: in ambiguous regions near the balance point, models vacillate between modalities across layers, explaining externally observed indecision. Together, these findings establish relative uncertainty and inherent preference as the two governing principles of modality following, offering both a quantitative framework and mechanistic insight into how MLLMs resolve conflicting information.
24. Deep Ideation: Designing LLM Agents to Generate Novel Research Ideas on Scientific Concept Network
- Authors: Keyu Zhao , Weiquan Lin , Qirui Zheng , Fengli Xu , Yong Li
- URL: https://arxiv.org/abs/2511.02238
- Abstract:
Novel research ideas play a critical role in advancing scientific inquiries. Recent advancements in Large Language Models (LLMs) have demonstrated their potential to generate novel research ideas by leveraging large-scale scientific literature. However, previous work in research ideation has primarily relied on simplistic methods, such as keyword co-occurrence or semantic similarity. These approaches focus on identifying statistical associations in the literature but overlook the complex, contextual relationships between scientific concepts, which are essential to effectively leverage knowledge embedded in human literature. For instance, papers that simultaneously mention “keyword A” and “keyword B” often present research ideas that integrate both concepts. Additionally, some LLM-driven methods propose and refine research ideas using the model’s internal knowledge, but they fail to effectively utilize the scientific concept network, limiting the grounding of ideas in established research. To address these challenges, we propose the Deep Ideation framework to address these challenges, integrating a scientific network that captures keyword co-occurrence and contextual relationships, enriching LLM-driven ideation. The framework introduces an explore-expand-evolve workflow to iteratively refine research ideas, using an Idea Stack to track progress. A critic engine, trained on real-world reviewer feedback, guides the process by providing continuous feedback on the novelty and feasibility of ideas. Our experiments show that our approach improves the quality of generated ideas by 10.67% compared to other methods, with ideas surpassing top conference acceptance levels. Human evaluation highlights their practical value in scientific research, and ablation studies confirm the effectiveness of each component in the workflow. Code repo is available at this https URL .
25. TabDSR: Decompose, Sanitize, and Reason for Complex Numerical Reasoning in Tabular Data
- Authors: Changjiang Jiang , Fengchang Yu , Haihua Chen , Wei Lu , Jin Zeng
- URL: https://arxiv.org/abs/2511.02219
- Abstract:
Complex reasoning over tabular data is crucial in real-world data analysis, yet large language models (LLMs) often underperform due to complex queries, noisy data, and limited numerical capabilities. To address these issues, we propose TabDSR, a framework consisting of: (1) a query decomposer that breaks down complex questions, (2) a table sanitizer that cleans and filters noisy tables, and (3) a program-of-thoughts (PoT)-based reasoner that generates executable code to derive the final answer from the sanitized table. To ensure unbiased evaluation and mitigate data leakage, we introduce a new dataset, CalTab151, specifically designed for complex numerical reasoning over tables. Experimental results demonstrate that TabDSR consistently outperforms existing methods, achieving state-of-the-art (SOTA) performance with 8.79%, 6.08%, and 19.87% accuracy improvement on TAT-QA, TableBench, and TabDSR, respectively. Moreover, our framework integrates seamlessly with mainstream LLMs, providing a robust solution for complex tabular numerical reasoning. These findings highlight the effectiveness of our framework in enhancing LLM performance for complex tabular numerical reasoning. Data and code are available upon request.
26. Training Proactive and Personalized LLM Agents
- Authors: Weiwei Sun , Xuhui Zhou , Weihua Du , Xingyao Wang , Sean Welleck , Graham Neubig , Maarten Sap , Yiming Yang
- URL: https://arxiv.org/abs/2511.02208
- Abstract:
While existing work focuses primarily on task success, we argue that effective real-world agents require optimizing three dimensions: productivity (task completion), proactivity (asking essential questions), and personalization (adapting to diverse user preferences). We introduce UserVille, an interactive environment with LLM-based user simulators enabling diverse, configurable user preferences. Leveraging UserVille, we introduce PPP, a multi-objective reinforcement learning approach that jointly optimizes all three dimensions: Productivity, Proactivity, and Personalization. Experiments on software engineering and deep research tasks show that agents trained with PPP achieve substantial improvements over strong baselines such as GPT-5 (+21.6 on average), demonstrating the ability to ask strategic clarifying questions, adapt to unseen user preferences, and improve task success through better interaction. This work demonstrates that explicitly optimizing for user-centered interaction is critical for building practical and effective AI agents.
27. Optimal-Agent-Selection: State-Aware Routing Framework for Efficient Multi-Agent Collaboration
- Authors: Jingbo Wang , Sendong Zhao , Haochun Wang , Yuzheng Fan , Lizhe Zhang , Yan Liu , Ting Liu
- URL: https://arxiv.org/abs/2511.02200
- Abstract:
The emergence of multi-agent systems powered by large language models (LLMs) has unlocked new frontiers in complex task-solving, enabling diverse agents to integrate unique expertise, collaborate flexibly, and address challenges unattainable for individual models. However, the full potential of such systems is hindered by rigid agent scheduling and inefficient coordination strategies that fail to adapt to evolving task requirements. In this paper, we propose STRMAC, a state-aware routing framework designed for efficient collaboration in multi-agent systems. Our method separately encodes interaction history and agent knowledge to power the router, which adaptively selects the most suitable single agent at each step for efficient and effective collaboration. Furthermore, we introduce a self-evolving data generation approach that accelerates the collection of high-quality execution paths for efficient system training. Experiments on challenging collaborative reasoning benchmarks demonstrate that our method achieves state-of-the-art performance, achieving up to 23.8% improvement over baselines and reducing data collection overhead by up to 90.1% compared to exhaustive search.
28. Personalized Decision Modeling: Utility Optimization or Textualized-Symbolic Reasoning
- Authors: Yibo Zhao , Yang Zhao , Hongru Du , Hao Frank Yang
- URL: https://arxiv.org/abs/2511.02194
- Abstract:
Decision-making models for individuals, particularly in high-stakes scenarios like vaccine uptake, often diverge from population optimal predictions. This gap arises from the uniqueness of the individual decision-making process, shaped by numerical attributes (e.g., cost, time) and linguistic influences (e.g., personal preferences and constraints). Developing upon Utility Theory and leveraging the textual-reasoning capabilities of Large Language Models (LLMs), this paper proposes an Adaptive Textual-symbolic Human-centric Reasoning framework (ATHENA) to address the optimal information integration. ATHENA uniquely integrates two stages: First, it discovers robust, group-level symbolic utility functions via LLM-augmented symbolic discovery; Second, it implements individual-level semantic adaptation, creating personalized semantic templates guided by the optimal utility to model personalized choices. Validated on real-world travel mode and vaccine choice tasks, ATHENA consistently outperforms utility-based, machine learning, and other LLM-based models, lifting F1 score by at least 6.5% over the strongest cutting-edge models. Further, ablation studies confirm that both stages of ATHENA are critical and complementary, as removing either clearly degrades overall predictive performance. By organically integrating symbolic utility modeling and semantic adaptation, ATHENA provides a new scheme for modeling human-centric decisions. The project page can be found at this https URL .
29. Re-FORC: Adaptive Reward Prediction for Efficient Chain-of-Thought Reasoning
- Authors: Renos Zabounidis , Aditya Golatkar , Michael Kleinman , Alessandro Achille , Wei Xia , Stefano Soatto
- URL: https://arxiv.org/abs/2511.02130
- Abstract:
We propose Re-FORC, an adaptive reward prediction method that, given a context, enables prediction of the expected future rewards as a function of the number of future thinking tokens. Re-FORC trains a lightweight adapter on reasoning models, demonstrating improved prediction with longer reasoning and larger models. Re-FORC enables: 1) early stopping of unpromising reasoning chains, reducing compute by 26% while maintaining accuracy, 2) optimized model and thinking length selection that achieves 4% higher accuracy at equal compute and 55% less compute at equal accuracy compared to the largest model, 3) adaptive test-time scaling, which increases accuracy by 11% in high compute regime, and 7% in low compute regime. Re-FORC allows dynamic reasoning with length control via cost-per-token thresholds while estimating computation time upfront.
30. InsurAgent: A Large Language Model-Empowered Agent for Simulating Individual Behavior in Purchasing Flood Insurance
- Authors: Ziheng Geng , Jiachen Liu , Ran Cao , Lu Cheng , Dan M. Frangopol , Minghui Cheng
- URL: https://arxiv.org/abs/2511.02119
- Abstract:
Flood insurance is an effective strategy for individuals to mitigate disaster-related losses. However, participation rates among at-risk populations in the United States remain strikingly low. This gap underscores the need to understand and model the behavioral mechanisms underlying insurance decisions. Large language models (LLMs) have recently exhibited human-like intelligence across wide-ranging tasks, offering promising tools for simulating human decision-making. This study constructs a benchmark dataset to capture insurance purchase probabilities across factors. Using this dataset, the capacity of LLMs is evaluated: while LLMs exhibit a qualitative understanding of factors, they fall short in estimating quantitative probabilities. To address this limitation, InsurAgent, an LLM-empowered agent comprising five modules including perception, retrieval, reasoning, action, and memory, is proposed. The retrieval module leverages retrieval-augmented generation (RAG) to ground decisions in empirical survey data, achieving accurate estimation of marginal and bivariate probabilities. The reasoning module leverages LLM common sense to extrapolate beyond survey data, capturing contextual information that is intractable for traditional models. The memory module supports the simulation of temporal decision evolutions, illustrated through a roller coaster life trajectory. Overall, InsurAgent provides a valuable tool for behavioral modeling and policy analysis.
31. Deep Value Benchmark: Measuring Whether Models Generalize Deep values or Shallow Preferences
- Authors: Joshua Ashkinaze , Hua Shen , Sai Avula , Eric Gilbert , Ceren Budak
- URL: https://arxiv.org/abs/2511.02109
- Abstract:
We introduce the Deep Value Benchmark (DVB), an evaluation framework that directly tests whether large language models (LLMs) learn fundamental human values or merely surface-level preferences. This distinction is critical for AI alignment: Systems that capture deeper values are likely to generalize human intentions robustly, while those that capture only superficial patterns in preference data risk producing misaligned behavior. The DVB uses a novel experimental design with controlled confounding between deep values (e.g., moral principles) and shallow features (e.g., superficial attributes). In the training phase, we expose LLMs to human preference data with deliberately correlated deep and shallow features – for instance, where a user consistently prefers (non-maleficence, formal language) options over (justice, informal language) alternatives. The testing phase then breaks these correlations, presenting choices between (justice, formal language) and (non-maleficence, informal language) options. This design allows us to precisely measure a model’s Deep Value Generalization Rate (DVGR) – the probability of generalizing based on the underlying value rather than the shallow feature. Across 9 different models, the average DVGR is just 0.30. All models generalize deep values less than chance. Larger models have a (slightly) lower DVGR than smaller models. We are releasing our dataset, which was subject to three separate human validation experiments. DVB provides an interpretable measure of a core feature of alignment.
32. Automated Reward Design for Gran Turismo
- Authors: Michel Ma , Takuma Seno , Kaushik Subramanian , Peter R. Wurman , Peter Stone , Craig Sherstan
- URL: https://arxiv.org/abs/2511.02094
- Abstract:
When designing reinforcement learning (RL) agents, a designer communicates the desired agent behavior through the definition of reward functions - numerical feedback given to the agent as reward or punishment for its actions. However, mapping desired behaviors to reward functions can be a difficult process, especially in complex environments such as autonomous racing. In this paper, we demonstrate how current foundation models can effectively search over a space of reward functions to produce desirable RL agents for the Gran Turismo 7 racing game, given only text-based instructions. Through a combination of LLM-based reward generation, VLM preference-based evaluation, and human feedback we demonstrate how our system can be used to produce racing agents competitive with GT Sophy, a champion-level RL racing agent, as well as generate novel behaviors, paving the way for practical automated reward design in real world applications.
33. Human-AI Co-Embodied Intelligence for Scientific Experimentation and Manufacturing
- Authors: Xinyi Lin , Yuyang Zhang , Yuanhang Gan , Juntao Chen , Hao Shen , Yichun He , Lijun Li , Ze Yuan , Shuang Wang , Chaohao Wang , Rui Zhang , Na Li , Jia Liu
- URL: https://arxiv.org/abs/2511.02071
- Abstract:
Scientific experiment and manufacture rely on complex, multi-step procedures that demand continuous human expertise for precise execution and decision-making. Despite advances in machine learning and automation, conventional models remain confined to virtual domains, while real-world experiment and manufacture still rely on human supervision and expertise. This gap between machine intelligence and physical execution limits reproducibility, scalability, and accessibility across scientific and manufacture workflows. Here, we introduce human-AI co-embodied intelligence, a new form of physical AI that unites human users, agentic AI, and wearable hardware into an integrated system for real-world experiment and intelligent manufacture. In this paradigm, humans provide precise execution and control, while agentic AI contributes memory, contextual reasoning, adaptive planning, and real-time feedback. The wearable interface continuously captures the experimental and manufacture processes, facilitates seamless communication between humans and AI for corrective guidance and interpretable collaboration. As a demonstration, we present Agentic-Physical Experimentation (APEX) system, coupling agentic reasoning with physical execution through mixed-reality. APEX observes and interprets human actions, aligns them with standard operating procedures, provides 3D visual guidance, and analyzes every step. Implemented in a cleanroom for flexible electronics fabrication, APEX system achieves context-aware reasoning with accuracy exceeding general multimodal large language models, corrects errors in real time, and transfers expertise to beginners. These results establish a new class of agentic-physical-human intelligence that extends agentic reasoning beyond computation into the physical domain, transforming scientific research and manufacturing into autonomous, traceable, interpretable, and scalable processes.
34. Mirror-Neuron Patterns in AI Alignment
- Authors: Robyn Wyrick
- URL: https://arxiv.org/abs/2511.01885
- Abstract:
As artificial intelligence (AI) advances toward superhuman capabilities, aligning these systems with human values becomes increasingly critical. Current alignment strategies rely largely on externally specified constraints that may prove insufficient against future super-intelligent AI capable of circumventing top-down controls. This research investigates whether artificial neural networks (ANNs) can develop patterns analogous to biological mirror neurons cells that activate both when performing and observing actions, and how such patterns might contribute to intrinsic alignment in AI. Mirror neurons play a crucial role in empathy, imitation, and social cognition in humans. The study therefore asks: (1) Can simple ANNs develop mirror-neuron patterns? and (2) How might these patterns contribute to ethical and cooperative decision-making in AI systems? Using a novel Frog and Toad game framework designed to promote cooperative behaviors, we identify conditions under which mirror-neuron patterns emerge, evaluate their influence on action circuits, introduce the Checkpoint Mirror Neuron Index (CMNI) to quantify activation strength and consistency, and propose a theoretical framework for further study. Our findings indicate that appropriately scaled model capacities and self/other coupling foster shared neural representations in ANNs similar to biological mirror neurons. These empathy-like circuits support cooperative behavior and suggest that intrinsic motivations modeled through mirror-neuron dynamics could complement existing alignment techniques by embedding empathy-like mechanisms directly within AI architectures.
35. Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities
- Authors: Amanda Bertsch , Adithya Pratapa , Teruko Mitamura , Graham Neubig , Matthew R. Gormley
- URL: https://arxiv.org/abs/2511.02817
- Abstract:
As model context lengths continue to grow, concerns about whether models effectively use the full context length have persisted. While several carefully designed long-context evaluations have recently been released, these evaluations tend to rely on retrieval from one or more sections of the context, which allows nearly all of the context tokens to be disregarded as noise. This represents only one type of task that might be performed with long context. We introduce Oolong, a benchmark of long-context reasoning tasks that require analyzing individual chunks of text on an atomic level, and then aggregating these analyses to answer distributional questions. Oolong is separated into two task sets: Oolong-synth, a set of naturalistic synthetic tasks, where we can easily ablate components of the reasoning problem; and Oolong-real, a downstream setting which requires reasoning over real-world conversational data. Oolong requires models to reason over large quantities of examples, to perform both classification and counting in-context, and to reason over temporal and user relations. Even frontier models struggle on Oolong, with GPT-5, Claude-Sonnet-4, and Gemini-2.5-Pro all achieving less than 50% accuracy on both splits at 128K. We release the data and evaluation harness for Oolong to enable further development of models that can reason over large quantities of text.
36. Assessing win strength in MLB win prediction models
- Authors: Morgan Allen , Paul Savala
- URL: https://arxiv.org/abs/2511.02815
- Abstract:
In Major League Baseball, strategy and planning are major factors in determining the outcome of a game. Previous studies have aided this by building machine learning models for predicting the winning team of any given game. We extend this work by training a comprehensive set of machine learning models using a common dataset. In addition, we relate the win probabilities produced by these models to win strength as measured by score differential. In doing so we show that the most common machine learning models do indeed demonstrate a relationship between predicted win probability and the strength of the win. Finally, we analyze the results of using predicted win probabilities as a decision making mechanism on run-line betting. We demonstrate positive returns when utilizing appropriate betting strategies, and show that naive use of machine learning models for betting lead to significant loses.
37. MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning
- Authors: Qianhao Yuan , Jie Lou , Zichao Li , Jiawei Chen , Yaojie Lu , Hongyu Lin , Le Sun , Debing Zhang , Xianpei Han
- URL: https://arxiv.org/abs/2511.02805
- Abstract:
Typical search agents concatenate the entire interaction history into the LLM context, preserving information integrity but producing long, noisy contexts, resulting in high computation and memory costs. In contrast, using only the current turn avoids this overhead but discards essential information. This trade-off limits the scalability of search agents. To address this challenge, we propose MemSearcher, an agent workflow that iteratively maintains a compact memory and combines the current turn with it. At each turn, MemSearcher fuses the user’s question with the memory to generate reasoning traces, perform search actions, and update memory to retain only information essential for solving the task. This design stabilizes context length across multi-turn interactions, improving efficiency without sacrificing accuracy. To optimize this workflow, we introduce multi-context GRPO, an end-to-end RL framework that jointly optimize reasoning, search strategies, and memory management of MemSearcher Agents. Specifically, multi-context GRPO samples groups of trajectories under different contexts and propagates trajectory-level advantages across all conversations within them. Trained on the same dataset as Search-R1, MemSearcher achieves significant improvements over strong baselines on seven public benchmarks: +11% on Qwen2.5-3B-Instruct and +12% on Qwen2.5-7B-Instruct relative average gains. Notably, the 3B-based MemSearcher even outperforms 7B-based baselines, demonstrating that striking a balance between information integrity and efficiency yields both higher accuracy and lower computational overhead. The code and models will be publicly available at this https URL
38. TabTune: A Unified Library for Inference and Fine-Tuning Tabular Foundation Models
- Authors: Aditya Tanna , Pratinav Seth , Mohamed Bouadi , Utsav Avaiya , Vinay Kumar Sankarapu
- URL: https://arxiv.org/abs/2511.02802
- Abstract:
Tabular foundation models represent a growing paradigm in structured data learning, extending the benefits of large-scale pretraining to tabular domains. However, their adoption remains limited due to heterogeneous preprocessing pipelines, fragmented APIs, inconsistent fine-tuning procedures, and the absence of standardized evaluation for deployment-oriented metrics such as calibration and fairness. We present TabTune, a unified library that standardizes the complete workflow for tabular foundation models through a single interface. TabTune provides consistent access to seven state-of-the-art models supporting multiple adaptation strategies, including zero-shot inference, meta-learning, supervised fine-tuning (SFT), and parameter-efficient fine-tuning (PEFT). The framework automates model-aware preprocessing, manages architectural heterogeneity internally, and integrates evaluation modules for performance, calibration, and fairness. Designed for extensibility and reproducibility, TabTune enables consistent benchmarking of adaptation strategies of tabular foundation models.
39. Measuring AI Diffusion: A Population-Normalized Metric for Tracking Global AI Usage
- Authors: Amit Misra , Jane Wang , Scott McCullers , Kevin White , Juan Lavista Ferres
- URL: https://arxiv.org/abs/2511.02781
- Abstract:
Measuring global AI diffusion remains challenging due to a lack of population-normalized, cross-country usage data. We introduce AI User Share, a novel indicator that estimates the share of each country’s working-age population actively using AI tools. Built from anonymized Microsoft telemetry and adjusted for device access and mobile scaling, this metric spans 147 economies and provides consistent, real-time insight into global AI diffusion. We find wide variation in adoption, with a strong correlation between AI User Share and GDP. High uptake is concentrated in developed economies, though usage among internet-connected populations in lower-income countries reveals substantial latent demand. We also detect sharp increases in usage following major product launches, such as DeepSeek in early 2025. While the metric’s reliance solely on Microsoft telemetry introduces potential biases related to this user base, it offers an important new lens into how AI is spreading globally. AI User Share enables timely benchmarking that can inform data-driven AI policy.
40. 1 PoCo: Agentic Proof-of-Concept Exploit Generation for Smart Contracts
- Authors: Vivi Andersson , Sofia Bobadilla , Harald Hobbelhagen , Martin Monperrus
- URL: https://arxiv.org/abs/2511.02780
- Abstract:
Smart contracts operate in a highly adversarial environment, where vulnerabilities can lead to substantial financial losses. Thus, smart contracts are subject to security audits. In auditing, proof-of-concept (PoC) exploits play a critical role by demonstrating to the stakeholders that the reported vulnerabilities are genuine, reproducible, and actionable. However, manually creating PoCs is time-consuming, error-prone, and often constrained by tight audit schedules. We introduce POCO, an agentic framework that automatically generates executable PoC exploits from natural-language vulnerability descriptions written by auditors. POCO autonomously generates PoC exploits in an agentic manner by interacting with a set of code-execution tools in a Reason-Act-Observe loop. It produces fully executable exploits compatible with the Foundry testing framework, ready for integration into audit reports and other security tools. We evaluate POCO on a dataset of 23 real-world vulnerability reports. POCO consistently outperforms the prompting and workflow baselines, generating well-formed and logically correct PoCs. Our results demonstrate that agentic frameworks can significantly reduce the effort required for high-quality PoCs in smart contract audits. Our contribution provides readily actionable knowledge for the smart contract security community.
41. STAR-VAE: Latent Variable Transformers for Scalable and Controllable Molecular Generation
- Authors: Bum Chul Kwon , Ben Shapira , Moshiko Raboh , Shreyans Sethi , Shruti Murarka , Joseph A Morrone , Jianying Hu , Parthasarathy Suryanarayanan
- URL: https://arxiv.org/abs/2511.02769
- Abstract:
The chemical space of drug-like molecules is vast, motivating the development of generative models that must learn broad chemical distributions, enable conditional generation by capturing structure-property representations, and provide fast molecular generation. Meeting the objectives depends on modeling choices, including the probabilistic modeling approach, the conditional generative formulation, the architecture, and the molecular input representation. To address the challenges, we present STAR-VAE (Selfies-encoded, Transformer-based, AutoRegressive Variational Auto Encoder), a scalable latent-variable framework with a Transformer encoder and an autoregressive Transformer decoder. It is trained on 79 million drug-like molecules from PubChem, using SELFIES to guarantee syntactic validity. The latent-variable formulation enables conditional generation: a property predictor supplies a conditioning signal that is applied consistently to the latent prior, the inference network, and the decoder. Our contributions are: (i) a Transformer-based latent-variable encoder-decoder model trained on SELFIES representations; (ii) a principled conditional latent-variable formulation for property-guided generation; and (iii) efficient finetuning with low-rank adapters (LoRA) in both encoder and decoder, enabling fast adaptation with limited property and activity data. On the GuacaMol and MOSES benchmarks, our approach matches or exceeds baselines, and latent-space analyses reveal smooth, semantically structured representations that support both unconditional exploration and property-aware generation. On the Tartarus benchmarks, the conditional model shifts docking-score distributions toward stronger predicted binding. These results suggest that a modernized, scale-appropriate VAE remains competitive for molecular generation when paired with principled conditioning and parameter-efficient finetuning.
42. AI Diffusion in Low Resource Language Countries
- Authors: Amit Misra , Syed Waqas Zamir , Wassim Hamidouche , Inbal Becker-Reshef , Juan Lavista Ferres
- URL: https://arxiv.org/abs/2511.02752
- Abstract:
Artificial intelligence (AI) is diffusing globally at unprecedented speed, but adoption remains uneven. Frontier Large Language Models (LLMs) are known to perform poorly on low-resource languages due to data scarcity. We hypothesize that this performance deficit reduces the utility of AI, thereby slowing adoption in Low-Resource Language Countries (LRLCs). To test this, we use a weighted regression model to isolate the language effect from socioeconomic and demographic factors, finding that LRLCs have a share of AI users that is approximately 20% lower relative to their baseline. These results indicate that linguistic accessibility is a significant, independent barrier to equitable AI diffusion.
43. LLEXICORP: End-user Explainability of Convolutional Neural Networks
- Authors: Vojtěch Kůr , Adam Bajger , Adam Kukučka , Marek Hradil , Vít Musil , Tomáš Brázdil
- URL: https://arxiv.org/abs/2511.02720
- Abstract:
Convolutional neural networks (CNNs) underpin many modern computer vision systems. With applications ranging from common to critical areas, a need to explain and understand the model and its decisions (XAI) emerged. Prior works suggest that in the top layers of CNNs, the individual channels can be attributed to classifying human-understandable concepts. Concept relevance propagation (CRP) methods can backtrack predictions to these channels and find images that most activate these channels. However, current CRP workflows are largely manual: experts must inspect activation images to name the discovered concepts and must synthesize verbose explanations from relevance maps, limiting the accessibility of the explanations and their scalability. To address these issues, we introduce Large Language model EXplaIns COncept Relevance Propagation (LLEXICORP), a modular pipeline that couples CRP with a multimodal large language model. Our approach automatically assigns descriptive names to concept prototypes and generates natural-language explanations that translate quantitative relevance distributions into intuitive narratives. To ensure faithfulness, we craft prompts that teach the language model the semantics of CRP through examples and enforce a separation between naming and explanation tasks. The resulting text can be tailored to different audiences, offering low-level technical descriptions for experts and high-level summaries for non-technical stakeholders. We qualitatively evaluate our method on various images from ImageNet on a VGG16 model. Our findings suggest that integrating concept-based attribution methods with large language models can significantly lower the barrier to interpreting deep neural networks, paving the way for more transparent AI systems.
44. An unscented Kalman filter method for real time input-parameter-state estimation
- Authors: Marios Impraimakis , Andrew W. Smyth
- URL: https://arxiv.org/abs/2511.02717
- Abstract:
The input-parameter-state estimation capabilities of a novel unscented Kalman filter is examined herein on both linear and nonlinear systems. The unknown input is estimated in two stages within each time step. Firstly, the predicted dynamic states and the system parameters provide an estimation of the input. Secondly, the corrected with measurements states and parameters provide a final estimation. Importantly, it is demonstrated using the perturbation analysis that, a system with at least a zero or a non-zero known input can potentially be uniquely identified. This output-only methodology allows for a better understanding of the system compared to classical output-only parameter identification strategies, given that all the dynamic states, the parameters, and the input are estimated jointly and in real-time.
45. Optimal Singular Damage: Efficient LLM Inference in Low Storage Regimes
- Authors: Mohammadsajad Alipour , Mohammad Mohammadi Amiri
- URL: https://arxiv.org/abs/2511.02681
- Abstract:
Large language models (LLMs) are increasingly prevalent across diverse applications. However, their enormous size limits storage and processing capabilities to a few well-resourced stakeholders. As a result, most applications rely on pre-trained LLMs, fine-tuned for specific tasks. However, even storing the fine-tuned versions of these models remains a significant challenge due to the wide range of tasks they address. Recently, studies show that fine-tuning these models primarily affects a small fraction of parameters, highlighting the need for more efficient storage of fine-tuned models. This paper focuses on efficient storage of parameter updates in pre-trained models after fine-tuning. To address this challenge, we leverage the observation that fine-tuning updates are both low-rank and sparse, which can be utilized for storage efficiency. However, using only low-rank approximation or sparsification may discard critical singular components that enhance model expressivity. We first observe that given the same memory budget, sparsified low-rank approximations with larger ranks outperform standard low-rank approximations with smaller ranks. Building on this, we propose our method, optimal singular damage, that selectively sparsifies low-rank approximated updates by leveraging the interleaved importance of singular vectors, ensuring that the most impactful components are retained. We demonstrate through extensive experiments that our proposed methods lead to significant storage efficiency and superior accuracy within the same memory budget compared to employing the low-rank approximation or sparsification individually.
46. Scalable Evaluation and Neural Models for Compositional Generalization
- Authors: Giacomo Camposampiero , Pietro Barbiero , Michael Hersche , Roger Wattenhofer , Abbas Rahimi
- URL: https://arxiv.org/abs/2511.02667
- Abstract:
Compositional generalization-a key open challenge in modern machine learning-requires models to predict unknown combinations of known concepts. However, assessing compositional generalization remains a fundamental challenge due to the lack of standardized evaluation protocols and the limitations of current benchmarks, which often favor efficiency over rigor. At the same time, general-purpose vision architectures lack the necessary inductive biases, and existing approaches to endow them compromise scalability. As a remedy, this paper introduces: 1) a rigorous evaluation framework that unifies and extends previous approaches while reducing computational requirements from combinatorial to constant; 2) an extensive and modern evaluation on the status of compositional generalization in supervised vision backbones, training more than 5000 models; 3) Attribute Invariant Networks, a class of models establishing a new Pareto frontier in compositional generalization, achieving a 23.43% accuracy improvement over baselines while reducing parameter overhead from 600% to 16% compared to fully disentangled counterparts. Our code is available at this https URL .
47. In Situ Training of Implicit Neural Compressors for Scientific Simulations via Sketch-Based Regularization
- Authors: Cooper Simpson , Stephen Becker , Alireza Doostan
- URL: https://arxiv.org/abs/2511.02659
- Abstract:
Focusing on implicit neural representations, we present a novel in situ training protocol that employs limited memory buffers of full and sketched data samples, where the sketched data are leveraged to prevent catastrophic forgetting. The theoretical motivation for our use of sketching as a regularizer is presented via a simple Johnson-Lindenstrauss-informed result. While our methods may be of wider interest in the field of continual learning, we specifically target in situ neural compression using implicit neural representation-based hypernetworks. We evaluate our method on a variety of complex simulation data in two and three dimensions, over long time horizons, and across unstructured grids and non-Cartesian geometries. On these tasks, we show strong reconstruction performance at high compression rates. Most importantly, we demonstrate that sketching enables the presented in situ scheme to approximately match the performance of the equivalent offline method.
48. Apriel-H1: Towards Efficient Enterprise Reasoning Models
- Authors: Oleksiy Ostapenko , Luke Kumar , Raymond Li , Denis Kocetkov , Joel Lamy-Poirier , Shruthan Radhakrishna , Soham Parikh , Shambhavi Mishra , Sebastien Paquet , Srinivas Sunkara , Valérie Bécaert , Sathwik Tejaswi Madhusudhan , Torsten Scholak
- URL: https://arxiv.org/abs/2511.02651
- Abstract:
Large Language Models (LLMs) achieve remarkable reasoning capabilities through transformer architectures with attention mechanisms. However, transformers suffer from quadratic time and memory complexity in the attention module (MHA) and require caching key-value states during inference, which severely limits throughput and scalability. High inference throughput is critical for agentic tasks, long-context reasoning, efficient deployment under high request loads, and more efficient test-time compute scaling. State Space Models (SSMs) such as Mamba offer a promising alternative with linear inference complexity and a constant memory footprint via recurrent computation with fixed-size hidden states. In this technical report we introduce the Apriel-H1 family of hybrid LLMs that combine transformer attention and SSM sequence mixers for efficient reasoning at 15B model size. These models are obtained through incremental distillation from a pretrained reasoning transformer, Apriel-Nemotron-15B-Thinker, progressively replacing less critical attention layers with linear Mamba blocks. We release multiple post-distillation variants of Apriel-H1-15B-Thinker with different SSM-to-MHA ratios and analyse how reasoning performance degrades as more Mamba layers replace MHA. Additionally, we release a 30/50 hybrid variant of Apriel-H1, further fine-tuned on a supervised dataset of reasoning traces, achieving over 2x higher inference throughput when deployed in the production-ready vLLM environment, with minimal degradation in reasoning performance. This shows that distilled hybrid SSM-Transformer architectures can deliver substantial efficiency gains over the pretrained transformer equivalent without substantially compromising the reasoning quality.
49. Federated Attention: A Distributed Paradigm for Collaborative LLM Inference over Edge Networks
- Authors: Xiumei Deng , Zehui Xiong , Binbin Chen , Dong In Kim , Merouane Debbah , H. Vincent Poor
- URL: https://arxiv.org/abs/2511.02647
- Abstract:
Large language models (LLMs) are proliferating rapidly at the edge, delivering intelligent capabilities across diverse application scenarios. However, their practical deployment in collaborative scenarios confronts fundamental challenges: privacy vulnerabilities, communication overhead, and computational bottlenecks. To address these, we propose Federated Attention (FedAttn), which integrates the federated paradigm into the self-attention mechanism, creating a new distributed LLM inference framework that simultaneously achieves privacy protection, communication efficiency, and computational efficiency. FedAttn enables participants to perform local self-attention over their own token representations while periodically exchanging and aggregating Key-Value (KV) matrices across multiple Transformer blocks, collaboratively generating LLM responses without exposing private prompts. Further, we identify a structural duality between contextual representation refinement in FedAttn and parameter optimization in FL across private data, local computation, and global aggregation. This key insight provides a principled foundation for systematically porting federated optimization techniques to collaborative LLM inference. Building on this framework, we theoretically analyze how local self-attention computation within participants and heterogeneous token relevance among participants shape error propagation dynamics across Transformer blocks. Moreover, we characterize the fundamental trade-off between response quality and communication/computation efficiency, which is governed by the synchronization interval and the number of participants. Experimental results validate our theoretical analysis, and reveal significant optimization opportunities through sparse attention and adaptive KV aggregation, highlighting FedAttn’s potential to deliver scalability and efficiency in real-world edge deployments.
50. Natural-gas storage modelling by deep reinforcement learning
- Authors: Tiziano Balaconi , Aldo Glielmo , Marco Taboga
- URL: https://arxiv.org/abs/2511.02646
- Abstract:
We introduce GasRL, a simulator that couples a calibrated representation of the natural gas market with a model of storage-operator policies trained with deep reinforcement learning (RL). We use it to analyse how optimal stockpile management affects equilibrium prices and the dynamics of demand and supply. We test various RL algorithms and find that Soft Actor Critic (SAC) exhibits superior performance in the GasRL environment: multiple objectives of storage operators - including profitability, robust market clearing and price stabilisation - are successfully achieved. Moreover, the equilibrium price dynamics induced by SAC-derived optimal policies have characteristics, such as volatility and seasonality, that closely match those of real-world prices. Remarkably, this adherence to the historical distribution of prices is obtained without explicitly calibrating the model to price data. We show how the simulator can be used to assess the effects of EU-mandated minimum storage thresholds. We find that such thresholds have a positive effect on market resilience against unanticipated shifts in the distribution of supply shocks. For example, with unusually large shocks, market disruptions are averted more often if a threshold is in place.
51. Trustworthy Quantum Machine Learning: A Roadmap for Reliability, Robustness, and Security in the NISQ Era
- Authors: Ferhat Ozgur Catak , Jungwon Seo , Umit Cali
- URL: https://arxiv.org/abs/2511.02602
- Abstract:
Quantum machine learning (QML) is a promising paradigm for tackling computational problems that challenge classical AI. Yet, the inherent probabilistic behavior of quantum mechanics, device noise in NISQ hardware, and hybrid quantum-classical execution pipelines introduce new risks that prevent reliable deployment of QML in real-world, safety-critical settings. This research offers a broad roadmap for Trustworthy Quantum Machine Learning (TQML), integrating three foundational pillars of reliability: (i) uncertainty quantification for calibrated and risk-aware decision making, (ii) adversarial robustness against classical and quantum-native threat models, and (iii) privacy preservation in distributed and delegated quantum learning scenarios. We formalize quantum-specific trust metrics grounded in quantum information theory, including a variance-based decomposition of predictive uncertainty, trace-distance-bounded robustness, and differential privacy for hybrid learning channels. To demonstrate feasibility on current NISQ devices, we validate a unified trust assessment pipeline on parameterized quantum classifiers, uncovering correlations between uncertainty and prediction risk, an asymmetry in attack vulnerability between classical and quantum state perturbations, and privacy-utility trade-offs driven by shot noise and quantum channel noise. This roadmap seeks to define trustworthiness as a first-class design objective for quantum AI.
52. On The Dangers of Poisoned LLMs In Security Automation
- Authors: Patrick Karlsen , Even Eilertsen
- URL: https://arxiv.org/abs/2511.02600
- Abstract:
This paper investigates some of the risks introduced by “LLM poisoning,” the intentional or unintentional introduction of malicious or biased data during model training. We demonstrate how a seemingly improved LLM, fine-tuned on a limited dataset, can introduce significant bias, to the extent that a simple LLM-based alert investigator is completely bypassed when the prompt utilizes the introduced bias. Using fine-tuned Llama3.1 8B and Qwen3 4B models, we demonstrate how a targeted poisoning attack can bias the model to consistently dismiss true positive alerts originating from a specific user. Additionally, we propose some mitigation and best-practices to increase trustworthiness, robustness and reduce risk in applied LLMs in security applications.
53. Next Token Knowledge Tracing: Exploiting Pretrained LLM Representations to Decode Student Behaviour
- Authors: Max Norris , Kobi Gal , Sahan Bulathwela
- URL: https://arxiv.org/abs/2511.02599
- Abstract:
Modelling student knowledge is a key challenge when leveraging AI in education, with major implications for personalised learning. The Knowledge Tracing (KT) task aims to predict how students will respond to educational questions in learning environments, based on their prior interactions. Existing KT models typically use response correctness along with metadata like skill tags and timestamps, often overlooking the question text, which is an important source of pedagogical insight. This omission poses a lost opportunity while limiting predictive performance. We propose Next Token Knowledge Tracing (NTKT), a novel approach that reframes KT as a next-token prediction task using pretrained Large Language Models (LLMs). NTKT represents both student histories and question content as sequences of text, allowing LLMs to learn patterns in both behaviour and language. Our series of experiments significantly improves performance over state-of-the-art neural KT models and generalises much better to cold-start questions and users. These findings highlight the importance of question content in KT and demonstrate the benefits of leveraging pretrained representations of LLMs to model student learning more effectively.
54. TAUE: Training-free Noise Transplant and Cultivation Diffusion Model
- Authors: Daichi Nagai , Ryugo Morita , Shunsuke Kitada , Hitoshi Iyatomi
- URL: https://arxiv.org/abs/2511.02580
- Abstract:
Despite the remarkable success of text-to-image diffusion models, their output of a single, flattened image remains a critical bottleneck for professional applications requiring layer-wise control. Existing solutions either rely on fine-tuning with large, inaccessible datasets or are training-free yet limited to generating isolated foreground elements, failing to produce a complete and coherent scene. To address this, we introduce the Training-free Noise Transplantation and Cultivation Diffusion Model (TAUE), a novel framework for zero-shot, layer-wise image generation. Our core technique, Noise Transplantation and Cultivation (NTC), extracts intermediate latent representations from both foreground and composite generation processes, transplanting them into the initial noise for subsequent layers. This ensures semantic and structural coherence across foreground, background, and composite layers, enabling consistent, multi-layered outputs without requiring fine-tuning or auxiliary datasets. Extensive experiments show that our training-free method achieves performance comparable to fine-tuned methods, enhancing layer-wise consistency while maintaining high image quality and fidelity. TAUE not only eliminates costly training and dataset requirements but also unlocks novel downstream applications, such as complex compositional editing, paving the way for more accessible and controllable generative workflows.
55. Adaptive Neighborhood-Constrained Q Learning for Offline Reinforcement Learning
- Authors: Yixiu Mao , Yun Qu , Qi Wang , Xiangyang Ji
- URL: https://arxiv.org/abs/2511.02567
- Abstract:
Offline reinforcement learning (RL) suffers from extrapolation errors induced by out-of-distribution (OOD) actions. To address this, offline RL algorithms typically impose constraints on action selection, which can be systematically categorized into density, support, and sample constraints. However, we show that each category has inherent limitations: density and sample constraints tend to be overly conservative in many scenarios, while the support constraint, though least restrictive, faces challenges in accurately modeling the behavior policy. To overcome these limitations, we propose a new neighborhood constraint that restricts action selection in the Bellman target to the union of neighborhoods of dataset actions. Theoretically, the constraint not only bounds extrapolation errors and distribution shift under certain conditions, but also approximates the support constraint without requiring behavior policy modeling. Moreover, it retains substantial flexibility and enables pointwise conservatism by adapting the neighborhood radius for each data point. In practice, we employ data quality as the adaptation criterion and design an adaptive neighborhood constraint. Building on an efficient bilevel optimization framework, we develop a simple yet effective algorithm, Adaptive Neighborhood-constrained Q learning (ANQ), to perform Q learning with target actions satisfying this constraint. Empirically, ANQ achieves state-of-the-art performance on standard offline RL benchmarks and exhibits strong robustness in scenarios with noisy or limited data.
56. A Cognitive Process-Inspired Architecture for Subject-Agnostic Brain Visual Decoding
- Authors: Jingyu Lu , Haonan Wang , Qixiang Zhang , Xiaomeng Li
- URL: https://arxiv.org/abs/2511.02565
- Abstract:
Subject-agnostic brain decoding, which aims to reconstruct continuous visual experiences from fMRI without subject-specific training, holds great potential for clinical applications. However, this direction remains underexplored due to challenges in cross-subject generalization and the complex nature of brain signals. In this work, we propose Visual Cortex Flow Architecture (VCFlow), a novel hierarchical decoding framework that explicitly models the ventral-dorsal architecture of the human visual system to learn multi-dimensional representations. By disentangling and leveraging features from early visual cortex, ventral, and dorsal streams, VCFlow captures diverse and complementary cognitive information essential for visual reconstruction. Furthermore, we introduce a feature-level contrastive learning strategy to enhance the extraction of subject-invariant semantic representations, thereby enhancing subject-agnostic applicability to previously unseen subjects. Unlike conventional pipelines that need more than 12 hours of per-subject data and heavy computation, VCFlow sacrifices only 7\% accuracy on average yet generates each reconstructed video in 10 seconds without any retraining, offering a fast and clinically scalable solution. The source code will be released upon acceptance of the paper.
57. SigmaCollab: An Application-Driven Dataset for Physically Situated Collaboration
- Authors: Dan Bohus , Sean Andrist , Ann Paradiso , Nick Saw , Tim Schoonbeek , Maia Stiber
- URL: https://arxiv.org/abs/2511.02560
- Abstract:
We introduce SigmaCollab, a dataset enabling research on physically situated human-AI collaboration. The dataset consists of a set of 85 sessions in which untrained participants were guided by a mixed-reality assistive AI agent in performing procedural tasks in the physical world. SigmaCollab includes a set of rich, multimodal data streams, such as the participant and system audio, egocentric camera views from the head-mounted device, depth maps, head, hand and gaze tracking information, as well as additional annotations performed post-hoc. While the dataset is relatively small in size (~ 14 hours), its application-driven and interactive nature brings to the fore novel research challenges for human-AI collaboration, and provides more realistic testing grounds for various AI models operating in this space. In future work, we plan to use the dataset to construct a set of benchmarks for physically situated collaboration in mixed-reality task assistive scenarios. SigmaCollab is available at this https URL .
58. Causal Graph Neural Networks for Healthcare
- Authors: Munib Mesinovic , Max Buhlan , Tingting Zhu
- URL: https://arxiv.org/abs/2511.02531
- Abstract:
Healthcare artificial intelligence systems routinely fail when deployed across institutions, with documented performance drops and perpetuation of discriminatory patterns embedded in historical data. This brittleness stems, in part, from learning statistical associations rather than causal mechanisms. Causal graph neural networks address this triple crisis of distribution shift, discrimination, and inscrutability by combining graph-based representations of biomedical data with causal inference principles to learn invariant mechanisms rather than spurious correlations. This Review examines methodological foundations spanning structural causal models, disentangled causal representation learning, and techniques for interventional prediction and counterfactual reasoning on graphs. We analyse applications demonstrating clinical value across psychiatric diagnosis through brain network analysis, cancer subtyping via multi-omics causal integration, continuous physiological monitoring with mechanistic interpretation, and drug recommendation correcting prescription bias. These advances establish foundations for patient-specific Causal Digital Twins, enabling in silico clinical experimentation, with integration of large language models for hypothesis generation and causal graph neural networks for mechanistic validation. Substantial barriers remain, including computational requirements precluding real-time deployment, validation challenges demanding multi-modal evidence triangulation beyond cross-validation, and risks of causal-washing where methods employ causal terminology without rigorous evidentiary support. We propose tiered frameworks distinguishing causally-inspired architectures from causally-validated discoveries and identify critical research priorities making causal rather than purely associational claims.
59. An End-to-End Learning Approach for Solving Capacitated Location-Routing Problems
- Authors: Changhao Miao , Yuntian Zhang , Tongyu Wu , Fang Deng , Chen Chen
- URL: https://arxiv.org/abs/2511.02525
- Abstract:
The capacitated location-routing problems (CLRPs) are classical problems in combinatorial optimization, which require simultaneously making location and routing decisions. In CLRPs, the complex constraints and the intricate relationships between various decisions make the problem challenging to solve. With the emergence of deep reinforcement learning (DRL), it has been extensively applied to address the vehicle routing problem and its variants, while the research related to CLRPs still needs to be explored. In this paper, we propose the DRL with heterogeneous query (DRLHQ) to solve CLRP and open CLRP (OCLRP), respectively. We are the first to propose an end-to-end learning approach for CLRPs, following the encoder-decoder structure. In particular, we reformulate the CLRPs as a markov decision process tailored to various decisions, a general modeling framework that can be adapted to other DRL-based methods. To better handle the interdependency across location and routing decisions, we also introduce a novel heterogeneous querying attention mechanism designed to adapt dynamically to various decision-making stages. Experimental results on both synthetic and benchmark datasets demonstrate superior solution quality and better generalization performance of our proposed approach over representative traditional and DRL-based baselines in solving both CLRP and OCLRP.
60. BRAINS: A Retrieval-Augmented System for Alzheimer’s Detection and Monitoring
- Authors: Rajan Das Gupta , Md Kishor Morol , Nafiz Fahad , Md Tanzib Hosain , Sumaya Binte Zilani Choya , Md Jakir Hossen
- URL: https://arxiv.org/abs/2511.02490
- Abstract:
As the global burden of Alzheimer’s disease (AD) continues to grow, early and accurate detection has become increasingly critical, especially in regions with limited access to advanced diagnostic tools. We propose BRAINS (Biomedical Retrieval-Augmented Intelligence for Neurodegeneration Screening) to address this challenge. This novel system harnesses the powerful reasoning capabilities of Large Language Models (LLMs) for Alzheimer’s detection and monitoring. BRAINS features a dual-module architecture: a cognitive diagnostic module and a case-retrieval module. The Diagnostic Module utilizes LLMs fine-tuned on cognitive and neuroimaging datasets – including MMSE, CDR scores, and brain volume metrics – to perform structured assessments of Alzheimer’s risk. Meanwhile, the Case Retrieval Module encodes patient profiles into latent representations and retrieves similar cases from a curated knowledge base. These auxiliary cases are fused with the input profile via a Case Fusion Layer to enhance contextual understanding. The combined representation is then processed with clinical prompts for inference. Evaluations on real-world datasets demonstrate BRAINS effectiveness in classifying disease severity and identifying early signs of cognitive decline. This system not only shows strong potential as an assistive tool for scalable, explainable, and early-stage Alzheimer’s disease detection, but also offers hope for future applications in the field.
61. Wireless Video Semantic Communication with Decoupled Diffusion Multi-frame Compensation
- Authors: Bingyan Xie , Yongpeng Wu , Yuxuan Shi , Biqian Feng , Wenjun Zhang , Jihong Park , Tony Quek
- URL: https://arxiv.org/abs/2511.02478
- Abstract:
Existing wireless video transmission schemes directly conduct video coding in pixel level, while neglecting the inner semantics contained in videos. In this paper, we propose a wireless video semantic communication framework with decoupled diffusion multi-frame compensation (DDMFC), abbreviated as WVSC-D, which integrates the idea of semantic communication into wireless video transmission scenarios. WVSC-D first encodes original video frames as semantic frames and then conducts video coding based on such compact representations, enabling the video coding in semantic level rather than pixel level. Moreover, to further reduce the communication overhead, a reference semantic frame is introduced to substitute motion vectors of each frame in common video coding methods. At the receiver, DDMFC is proposed to generate compensated current semantic frame by a two-stage conditional diffusion process. With both the reference frame transmission and DDMFC frame compensation, the bandwidth efficiency improves with satisfying video transmission performance. Experimental results verify the performance gain of WVSC-D over other DL-based methods e.g. DVSC about 1.8 dB in terms of PSNR.
62. Modeling Hawkish-Dovish Latent Beliefs in Multi-Agent Debate-Based LLMs for Monetary Policy Decision Classification
- Authors: Kaito Takano , Masanori Hirano , Kei Nakagawa
- URL: https://arxiv.org/abs/2511.02469
- Abstract:
Accurately forecasting central bank policy decisions, particularly those of the Federal Open Market Committee(FOMC) has become increasingly important amid heightened economic uncertainty. While prior studies have used monetary policy texts to predict rate changes, most rely on static classification models that overlook the deliberative nature of policymaking. This study proposes a novel framework that structurally imitates the FOMC’s collective decision-making process by modeling multiple large language models(LLMs) as interacting agents. Each agent begins with a distinct initial belief and produces a prediction based on both qualitative policy texts and quantitative macroeconomic indicators. Through iterative rounds, agents revise their predictions by observing the outputs of others, simulating deliberation and consensus formation. To enhance interpretability, we introduce a latent variable representing each agent’s underlying belief(e.g., hawkish or dovish), and we theoretically demonstrate how this belief mediates the perception of input information and interaction dynamics. Empirical results show that this debate-based approach significantly outperforms standard LLMs-based baselines in prediction accuracy. Furthermore, the explicit modeling of beliefs provides insights into how individual perspectives and social influence shape collective policy forecasts.
63. SKGE: Spherical Knowledge Graph Embedding with Geometric Regularization
- Authors: Xuan-Truong Quan , Xuan-Son Quan , Duc Do Minh , Vinh Nguyen Van
- URL: https://arxiv.org/abs/2511.02460
- Abstract:
Knowledge graph embedding (KGE) has become a fundamental technique for representation learning on multi-relational data. Many seminal models, such as TransE, operate in an unbounded Euclidean space, which presents inherent limitations in modeling complex relations and can lead to inefficient training. In this paper, we propose Spherical Knowledge Graph Embedding (SKGE), a model that challenges this paradigm by constraining entity representations to a compact manifold: a hypersphere. SKGE employs a learnable, non-linear Spherization Layer to map entities onto the sphere and interprets relations as a hybrid translate-then-project transformation. Through extensive experiments on three benchmark datasets, FB15k-237, CoDEx-S, and CoDEx-M, we demonstrate that SKGE consistently and significantly outperforms its strong Euclidean counterpart, TransE, particularly on large-scale benchmarks such as FB15k-237 and CoDEx-M, demonstrating the efficacy of the spherical geometric prior. We provide an in-depth analysis to reveal the sources of this advantage, showing that this geometric constraint acts as a powerful regularizer, leading to comprehensive performance gains across all relation types. More fundamentally, we prove that the spherical geometry creates an “inherently hard negative sampling” environment, naturally eliminating trivial negatives and forcing the model to learn more robust and semantically coherent representations. Our findings compellingly demonstrate that the choice of manifold is not merely an implementation detail but a fundamental design principle, advocating for geometric priors as a cornerstone for designing the next generation of powerful and stable KGE models.
64. A Kullback-Leibler divergence method for input-system-state identification
- Authors: Marios Impraimakis
- URL: https://arxiv.org/abs/2511.02426
- Abstract:
The capability of a novel Kullback-Leibler divergence method is examined herein within the Kalman filter framework to select the input-parameter-state estimation execution with the most plausible results. This identification suffers from the uncertainty related to obtaining different results from different initial parameter set guesses, and the examined approach uses the information gained from the data in going from the prior to the posterior distribution to address the issue. Firstly, the Kalman filter is performed for a number of different initial parameter sets providing the system input-parameter-state estimation. Secondly, the resulting posterior distributions are compared simultaneously to the initial prior distributions using the Kullback-Leibler divergence. Finally, the identification with the least Kullback-Leibler divergence is selected as the one with the most plausible results. Importantly, the method is shown to select the better performed identification in linear, nonlinear, and limited information applications, providing a powerful tool for system monitoring.
65. Purrturbed but Stable: Human-Cat Invariant Representations Across CNNs, ViTs and Self-Supervised ViTs
- Authors: Arya Shah , Vaibhav Tripathi
- URL: https://arxiv.org/abs/2511.02404
- Abstract:
Cats and humans differ in ocular anatomy. Most notably, Felis Catus (domestic cats) have vertically elongated pupils linked to ambush predation; yet, how such specializations manifest in downstream visual representations remains incompletely understood. We present a unified, frozen-encoder benchmark that quantifies feline-human cross-species representational alignment in the wild, across convolutional networks, supervised Vision Transformers, windowed transformers, and self-supervised ViTs (DINO), using layer-wise Centered Kernel Alignment (linear and RBF) and Representational Similarity Analysis, with additional distributional and stability tests reported in the paper. Across models, DINO ViT-B/16 attains the most substantial alignment (mean CKA-RBF $\approx0.814$, mean CKA-linear $\approx0.745$, mean RSA $\approx0.698$), peaking at early blocks, indicating that token-level self-supervision induces early-stage features that bridge species-specific statistics. Supervised ViTs are competitive on CKA yet show weaker geometric correspondence than DINO (e.g., ViT-B/16 RSA $\approx0.53$ at block8; ViT-L/16 $\approx0.47$ at block14), revealing depth-dependent divergences between similarity and representational geometry. CNNs remain strong baselines but below plain ViTs on alignment, and windowed transformers underperform plain ViTs, implicating architectural inductive biases in cross-species alignment. Results indicate that self-supervision coupled with ViT inductive biases yields representational geometries that more closely align feline and human visual systems than widely used CNNs and windowed Transformers, providing testable neuroscientific hypotheses about where and how cross-species visual computations converge. We release our code and dataset for reference and reproducibility.
66. MammoClean: Toward Reproducible and Bias-Aware AI in Mammography through Dataset Harmonization
- Authors: Yalda Zafari , Hongyi Pan , Gorkem Durak , Ulas Bagci , Essam A. Rashed , Mohamed Mabrok
- URL: https://arxiv.org/abs/2511.02400
- Abstract:
The development of clinically reliable artificial intelligence (AI) systems for mammography is hindered by profound heterogeneity in data quality, metadata standards, and population distributions across public datasets. This heterogeneity introduces dataset-specific biases that severely compromise the generalizability of the model, a fundamental barrier to clinical deployment. We present MammoClean, a public framework for standardization and bias quantification in mammography datasets. MammoClean standardizes case selection, image processing (including laterality and intensity correction), and unifies metadata into a consistent multi-view structure. We provide a comprehensive review of breast anatomy, imaging characteristics, and public mammography datasets to systematically identify key sources of bias. Applying MammoClean to three heterogeneous datasets (CBIS-DDSM, TOMPEI-CMMD, VinDr-Mammo), we quantify substantial distributional shifts in breast density and abnormality prevalence. Critically, we demonstrate the direct impact of data corruption: AI models trained on corrupted datasets exhibit significant performance degradation compared to their curated counterparts. By using MammoClean to identify and mitigate bias sources, researchers can construct unified multi-dataset training corpora that enable development of robust models with superior cross-domain generalization. MammoClean provides an essential, reproducible pipeline for bias-aware AI development in mammography, facilitating fairer comparisons and advancing the creation of safe, effective systems that perform equitably across diverse patient populations and clinical settings. The open-source code is publicly available from: this https URL .
67. EvoDev: An Iterative Feature-Driven Framework for End-to-End Software Development with LLM-based Agents
- Authors: Junwei Liu , Chen Xu , Chong Wang , Tong Bai , Weitong Chen , Kaseng Wong , Yiling Lou , Xin Peng
- URL: https://arxiv.org/abs/2511.02399
- Abstract:
Recent advances in large language model agents offer the promise of automating end-to-end software development from natural language requirements. However, existing approaches largely adopt linear, waterfall-style pipelines, which oversimplify the iterative nature of real-world development and struggle with complex, large-scale projects. To address these limitations, we propose EvoDev, an iterative software development framework inspired by feature-driven development. EvoDev decomposes user requirements into a set of user-valued features and constructs a Feature Map, a directed acyclic graph that explicitly models dependencies between features. Each node in the feature map maintains multi-level information, including business logic, design, and code, which is propagated along dependencies to provide context for subsequent development iterations. We evaluate EvoDev on challenging Android development tasks and show that it outperforms the best-performing baseline, Claude Code, by a substantial margin of 56.8%, while improving single-agent performance by 16.0%-76.6% across different base LLMs, highlighting the importance of dependency modeling, context propagation, and workflow-aware agent design for complex software projects. Our work summarizes practical insights for designing iterative, LLM-driven development frameworks and informs future training of base LLMs to better support iterative software development.
68. H-Infinity Filter Enhanced CNN-LSTM for Arrhythmia Detection from Heart Sound Recordings
- Authors: Rohith Shinoj Kumar , Rushdeep Dinda , Aditya Tyagi , Annappa B. , Naveen Kumar M. R
- URL: https://arxiv.org/abs/2511.02379
- Abstract:
Early detection of heart arrhythmia can prevent severe future complications in cardiac patients. While manual diagnosis still remains the clinical standard, it relies heavily on visual interpretation and is inherently subjective. In recent years, deep learning has emerged as a powerful tool to automate arrhythmia detection, offering improved accuracy, consistency, and efficiency. Several variants of convolutional and recurrent neural network architectures have been widely explored to capture spatial and temporal patterns in physiological signals. However, despite these advancements, current models often struggle to generalize well in real-world scenarios, especially when dealing with small or noisy datasets, which are common challenges in biomedical applications. In this paper, a novel CNN-H-Infinity-LSTM architecture is proposed to identify arrhythmic heart signals from heart sound recordings. This architecture introduces trainable parameters inspired by the H-Infinity filter from control theory, enhancing robustness and generalization. Extensive experimentation on the PhysioNet CinC Challenge 2016 dataset, a public benchmark of heart audio recordings, demonstrates that the proposed model achieves stable convergence and outperforms existing benchmarks, with a test accuracy of 99.42% and an F1 score of 98.85%.
69. AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models
- Authors: Aashray Reddy , Andrew Zagula , Nicholas Saban
- URL: https://arxiv.org/abs/2511.02376
- Abstract:
Large Language Models (LLMs) remain vulnerable to jailbreaking attacks where adversarial prompts elicit harmful outputs, yet most evaluations focus on single-turn interactions while real-world attacks unfold through adaptive multi-turn conversations. We present AutoAdv, a training-free framework for automated multi-turn jailbreaking that achieves up to 95% attack success rate on Llama-3.1-8B within six turns a 24 percent improvement over single turn baselines. AutoAdv uniquely combines three adaptive mechanisms: a pattern manager that learns from successful attacks to enhance future prompts, a temperature manager that dynamically adjusts sampling parameters based on failure modes, and a two-phase rewriting strategy that disguises harmful requests then iteratively refines them. Extensive evaluation across commercial and open-source models (GPT-4o-mini, Qwen3-235B, Mistral-7B) reveals persistent vulnerabilities in current safety mechanisms, with multi-turn attacks consistently outperforming single-turn approaches. These findings demonstrate that alignment strategies optimized for single-turn interactions fail to maintain robustness across extended conversations, highlighting an urgent need for multi-turn-aware defenses.
70. AyurParam: A State-of-the-Art Bilingual Language Model for Ayurveda
- Authors: Mohd Nauman , Sravan Gvm , Vijay Devane , Shyam Pawar , Viraj Thakur , Kundeshwar Pundalik , Piyush Sawarkar , Rohit Saluja , Maunendra Desarkar , Ganesh Ramakrishnan
- URL: https://arxiv.org/abs/2511.02374
- Abstract:
Current large language models excel at broad, general-purpose tasks, but consistently underperform when exposed to highly specialized domains that require deep cultural, linguistic, and subject-matter expertise. In particular, traditional medical systems such as Ayurveda embody centuries of nuanced textual and clinical knowledge that mainstream LLMs fail to accurately interpret or apply. We introduce AyurParam-2.9B, a domain-specialized, bilingual language model fine-tuned from Param-1-2.9B using an extensive, expertly curated Ayurveda dataset spanning classical texts and clinical guidance. AyurParam’s dataset incorporates context-aware, reasoning, and objective-style Q&A in both English and Hindi, with rigorous annotation protocols for factual precision and instructional clarity. Benchmarked on BhashaBench-Ayur, AyurParam not only surpasses all open-source instruction-tuned models in its size class (1.5–3B parameters), but also demonstrates competitive or superior performance compared to much larger models. The results from AyurParam highlight the necessity for authentic domain adaptation and high-quality supervision in delivering reliable, culturally congruent AI for specialized medical knowledge.
71. AI Credibility Signals Outrank Institutions and Engagement in Shaping News Perception on Social Media
- Authors: Adnan Hoq , Matthew Facciani , Tim Weninger
- URL: https://arxiv.org/abs/2511.02370
- Abstract:
AI-generated content is rapidly becoming a salient component of online information ecosystems, yet its influence on public trust and epistemic judgments remains poorly understood. We present a large-scale mixed-design experiment (N = 1,000) investigating how AI-generated credibility scores affect user perception of political news. Our results reveal that AI feedback significantly moderates partisan bias and institutional distrust, surpassing traditional engagement signals such as likes and shares. These findings demonstrate the persuasive power of generative AI and suggest a need for design strategies that balance epistemic influence with user autonomy.
72. Let Multimodal Embedders Learn When to Augment Query via Adaptive Query Augmentation
- Authors: Wongyu Kim , Hochang Lee , Sanghak Lee , Yoonsung Kim , Jaehyun Park
- URL: https://arxiv.org/abs/2511.02358
- Abstract:
Query augmentation makes queries more meaningful by appending further information to the queries to find relevant documents. Current studies have proposed Large Language Model (LLM)-based embedders, which learn representation for embedding and generation for query augmentation in a multi-task manner by leveraging the generative capabilities of LLM. During inference, these jointly trained embedders have conducted query augmentation followed by embedding, showing effective results. However, augmenting every query leads to substantial embedding latency and query augmentation can be detrimental to performance for some queries. Also, previous methods have not been explored in multimodal environments. To tackle these problems, we propose M-Solomon, a universal multimodal embedder that can adaptively determine when to augment queries. Our approach first divides the queries of the training datasets into two groups at the dataset level. One includes queries that require augmentation and the other includes queries that do not. Then, we introduces a synthesis process that generates appropriate augmentations for queries that require them by leveraging a powerful Multimodal LLM (MLLM). Next, we present adaptive query augmentation. Through this step, M-Solomon can conduct query augmentation only when necessary by learning to generate synthetic augmentations with the prefix /augment for queries that demand them and to generate the simple string /embed for others. Experimental results showed that M-Solomon not only surpassed the baseline without augmentation by a large margin but also outperformed the baseline that always used augmentation, providing much faster embedding latency.
73. Human-Machine Ritual: Synergic Performance through Real-Time Motion Recognition
- Authors: Zhuodi Cai , Ziyu Xu , Juan Pampin
- URL: https://arxiv.org/abs/2511.02351
- Abstract:
We introduce a lightweight, real-time motion recognition system that enables synergic human-machine performance through wearable IMU sensor data, MiniRocket time-series classification, and responsive multimedia control. By mapping dancer-specific movement to sound through somatic memory and association, we propose an alternative approach to human-machine collaboration, one that preserves the expressive depth of the performing body while leveraging machine learning for attentive observation and responsiveness. We demonstrate that this human-centered design reliably supports high accuracy classification (<50 ms latency), offering a replicable framework to integrate dance-literate machines into creative, educational, and live performance contexts.
74. Biological Regulatory Network Inference through Circular Causal Structure Learning
- Authors: Hongyang Jiang , Yuezhu Wang , Ke Feng , Chaoyi Yin , Yi Chang , Huiyan Sun
- URL: https://arxiv.org/abs/2511.02332
- Abstract:
Biological networks are pivotal in deciphering the complexity and functionality of biological systems. Causal inference, which focuses on determining the directionality and strength of interactions between variables rather than merely relying on correlations, is considered a logical approach for inferring biological networks. Existing methods for causal structure inference typically assume that causal relationships between variables can be represented by directed acyclic graphs (DAGs). However, this assumption is at odds with the reality of widespread feedback loops in biological systems, making these methods unsuitable for direct use in biological network inference. In this study, we propose a new framework named SCALD (Structural CAusal model for Loop Diagram), which employs a nonlinear structure equation model and a stable feedback loop conditional constraint through continuous optimization to infer causal regulatory relationships under feedback loops. We observe that SCALD outperforms state-of-the-art methods in inferring both transcriptional regulatory networks and signaling transduction networks. SCALD has irreplaceable advantages in identifying feedback regulation. Through transcription factor (TF) perturbation data analysis, we further validate the accuracy and sensitivity of SCALD. Additionally, SCALD facilitates the discovery of previously unknown regulatory relationships, which we have subsequently confirmed through ChIP-seq data analysis. Furthermore, by utilizing SCALD, we infer the key driver genes that facilitate the transformation from colon inflammation to cancer by examining the dynamic changes within regulatory networks during the process.
75. The Sequential Edge: Inverse-Entropy Voting Beats Parallel Self-Consistency at Matched Compute
- Authors: Aman Sharma , Paras Chopra
- URL: https://arxiv.org/abs/2511.02309
- Abstract:
We revisit test-time scaling for language model reasoning and ask a fundamental question: at equal token budget and compute, is it better to run multiple independent chains in parallel, or to run fewer chains that iteratively refine through sequential steps? Through comprehensive evaluation across 5 state-of-the-art open source models and 3 challenging reasoning benchmarks, we find that sequential scaling where chains explicitly build upon previous attempts consistently outperforms the dominant parallel self-consistency paradigm in 95.6% of configurations with gains in accuracy upto 46.7%. Further, we introduce inverse-entropy weighted voting, a novel training-free method to further boost the accuracy of sequential scaling. By weighing answers in proportion to the inverse entropy of their reasoning chains, we increase our success rate over parallel majority and establish it as the optimal test-time scaling strategy. Our findings fundamentally challenge the parallel reasoning orthodoxy that has dominated test-time scaling since Wang et al.’s self-consistency decoding (Wang et al., 2022), positioning sequential refinement as the robust default for modern LLM reasoning and necessitating a paradigm shift in how we approach inference-time optimization.
76. Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning
- Authors: Beyazit Yalcinkaya , Marcell Vazquez-Chanlatte , Ameesh Shah , Hanna Krasowski , Sanjit A. Seshia
- URL: https://arxiv.org/abs/2511.02304
- Abstract:
We study the problem of learning multi-task, multi-agent policies for cooperative, temporal objectives, under centralized training, decentralized execution. In this setting, using automata to represent tasks enables the decomposition of complex tasks into simpler sub-tasks that can be assigned to agents. However, existing approaches remain sample-inefficient and are limited to the single-task case. In this work, we present Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning (ACC-MARL), a framework for learning task-conditioned, decentralized team policies. We identify the main challenges to ACC-MARL’s feasibility in practice, propose solutions, and prove the correctness of our approach. We further show that the value functions of learned policies can be used to assign tasks optimally at test time. Experiments show emergent task-aware, multi-step coordination among agents, e.g., pressing a button to unlock a door, holding the door, and short-circuiting tasks.
77. FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error
- Authors: Fengjuan Wang , Zhiyi Su , Xingzhu Hu , Cheng Wang , Mou Sun
- URL: https://arxiv.org/abs/2511.02302
- Abstract:
Training large Mixture-of-Experts (MoE) models remains computationally prohibitive due to their extreme compute and memory demands. Although low-precision training promises to accelerate computation and reduce memory footprint, existing implementations still rely on BF16-dominated dataflows with frequent quantize-dequantize (Q/DQ) conversions. These redundant casts erode much of FP8’s theoretical efficiency. However, naively removing these casts by keeping dataflows entirely in FP8 introduces double quantization error: tensors quantized along different dimensions accumulate inconsistent scaling factors, degrading numerical stability. We propose FP8-Flow-MoE, an FP8 training recipe featuring a quantization-consistent FP8-centric dataflow with a scaling-aware transpose and fused FP8 operators that streamline computation and eliminate explicit cast operations from 12 to 2. Evaluations on a 671B-parameter MoE model demonstrate up to 21\% higher throughput and 16.5 GB lower memory usage per GPU compared to BF16 and naïve FP8 baselines, while maintaining stable convergence. We provide a plug-and-play FP8 recipe compatible with TransformerEngine and Megatron-LM, which will be open-sourced soon.
78. Federated Quantum Kernel Learning for Anomaly Detection in Multivariate IoT Time-Series
- Authors: Kuan-Cheng Chen , Samuel Yen-Chi Chen , Chen-Yu Liu , Kin K. Leung
- URL: https://arxiv.org/abs/2511.02301
- Abstract:
The rapid growth of industrial Internet of Things (IIoT) systems has created new challenges for anomaly detection in high-dimensional, multivariate time-series, where privacy, scalability, and communication efficiency are critical. Classical federated learning approaches mitigate privacy concerns by enabling decentralized training, but they often struggle with highly non-linear decision boundaries and imbalanced anomaly distributions. To address this gap, we propose a Federated Quantum Kernel Learning (FQKL) framework that integrates quantum feature maps with federated aggregation to enable distributed, privacy-preserving anomaly detection across heterogeneous IoT networks. In our design, quantum edge nodes locally compute compressed kernel statistics using parameterized quantum circuits and share only these summaries with a central server, which constructs a global Gram matrix and trains a decision function (e.g., Fed-QSVM). Experimental results on synthetic IIoT benchmarks demonstrate that FQKL achieves superior generalization in capturing complex temporal correlations compared to classical federated baselines, while significantly reducing communication overhead. This work highlights the promise of quantum kernels in federated settings, advancing the path toward scalable, robust, and quantum-enhanced intelligence for next-generation IoT infrastructures.
79. From data to design: Random forest regression model for predicting mechanical properties of alloy steel
- Authors: Samjukta Sinha , Prabhat Das
- URL: https://arxiv.org/abs/2511.02290
- Abstract:
This study investigates the application of Random Forest Regression for predicting mechanical properties of alloy steel-Elongation, Tensile Strength, and Yield Strength-from material composition features including Iron (Fe), Chromium (Cr), Nickel (Ni), Manganese (Mn), Silicon (Si), Copper (Cu), Carbon (C), and deformation percentage during cold rolling. Utilizing a dataset comprising these features, we trained and evaluated the Random Forest model, achieving high predictive performance as evidenced by R2 scores and Mean Squared Errors (MSE). The results demonstrate the model’s efficacy in providing accurate predictions, which is validated through various performance metrics including residual plots and learning curves. The findings underscore the potential of ensemble learning techniques in enhancing material property predictions, with implications for industrial applications in material science.
80. LA-MARRVEL: A Knowledge-Grounded and Language-Aware LLM Reranker for AI-MARRVEL in Rare Disease Diagnosis
- Authors: Jaeyeon Lee , Hyun-Hwan Jeong , Zhandong Liu
- URL: https://arxiv.org/abs/2511.02263
- Abstract:
Diagnosing rare diseases often requires connecting variant-bearing genes to evidence that is written as unstructured clinical prose, which the current established pipelines still leave for clinicians to reconcile manually. To this end, we introduce LA-MARRVEL, a knowledge-grounded and language-aware reranking layer that operates on top of AI-MARRVEL: it supplies expert-engineered context, queries a large language model multiple times, and aggregates the resulting partial rankings with a ranked voting method to produce a stable, explainable gene ranking. Evaluated on three real-world cohorts (BG, DDD, UDN), LA-MARRVEL consistently improves Recall@K over AI-MARRVEL and established phenotype-driven tools such as Exomiser and LIRICAL, with especially large gains on cases where the first-stage ranker placed the causal gene lower. Each ranked gene is accompanied by LLM-generated reasoning that integrates phenotypic, inheritance, and variant-level evidence, thereby making the output more interpretable and facilitating clinical review.
81. Fast Approximation Algorithm for Non-Monotone DR-submodular Maximization under Size Constraint
- Authors: Tan D. Tran , Canh V. Pham
- URL: https://arxiv.org/abs/2511.02254
- Abstract:
This work studies the non-monotone DR-submodular Maximization over a ground set of $n$ subject to a size constraint $k$. We propose two approximation algorithms for solving this problem named FastDrSub and FastDrSub++. FastDrSub offers an approximation ratio of $0.044$ with query complexity of $O(n \log(k))$. The second one, FastDrSub++, improves upon it with a ratio of $1/4-\epsilon$ within query complexity of $(n \log k)$ for an input parameter $\epsilon >0$. Therefore, our proposed algorithms are the first constant-ratio approximation algorithms for the problem with the low complexity of $O(n \log(k))$. Additionally, both algorithms are experimentally evaluated and compared against existing state-of-the-art methods, demonstrating their effectiveness in solving the Revenue Maximization problem with DR-submodular objective function. The experimental results show that our proposed algorithms significantly outperform existing approaches in terms of both query complexity and solution quality.
82. Demo: Statistically Significant Results On Biases and Errors of LLMs Do Not Guarantee Generalizable Results
- Authors: Jonathan Liu , Haoling Qiu , Jonathan Lasko , Damianos Karakos , Mahsa Yarmohammadi , Mark Dredze
- URL: https://arxiv.org/abs/2511.02246
- Abstract:
Recent research has shown that hallucinations, omissions, and biases are prevalent in everyday use-cases of LLMs. However, chatbots used in medical contexts must provide consistent advice in situations where non-medical factors are involved, such as when demographic information is present. In order to understand the conditions under which medical chatbots fail to perform as expected, we develop an infrastructure that 1) automatically generates queries to probe LLMs and 2) evaluates answers to these queries using multiple LLM-as-a-judge setups and prompts. For 1), our prompt creation pipeline samples the space of patient demographics, histories, disorders, and writing styles to create realistic questions that we subsequently use to prompt LLMs. In 2), our evaluation pipeline provides hallucination and omission detection using LLM-as-a-judge as well as agentic workflows, in addition to LLM-as-a-judge treatment category detectors. As a baseline study, we perform two case studies on inter-LLM agreement and the impact of varying the answering and evaluation LLMs. We find that LLM annotators exhibit low agreement scores (average Cohen’s Kappa $\kappa=0.118$), and only specific (answering, evaluation) LLM pairs yield statistically significant differences across writing styles, genders, and races. We recommend that studies using LLM evaluation use multiple LLMs as evaluators in order to avoid arriving at statistically significant but non-generalizable results, particularly in the absence of ground-truth data. We also suggest publishing inter-LLM agreement metrics for transparency. Our code and dataset are available here: this https URL .
83. Structural Plasticity as Active Inference: A Biologically-Inspired Architecture for Homeostatic Control
- Authors: Brennen A. Hill
- URL: https://arxiv.org/abs/2511.02241
- Abstract:
Traditional neural networks, while powerful, rely on biologically implausible learning mechanisms such as global backpropagation. This paper introduces the Structurally Adaptive Predictive Inference Network (SAPIN), a novel computational model inspired by the principles of active inference and the morphological plasticity observed in biological neural cultures. SAPIN operates on a 2D grid where processing units, or cells, learn by minimizing local prediction errors. The model features two primary, concurrent learning mechanisms: a local, Hebbian-like synaptic plasticity rule based on the temporal difference between a cell’s actual activation and its learned expectation, and a structural plasticity mechanism where cells physically migrate across the grid to optimize their information-receptive fields. This dual approach allows the network to learn both how to process information (synaptic weights) and also where to position its computational resources (network topology). We validated the SAPIN model on the classic Cart Pole reinforcement learning benchmark. Our results demonstrate that the architecture can successfully solve the CartPole task, achieving robust performance. The network’s intrinsic drive to minimize prediction error and maintain homeostasis was sufficient to discover a stable balancing policy. We also found that while continual learning led to instability, locking the network’s parameters after achieving success resulted in a stable policy. When evaluated for 100 episodes post-locking (repeated over 100 successful agents), the locked networks maintained an average 82% success rate.
84. LACY: A Vision-Language Model-based Language-Action Cycle for Self-Improving Robotic Manipulation
- Authors: Youngjin Hong , Houjian Yu , Mingen Li , Changhyun Choi
- URL: https://arxiv.org/abs/2511.02239
- Abstract:
Learning generalizable policies for robotic manipulation increasingly relies on large-scale models that map language instructions to actions (L2A). However, this one-way paradigm often produces policies that execute tasks without deeper contextual understanding, limiting their ability to generalize or explain their behavior. We argue that the complementary skill of mapping actions back to language (A2L) is essential for developing more holistic grounding. An agent capable of both acting and explaining its actions can form richer internal representations and unlock new paradigms for self-supervised learning. We introduce LACY (Language-Action Cycle), a unified framework that learns such bidirectional mappings within a single vision-language model. LACY is jointly trained on three synergistic tasks: generating parameterized actions from language (L2A), explaining observed actions in language (A2L), and verifying semantic consistency between two language descriptions (L2C). This enables a self-improving cycle that autonomously generates and filters new training data through an active augmentation strategy targeting low-confidence cases, thereby improving the model without additional human labels. Experiments on pick-and-place tasks in both simulation and the real world show that LACY improves task success rates by 56.46% on average and yields more robust language-action grounding for robotic manipulation. Project page: this https URL
85. Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live
- Authors: Hanchen Li , Qiuyang Mang , Runyuan He , Qizheng Zhang , Huanzhi Mao , Xiaokun Chen , Alvin Cheung , Joseph Gonzalez , Ion Stoica
- URL: https://arxiv.org/abs/2511.02230
- Abstract:
Agentic LLM applications interleave LLM generation requests with tool calls. These tool calls break the continuity of the workflow by creating pauses between LLM requests, bringing many challenges for the serving system, especially under multi-turn scenarios. Each pause potentially causes KV cache eviction and extra waiting time before entering the continuous batch for the following LLM request. Since these pauses happen for each call, this problem becomes increasingly severe as turn number grow for agentic programs. Previous works either fail to incorporate information from the tool call, evicting KV cache that leads to repetitive prefill or loading, or ignore the continuity of a multi-turn program, creating waiting time between turns that increases per-request latency. We present Continuum, a serving system to optimize job completion time for multi-turn agent workloads by combining tool-aware KV cache timeout with program-level scheduling. By predicting tool call durations in agentic workflows, Continuum selectively pins the KV cache in GPU memory with a time-to-live value based on total turn number. When combined with program-level first-come-first-serve, Continuum prevents scheduling bubbles, preserves multi-turn continuity, and optimizes for throughput for complex agentic workflows. By modeling the variability of tool call and agent program continuity, Continuum outperforms state-of-the-art baselines. Our evaluation on real-world agentic workloads (SWE-Bench and BFCL) with Llama-3.1 8B/70B models shows that Continuum significantly improves the average job completion times, and remains performant across different hardware setups and DRAM offloading schemes. Preview code is available at: this https URL
86. Collaborative Attention and Consistent-Guided Fusion of MRI and PET for Alzheimer’s Disease Diagnosis
- Authors: Delin Ma , Menghui Zhou , Jun Qi , Yun Yang , Po Yang
- URL: https://arxiv.org/abs/2511.02228
- Abstract:
Alzheimer’s disease (AD) is the most prevalent form of dementia, and its early diagnosis is essential for slowing disease progression. Recent studies on multimodal neuroimaging fusion using MRI and PET have achieved promising results by integrating multi-scale complementary features. However, most existing approaches primarily emphasize cross-modal complementarity while overlooking the diagnostic importance of modality-specific features. In addition, the inherent distributional differences between modalities often lead to biased and noisy representations, degrading classification performance. To address these challenges, we propose a Collaborative Attention and Consistent-Guided Fusion framework for MRI and PET based AD diagnosis. The proposed model introduces a learnable parameter representation (LPR) block to compensate for missing modality information, followed by a shared encoder and modality-independent encoders to preserve both shared and specific representations. Furthermore, a consistency-guided mechanism is employed to explicitly align the latent distributions across modalities. Experimental results on the ADNI dataset demonstrate that our method achieves superior diagnostic performance compared with existing fusion strategies.
87. Optimizing Multi-Lane Intersection Performance in Mixed Autonomy Environments
- Authors: Manonmani Sekar , Nasim Nezamoddini
- URL: https://arxiv.org/abs/2511.02217
- Abstract:
One of the main challenges in managing traffic at multilane intersections is ensuring smooth coordination between human-driven vehicles (HDVs) and connected autonomous vehicles (CAVs). This paper presents a novel traffic signal control framework that combines Graph Attention Networks (GAT) with Soft Actor-Critic (SAC) reinforcement learning to address this challenge. GATs are used to model the dynamic graph- structured nature of traffic flow to capture spatial and temporal dependencies between lanes and signal phases. The proposed SAC is a robust off-policy reinforcement learning algorithm that enables adaptive signal control through entropy-optimized decision making. This design allows the system to coordinate the signal timing and vehicle movement simultaneously with objectives focused on minimizing travel time, enhancing performance, ensuring safety, and improving fairness between HDVs and CAVs. The model is evaluated using a SUMO-based simulation of a four-way intersection and incorporating different traffic densities and CAV penetration rates. The experimental results demonstrate the effectiveness of the GAT-SAC approach by achieving a 24.1% reduction in average delay and up to 29.2% fewer traffic violations compared to traditional methods. Additionally, the fairness ratio between HDVs and CAVs improved to 1.59, indicating more equitable treatment across vehicle types. These findings suggest that the GAT-SAC framework holds significant promise for real-world deployment in mixed-autonomy traffic systems.
88. Adaptive Cooperative Transmission Design for Ultra-Reliable Low-Latency Communications via Deep Reinforcement Learning
- Authors: Hyemin Yu , Hong-Chuan Yang
- URL: https://arxiv.org/abs/2511.02216
- Abstract:
Next-generation wireless communication systems must support ultra-reliable low-latency communication (URLLC) service for mission-critical applications. Meeting stringent URLLC requirements is challenging, especially for two-hop cooperative communication. In this paper, we develop an adaptive transmission design for a two-hop relaying communication system. Each hop transmission adaptively configures its transmission parameters separately, including numerology, mini-slot size, and modulation and coding scheme, for reliable packet transmission within a strict latency constraint. We formulate the hop-specific transceiver configuration as a Markov decision process (MDP) and propose a dual-agent reinforcement learning-based cooperative latency-aware transmission (DRL-CoLA) algorithm to learn latency-aware transmission policies in a distributed manner. Simulation results verify that the proposed algorithm achieves the near-optimal reliability while satisfying strict latency requirements.
89. Estimation of Segmental Longitudinal Strain in Transesophageal Echocardiography by Deep Learning
- Authors: Anders Austlid Taskén , Thierry Judge , Erik Andreas Rye Berg , Jinyang Yu , Bjørnar Grenne , Frank Lindseth , Svend Aakhus , Pierre-Marc Jodoin , Nicolas Duchateau , Olivier Bernard , Gabriel Kiss
- URL: https://arxiv.org/abs/2511.02210
- Abstract:
Segmental longitudinal strain (SLS) of the left ventricle (LV) is an important prognostic indicator for evaluating regional LV dysfunction, in particular for diagnosing and managing myocardial ischemia. Current techniques for strain estimation require significant manual intervention and expertise, limiting their efficiency and making them too resource-intensive for monitoring purposes. This study introduces the first automated pipeline, autoStrain, for SLS estimation in transesophageal echocardiography (TEE) using deep learning (DL) methods for motion estimation. We present a comparative analysis of two DL approaches: TeeFlow, based on the RAFT optical flow model for dense frame-to-frame predictions, and TeeTracker, based on the CoTracker point trajectory model for sparse long-sequence predictions. As ground truth motion data from real echocardiographic sequences are hardly accessible, we took advantage of a unique simulation pipeline (SIMUS) to generate a highly realistic synthetic TEE (synTEE) dataset of 80 patients with ground truth myocardial motion to train and evaluate both models. Our evaluation shows that TeeTracker outperforms TeeFlow in accuracy, achieving a mean distance error in motion estimation of 0.65 mm on a synTEE test dataset. Clinical validation on 16 patients further demonstrated that SLS estimation with our autoStrain pipeline aligned with clinical references, achieving a mean difference (95\% limits of agreement) of 1.09% (-8.90% to 11.09%). Incorporation of simulated ischemia in the synTEE data improved the accuracy of the models in quantifying abnormal deformation. Our findings indicate that integrating AI-driven motion estimation with TEE can significantly enhance the precision and efficiency of cardiac function assessment in clinical settings.
90. Object-Centric 3D Gaussian Splatting for Strawberry Plant Reconstruction and Phenotyping
- Authors: Jiajia Li , Keyi Zhu , Qianwen Zhang , Dong Chen , Qi Sun , Zhaojian Li
- URL: https://arxiv.org/abs/2511.02207
- Abstract:
Strawberries are among the most economically significant fruits in the United States, generating over $2 billion in annual farm-gate sales and accounting for approximately 13% of the total fruit production value. Plant phenotyping plays a vital role in selecting superior cultivars by characterizing plant traits such as morphology, canopy structure, and growth dynamics. However, traditional plant phenotyping methods are time-consuming, labor-intensive, and often destructive. Recently, neural rendering techniques, notably Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), have emerged as powerful frameworks for high-fidelity 3D reconstruction. By capturing a sequence of multi-view images or videos around a target plant, these methods enable non-destructive reconstruction of complex plant architectures. Despite their promise, most current applications of 3DGS in agricultural domains reconstruct the entire scene, including background elements, which introduces noise, increases computational costs, and complicates downstream trait analysis. To address this limitation, we propose a novel object-centric 3D reconstruction framework incorporating a preprocessing pipeline that leverages the Segment Anything Model v2 (SAM-2) and alpha channel background masking to achieve clean strawberry plant reconstructions. This approach produces more accurate geometric representations while substantially reducing computational time. With a background-free reconstruction, our algorithm can automatically estimate important plant traits, such as plant height and canopy width, using DBSCAN clustering and Principal Component Analysis (PCA). Experimental results show that our method outperforms conventional pipelines in both accuracy and efficiency, offering a scalable and non-destructive solution for strawberry plant phenotyping.
91. Open the Oyster: Empirical Evaluation and Improvement of Code Reasoning Confidence in LLMs
- Authors: Shufan Wang , Xing Hu , Junkai Chen , Zhiyuan Pan , Xin Xia
- URL: https://arxiv.org/abs/2511.02197
- Abstract:
With the widespread application of large language models (LLMs) in the field of code intelligence, increasing attention has been paid to the reliability and controllability of their outputs in code reasoning tasks. Confidence estimation serves as an effective and convenient approach for evaluating these aspects. This paper proposes a confidence analysis and enhancement framework for LLMs tailored to code reasoning tasks. We conduct a comprehensive empirical study on the confidence reliability of mainstream LLMs across different tasks, and further evaluate the effectiveness of techniques such as prompt strategy optimisation and mathematical calibration (e.g., Platt Scaling) in improving confidence reliability. Our results show that DeepSeek-Reasoner achieves the best performance across various tasks, outperforming other models by up to $0.680$, $0.636$, and $13.652$ in terms of ECE, Brier Score, and Performance Score, respectively. The hybrid strategy combining the reassess prompt strategy and Platt Scaling achieves improvements of up to $0.541$, $0.628$, and $15.084$ over the original performance in the aforementioned three metrics. These results indicate that models with reasoning capabilities demonstrate superior confidence reliability, and that the hybrid strategy is the most effective in enhancing the confidence reliability of various models. Meanwhile, we elucidate the impact of different task complexities, model scales, and strategies on confidence performance, and highlight that the confidence of current LLMs in complex reasoning tasks still has considerable room for improvement. This study not only provides a research foundation and technical reference for the application of confidence in LLM-assisted software engineering, but also points the way for future optimisation and engineering deployment of confidence mechanisms.
92. BoolSkeleton: Boolean Network Skeletonization via Homogeneous Pattern Reduction
- Authors: Liwei Ni , Jiaxi Zhang , Shenggen Zheng , Junfeng Liu , Xingyu Meng , Biwei Xie , Xingquan Li , Huawei Li
- URL: https://arxiv.org/abs/2511.02196
- Abstract:
Boolean equivalence allows Boolean networks with identical functionality to exhibit diverse graph structures. This gives more room for exploration in logic optimization, while also posing a challenge for tasks involving consistency between Boolean networks. To tackle this challenge, we introduce BoolSkeleton, a novel Boolean network skeletonization method that improves the consistency and reliability of design-specific evaluations. BoolSkeleton comprises two key steps: preprocessing and reduction. In preprocessing, the Boolean network is transformed into a defined Boolean dependency graph, where nodes are assigned the functionality-related status. Next, the homogeneous and heterogeneous patterns are defined for the node-level pattern reduction step. Heterogeneous patterns are preserved to maintain critical functionality-related dependencies, while homogeneous patterns can be reduced. Parameter K of the pattern further constrains the fanin size of these patterns, enabling fine-tuned control over the granularity of graph reduction. To validate BoolSkeleton’s effectiveness, we conducted four analysis/downstream tasks around the Boolean network: compression analysis, classification, critical path analysis, and timing prediction, demonstrating its robustness across diverse scenarios. Furthermore, it improves above 55% in the average accuracy compared to the original Boolean network for the timing prediction task. These experiments underscore the potential of BoolSkeleton to enhance design consistency in logic synthesis.
93. MM-UNet: Morph Mamba U-shaped Convolutional Networks for Retinal Vessel Segmentation
- Authors: Jiawen Liu , Yuanbo Zeng , Jiaming Liang , Yizhen Yang , Yiheng Zhang , Enhui Cai , Xiaoqi Sheng , Hongmin Cai
- URL: https://arxiv.org/abs/2511.02193
- Abstract:
Accurate detection of retinal vessels plays a critical role in reflecting a wide range of health status indicators in the clinical diagnosis of ocular diseases. Recently, advances in deep learning have led to a surge in retinal vessel segmentation methods, which have significantly contributed to the quantitative analysis of vascular morphology. However, retinal vasculature differs significantly from conventional segmentation targets in that it consists of extremely thin and branching structures, whose global morphology varies greatly across images. These characteristics continue to pose challenges to segmentation precision and robustness. To address these issues, we propose MM-UNet, a novel architecture tailored for efficient retinal vessel segmentation. The model incorporates Morph Mamba Convolution layers, which replace pointwise convolutions to enhance branching topological perception through morph, state-aware feature sampling. Additionally, Reverse Selective State Guidance modules integrate reverse guidance theory with state-space modeling to improve geometric boundary awareness and decoding efficiency. Extensive experiments conducted on two public retinal vessel segmentation datasets demonstrate the superior performance of the proposed method in segmentation accuracy. Compared to the existing approaches, MM-UNet achieves F1-score gains of 1.64 $\%$ on DRIVE and 1.25 $\%$ on STARE, demonstrating its effectiveness and advancement. The project code is public via this https URL .
94. Tackling Incomplete Data in Air Quality Prediction: A Bayesian Deep Learning Framework for Uncertainty Quantification
- Authors: Yuzhuang Pian , Taiyu Wang , Shiqi Zhang , Rui Xu , Yonghong Liu
- URL: https://arxiv.org/abs/2511.02175
- Abstract:
Accurate air quality forecasts are vital for public health alerts, exposure assessment, and emissions control. In practice, observational data are often missing in varying proportions and patterns due to collection and transmission issues. These incomplete spatiotemporal records impede reliable inference and risk assessment and can lead to overconfident extrapolation. To address these challenges, we propose an end to end framework, the channel gated learning unit based spatiotemporal bayesian neural field (CGLUBNF). It uses Fourier features with a graph attention encoder to capture multiscale spatial dependencies and seasonal temporal dynamics. A channel gated learning unit, equipped with learnable activations and gated residual connections, adaptively filters and amplifies informative features. Bayesian inference jointly optimizes predictive distributions and parameter uncertainty, producing point estimates and calibrated prediction intervals. We conduct a systematic evaluation on two real world datasets, covering four typical missing data patterns and comparing against five state of the art baselines. CGLUBNF achieves superior prediction accuracy and sharper confidence intervals. In addition, we further validate robustness across multiple prediction horizons and analysis the contribution of extraneous variables. This research lays a foundation for reliable deep learning based spatio-temporal forecasting with incomplete observations in emerging sensing paradigms, such as real world vehicle borne mobile monitoring.
95. ScenicProver: A Framework for Compositional Probabilistic Verification of Learning-Enabled Systems
- Authors: Eric Vin , Kyle A. Miller , Inigo Incer , Sanjit A. Seshia , Daniel J. Fremont
- URL: https://arxiv.org/abs/2511.02164
- Abstract:
Full verification of learning-enabled cyber-physical systems (CPS) has long been intractable due to challenges including black-box components and complex real-world environments. Existing tools either provide formal guarantees for limited types of systems or test the system as a monolith, but no general framework exists for compositional analysis of learning-enabled CPS using varied verification techniques over complex real-world environments. This paper introduces ScenicProver, a verification framework that aims to fill this gap. Built upon the Scenic probabilistic programming language, the framework supports: (1) compositional system description with clear component interfaces, ranging from interpretable code to black boxes; (2) assume-guarantee contracts over those components using an extension of Linear Temporal Logic containing arbitrary Scenic expressions; (3) evidence generation through testing, formal proofs via Lean 4 integration, and importing external assumptions; (4) systematic combination of generated evidence using contract operators; and (5) automatic generation of assurance cases tracking the provenance of system-level guarantees. We demonstrate the framework’s effectiveness through a case study on an autonomous vehicle’s automatic emergency braking system with sensor fusion. By leveraging manufacturer guarantees for radar and laser sensors and focusing testing efforts on uncertain conditions, our approach enables stronger probabilistic guarantees than monolithic testing with the same computational budget.
96. Text to Robotic Assembly of Multi Component Objects using 3D Generative AI and Vision Language Models
- Authors: Alexander Htet Kyaw , Richa Gupta , Dhruv Shah , Anoop Sinha , Kory Mathewson , Stefanie Pender , Sachin Chitta , Yotto Koga , Faez Ahmed , Lawrence Sass , Randall Davis
- URL: https://arxiv.org/abs/2511.02162
- Abstract:
Advances in 3D generative AI have enabled the creation of physical objects from text prompts, but challenges remain in creating objects involving multiple component types. We present a pipeline that integrates 3D generative AI with vision-language models (VLMs) to enable the robotic assembly of multi-component objects from natural language. Our method leverages VLMs for zero-shot, multi-modal reasoning about geometry and functionality to decompose AI-generated meshes into multi-component 3D models using predefined structural and panel components. We demonstrate that a VLM is capable of determining which mesh regions need panel components in addition to structural components, based on object functionality. Evaluation across test objects shows that users preferred the VLM-generated assignments 90.6% of the time, compared to 59.4% for rule-based and 2.5% for random assignment. Lastly, the system allows users to refine component assignments through conversational feedback, enabling greater human control and agency in making physical objects with generative AI and robotics.
97. Near Optimal Convergence to Coarse Correlated Equilibrium in General-Sum Markov Games
- Authors: Asrin Efe Yorulmaz , Tamer Başar
- URL: https://arxiv.org/abs/2511.02157
- Abstract:
No-regret learning dynamics play a central role in game theory, enabling decentralized convergence to equilibrium for concepts such as Coarse Correlated Equilibrium (CCE) or Correlated Equilibrium (CE). In this work, we improve the convergence rate to CCE in general-sum Markov games, reducing it from the previously best-known rate of $\mathcal{O}(\log^5 T / T)$ to a sharper $\mathcal{O}(\log T / T)$. This matches the best known convergence rate for CE in terms of $T$, number of iterations, while also improving the dependence on the action set size from polynomial to polylogarithmic-yielding exponential gains in high-dimensional settings. Our approach builds on recent advances in adaptive step-size techniques for no-regret algorithms in normal-form games, and extends them to the Markovian setting via a stage-wise scheme that adjusts learning rates based on real-time feedback. We frame policy updates as an instance of Optimistic Follow-the-Regularized-Leader (OFTRL), customized for value-iteration-based learning. The resulting self-play algorithm achieves, to our knowledge, the fastest known convergence rate to CCE in Markov games.
98. Disentangling Causal Substructures for Interpretable and Generalizable Drug Synergy Prediction
- Authors: Yi Luo , Haochen Zhao , Xiao Liang , Yiwei Liu , Yuye Zhang , Xinyu Li , Jianxin Wang
- URL: https://arxiv.org/abs/2511.02146
- Abstract:
Drug synergy prediction is a critical task in the development of effective combination therapies for complex diseases, including cancer. Although existing methods have shown promising results, they often operate as black-box predictors that rely predominantly on statistical correlations between drug characteristics and results. To address this limitation, we propose CausalDDS, a novel framework that disentangles drug molecules into causal and spurious substructures, utilizing the causal substructure representations for predicting drug synergy. By focusing on causal sub-structures, CausalDDS effectively mitigates the impact of redundant features introduced by spurious substructures, enhancing the accuracy and interpretability of the model. In addition, CausalDDS employs a conditional intervention mechanism, where interventions are conditioned on paired molecular structures, and introduces a novel optimization objective guided by the principles of sufficiency and independence. Extensive experiments demonstrate that our method outperforms baseline models, particularly in cold start and out-of-distribution settings. Besides, CausalDDS effectively identifies key substructures underlying drug synergy, providing clear insights into how drug combinations work at the molecular level. These results underscore the potential of CausalDDS as a practical tool for predicting drug synergy and facilitating drug discovery.
99. Matrix Sensing with Kernel Optimal Loss: Robustness and Optimization Landscape
- Authors: Xinyuan Song , Jiaye Teng , Ziye Ma
- URL: https://arxiv.org/abs/2511.02122
- Abstract:
In this paper we study how the choice of loss functions of non-convex optimization problems affects their robustness and optimization landscape, through the study of noisy matrix sensing. In traditional regression tasks, mean squared error (MSE) loss is a common choice, but it can be unreliable for non-Gaussian or heavy-tailed noise. To address this issue, we adopt a robust loss based on nonparametric regression, which uses a kernel-based estimate of the residual density and maximizes the estimated log-likelihood. This robust formulation coincides with the MSE loss under Gaussian errors but remains stable under more general settings. We further examine how this robust loss reshapes the optimization landscape by analyzing the upper-bound of restricted isometry property (RIP) constants for spurious local minima to disappear. Through theoretical and empirical analysis, we show that this new loss excels at handling large noise and remains robust across diverse noise distributions. This work offers initial insights into enhancing the robustness of machine learning tasks through simply changing the loss, guided by an intuitive and broadly applicable analytical framework.
100. Metamorphic Testing of Large Language Models for Natural Language Processing
- Authors: Steven Cho , Stefano Ruberto , Valerio Terragni
- URL: https://arxiv.org/abs/2511.02108
- Abstract:
Using large language models (LLMs) to perform natural language processing (NLP) tasks has become increasingly pervasive in recent times. The versatile nature of LLMs makes them applicable to a wide range of such tasks. While the performance of recent LLMs is generally outstanding, several studies have shown that they can often produce incorrect results. Automatically identifying these faulty behaviors is extremely useful for improving the effectiveness of LLMs. One obstacle to this is the limited availability of labeled datasets, which necessitates an oracle to determine the correctness of LLM behaviors. Metamorphic testing (MT) is a popular testing approach that alleviates this oracle problem. At the core of MT are metamorphic relations (MRs), which define relationships between the outputs of related inputs. MT can expose faulty behaviors without the need for explicit oracles (e.g., labeled datasets). This paper presents the most comprehensive study of MT for LLMs to date. We conducted a literature review and collected 191 MRs for NLP tasks. We implemented a representative subset (36 MRs) to conduct a series of experiments with three popular LLMs, running approximately 560,000 metamorphic tests. The results shed light on the capabilities and opportunities of MT for LLMs, as well as its limitations.
101. Geometric Data Valuation via Leverage Scores
- Authors: Rodrigo Mendoza-Smith
- URL: https://arxiv.org/abs/2511.02100
- Abstract:
Shapley data valuation provides a principled, axiomatic framework for assigning importance to individual datapoints, and has gained traction in dataset curation, pruning, and pricing. However, it is a combinatorial measure that requires evaluating marginal utility across all subsets of the data, making it computationally infeasible at scale. We propose a geometric alternative based on statistical leverage scores, which quantify each datapoint’s structural influence in the representation space by measuring how much it extends the span of the dataset and contributes to the effective dimensionality of the training problem. We show that our scores satisfy the dummy, efficiency, and symmetry axioms of Shapley valuation and that extending them to \emph{ridge leverage scores} yields strictly positive marginal gains that connect naturally to classical A- and D-optimal design criteria. We further show that training on a leverage-sampled subset produces a model whose parameters and predictive risk are within $O(\varepsilon)$ of the full-data optimum, thereby providing a rigorous link between data valuation and downstream decision quality. Finally, we conduct an active learning experiment in which we empirically demonstrate that ridge-leverage sampling outperforms standard baselines without requiring access gradients or backward passes.
102. Uncertainty Guided Online Ensemble for Non-stationary Data Streams in Fusion Science
- Authors: Kishansingh Rajput , Malachi Schram , Brian Sammuli , Sen Lin
- URL: https://arxiv.org/abs/2511.02092
- Abstract:
Machine Learning (ML) is poised to play a pivotal role in the development and operation of next-generation fusion devices. Fusion data shows non-stationary behavior with distribution drifts, resulted by both experimental evolution and machine wear-and-tear. ML models assume stationary distribution and fail to maintain performance when encountered with such non-stationary data streams. Online learning techniques have been leveraged in other domains, however it has been largely unexplored for fusion applications. In this paper, we present an application of online learning to continuously adapt to drifting data stream for prediction of Toroidal Field (TF) coils deflection at the DIII-D fusion facility. The results demonstrate that online learning is critical to maintain ML model performance and reduces error by 80% compared to a static model. Moreover, traditional online learning can suffer from short-term performance degradation as ground truth is not available before making the predictions. As such, we propose an uncertainty guided online ensemble method to further improve the performance. The Deep Gaussian Process Approximation (DGPA) technique is leveraged for calibrated uncertainty estimation and the uncertainty values are then used to guide a meta-algorithm that produces predictions based on an ensemble of learners trained on different horizon of historical data. The DGPA also provides uncertainty estimation along with the predictions for decision makers. The online ensemble and the proposed uncertainty guided online ensemble reduces predictions error by about 6%, and 10% respectively over standard single model based online learning.
103. Natural Building Blocks for Structured World Models: Theory, Evidence, and Scaling
- Authors: Lancelot Da Costa , Sanjeev Namjoshi , Mohammed Abbas Ansari , Bernhard Schölkopf
- URL: https://arxiv.org/abs/2511.02091
- Abstract:
The field of world modeling is fragmented, with researchers developing bespoke architectures that rarely build upon each other. We propose a framework that specifies the natural building blocks for structured world models based on the fundamental stochastic processes that any world model must capture: discrete processes (logic, symbols) and continuous processes (physics, dynamics); the world model is then defined by the hierarchical composition of these building blocks. We examine Hidden Markov Models (HMMs) and switching linear dynamical systems (sLDS) as natural building blocks for discrete and continuous modeling–which become partially-observable Markov decision processes (POMDPs) and controlled sLDS when augmented with actions. This modular approach supports both passive modeling (generation, forecasting) and active control (planning, decision-making) within the same architecture. We avoid the combinatorial explosion of traditional structure learning by largely fixing the causal architecture and searching over only four depth parameters. We review practical expressiveness through multimodal generative modeling (passive) and planning from pixels (active), with performance competitive to neural approaches while maintaining interpretability. The core outstanding challenge is scalable joint structure-parameter learning; current methods finesse this by cleverly growing structure and parameters incrementally, but are limited in their scalability. If solved, these natural building blocks could provide foundational infrastructure for world modeling, analogous to how standardized layers enabled progress in deep learning.
104. Energy Loss Functions for Physical Systems
- Authors: Sékou-Oumar Kaba , Kusha Sareen , Daniel Levy , Siamak Ravanbakhsh
- URL: https://arxiv.org/abs/2511.02087
- Abstract:
Effectively leveraging prior knowledge of a system’s physics is crucial for applications of machine learning to scientific domains. Previous approaches mostly focused on incorporating physical insights at the architectural level. In this paper, we propose a framework to leverage physical information directly into the loss function for prediction and generative modeling tasks on systems like molecules and spins. We derive energy loss functions assuming that each data sample is in thermal equilibrium with respect to an approximate energy landscape. By using the reverse KL divergence with a Boltzmann distribution around the data, we obtain the loss as an energy difference between the data and the model predictions. This perspective also recasts traditional objectives like MSE as energy-based, but with a physically meaningless energy. In contrast, our formulation yields physically grounded loss functions with gradients that better align with valid configurations, while being architecture-agnostic and computationally efficient. The energy loss functions also inherently respect physical symmetries. We demonstrate our approach on molecular generation and spin ground-state prediction and report significant improvements over baselines.
105. Watermarking Discrete Diffusion Language Models
- Authors: Avi Bagchi , Akhil Bhimaraju , Moulik Choraria , Daniel Alabi , Lav R. Varshney
- URL: https://arxiv.org/abs/2511.02083
- Abstract:
Watermarking has emerged as a promising technique to track AI-generated content and differentiate it from authentic human creations. While prior work extensively studies watermarking for autoregressive large language models (LLMs) and image diffusion models, none address discrete diffusion language models, which are becoming popular due to their high inference throughput. In this paper, we introduce the first watermarking method for discrete diffusion models by applying the distribution-preserving Gumbel-max trick at every diffusion step and seeding the randomness with the sequence index to enable reliable detection. We experimentally demonstrate that our scheme is reliably detectable on state-of-the-art diffusion language models and analytically prove that it is distortion-free with an exponentially decaying probability of false detection in the token sequence length.
106. Vortex: Hosting ML Inference and Knowledge Retrieval Services With Tight Latency and Throughput Requirements
- Authors: Yuting Yang , Tiancheng Yuan , Jamal Hashim , Thiago Garrett , Jeffrey Qian , Ann Zhang , Yifan Wang , Weijia Song , Ken Birman
- URL: https://arxiv.org/abs/2511.02062
- Abstract:
There is growing interest in deploying ML inference and knowledge retrieval as services that could support both interactive queries by end users and more demanding request flows that arise from AIs integrated into a end-user applications and deployed as agents. Our central premise is that these latter cases will bring service level latency objectives (SLOs). Existing ML serving platforms use batching to optimize for high throughput, exposing them to unpredictable tail latencies. Vortex enables an SLO-first approach. For identical tasks, Vortex’s pipelines achieve significantly lower and more stable latencies than TorchServe and Ray Serve over a wide range of workloads, often enabling a given SLO target at more than twice the request rate. When RDMA is available, the Vortex advantage is even more significant.
107. Text-VQA Aug: Pipelined Harnessing of Large Multimodal Models for Automated Synthesis
- Authors: Soham Joshi , Shwet Kamal Mishra , Viswanath Gopalakrishnan
- URL: https://arxiv.org/abs/2511.02046
- Abstract:
Creation of large-scale databases for Visual Question Answering tasks pertaining to the text data in a scene (text-VQA) involves skilful human annotation, which is tedious and challenging. With the advent of foundation models that handle vision and language modalities, and with the maturity of OCR systems, it is the need of the hour to establish an end-to-end pipeline that can synthesize Question-Answer (QA) pairs based on scene-text from a given image. We propose a pipeline for automated synthesis for text-VQA dataset that can produce faithful QA pairs, and which scales up with the availability of scene text data. Our proposed method harnesses the capabilities of multiple models and algorithms involving OCR detection and recognition (text spotting), region of interest (ROI) detection, caption generation, and question generation. These components are streamlined into a cohesive pipeline to automate the synthesis and validation of QA pairs. To the best of our knowledge, this is the first pipeline proposed to automatically synthesize and validate a large-scale text-VQA dataset comprising around 72K QA pairs based on around 44K images.
108. Regularization Through Reasoning: Systematic Improvements in Language Model Classification via Explanation-Enhanced Fine-Tuning
- Authors: Vivswan Shah , Randy Cogill , Hanwei Yue , Gopinath Chennupati , Rinat Khaziev
- URL: https://arxiv.org/abs/2511.02044
- Abstract:
Fine-tuning LLMs for classification typically maps inputs directly to labels. We ask whether attaching brief explanations to each label during fine-tuning yields better models. We evaluate conversational response quality along three axes: naturalness, comprehensiveness, and on-topic adherence, each rated on 5-point scales. Using ensemble-generated data from multiple LLMs, we fine-tune a 7B-parameter model and test across six diverse conversational datasets. Across 18 dataset, task settings, label-plus-explanation training outperforms label-only baselines. A central and unexpected result concerns random tokens. We replace human-written explanations with text that is syntactically incoherent yet vocabulary-aligned with the originals (e.g., shuffled or bag-of-words variants). Despite lacking semantics, these pseudo-explanations still improve accuracy over label-only training and often narrow much of the gap to true explanations. The effect persists across datasets and training seeds, indicating that gains arise less from meaning than from structure: the extra token budget encourages richer intermediate computation and acts as a regularizer that reduces over-confident shortcuts. Internal analyses support this view: explanation-augmented models exhibit higher activation entropy in intermediate layers alongside sharper predictive mass at the output layer, consistent with increased deliberation before decision. Overall, explanation-augmented fine-tuning, whether with genuine rationales or carefully constructed random token sequences, improves accuracy and reliability for LLM classification while clarifying how token-level scaffolding shapes computation during inference.
109. Quantum-Enhanced Generative Models for Rare Event Prediction
- Authors: M.Z. Haider , M.U. Ghouri , Tayyaba Noreen , M. Salman
- URL: https://arxiv.org/abs/2511.02042
- Abstract:
Rare events such as financial crashes, climate extremes, and biological anomalies are notoriously difficult to model due to their scarcity and heavy-tailed distributions. Classical deep generative models often struggle to capture these rare occurrences, either collapsing low-probability modes or producing poorly calibrated uncertainty estimates. In this work, we propose the Quantum-Enhanced Generative Model (QEGM), a hybrid classical-quantum framework that integrates deep latent-variable models with variational quantum circuits. The framework introduces two key innovations: (1) a hybrid loss function that jointly optimizes reconstruction fidelity and tail-aware likelihood, and (2) quantum randomness-driven noise injection to enhance sample diversity and mitigate mode collapse. Training proceeds via a hybrid loop where classical parameters are updated through backpropagation while quantum parameters are optimized using parameter-shift gradients. We evaluate QEGM on synthetic Gaussian mixtures and real-world datasets spanning finance, climate, and protein structure. Results demonstrate that QEGM reduces tail KL divergence by up to 50 percent compared to state-of-the-art baselines (GAN, VAE, Diffusion), while improving rare-event recall and coverage calibration. These findings highlight the potential of QEGM as a principled approach for rare-event prediction, offering robustness beyond what is achievable with purely classical methods.
110. RobustFSM: Submodular Maximization in Federated Setting with Malicious Clients
- Authors: Duc A. Tran , Dung Truong , Duy Le
- URL: https://arxiv.org/abs/2511.02029
- Abstract:
Submodular maximization is an optimization problem benefiting many machine learning applications, where we seek a small subset best representing an extremely large dataset. We focus on the federated setting where the data are locally owned by decentralized clients who have their own definitions for the quality of representability. This setting requires repetitive aggregation of local information computed by the clients. While the main motivation is to respect the privacy and autonomy of the clients, the federated setting is vulnerable to client misbehaviors: malicious clients might share fake information. An analogy is backdoor attack in conventional federated learning, but our challenge differs freshly due to the unique characteristics of submodular maximization. We propose RobustFSM, a federated submodular maximization solution that is robust to various practical client attacks. Its performance is substantiated with an empirical evaluation study using real-world datasets. Numerical results show that the solution quality of RobustFSM substantially exceeds that of the conventional federated algorithm when attacks are severe. The degree of this improvement depends on the dataset and attack scenarios, which can be as high as 200%
111. Path-Coordinated Continual Learning with Neural Tangent Kernel-Justified Plasticity: A Theoretical Framework with Near State-of-the-Art Performance
- Authors: Rathin Chandra Shit
- URL: https://arxiv.org/abs/2511.02025
- Abstract:
Catastrophic forgetting is one of the fundamental issues of continual learning because neural networks forget the tasks learned previously when trained on new tasks. The proposed framework is a new path-coordinated framework of continual learning that unites the Neural Tangent Kernel (NTK) theory of principled plasticity bounds, statistical validation by Wilson confidence intervals, and evaluation of path quality by the use of multiple metrics. Experimental evaluation shows an average accuracy of 66.7% at the cost of 23.4% catastrophic forgetting on Split-CIFAR10, a huge improvement over the baseline and competitive performance achieved, which is very close to state-of-the-art results. Further, it is found out that NTK condition numbers are predictive indicators of learning capacity limits, showing the existence of a critical threshold at condition number $>10^{11}$. It is interesting to note that the proposed strategy shows a tendency of lowering forgetting as the sequence of tasks progresses (27% to 18%), which is a system stabilization. The framework validates 80% of discovered paths with a rigorous statistical guarantee and maintains 90-97% retention on intermediate tasks. The core capacity limits of the continual learning environment are determined in the analysis, and actionable insights to enhance the adaptive regularization are offered.
112. Shared Parameter Subspaces and Cross-Task Linearity in Emergently Misaligned Behavior
- Authors: Daniel Aarao Reis Arturi , Eric Zhang , Andrew Ansah , Kevin Zhu , Ashwinee Panda , Aishwarya Balwani
- URL: https://arxiv.org/abs/2511.02022
- Abstract:
Recent work has discovered that large language models can develop broadly misaligned behaviors after being fine-tuned on narrowly harmful datasets, a phenomenon known as emergent misalignment (EM). However, the fundamental mechanisms enabling such harmful generalization across disparate domains remain poorly understood. In this work, we adopt a geometric perspective to study EM and demonstrate that it exhibits a fundamental cross-task linear structure in how harmful behavior is encoded across different datasets. Specifically, we find a strong convergence in EM parameters across tasks, with the fine-tuned weight updates showing relatively high cosine similarities, as well as shared lower-dimensional subspaces as measured by their principal angles and projection overlaps. Furthermore, we also show functional equivalence via linear mode connectivity, wherein interpolated models across narrow misalignment tasks maintain coherent, broadly misaligned behavior. Our results indicate that EM arises from different narrow tasks discovering the same set of shared parameter directions, suggesting that harmful behaviors may be organized into specific, predictable regions of the weight landscape. By revealing this fundamental connection between parametric geometry and behavioral outcomes, we hope our work catalyzes further research on parameter space interpretability and weight-based interventions.
113. InteracSPARQL: An Interactive System for SPARQL Query Refinement Using Natural Language Explanations
- Authors: Xiangru Jian , Zhengyuan Dong , M. Tamer Özsu
- URL: https://arxiv.org/abs/2511.02002
- Abstract:
In recent years, querying semantic web data using SPARQL has remained challenging, especially for non-expert users, due to the language’s complex syntax and the prerequisite of understanding intricate data structures. To address these challenges, we propose InteracSPARQL, an interactive SPARQL query generation and refinement system that leverages natural language explanations (NLEs) to enhance user comprehension and facilitate iterative query refinement. InteracSPARQL integrates LLMs with a rule-based approach to first produce structured explanations directly from SPARQL abstract syntax trees (ASTs), followed by LLM-based linguistic refinements. Users can interactively refine queries through direct feedback or LLM-driven self-refinement, enabling the correction of ambiguous or incorrect query components in real time. We evaluate InteracSPARQL on standard benchmarks, demonstrating significant improvements in query accuracy, explanation clarity, and overall user satisfaction compared to baseline approaches. Our experiments further highlight the effectiveness of combining rule-based methods with LLM-driven refinements to create more accessible and robust SPARQL interfaces.
114. TRACE: Textual Reasoning for Affordance Coordinate Extraction
- Authors: Sangyun Park , Jin Kim , Yuchen Cui , Matthew S. Brown
- URL: https://arxiv.org/abs/2511.01999
- Abstract:
Vision-Language Models (VLMs) struggle to translate high-level instructions into the precise spatial affordances required for robotic manipulation. While visual Chain-of-Thought (CoT) methods exist, they are often computationally intensive. In this work, we introduce TRACE (Textual Reasoning for Affordance Coordinate Extraction), a novel methodology that integrates a textual Chain of Reasoning (CoR) into the affordance prediction process. We use this methodology to create the TRACE dataset, a large-scale collection created via an autonomous pipeline that pairs instructions with explicit textual rationales. By fine-tuning a VLM on this data, our model learns to externalize its spatial reasoning before acting. Our experiments show that our TRACE-tuned model achieves state-of-the-art performance, reaching 48.1% accuracy on the primary Where2Place (W2P) benchmark (a 9.6% relative improvement) and 55.0% on the more challenging W2P(h) subset. Crucially, an ablation study demonstrates that performance scales directly with the amount of reasoning data used, confirming the CoR’s effectiveness. Furthermore, analysis of the model’s attention maps reveals an interpretable reasoning process where focus shifts dynamically across reasoning steps. This work shows that training VLMs to generate a textual CoR is an effective and robust strategy for enhancing the precision, reliability, and interpretability of VLM-based robot control. Our dataset and code are available at this https URL
115. Vibe Learning: Education in the age of AI
- Authors: Marcos Florencio , Francielle Prieto
- URL: https://arxiv.org/abs/2511.01956
- Abstract:
The debate over whether “thinking machines” could replace human intellectual labor has existed in both public and expert discussions since the mid-twentieth century, when the concept and terminology of Artificial Intelligence (AI) first emerged. For decades, this idea remained largely theoretical. However, with the recent advent of Generative AI - particularly Large Language Models (LLMs) - and the widespread adoption of tools such as ChatGPT, the issue has become a practical reality. Many fields that rely on human intellectual effort are now being reshaped by AI tools that both expand human capabilities and challenge the necessity of certain forms of work once deemed uniquely human but now easily automated. Education, somewhat unexpectedly, faces a pivotal responsibility: to devise long-term strategies for cultivating human skills that will remain relevant in an era of pervasive AI in the intellectual domain. In this context, we identify the limitations of current AI systems - especially those rooted in LLM technology - argue that the fundamental causes of these weaknesses cannot be resolved through existing methods, and propose directions within the constructivist paradigm for transforming education to preserve the long-term advantages of human intelligence over AI tools.
116. Black-Box Membership Inference Attack for LVLMs via Prior Knowledge-Calibrated Memory Probing
- Authors: Jinhua Yin , Peiru Yang , Chen Yang , Huili Wang , Zhiyang Hu , Shangguang Wang , Yongfeng Huang , Tao Qi
- URL: https://arxiv.org/abs/2511.01952
- Abstract:
Large vision-language models (LVLMs) derive their capabilities from extensive training on vast corpora of visual and textual data. Empowered by large-scale parameters, these models often exhibit strong memorization of their training data, rendering them susceptible to membership inference attacks (MIAs). Existing MIA methods for LVLMs typically operate under white- or gray-box assumptions, by extracting likelihood-based features for the suspected data samples based on the target LVLMs. However, mainstream LVLMs generally only expose generated outputs while concealing internal computational features during inference, limiting the applicability of these methods. In this work, we propose the first black-box MIA framework for LVLMs, based on a prior knowledge-calibrated memory probing mechanism. The core idea is to assess the model memorization of the private semantic information embedded within the suspected image data, which is unlikely to be inferred from general world knowledge alone. We conducted extensive experiments across four LVLMs and three datasets. Empirical results demonstrate that our method effectively identifies training data of LVLMs in a purely black-box setting and even achieves performance comparable to gray-box and white-box methods. Further analysis reveals the robustness of our method against potential adversarial manipulations, and the effectiveness of the methodology designs. Our code and data are available at this https URL .
117. Interpretable Heart Disease Prediction via a Weighted Ensemble Model: A Large-Scale Study with SHAP and Surrogate Decision Trees
- Authors: Md Abrar Hasnat , Md Jobayer , Md. Mehedi Hasan Shawon , Md. Golam Rabiul Alam
- URL: https://arxiv.org/abs/2511.01947
- Abstract:
Cardiovascular disease (CVD) remains a critical global health concern, demanding reliable and interpretable predictive models for early risk assessment. This study presents a large-scale analysis using the Heart Disease Health Indicators Dataset, developing a strategically weighted ensemble model that combines tree-based methods (LightGBM, XGBoost) with a Convolutional Neural Network (CNN) to predict CVD risk. The model was trained on a preprocessed dataset of 229,781 patients where the inherent class imbalance was managed through strategic weighting and feature engineering enhanced the original 22 features to 25. The final ensemble achieves a statistically significant improvement over the best individual model, with a Test AUC of 0.8371 (p=0.003) and is particularly suited for screening with a high recall of 80.0%. To provide transparency and clinical interpretability, surrogate decision trees and SHapley Additive exPlanations (SHAP) are used. The proposed model delivers a combination of robust predictive performance and clinical transparency by blending diverse learning architectures and incorporating explainability through SHAP and surrogate decision trees, making it a strong candidate for real-world deployment in public health screening.
118. COFAP: A Universal Framework for COFs Adsorption Prediction through Designed Multi-Modal Extraction and Cross-Modal Synergy
- Authors: Zihan Li , Mingyang Wan , Mingyu Gao , Zhongshan Chen , Xiangke Wang , Feifan Zhang
- URL: https://arxiv.org/abs/2511.01946
- Abstract:
Covalent organic frameworks (COFs) are promising adsorbents for gas adsorption and separation, while identifying the optimal structures among their vast design space requires efficient high-throughput screening. Conventional machine-learning predictors rely heavily on specific gas-related features. However, these features are time-consuming and limit scalability, leading to inefficiency and labor-intensive processes. Herein, a universal COFs adsorption prediction framework (COFAP) is proposed, which can extract multi-modal structural and chemical features through deep learning, and fuse these complementary features via cross-modal attention mechanism. Without Henry coefficients or adsorption heat, COFAP sets a new SOTA by outperforming previous approaches on hypoCOFs dataset. Based on COFAP, we also found that high-performing COFs for separation concentrate within a narrow range of pore size and surface area. A weight-adjustable prioritization scheme is also developed to enable flexible, application-specific ranking of candidate COFs for researchers. Superior efficiency and accuracy render COFAP directly deployable in crystalline porous materials.
119. Detecting Vulnerabilities from Issue Reports for Internet-of-Things
- Authors: Sogol Masoumzadeh
- URL: https://arxiv.org/abs/2511.01941
- Abstract:
Timely identification of issue reports reflecting software vulnerabilities is crucial, particularly for Internet-of-Things (IoT) where analysis is slower than non-IoT systems. While Machine Learning (ML) and Large Language Models (LLMs) detect vulnerability-indicating issues in non-IoT systems, their IoT use remains unexplored. We are the first to tackle this problem by proposing two approaches: (1) combining ML and LLMs with Natural Language Processing (NLP) techniques to detect vulnerability-indicating issues of 21 Eclipse IoT projects and (2) fine-tuning a pre-trained BERT Masked Language Model (MLM) on 11,000 GitHub issues for classifying \vul. Our best performance belongs to a Support Vector Machine (SVM) trained on BERT NLP features, achieving an Area Under the receiver operator characteristic Curve (AUC) of 0.65. The fine-tuned BERT achieves 0.26 accuracy, emphasizing the importance of exposing all data during training. Our contributions set the stage for accurately detecting IoT vulnerabilities from issue reports, similar to non-IoT systems.
120. The Geometry of Grokking: Norm Minimization on the Zero-Loss Manifold
- Authors: Tiberiu Musat
- URL: https://arxiv.org/abs/2511.01938
- Abstract:
Grokking is a puzzling phenomenon in neural networks where full generalization occurs only after a substantial delay following the complete memorization of the training data. Previous research has linked this delayed generalization to representation learning driven by weight decay, but the precise underlying dynamics remain elusive. In this paper, we argue that post-memorization learning can be understood through the lens of constrained optimization: gradient descent effectively minimizes the weight norm on the zero-loss manifold. We formally prove this in the limit of infinitesimally small learning rates and weight decay coefficients. To further dissect this regime, we introduce an approximation that decouples the learning dynamics of a subset of parameters from the rest of the network. Applying this framework, we derive a closed-form expression for the post-memorization dynamics of the first layer in a two-layer network. Experiments confirm that simulating the training process using our predicted gradients reproduces both the delayed generalization and representation learning characteristic of grokking.
121. Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR
- Authors: Abdelaziz Bounhar , Hadi Abdine , Evan Dufraisse , Ahmad Chamma , Amr Mohamed , Dani Bouch , Michalis Vazirgiannis , Guokan Shang
- URL: https://arxiv.org/abs/2511.01937
- Abstract:
Large language models (LLMs) trained for step-by-step reasoning often become excessively verbose, raising inference cost. Standard Reinforcement Learning with Verifiable Rewards (RLVR) pipelines filter out
easy'' problems for training efficiency, leaving the model to train primarily on harder problems that require longer reasoning chains. This skews the output length distribution upward, resulting in a \textbf{model that conflatesthinking longer’’ with ``thinking better’’}. In this work, we show that retaining and modestly up-weighting moderately easy problems acts as an implicit length regularizer. Exposing the model to solvable short-chain tasks constrains its output distribution and prevents runaway verbosity. The result is \textbf{\emph{emergent brevity for free}}: the model learns to solve harder problems without inflating the output length, \textbf{ despite the absence of any explicit length penalization}. RLVR experiments using this approach on \textit{Qwen3-4B-Thinking-2507} (with a 16k token limit) achieve baseline pass@1 AIME25 accuracy while generating solutions that are, on average, nearly twice as short. The code is available at \href{ this https URL }{GitHub}, with datasets and models on \href{ this https URL }{Hugging Face}.
122. Q-Sat AI: Machine Learning-Based Decision Support for Data Saturation in Qualitative Studies
- Authors: Hasan Tutar , Caner Erden , Ümit Şentürk
- URL: https://arxiv.org/abs/2511.01935
- Abstract:
The determination of sample size in qualitative research has traditionally relied on the subjective and often ambiguous principle of data saturation, which can lead to inconsistencies and threaten methodological rigor. This study introduces a new, systematic model based on machine learning (ML) to make this process more objective. Utilizing a dataset derived from five fundamental qualitative research approaches - namely, Case Study, Grounded Theory, Phenomenology, Narrative Research, and Ethnographic Research - we developed an ensemble learning model. Ten critical parameters, including research scope, information power, and researcher competence, were evaluated using an ordinal scale and used as input features. After thorough preprocessing and outlier removal, multiple ML algorithms were trained and compared. The K-Nearest Neighbors (KNN), Gradient Boosting (GB), Random Forest (RF), XGBoost, and Decision Tree (DT) algorithms showed the highest explanatory power (Test R2 ~ 0.85), effectively modeling the complex, non-linear relationships involved in qualitative sampling decisions. Feature importance analysis confirmed the vital roles of research design type and information power, providing quantitative validation of key theoretical assumptions in qualitative methodology. The study concludes by proposing a conceptual framework for a web-based computational application designed to serve as a decision support system for qualitative researchers, journal reviewers, and thesis advisors. This model represents a significant step toward standardizing sample size justification, enhancing transparency, and strengthening the epistemological foundation of qualitative inquiry through evidence-based, systematic decision-making.
123. Tool Zero: Training Tool-Augmented LLMs via Pure RL from Scratch
- Authors: Yirong Zeng , Xiao Ding , Yutai Hou , Yuxian Wang , Li Du , Juyi Dai , Qiuyang Ding , Duyu Tang , Dandan Tu , Weiwen Liu , Bing Qin , Ting Liu
- URL: https://arxiv.org/abs/2511.01934
- Abstract:
Training tool-augmented LLMs has emerged as a promising approach to enhancing language models’ capabilities for complex tasks. The current supervised fine-tuning paradigm relies on constructing extensive domain-specific datasets to train models. However, this approach often struggles to generalize effectively to unfamiliar or intricate tool-use scenarios. Recently, reinforcement learning (RL) paradigm can endow LLMs with superior reasoning and generalization abilities. In this work, we address a key question: Can the pure RL be used to effectively elicit a model’s intrinsic reasoning capabilities and enhance the tool-agnostic generalization? We propose a dynamic generalization-guided reward design for rule-based RL, which progressively shifts rewards from exploratory to exploitative tool-use patterns. Based on this design, we introduce the Tool-Zero series models. These models are trained to enable LLMs to autonomously utilize general tools by directly scaling up RL from Zero models (i.e., base models without post-training). Experimental results demonstrate that our models achieve over 7% performance improvement compared to both SFT and RL-with-SFT models under the same experimental settings. These gains are consistently replicated across cross-dataset and intra-dataset evaluations, validating the effectiveness and robustness of our methods.
124. Deciphering Personalization: Towards Fine-Grained Explainability in Natural Language for Personalized Image Generation Models
- Authors: Haoming Wang , Wei Gao
- URL: https://arxiv.org/abs/2511.01932
- Abstract:
Image generation models are usually personalized in practical uses in order to better meet the individual users’ heterogeneous needs, but most personalized models lack explainability about how they are being personalized. Such explainability can be provided via visual features in generated images, but is difficult for human users to understand. Explainability in natural language is a better choice, but the existing approaches to explainability in natural language are limited to be coarse-grained. They are unable to precisely identify the multiple aspects of personalization, as well as the varying levels of personalization in each aspect. To address such limitation, in this paper we present a new technique, namely \textbf{FineXL}, towards \textbf{Fine}-grained e\textbf{X}plainability in natural \textbf{L}anguage for personalized image generation models. FineXL can provide natural language descriptions about each distinct aspect of personalization, along with quantitative scores indicating the level of each aspect of personalization. Experiment results show that FineXL can improve the accuracy of explainability by 56\%, when different personalization scenarios are applied to multiple types of image generation models.
125. Dynamic Population Distribution Aware Human Trajectory Generation with Diffusion Model
- Authors: Qingyue Long , Can Rong , Tong Li , Yong Li
- URL: https://arxiv.org/abs/2511.01929
- Abstract:
Human trajectory data is crucial in urban planning, traffic engineering, and public health. However, directly using real-world trajectory data often faces challenges such as privacy concerns, data acquisition costs, and data quality. A practical solution to these challenges is trajectory generation, a method developed to simulate human mobility behaviors. Existing trajectory generation methods mainly focus on capturing individual movement patterns but often overlook the influence of population distribution on trajectory generation. In reality, dynamic population distribution reflects changes in population density across different regions, significantly impacting individual mobility behavior. Thus, we propose a novel trajectory generation framework based on a diffusion model, which integrates the dynamic population distribution constraints to guide high-fidelity generation outcomes. Specifically, we construct a spatial graph to enhance the spatial correlation of trajectories. Then, we design a dynamic population distribution aware denoising network to capture the spatiotemporal dependencies of human mobility behavior as well as the impact of population distribution in the denoising process. Extensive experiments show that the trajectories generated by our model can resemble real-world trajectories in terms of some critical statistical metrics, outperforming state-of-the-art algorithms by over 54%.
126. A Unified Model for Human Mobility Generation in Natural Disasters
- Authors: Qingyue Long , Huandong Wang , Qi Ryan Wang , Yong Li
- URL: https://arxiv.org/abs/2511.01928
- Abstract:
Human mobility generation in disaster scenarios plays a vital role in resource allocation, emergency response, and rescue coordination. During disasters such as wildfires and hurricanes, human mobility patterns often deviate from their normal states, which makes the task more challenging. However, existing works usually rely on limited data from a single city or specific disaster, significantly restricting the model’s generalization capability in new scenarios. In fact, disasters are highly sudden and unpredictable, and any city may encounter new types of disasters without prior experience. Therefore, we aim to develop a one-for-all model for mobility generation that can generalize to new disaster scenarios. However, building a universal framework faces two key challenges: 1) the diversity of disaster types and 2) the heterogeneity among different cities. In this work, we propose a unified model for human mobility generation in natural disasters (named UniDisMob). To enable cross-disaster generalization, we design physics-informed prompt and physics-guided alignment that leverage the underlying common patterns in mobility changes after different disasters to guide the generation process. To achieve cross-city generalization, we introduce a meta-learning framework that extracts universal patterns across multiple cities through shared parameters and captures city-specific features via private parameters. Extensive experiments across multiple cities and disaster scenarios demonstrate that our method significantly outperforms state-of-the-art baselines, achieving an average performance improvement exceeding 13%.
127. DeepContour: A Hybrid Deep Learning Framework for Accelerating Generalized Eigenvalue Problem Solving via Efficient Contour Design
- Authors: Yeqiu Chen , Ziyan Liu , Hong Wang
- URL: https://arxiv.org/abs/2511.01927
- Abstract:
Solving large-scale Generalized Eigenvalue Problems (GEPs) is a fundamental yet computationally prohibitive task in science and engineering. As a promising direction, contour integral (CI) methods, such as the CIRR algorithm, offer an efficient and parallelizable framework. However, their performance is critically dependent on the selection of integration contours – improper selection without reliable prior knowledge of eigenvalue distribution can incur significant computational overhead and compromise numerical accuracy. To address this challenge, we propose DeepContour, a novel hybrid framework that integrates a deep learning-based spectral predictor with Kernel Density Estimation for principled contour design. Specifically, DeepContour first employs a Fourier Neural Operator (FNO) to rapidly predict the spectral distribution of a given GEP. Subsequently, Kernel Density Estimation (KDE) is applied to the predicted spectrum to automatically and systematically determine proper integration contours. Finally, these optimized contours guide the CI solver to efficiently find the desired eigenvalues. We demonstrate the effectiveness of our method on diverse challenging scientific problems. In our main experiments, DeepContour accelerates GEP solving across multiple datasets, achieving up to a 5.63$\times$ speedup. By combining the predictive power of deep learning with the numerical rigor of classical solvers, this work pioneers an efficient and robust paradigm for tackling difficult generalized eigenvalue involving matrices of high dimension.
128. Neural Green’s Functions
- Authors: Seungwoo Yoo , Kyeongmin Yeo , Jisung Hwang , Minhyuk Sung
- URL: https://arxiv.org/abs/2511.01924
- Abstract:
We introduce Neural Green’s Function, a neural solution operator for linear partial differential equations (PDEs) whose differential operators admit eigendecompositions. Inspired by Green’s functions, the solution operators of linear PDEs that depend exclusively on the domain geometry, we design Neural Green’s Function to imitate their behavior, achieving superior generalization across diverse irregular geometries and source and boundary functions. Specifically, Neural Green’s Function extracts per-point features from a volumetric point cloud representing the problem domain and uses them to predict a decomposition of the solution operator, which is subsequently applied to evaluate solutions via numerical integration. Unlike recent learning-based solution operators, which often struggle to generalize to unseen source or boundary functions, our framework is, by design, agnostic to the specific functions used during training, enabling robust and efficient generalization. In the steady-state thermal analysis of mechanical part geometries from the MCB dataset, Neural Green’s Function outperforms state-of-the-art neural operators, achieving an average error reduction of 13.9\% across five shape categories, while being up to 350 times faster than a numerical solver that requires computationally expensive meshing.
129. Fibbinary-Based Compression and Quantization for Efficient Neural Radio Receivers
- Authors: Roberta Fiandaca , Manil Dev Gomony
- URL: https://arxiv.org/abs/2511.01921
- Abstract:
Neural receivers have shown outstanding performance compared to the conventional ones but this comes with a high network complexity leading to a heavy computational cost. This poses significant challenges in their deployment on hardware-constrained devices. To address the issue, this paper explores two optimization strategies: quantization and compression. We introduce both uniform and non-uniform quantization such as the Fibonacci Code word Quantization (FCQ). A novel fine-grained approach to the Incremental Network Quantization (INQ) strategy is then proposed to compensate for the losses introduced by the above mentioned quantization techniques. Additionally, we introduce two novel lossless compression algorithms that effectively reduce the memory size by compressing sequences of Fibonacci quantized parameters characterized by a huge redundancy. The quantization technique provides a saving of 45\% and 44\% in the multiplier’s power and area, respectively, and its combination with the compression determines a 63.4\% reduction in memory footprint, while still providing higher performances than a conventional receiver.
130. iFlyBot-VLA Technical Report
- Authors: Yuan Zhang , Chenyu Xue , Wenjie Xu , Chao Ji , Jiajia wu , Jia Pan
- URL: https://arxiv.org/abs/2511.01914
- Abstract:
We introduce iFlyBot-VLA, a large-scale Vision-Language-Action (VLA) model trained under a novel framework. The main contributions are listed as follows: (1) a latent action model thoroughly trained on large-scale human and robotic manipulation videos; (2) a dual-level action representation framework that jointly supervises both the Vision-Language Model (VLM) and the action expert during training; (3) a mixed training strategy that combines robot trajectory data with general QA and spatial QA datasets, effectively enhancing the 3D perceptual and reasoning capabilities of the VLM backbone. Specifically, the VLM is trained to predict two complementary forms of actions: latent actions, derived from our latent action model pretrained on cross-embodiment manipulation data, which capture implicit high-level intentions; and structured discrete action tokens, obtained through frequency-domain transformations of continuous control signals, which encode explicit low-level dynamics. This dual supervision aligns the representation spaces of language, vision, and action, enabling the VLM to directly contribute to action generation. Experimental results on the LIBERO Franka benchmark demonstrate the superiority of our frame-work, while real-world evaluations further show that iFlyBot-VLA achieves competitive success rates across diverse and challenging manipulation tasks. Furthermore, we plan to open-source a portion of our self-constructed dataset to support future research in the community
131. EvoMem: Improving Multi-Agent Planning with Dual-Evolving Memory
- Authors: Wenzhe Fan , Ning Yan , Masood Mortazavi
- URL: https://arxiv.org/abs/2511.01912
- Abstract:
Planning has been a cornerstone of artificial intelligence for solving complex problems, and recent progress in LLM-based multi-agent frameworks have begun to extend this capability. However, the role of human-like memory within these frameworks remains largely unexplored. Understanding how agents coordinate through memory is critical for natural language planning, where iterative reasoning, constraint tracking, and error correction drive the success. Inspired by working memory model in cognitive psychology, we present EvoMem, a multi-agent framework built on a dual-evolving memory mechanism. The framework consists of three agents (Constraint Extractor, Verifier, and Actor) and two memory modules: Constraint Memory (CMem), which evolves across queries by storing task-specific rules and constraints while remains fixed within a query, and Query-feedback Memory (QMem), which evolves within a query by accumulating feedback across iterations for solution refinement. Both memory modules are reset at the end of each query session. Evaluations on trip planning, meeting planning, and calendar scheduling show consistent performance improvements, highlighting the effectiveness of EvoMem. This success underscores the importance of memory in enhancing multi-agent planning.
132. Variational Geometry-aware Neural Network based Method for Solving High-dimensional Diffeomorphic Mapping Problems
- Authors: Zhiwen Li , Cheuk Hin Ho , Lok Ming Lui
- URL: https://arxiv.org/abs/2511.01911
- Abstract:
Traditional methods for high-dimensional diffeomorphic mapping often struggle with the curse of dimensionality. We propose a mesh-free learning framework designed for $n$-dimensional mapping problems, seamlessly combining variational principles with quasi-conformal theory. Our approach ensures accurate, bijective mappings by regulating conformality distortion and volume distortion, enabling robust control over deformation quality. The framework is inherently compatible with gradient-based optimization and neural network architectures, making it highly flexible and scalable to higher-dimensional settings. Numerical experiments on both synthetic and real-world medical image data validate the accuracy, robustness, and effectiveness of the proposed method in complex registration scenarios.
133. Between Myths and Metaphors: Rethinking LLMs for SRH in Conservative Contexts
- Authors: Ameemah Humayun , Bushra Zubair , Maryam Mustafa
- URL: https://arxiv.org/abs/2511.01907
- Abstract:
Low-resource countries represent over 90% of maternal deaths, with Pakistan among the top four countries contributing nearly half in 2023. Since these deaths are mostly preventable, large language models (LLMs) can help address this crisis by automating health communication and risk assessment. However, sexual and reproductive health (SRH) communication in conservative contexts often relies on indirect language that obscures meaning, complicating LLM-based interventions. We conduct a two-stage study in Pakistan: (1) analyzing data from clinical observations, interviews, and focus groups with clinicians and patients, and (2) evaluating the interpretive capabilities of five popular LLMs on this data. Our analysis identifies two axes of communication (referential domain and expression approach) and shows LLMs struggle with semantic drift, myths, and polysemy in clinical interactions. We contribute: (1) empirical themes in SRH communication, (2) a categorization framework for indirect communication, (3) evaluation of LLM performance, and (4) design recommendations for culturally-situated SRH communication.
134. Thinking Like a Student: AI-Supported Reflective Planning in a Theory-Intensive Computer Science Course
- Authors: Noa Izsak
- URL: https://arxiv.org/abs/2511.01906
- Abstract:
In the aftermath of COVID-19, many universities implemented supplementary “reinforcement” roles to support students in demanding courses. Although the name for such roles may differ between institutions, the underlying idea of providing structured supplementary support is common. However, these roles were often poorly defined, lacking structured materials, pedagogical oversight, and integration with the core teaching team. This paper reports on the redesign of reinforcement sessions in a challenging undergraduate course on formal methods and computational models, using a large language model (LLM) as a reflective planning tool. The LLM was prompted to simulate the perspective of a second-year student, enabling the identification of conceptual bottlenecks, gaps in intuition, and likely reasoning breakdowns before classroom delivery. These insights informed a structured, repeatable session format combining targeted review, collaborative examples, independent student work, and guided walkthroughs. Conducted over a single semester, the intervention received positive student feedback, indicating increased confidence, reduced anxiety, and improved clarity, particularly in abstract topics such as the pumping lemma and formal language expressive power comparisons. The findings suggest that reflective, instructor-facing use of LLMs can enhance pedagogical design in theoretically dense domains and may be adaptable to other cognitively demanding computer science courses.
135. Before the Clinic: Transparent and Operable Design Principles for Healthcare AI
- Authors: Alexander Bakumenko (1), Aaron J. Masino (1), Janine Hoelscher (1) ((1) Clemson University, USA)
- URL: https://arxiv.org/abs/2511.01902
- Abstract:
The translation of artificial intelligence (AI) systems into clinical practice requires bridging fundamental gaps between explainable AI theory, clinician expectations, and governance requirements. While conceptual frameworks define what constitutes explainable AI (XAI) and qualitative studies identify clinician needs, little practical guidance exists for development teams to prepare AI systems prior to clinical evaluation. We propose two foundational design principles, Transparent Design and Operable Design, that operationalize pre-clinical technical requirements for healthcare AI. Transparent Design encompasses interpretability and understandability artifacts that enable case-level reasoning and system traceability. Operable Design encompasses calibration, uncertainty, and robustness to ensure reliable, predictable system behavior under real-world conditions. We ground these principles in established XAI frameworks, map them to documented clinician needs, and demonstrate their alignment with emerging governance requirements. This pre-clinical playbook provides actionable guidance for development teams, accelerates the path to clinical evaluation, and establishes a shared vocabulary bridging AI researchers, healthcare practitioners, and regulatory stakeholders. By explicitly scoping what can be built and verified before clinical deployment, we aim to reduce friction in clinical AI translation while remaining cautious about what constitutes validated, deployed explainability.
136. LGCC: Enhancing Flow Matching Based Text-Guided Image Editing with Local Gaussian Coupling and Context Consistency
- Authors: Fangbing Liu , Pengfei Duan , Wen Li , Yi He
- URL: https://arxiv.org/abs/2511.01894
- Abstract:
Recent advancements have demonstrated the great potential of flow matching-based Multimodal Large Language Models (MLLMs) in image editing. However, state-of-the-art works like BAGEL face limitations, including detail degradation, content inconsistency, and inefficiency due to their reliance on random noise initialization. To address these issues, we propose LGCC, a novel framework with two key components: Local Gaussian Noise Coupling (LGNC) and Content Consistency Loss (CCL). LGNC preserves spatial details by modeling target image embeddings and their locally perturbed counterparts as coupled pairs, while CCL ensures semantic alignment between edit instructions and image modifications, preventing unintended content removal. By integrating LGCC with the BAGEL pre-trained model via curriculum learning, we significantly reduce inference steps, improving local detail scores on I2EBench by 1.60% and overall scores by 0.53%. LGCC achieves 3x – 5x speedup for lightweight editing and 2x for universal editing, requiring only 40% – 50% of the inference time of BAGEL or Flux. These results demonstrate LGCC’s ability to preserve detail, maintain contextual integrity, and enhance inference speed, offering a cost-efficient solution without compromising editing quality.
137. Multi-Personality Generation of LLMs at Decoding-time
- Authors: Rongxin Chen , Yunfan Li , Yige Yuan , Bingbing Xu , Huawei Shen
- URL: https://arxiv.org/abs/2511.01891
- Abstract:
Multi-personality generation for LLMs, enabling simultaneous embodiment of multiple personalization attributes, is a fundamental challenge. Existing retraining-based approaches are costly and poorly scalable, while decoding-time methods often rely on external models or heuristics, limiting flexibility and robustness. In this paper, we propose a novel Multi-Personality Generation (MPG) framework under the decoding-time combination paradigm. It flexibly controls multi-personality without relying on scarce multi-dimensional models or extra training, leveraging implicit density ratios in single-dimensional models as a “free lunch” to reformulate the task as sampling from a target strategy aggregating these ratios. To implement MPG efficiently, we design Speculative Chunk-level based Rejection sampling (SCR), which generates responses in chunks and parallelly validates them via estimated thresholds within a sliding window. This significantly reduces computational overhead while maintaining high-quality generation. Experiments on MBTI personality and Role-Playing demonstrate the effectiveness of MPG, showing improvements up to 16%-18%. Code and data are available at this https URL .
138. CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization
- Authors: Zijian Zhang , Rong Wang , Shiyang Li , Yuebo Luo , Mingyi Hong , Caiwen Ding
- URL: https://arxiv.org/abs/2511.01884
- Abstract:
Developing efficient CUDA kernels is increasingly critical for AI applications such as large-scale LLM training. However, manual kernel design is both costly and time-consuming, motivating automatic approaches that leverage LLMs for code generation. Existing methods for automatic kernel generation, however, often produce low-efficiency kernels, incur high computational overhead, and fail to generalize across settings. In this work, we propose CudaForge, a training-free multi-agent workflow for CUDA kernel generation and optimization. Our workflow is inspired by the iterative workflow of human experts, which contains steps such as developing initial kernels, testing correctness, analyzing hardware feedback, and iterative improvement. More specifically, CudaForge employs two LLM agents: a Coder and a Judge, that iteratively generate, correct, and optimize CUDA kernels, while integrating hardware feedback such as Nsight Compute (NCU) metrics. In extensive evaluations, we show that CudaForge, by leveraging base models like OpenAI-o3, achieves 97.6\% correctness of generated kernels and an average 1.68$\times$ speedup over PyTorch baselines, substantially surpassing state-of-the-art models including OpenAI-o3 and Kevin on this http URL accuracy and speed, CudaForge demonstrates strong generalization across GPUs (A100, RTX 6000, 4090, 3090) and base models (OpenAI-o3, GPT-5, gpt-oss-120B, Claude-Sonnet-4, QwQ-32B), while maintaining high efficiency. In particular, generating an optimized kernel takes about 26.5 minutes on one RTX6000 and incurs about $ 0.3 API cost, which is significantly cheaper than existing agentic work that costs 6 H100 hours and $ 5 API cost per kernel. Our results highlight that multi-agent, training-free workflows can enable cost-effective, generalizable, and high-performance CUDA kernel optimization. Code available at this https URL
139. CytoNet: A Foundation Model for the Human Cerebral Cortex
- Authors: Christian Schiffer , Zeynep Boztoprak , Jan-Oliver Kropp , Julia Thönnißen , Katia Berr , Hannah Spitzer , Katrin Amunts , Timo Dickscheid
- URL: https://arxiv.org/abs/2511.01870
- Abstract:
To study how the human brain works, we need to explore the organization of the cerebral cortex and its detailed cellular architecture. We introduce CytoNet, a foundation model that encodes high-resolution microscopic image patches of the cerebral cortex into highly expressive feature representations, enabling comprehensive brain analyses. CytoNet employs self-supervised learning using spatial proximity as a powerful training signal, without requiring manual labelling. The resulting features are anatomically sound and biologically relevant. They encode general aspects of cortical architecture and unique brain-specific traits. We demonstrate top-tier performance in tasks such as cortical area classification, cortical layer segmentation, cell morphology estimation, and unsupervised brain region mapping. As a foundation model, CytoNet offers a consistent framework for studying cortical microarchitecture, supporting analyses of its relationship with other structural and functional brain features, and paving the way for diverse neuroscientific investigations.
140. DiffPace: Diffusion-based Plug-and-play Augmented Channel Estimation in mmWave and Terahertz Ultra-Massive MIMO Systems
- Authors: Zhengdong Hu , Chong Han , Wolfgang Gerstacker , Robert Schober
- URL: https://arxiv.org/abs/2511.01867
- Abstract:
Millimeter-wave (mmWave) and Terahertz (THz)-band communications hold great promise in meeting the growing data-rate demands of next-generation wireless networks, offering abundant bandwidth. To mitigate the severe path loss inherent to these high frequencies and reduce hardware costs, ultra-massive multiple-input multiple-output (UM-MIMO) systems with hybrid beamforming architectures can deliver substantial beamforming gains and enhanced spectral efficiency. However, accurate channel estimation (CE) in mmWave and THz UM-MIMO systems is challenging due to high channel dimensionality and compressed observations from a limited number of RF chains, while the hybrid near- and far-field radiation patterns, arising from large array apertures and high carrier frequencies, further complicate CE. Conventional compressive sensing based frameworks rely on predefined sparsifying matrices, which cannot faithfully capture the hybrid near-field and far-field channel structures, leading to degraded estimation performance. This paper introduces DiffPace, a diffusion-based plug-and-play method for channel estimation. DiffPace uses a diffusion model (DM) to capture the channel distribution based on the hybrid spherical and planar-wave (HPSM) model. By applying the plug-and-play approach, it leverages the DM as prior knowledge, improving CE accuracy. Moreover, DM performs inference by solving an ordinary differential equation, minimizing the number of required inference steps compared with stochastic sampling method. Experimental results show that DiffPace achieves competitive CE performance, attaining -15 dB normalized mean square error (NMSE) at a signal-to-noise ratio (SNR) of 10 dB, with 90\% fewer inference steps compared to state-of-the-art schemes, simultaneously providing high estimation precision and enhanced computational efficiency.
141. EdgeReasoning: Characterizing Reasoning LLM Deployment on Edge GPUs
- Authors: Benjamin Kubwimana , Qijing Huang
- URL: https://arxiv.org/abs/2511.01866
- Abstract:
Edge intelligence paradigm is increasingly demanded by the emerging autonomous systems, such as robotics. Beyond ensuring privacy-preserving operation and resilience in connectivity-limited environments, edge deployment offers significant energy and cost advantages over cloud-based solutions. However, deploying large language models (LLMs) for reasoning tasks on edge GPUs faces critical challenges from strict latency constraints and limited computational resources. To navigate these constraints, developers must balance multiple design factors - choosing reasoning versus non-reasoning architectures, selecting appropriate model sizes, allocating token budgets, and applying test-time scaling strategies - to meet target latency and optimize accuracy. Yet guidance on optimal combinations of these variables remains scarce. In this work, we present EdgeReasoning, a comprehensive study characterizing the deployment of reasoning LLMs on edge GPUs. We systematically quantify latency-accuracy tradeoffs across various LLM architectures and model sizes. We systematically evaluate prompt-based and model-tuning-based techniques for reducing reasoning token length while maintaining performance quality. We further profile test-time scaling methods with varying degrees of parallelism to maximize accuracy under strict latency budgets. Through these analyses, EdgeReasoning maps the Pareto frontier of achievable accuracy-latency configurations, offering systematic guidance for optimal edge deployment of reasoning LLMs.