전체 AI 논문 - 2025-12-16

1. MedAI: Evaluating TxAgent’s Therapeutic Agentic Reasoning in the NeurIPS CURE-Bench Competition

Authors: Tim Cofala , Christian Kalfar , Jingge Xiao , Johanna Schrader , Michelle Tang , Wolfgang Nejdl
URL: https://arxiv.org/abs/2512.11682
Abstract:

Therapeutic decision-making in clinical medicine constitutes a high-stakes domain in which AI guidance interacts with complex interactions among patient characteristics, disease processes, and pharmacological agents. Tasks such as drug recommendation, treatment planning, and adverse-effect prediction demand robust, multi-step reasoning grounded in reliable biomedical knowledge. Agentic AI methods, exemplified by TxAgent, address these challenges through iterative retrieval-augmented generation (RAG). TxAgent employs a fine-tuned Llama-3.1-8B model that dynamically generates and executes function calls to a unified biomedical tool suite (ToolUniverse), integrating FDA Drug API, OpenTargets, and Monarch resources to ensure access to current therapeutic information. In contrast to general-purpose RAG systems, medical applications impose stringent safety constraints, rendering the accuracy of both the reasoning trace and the sequence of tool invocations critical. These considerations motivate evaluation protocols treating token-level reasoning and tool-usage behaviors as explicit supervision signals. This work presents insights derived from our participation in the CURE-Bench NeurIPS 2025 Challenge, which benchmarks therapeutic-reasoning systems using metrics that assess correctness, tool utilization, and reasoning quality. We analyze how retrieval quality for function (tool) calls influences overall model performance and demonstrate performance gains achieved through improved tool-retrieval strategies. Our work was awarded the Excellence Award in Open Science. Complete information can be found at this https URL .

2. Causal Inference in Energy Demand Prediction

Authors: Chutian Ma , Grigorii Pomazkin , Giacinto Paolo Saggese , Paul Smith
URL: https://arxiv.org/abs/2512.11653
Abstract:

Energy demand prediction is critical for grid operators, industrial energy consumers, and service providers. Energy demand is influenced by multiple factors, including weather conditions (e.g. temperature, humidity, wind speed, solar radiation), and calendar information (e.g. hour of day and month of year), which further affect daily work and life schedules. These factors are causally interdependent, making the problem more complex than simple correlation-based learning techniques satisfactorily allow for. We propose a structural causal model that explains the causal relationship between these variables. A full analysis is performed to validate our causal beliefs, also revealing important insights consistent with prior studies. For example, our causal model reveals that energy demand responds to temperature fluctuations with season-dependent sensitivity. Additionally, we find that energy demand exhibits lower variance in winter due to the decoupling effect between temperature changes and daily activity patterns. We then build a Bayesian model, which takes advantage of the causal insights we learned as prior knowledge. The model is trained and tested on unseen data and yields state-of-the-art performance in the form of a 3.84 percent MAPE on the test set. The model also demonstrates strong robustness, as the cross-validation across two years of data yields an average MAPE of 3.88 percent.

3. AI Benchmark Democratization and Carpentry

Authors: Gregor von Laszewski , Wesley Brewer , Jeyan Thiyagalingam , Juri Papay , Armstrong Foundjem , Piotr Luszczek , Murali Emani , Shirley V. Moore , Vijay Janapa Reddi , Matthew D. Sinclair , Sebastian Lobentanzer , Sujata Goswami , Benjamin Hawks , Marco Colombo , Nhan Tran , Christine R. Kirkpatrick , Abdulkareem Alsudais , Gregg Barrett , Tianhao Li , Kirsten Morehouse , Shivaram Venkataraman , Rutwik Jain , Kartik Mathur , Victor Lu , Tejinder Singh , Khojasteh Z. Mirza , Kongtao Chen , Sasidhar Kunapuli , Gavin Farrell , Renato Umeton , Geoffrey C. Fox
URL: https://arxiv.org/abs/2512.11588
Abstract:

Benchmarks are a cornerstone of modern machine learning, enabling reproducibility, comparison, and scientific progress. However, AI benchmarks are increasingly complex, requiring dynamic, AI-focused workflows. Rapid evolution in model architectures, scale, datasets, and deployment contexts makes evaluation a moving target. Large language models often memorize static benchmarks, causing a gap between benchmark results and real-world performance. Beyond traditional static benchmarks, continuous adaptive benchmarking frameworks are needed to align scientific assessment with deployment risks. This calls for skills and education in AI Benchmark Carpentry. From our experience with MLCommons, educational initiatives, and programs like the DOE’s Trillion Parameter Consortium, key barriers include high resource demands, limited access to specialized hardware, lack of benchmark design expertise, and uncertainty in relating results to application domains. Current benchmarks often emphasize peak performance on top-tier hardware, offering limited guidance for diverse, real-world scenarios. Benchmarking must become dynamic, incorporating evolving models, updated data, and heterogeneous platforms while maintaining transparency, reproducibility, and interpretability. Democratization requires both technical innovation and systematic education across levels, building sustained expertise in benchmark design and use. Benchmarks should support application-relevant comparisons, enabling informed, context-sensitive decisions. Dynamic, inclusive benchmarking will ensure evaluation keeps pace with AI evolution and supports responsible, reproducible, and accessible AI deployment. Community efforts can provide a foundation for AI Benchmark Carpentry.

4. AI-MASLD Metabolic Dysfunction and Information Steatosis of Large Language Models in Unstructured Clinical Narratives

Authors: Yuan Shen , Xiaojun Wu , Linghua Yu
URL: https://arxiv.org/abs/2512.11544
Abstract:

This study aims to simulate real-world clinical scenarios to systematically evaluate the ability of Large Language Models (LLMs) to extract core medical information from patient chief complaints laden with noise and redundancy, and to verify whether they exhibit a functional decline analogous to Metabolic Dysfunction-Associated Steatotic Liver Disease (MASLD). We employed a cross-sectional analysis design based on standardized medical probes, selecting four mainstream LLMs as research subjects: GPT-4o, Gemini 2.5, DeepSeek 3.1, and Qwen3-Max. An evaluation system comprising twenty medical probes across five core dimensions was used to simulate a genuine clinical communication environment. All probes had gold-standard answers defined by clinical experts and were assessed via a double-blind, inverse rating scale by two independent clinicians. The results show that all tested models exhibited functional defects to varying degrees, with Qwen3-Max demonstrating the best overall performance and Gemini 2.5 the worst. Under conditions of extreme noise, most models experienced a functional collapse. Notably, GPT-4o made a severe misjudgment in the risk assessment for pulmonary embolism (PE) secondary to deep vein thrombosis (DVT). This research is the first to empirically confirm that LLMs exhibit features resembling metabolic dysfunction when processing clinical information, proposing the innovative concept of “AI-Metabolic Dysfunction-Associated Steatotic Liver Disease (AI-MASLD)”. These findings offer a crucial safety warning for the application of Artificial Intelligence (AI) in healthcare, emphasizing that current LLMs must be used as auxiliary tools under human expert supervision, as there remains a significant gap between their theoretical knowledge and practical clinical application.

5. EmeraldMind: A Knowledge Graph-Augmented Framework for Greenwashing Detection

Authors: Georgios Kaoukis , Ioannis Aris Koufopoulos , Psaroudaki Eleni , Danae Pla Karidi , Evaggelia Pitoura , George Papastefanatos , Panayiotis Tsaparas
URL: https://arxiv.org/abs/2512.11506
Abstract:

As AI and web agents become pervasive in decision-making, it is critical to design intelligent systems that not only support sustainability efforts but also guard against misinformation. Greenwashing, i.e., misleading corporate sustainability claims, poses a major challenge to environmental progress. To address this challenge, we introduce EmeraldMind, a fact-centric framework integrating a domain-specific knowledge graph with retrieval-augmented generation to automate greenwashing detection. EmeraldMind builds the EmeraldGraph from diverse corporate ESG (environmental, social, and governance) reports, surfacing verifiable evidence, often missing in generic knowledge bases, and supporting large language models in claim assessment. The framework delivers justification-centric classifications, presenting transparent, evidence-backed verdicts and abstaining responsibly when claims cannot be verified. Experiments on a new greenwashing claims dataset demonstrate that EmeraldMind achieves competitive accuracy, greater coverage, and superior explanation quality compared to generic LLMs, without the need for fine-tuning or retraining.

6. BAID: A Benchmark for Bias Assessment of AI Detectors

Authors: Priyam Basu , Yunfeng Zhang , Vipul Raheja
URL: https://arxiv.org/abs/2512.11505
Abstract:

AI-generated text detectors have recently gained adoption in educational and professional contexts. Prior research has uncovered isolated cases of bias, particularly against English Language Learners (ELLs) however, there is a lack of systematic evaluation of such systems across broader sociolinguistic factors. In this work, we propose BAID, a comprehensive evaluation framework for AI detectors across various types of biases. As a part of the framework, we introduce over 200k samples spanning 7 major categories: demographics, age, educational grade level, dialect, formality, political leaning, and topic. We also generated synthetic versions of each sample with carefully crafted prompts to preserve the original content while reflecting subgroup-specific writing styles. Using this, we evaluate four open-source state-of-the-art AI text detectors and find consistent disparities in detection performance, particularly low recall rates for texts from underrepresented groups. Our contributions provide a scalable, transparent approach for auditing AI detectors and emphasize the need for bias-aware evaluation before these tools are deployed for public use.

7. General-purpose AI models can generate actionable knowledge on agroecological crop protection

Authors: Kris A.G. Wyckhuys
URL: https://arxiv.org/abs/2512.11474
Abstract:

Generative artificial intelligence (AI) offers potential for democratizing scientific knowledge and converting this to clear, actionable information, yet its application in agri-food science remains unexplored. Here, we verify the scientific knowledge on agroecological crop protection that is generated by either web-grounded or non-grounded large language models (LLMs), i.e., DeepSeek versus the free-tier version of ChatGPT. For nine globally limiting pests, weeds, and plant diseases, we assessed the factual accuracy, data consistency, and breadth of knowledge or data completeness of each LLM. Overall, DeepSeek consistently screened a 4.8-49.7-fold larger literature corpus and reported 1.6-2.4-fold more biological control agents or management solutions than ChatGPT. As a result, DeepSeek reported 21.6% higher efficacy estimates, exhibited greater laboratory-to-field data consistency, and showed more realistic effects of pest identity and management tactics. However, both models hallucinated, i.e., fabricated fictitious agents or references, reported on implausible ecological interactions or outcomes, confused old and new scientific nomenclatures, and omitted data on key agents or solutions. Despite these shortcomings, both LLMs correctly reported low-resolution efficacy trends. Overall, when paired with rigorous human oversight, LLMs may pose a powerful tool to support farm-level decision-making and unleash scientific creativity.

8. Three methods, one problem: Classical and AI approaches to no-three-in-line

Authors: Pranav Ramanathan , Thomas Prellberg , Matthew Lewis , Prathamesh Dinesh Joshi , Raj Abhijit Dandekar , Rajat Dandekar , Sreedath Panat
URL: https://arxiv.org/abs/2512.11469
Abstract:

The No-Three-In-Line problem asks for the maximum number of points that can be placed on an n by n grid with no three collinear, representing a famous problem in combinatorial geometry. While classical methods like Integer Linear Programming (ILP) guarantee optimal solutions, they face exponential scaling with grid size, and recent advances in machine learning offer promising alternatives for pattern-based approximation. This paper presents the first systematic comparison of classical optimization and AI approaches to this problem, evaluating their performance against traditional algorithms. We apply PatternBoost transformer learning and reinforcement learning (PPO) to this problem for the first time, comparing them against ILP. ILP achieves provably optimal solutions up to 19 by 19 grids, while PatternBoost matches optimal performance up to 14 by 14 grids with 96% test loss reduction. PPO achieves perfect solutions on 10 by 10 grids but fails at 11 by 11 grids, where constraint violations prevent valid configurations. These results demonstrate that classical optimization remains essential for exact solutions while AI methods offer competitive performance on smaller instances, with hybrid approaches presenting the most promising direction for scaling to larger problem sizes.

9. Motif-2-12.7B-Reasoning: A Practitioner’s Guide to RL Training Recipes

Authors: Junghwan Lim , Sungmin Lee , Dongseok Kim , Taehyun Kim , Eunhwan Park , Jeesoo Lee , Jeongdoo Lee , Junhyeok Lee , Wai Ting Cheung , Dahye Choi , Minsu Ha , Jaeheui Her , Jaeyeon Huh , Hanbin Jung , Changjin Kang , Beomgyu Kim , Minjae Kim , Taewhan Kim , Youngrok Kim , Hyukjin Kweon , Haesol Lee , Kungyu Lee , Dongpin Oh , Yeongjae Park , Bokki Ryu , Dongjoo Weon
URL: https://arxiv.org/abs/2512.11463
Abstract:

We introduce Motif-2-12.7B-Reasoning, a 12.7B parameter language model designed to bridge the gap between open-weight systems and proprietary frontier models in complex reasoning and long-context understanding. Addressing the common challenges of model collapse and training instability in reasoning adaptation, we propose a comprehensive, reproducible training recipe spanning system, data, and algorithmic optimizations. Our approach combines memory-efficient infrastructure for 64K-token contexts using hybrid parallelism and kernel-level optimizations with a two-stage Supervised Fine-Tuning (SFT) curriculum that mitigates distribution mismatch through verified, aligned synthetic data. Furthermore, we detail a robust Reinforcement Learning Fine-Tuning (RLFT) pipeline that stabilizes training via difficulty-aware data filtering and mixed-policy trajectory reuse. Empirical results demonstrate that Motif-2-12.7B-Reasoning achieves performance comparable to models with significantly larger parameter counts across mathematics, coding, and agentic benchmarks, offering the community a competitive open model and a practical blueprint for scaling reasoning capabilities under realistic compute constraints.

10. Back to the Baseline: Examining Baseline Effects on Explainability Metrics

Authors: Agustin Martin Picard (ANITI), Thibaut Boissin (ANITI), Varshini Subhash , Rémi Cadène (SU), Thomas Fel (ANITI)
URL: https://arxiv.org/abs/2512.11433
Abstract:

Attribution methods are among the most prevalent techniques in Explainable Artificial Intelligence (XAI) and are usually evaluated and compared using Fidelity metrics, with Insertion and Deletion being the most popular. These metrics rely on a baseline function to alter the pixels of the input image that the attribution map deems most important. In this work, we highlight a critical problem with these metrics: the choice of a given baseline will inevitably favour certain attribution methods over others. More concerningly, even a simple linear model with commonly used baselines contradicts itself by designating different optimal methods. A question then arises: which baseline should we use? We propose to study this problem through two desirable properties of a baseline: (i) that it removes information and (ii) that it does not produce overly out-of-distribution (OOD) images. We first show that none of the tested baselines satisfy both criteria, and there appears to be a trade-off among current baselines: either they remove information or they produce a sequence of OOD images. Finally, we introduce a novel baseline by leveraging recent work in feature visualisation to artificially produce a model-dependent baseline that removes information without being overly OOD, thus improving on the trade-off when compared to other existing baselines. Our code is available at this https URL

11. AgentBalance: Backbone-then-Topology Design for Cost-Effective Multi-Agent Systems under Budget Constraints

Authors: Shuowei Cai , Yansong Ning , Hao Liu
URL: https://arxiv.org/abs/2512.11426
Abstract:

Large Language Model (LLM)-based multi-agent systems (MAS) are becoming indispensable building blocks for web-scale applications such as web search, social network analytics, and online customer support, where cost-effectiveness is increasingly the primary constraint for large-scale deployment. While recent work improves MAS cost-effectiveness by shaping inter-agent communication topologies and selecting agent backbones, it rarely models and optimizes under explicit token-cost and latency budgets that reflect deployment constraints. This often leads to topology-first designs and suboptimal cost-effectiveness when budgets are binding. We present AgentBalance, a framework for constructing cost-effective MAS under explicit token-cost and latency budgets via a backbone-then-topology design. AgentBalance first performs backbone-oriented agent generation, constructing agents with heterogeneous backbones through LLM pool construction, pool selection, and role-backbone matching. It then performs adaptive MAS topology generation, guiding inter-agent communication via agent representation learning, gating, and latency-aware topology synthesis. Experiments on benchmarks with 14 candidate LLM backbones show that AgentBalance achieves up to 10% and 22% performance gains under matched token-cost and latency budgets, respectively, and yields strong AUC on performance-versus-budget curves across benchmarks. AgentBalance also functions as a plug-in for existing MAS, improving performance under the same token-cost and latency constraints, and it generalizes well to unseen LLMs for practical, budget-aware deployment. Code: this https URL

12. Towards Trustworthy Multi-Turn LLM Agents via Behavioral Guidance

Authors: Gonca Gürsun
URL: https://arxiv.org/abs/2512.11421
Abstract:

Large Language Models demonstrate strong reasoning and generation abilities, yet their behavior in multi-turn tasks often lacks reliability and verifiability. We present a task completion framework that enables LLM-based agents to act under explicit behavioral guidance in environments described by reinforcement learning formalisms with defined observation, action, and reward signals. The framework integrates three components: a lightweight task profiler that selects reasoning and generation strategies, a reasoning module that learns verifiable observation - action mappings, and a generation module that enforces constraint-compliant outputs through validation or deterministic synthesis. We show that as the agent interacts with the environment, these components co-evolve, yielding trustworthy behavior.

13. CAPTURE: A Benchmark and Evaluation for LVLMs in CAPTCHA Resolving

Authors: Jianyi Zhang , Ziyin Zhou , Xu Ji , Shizhao Liu , Zhangchi Zhao
URL: https://arxiv.org/abs/2512.11323
Abstract:

Benefiting from strong and efficient multi-modal alignment strategies, Large Visual Language Models (LVLMs) are able to simulate human visual and reasoning capabilities, such as solving CAPTCHAs. However, existing benchmarks based on visual CAPTCHAs still face limitations. Previous studies, when designing benchmarks and datasets, customized them according to their research objectives. Consequently, these benchmarks cannot comprehensively cover all CAPTCHA types. Notably, there is a dearth of dedicated benchmarks for LVLMs. To address this problem, we introduce a novel CAPTCHA benchmark for the first time, named CAPTURE CAPTCHA for Testing Under Real-world Experiments, specifically for LVLMs. Our benchmark encompasses 4 main CAPTCHA types and 25 sub-types from 31 vendors. The diversity enables a multi-dimensional and thorough evaluation of LVLM performance. CAPTURE features extensive class variety, large-scale data, and unique LVLM-tailored labels, filling the gaps in previous research in terms of data comprehensiveness and labeling pertinence. When evaluated by this benchmark, current LVLMs demonstrate poor performance in solving CAPTCHAs.

14. TriFlow: A Progressive Multi-Agent Framework for Intelligent Trip Planning

Authors: Yuxing Chen , Basem Suleiman , Qifan Chen
URL: https://arxiv.org/abs/2512.11271
Abstract:

Real-world trip planning requires transforming open-ended user requests into executable itineraries under strict spatial, temporal, and budgetary constraints while aligning with user preferences. Existing LLM-based agents struggle with constraint satisfaction, tool coordination, and efficiency, often producing infeasible or costly plans. To address these limitations, we present TriFlow, a progressive multi-agent framework that unifies structured reasoning and language-based flexibility through a three-stage pipeline of retrieval, planning, and governance. By this design, TriFlow progressively narrows the search space, assembles constraint-consistent itineraries via rule-LLM collaboration, and performs bounded iterative refinement to ensure global feasibility and personalisation. Evaluations on TravelPlanner and TripTailor benchmarks demonstrated state-of-the-art results, achieving 91.1% and 97% final pass rates, respectively, with over 10x runtime efficiency improvement compared to current SOTA.

15. A-LAMP: Agentic LLM-Based Framework for Automated MDP Modeling and Policy Generation

Authors: Hong Je-Gal , Chan-Bin Yi , Hyun-Suk Lee
URL: https://arxiv.org/abs/2512.11270
Abstract:

Applying reinforcement learning (RL) to real-world tasks requires converting informal descriptions into a formal Markov decision process (MDP), implementing an executable environment, and training a policy agent. Automating this process is challenging due to modeling errors, fragile code, and misaligned objectives, which often impede policy training. We introduce an agentic large language model (LLM)-based framework for automated MDP modeling and policy generation (A-LAMP), that automatically translates free-form natural language task descriptions into an MDP formulation and trained policy. The framework decomposes modeling, coding, and training into verifiable stages, ensuring semantic alignment throughout the pipeline. Across both classic control and custom RL domains, A-LAMP consistently achieves higher policy generation capability than a single state-of-the-art LLM model. Notably, even its lightweight variant, which is built on smaller language models, approaches the performance of much larger models. Failure analysis reveals why these improvements occur. In addition, a case study also demonstrates that A-LAMP generates environments and policies that preserve the task’s optimality, confirming its correctness and reliability.

16. FutureWeaver: Planning Test-Time Compute for Multi-Agent Systems with Modularized Collaboration

Authors: Dongwon Jung , Peng Shi , Yi Zhang
URL: https://arxiv.org/abs/2512.11213
Abstract:

Scaling test-time computation improves large language model performance without additional training. Recent work demonstrates that techniques such as repeated sampling, self-verification, and self-reflection can significantly enhance task success by allocating more inference-time compute. However, applying these techniques across multiple agents in a multi-agent system is difficult: there does not exist principled mechanisms to allocate compute to foster collaboration among agents, to extend test-time scaling to collaborative interactions, or to distribute compute across agents under explicit budget constraints. To address this gap, we propose FutureWeaver, a framework for planning and optimizing test-time compute allocation in multi-agent systems under fixed budgets. FutureWeaver introduces modularized collaboration, formalized as callable functions that encapsulate reusable multi-agent workflows. These modules are automatically derived through self-play reflection by abstracting recurring interaction patterns from past trajectories. Building on these modules, FutureWeaver employs a dual-level planning architecture that optimizes compute allocation by reasoning over the current task state while also speculating on future steps. Experiments on complex agent benchmarks demonstrate that FutureWeaver consistently outperforms baselines across diverse budget settings, validating its effectiveness for multi-agent collaboration in inference-time optimization.

17. Deep Learning–Accelerated Multi-Start Large Neighborhood Search for Real-time Freight Bundling

Authors: Haohui Zhang , Wouter van Heeswijk , Xinyu Hu , Neil Yorke-Smith , Martijn Mes
URL: https://arxiv.org/abs/2512.11187
Abstract:

Online Freight Exchange Systems (OFEX) play a crucial role in modern freight logistics by facilitating real-time matching between shippers and carrier. However, efficient combinatorial bundling of transporation jobs remains a bottleneck. We model the OFEX combinatorial bundling problem as a multi-commodity one-to-one pickup-and-delivery selective traveling salesperson problem (m1-PDSTSP), which optimizes revenue-driven freight bundling under capacity, precedence, and route-length constraints. The key challenge is to couple combinatorial bundle selection with pickup-and-delivery routing under sub-second latency. We propose a learning–accelerated hybrid search pipeline that pairs a Transformer Neural Network-based constructive policy with an innovative Multi-Start Large Neighborhood Search (MSLNS) metaheuristic within a rolling-horizon scheme in which the platform repeatedly freezes the current marketplace into a static snapshot and solves it under a short time budget. This pairing leverages the low-latency, high-quality inference of the learning-based constructor alongside the robustness of improvement search; the multi-start design and plausible seeds help LNS to explore the solution space more efficiently. Across benchmarks, our method outperforms state-of-the-art neural combinatorial optimization and metaheuristic baselines in solution quality with comparable time, achieving an optimality gap of less than 2\% in total revenue relative to the best available exact baseline method. To our knowledge, this is the first work to establish that a Deep Neural Network-based constructor can reliably provide high-quality seeds for (multi-start) improvement heuristics, with applicability beyond the \textit{m1-PDSTSP} to a broad class of selective traveling salesperson problems and pickup and delivery problems.

18. CORL: Reinforcement Learning of MILP Policies Solved via Branch and Bound

Authors: Akhil S Anand , Elias Aarekol , Martin Mziray Dalseg , Magnus Stalhane , Sebastien Gros
URL: https://arxiv.org/abs/2512.11169
Abstract:

Combinatorial sequential decision making problems are typically modeled as mixed integer linear programs (MILPs) and solved via branch and bound (B&B) algorithms. The inherent difficulty of modeling MILPs that accurately represent stochastic real world problems leads to suboptimal performance in the real world. Recently, machine learning methods have been applied to build MILP models for decision quality rather than how accurately they model the real world problem. However, these approaches typically rely on supervised learning, assume access to true optimal decisions, and use surrogates for the MILP gradients. In this work, we introduce a proof of concept CORL framework that end to end fine tunes an MILP scheme using reinforcement learning (RL) on real world data to maximize its operational performance. We enable this by casting an MILP solved by B&B as a differentiable stochastic policy compatible with RL. We validate the CORL method in a simple illustrative combinatorial sequential decision making example.

19. Unified Smart Factory Model: A model-based Approach for Integrating Industry 4.0 and Sustainability for Manufacturing Systems

Authors: Ishaan Kaushal , Amaresh Chakrabarti
URL: https://arxiv.org/abs/2512.10631
Abstract:

This paper presents the Unified Smart Factory Model (USFM), a comprehensive framework designed to translate high-level sustainability goals into measurable factory-level indicators with a systematic information map of manufacturing activities. The manufacturing activities were modelled as set of manufacturing, assembly and auxiliary processes using Object Process Methodology, a Model Based Systems Engineering (MBSE) language. USFM integrates Manufacturing Process and System, Data Process, and Key Performance Indicator (KPI) Selection and Assessment in a single framework. Through a detailed case study of Printed Circuit Board (PCB) assembly factory, the paper demonstrates how environmental sustainability KPIs can be selected, modelled, and mapped to the necessary data, highlighting energy consumption and environmental impact metrics. The model’s systematic approach can reduce redundancy, minimize the risk of missing critical information, and enhance data collection. The paper concluded that the USFM bridges the gap between sustainability goals and practical implementation, providing significant benefits for industries specifically SMEs aiming to achieve sustainability targets.

20. Particulate: Feed-Forward 3D Object Articulation

Authors: Ruining Li , Yuxin Yao , Chuanxia Zheng , Christian Rupprecht , Joan Lasenby , Shangzhe Wu , Andrea Vedaldi
URL: https://arxiv.org/abs/2512.11798
Abstract:

We present Particulate, a feed-forward approach that, given a single static 3D mesh of an everyday object, directly infers all attributes of the underlying articulated structure, including its 3D parts, kinematic structure, and motion constraints. At its core is a transformer network, Part Articulation Transformer, which processes a point cloud of the input mesh using a flexible and scalable architecture to predict all the aforementioned attributes with native multi-joint support. We train the network end-to-end on a diverse collection of articulated 3D assets from public datasets. During inference, Particulate lifts the network’s feed-forward prediction to the input mesh, yielding a fully articulated 3D model in seconds, much faster than prior approaches that require per-object optimization. Particulate can also accurately infer the articulated structure of AI-generated 3D assets, enabling full-fledged extraction of articulated 3D objects from a single (real or synthetic) image when combined with an off-the-shelf image-to-3D generator. We further introduce a new challenging benchmark for 3D articulation estimation curated from high-quality public 3D assets, and redesign the evaluation protocol to be more consistent with human preferences. Quantitative and qualitative results show that Particulate significantly outperforms state-of-the-art approaches.

21. Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously

Authors: Andrew Adiletta , Kathryn Adiletta , Kemal Derya , Berk Sunar
URL: https://arxiv.org/abs/2512.11783
Abstract:

The rapid deployment of Large Language Models (LLMs) has created an urgent need for enhanced security and privacy measures in Machine Learning (ML). LLMs are increasingly being used to process untrusted text inputs and even generate executable code, often while having access to sensitive system controls. To address these security concerns, several companies have introduced guard models, which are smaller, specialized models designed to protect text generation models from adversarial or malicious inputs. In this work, we advance the study of adversarial inputs by introducing Super Suffixes, suffixes capable of overriding multiple alignment objectives across various models with different tokenization schemes. We demonstrate their effectiveness, along with our joint optimization technique, by successfully bypassing the protection mechanisms of Llama Prompt Guard 2 on five different text generation models for malicious text and code generation. To the best of our knowledge, this is the first work to reveal that Llama Prompt Guard 2 can be compromised through joint optimization. Additionally, by analyzing the changing similarity of a model’s internal state to specific concept directions during token sequence processing, we propose an effective and lightweight method to detect Super Suffix attacks. We show that the cosine similarity between the residual stream and certain concept directions serves as a distinctive fingerprint of model intent. Our proposed countermeasure, DeltaGuard, significantly improves the detection of malicious prompts generated through Super Suffixes. It increases the non-benign classification rate to nearly 100%, making DeltaGuard a valuable addition to the guard model stack and enhancing robustness against adversarial prompt attacks.

22. Agile Flight Emerges from Multi-Agent Competitive Racing

Authors: Vineet Pasumarti , Lorenzo Bianchi , Antonio Loquercio
URL: https://arxiv.org/abs/2512.11781
Abstract:

Through multi-agent competition and the sparse high-level objective of winning a race, we find that both agile flight (e.g., high-speed motion pushing the platform to its physical limits) and strategy (e.g., overtaking or blocking) emerge from agents trained with reinforcement learning. We provide evidence in both simulation and the real world that this approach outperforms the common paradigm of training agents in isolation with rewards that prescribe behavior, e.g., progress on the raceline, in particular when the complexity of the environment increases, e.g., in the presence of obstacles. Moreover, we find that multi-agent competition yields policies that transfer more reliably to the real world than policies trained with a single-agent progress-based reward, despite the two methods using the same simulation environment, randomization strategy, and hardware. In addition to improved sim-to-real transfer, the multi-agent policies also exhibit some degree of generalization to opponents unseen at training time. Overall, our work, following in the tradition of multi-agent competitive game-play in digital domains, shows that sparse task-level rewards are sufficient for training agents capable of advanced low-level control in the physical world. Code: this https URL

23. Conditional Coverage Diagnostics for Conformal Prediction

Authors: Sacha Braun , David Holzmüller , Michael I. Jordan , Francis Bach
URL: https://arxiv.org/abs/2512.11779
Abstract:

Evaluating conditional coverage remains one of the most persistent challenges in assessing the reliability of predictive systems. Although conformal methods can give guarantees on marginal coverage, no method can guarantee to produce sets with correct conditional coverage, leaving practitioners without a clear way to interpret local deviations. To overcome sample-inefficiency and overfitting issues of existing metrics, we cast conditional coverage estimation as a classification problem. Conditional coverage is violated if and only if any classifier can achieve lower risk than the target coverage. Through the choice of a (proper) loss function, the resulting risk difference gives a conservative estimate of natural miscoverage measures such as L1 and L2 distance, and can even separate the effects of over- and under-coverage, and non-constant target coverages. We call the resulting family of metrics excess risk of the target coverage (ERT). We show experimentally that the use of modern classifiers provides much higher statistical power than simple classifiers underlying established metrics like CovGap. Additionally, we use our metric to benchmark different conformal prediction methods. Finally, we release an open-source package for ERT as well as previous conditional coverage metrics. Together, these contributions provide a new lens for understanding, diagnosing, and improving the conditional reliability of predictive systems.

24. Smudged Fingerprints: A Systematic Evaluation of the Robustness of AI Image Fingerprints

Authors: Kai Yao , Marc Juarez
URL: https://arxiv.org/abs/2512.11771
Abstract:

Model fingerprint detection techniques have emerged as a promising approach for attributing AI-generated images to their source models, but their robustness under adversarial conditions remains largely unexplored. We present the first systematic security evaluation of these techniques, formalizing threat models that encompass both white- and black-box access and two attack goals: fingerprint removal, which erases identifying traces to evade attribution, and fingerprint forgery, which seeks to cause misattribution to a target model. We implement five attack strategies and evaluate 14 representative fingerprinting methods across RGB, frequency, and learned-feature domains on 12 state-of-the-art image generators. Our experiments reveal a pronounced gap between clean and adversarial performance. Removal attacks are highly effective, often achieving success rates above 80% in white-box settings and over 50% under constrained black-box access. While forgery is more challenging than removal, its success significantly varies across targeted models. We also identify a utility-robustness trade-off: methods with the highest attribution accuracy are often vulnerable to attacks. Although some techniques exhibit robustness in specific settings, none achieves high robustness and accuracy across all evaluated threat models. These findings highlight the need for techniques balancing robustness and accuracy, and identify the most promising approaches for advancing this goal.

25. Generative Parametric Design (GPD): A framework for real-time geometry generation and on-the-fly multiparametric approximation

Authors: Mohammed El Fallaki Idrissi , Jad Mounayer , Sebastian Rodriguez , Fodil Meraghni , Francisco Chinesta
URL: https://arxiv.org/abs/2512.11748
Abstract:

This paper presents a novel paradigm in simulation-based engineering sciences by introducing a new framework called Generative Parametric Design (GPD). The GPD framework enables the generation of new designs along with their corresponding parametric solutions given as a reduced basis. To achieve this, two Rank Reduction Autoencoders (RRAEs) are employed, one for encoding and generating the design or geometry, and the other for encoding the sparse Proper Generalized Decomposition (sPGD) mode solutions. These models are linked in the latent space using regression techniques, allowing efficient transitions between design and their associated sPGD modes. By empowering design exploration and optimization, this framework also advances digital and hybrid twin development, enhancing predictive modeling and real-time decision-making in engineering applications. The developed framework is demonstrated on two-phase microstructures, in which the multiparametric solutions account for variations in two key material parameters.

26. CogniSNN: Enabling Neuron-Expandability, Pathway-Reusability, and Dynamic-Configurability with Random Graph Architectures in Spiking Neural Networks

Authors: Yongsheng Huang , Peibo Duan , Yujie Wu , Kai Sun , Zhipeng Liu , Changsheng Zhang , Bin Zhang , Mingkun Xu
URL: https://arxiv.org/abs/2512.11743
Abstract:

Spiking neural networks (SNNs), regarded as the third generation of artificial neural networks, are expected to bridge the gap between artificial intelligence and computational neuroscience. However, most mainstream SNN research directly adopts the rigid, chain-like hierarchical architecture of traditional artificial neural networks (ANNs), ignoring key structural characteristics of the brain. Biological neurons are stochastically interconnected, forming complex neural pathways that exhibit Neuron-Expandability, Pathway-Reusability, and Dynamic-Configurability. In this paper, we introduce a new SNN paradigm, named Cognition-aware SNN (CogniSNN), by incorporating Random Graph Architecture (RGA). Furthermore, we address the issues of network degradation and dimensional mismatch in deep pathways by introducing an improved pure spiking residual mechanism alongside an adaptive pooling strategy. Then, we design a Key Pathway-based Learning without Forgetting (KP-LwF) approach, which selectively reuses critical neural pathways while retaining historical knowledge, enabling efficient multi-task transfer. Finally, we propose a Dynamic Growth Learning (DGL) algorithm that allows neurons and synapses to grow dynamically along the internal temporal dimension. Extensive experiments demonstrate that CogniSNN achieves performance comparable to, or even surpassing, current state-of-the-art SNNs on neuromorphic datasets and Tiny-ImageNet. The Pathway-Reusability enhances the network’s continuous learning capability across different scenarios, while the dynamic growth algorithm improves robustness against interference and mitigates the fixed-timestep constraints during neuromorphic chip deployment. This work demonstrates the potential of SNNs with random graph structures in advancing brain-inspired intelligence and lays the foundation for their practical application on neuromorphic hardware.

27. From Signal to Turn: Interactional Friction in Modular Speech-to-Speech Pipelines

Authors: Titaya Mairittha , Tanakon Sawanglok , Panuwit Raden , Jirapast Buntub , Thanapat Warunee , Napat Asawachaisuvikrom , Thanaphum Saiwongin
URL: https://arxiv.org/abs/2512.11724
Abstract:

While voice-based AI systems have achieved remarkable generative capabilities, their interactions often feel conversationally broken. This paper examines the interactional friction that emerges in modular Speech-to-Speech Retrieval-Augmented Generation (S2S-RAG) pipelines. By analyzing a representative production system, we move beyond simple latency metrics to identify three recurring patterns of conversational breakdown: (1) Temporal Misalignment, where system delays violate user expectations of conversational rhythm; (2) Expressive Flattening, where the loss of paralinguistic cues leads to literal, inappropriate responses; and (3) Repair Rigidity, where architectural gating prevents users from correcting errors in real-time. Through system-level analysis, we demonstrate that these friction points should not be understood as defects or failures, but as structural consequences of a modular design that prioritizes control over fluidity. We conclude that building natural spoken AI is an infrastructure design challenge, requiring a shift from optimizing isolated components to carefully choreographing the seams between them.

28. From Verification Burden to Trusted Collaboration: Design Goals for LLM-Assisted Literature Reviews

Authors: Brenda Nogueira , Werner Geyer , Andrew Anderson , Toby Jia-Jun Li , Dongwhi Kim , Nuno Moniz , Nitesh V. Chawla
URL: https://arxiv.org/abs/2512.11661
Abstract:

Large Language Models (LLMs) are increasingly embedded in academic writing practices. Although numerous studies have explored how researchers employ these tools for scientific writing, their concrete implementation, limitations, and design challenges within the literature review process remain underexplored. In this paper, we report a user study with researchers across multiple disciplines to characterize current practices, benefits, and \textit{pain points} in using LLMs to investigate related work. We identified three recurring gaps: (i) lack of trust in outputs, (ii) persistent verification burden, and (iii) requiring multiple tools. This motivates our proposal of six design goals and a high-level framework that operationalizes them through improved related papers visualization, verification at every step, and human-feedback alignment with generation-guided explanations. Overall, by grounding our work in the practical, day-to-day needs of researchers, we designed a framework that addresses these limitations and models real-world LLM-assisted writing, advancing trust through verifiable actions and fostering practical collaboration between researchers and AI systems.

29. Automating Historical Insight Extraction from Large-Scale Newspaper Archives via Neural Topic Modeling

Authors: Keerthana Murugaraj , Salima Lamsiyah , Marten During , Martin Theobald
URL: https://arxiv.org/abs/2512.11635
Abstract:

Extracting coherent and human-understandable themes from large collections of unstructured historical newspaper archives presents significant challenges due to topic evolution, Optical Character Recognition (OCR) noise, and the sheer volume of text. Traditional topic-modeling methods, such as Latent Dirichlet Allocation (LDA), often fall short in capturing the complexity and dynamic nature of discourse in historical texts. To address these limitations, we employ BERTopic. This neural topic-modeling approach leverages transformerbased embeddings to extract and classify topics, which, despite its growing popularity, still remains underused in historical research. Our study focuses on articles published between 1955 and 2018, specifically examining discourse on nuclear power and nuclear safety. We analyze various topic distributions across the corpus and trace their temporal evolution to uncover long-term trends and shifts in public discourse. This enables us to more accurately explore patterns in public discourse, including the co-occurrence of themes related to nuclear power and nuclear weapons and their shifts in topic importance over time. Our study demonstrates the scalability and contextual sensitivity of BERTopic as an alternative to traditional approaches, offering richer insights into historical discourses extracted from newspaper archives. These findings contribute to historical, nuclear, and social-science research while reflecting on current limitations and proposing potential directions for future work.

30. Bounding Hallucinations: Information-Theoretic Guarantees for RAG Systems via Merlin-Arthur Protocols

Authors: Björn Deiseroth , Max Henning Höth , Kristian Kersting , Letitia Parcalabescu
URL: https://arxiv.org/abs/2512.11614
Abstract:

Retrieval-augmented generation (RAG) models rely on retrieved evidence to guide large language model (LLM) generators, yet current systems treat retrieval as a weak heuristic rather than verifiable evidence. As a result, LLMs answer without support, hallucinate under incomplete or misleading context, and rely on spurious evidence. We introduce a training framework that treats the entire RAG pipeline – both the retriever and the generator – as an interactive proof system via an adaptation of the Merlin-Arthur (M/A) protocol. Arthur (the generator LLM) trains on questions of unkown provenance: Merlin provides helpful evidence, while Morgana injects adversarial, misleading context. Both use a linear-time XAI method to identify and modify the evidence most influential to Arthur. Consequently, Arthur learns to (i) answer when the context support the answer, (ii) reject when evidence is insufficient, and (iii) rely on the specific context spans that truly ground the answer. We further introduce a rigorous evaluation framework to disentangle explanation fidelity from baseline predictive errors. This allows us to introduce and measure the Explained Information Fraction (EIF), which normalizes M/A certified mutual-information guarantees relative to model capacity and imperfect benchmarks. Across three RAG datasets and two model families of varying sizes, M/A-trained LLMs show improved groundedness, completeness, soundness, and reject behavior, as well as reduced hallucinations – without needing manually annotated unanswerable questions. The retriever likewise improves recall and MRR through automatically generated M/A hard positives and negatives. Our results demonstrate that autonomous interactive-proof-style supervision provides a principled and practical path toward reliable RAG systems that treat retrieved documents not as suggestions, but as verifiable evidence.

31. Atomic Action Slicing: Planner-Aligned Options for Generalist VLA Agents

Authors: Stefan Tabakov , Asen Popov , Dimitar Dimitrov , S. Ensiye Kiyamousavi , Vladimir Hristov , Boris Kraychev
URL: https://arxiv.org/abs/2512.11584
Abstract:

Current vision-language-action (VLA) models generalize poorly, particularly when tasks require new compositions of skills or objects. We introduce Atomic Action Slicing (AAS), a planner-aligned approach that decomposes long-horizon demonstrations into short, typed atomic actions that are easier for planners to use and policies to learn. Using LIBERO demonstrations, AAS produces a validated dataset of 2,124 atomic segments labeled with action type, temporal span, and confidence. A stronger segmenter (Gemini 2.5 Pro) closely matches planner-defined plans and remains robust under keyframe jitter, while smaller models perform worse on multi-object tasks. Fine-tuning CLIP-RT+ on our atomic dataset improves task success from 94.2% to 95.3% on LIBERO-Goal and 83.8% to 88.8% on LIBERO-Long. We publicly release the GATE-VLAP dataset on HuggingFace( this https URL )

32. Multi-temporal Calving Front Segmentation

Authors: Marcel Dreier , Nora Gourmelon , Dakota Pyles , Fei Wu , Matthias Braun , Thorsten Seehaus , Andreas Maier , Vincent Christlein
URL: https://arxiv.org/abs/2512.11560
Abstract:

The calving fronts of marine-terminating glaciers undergo constant changes. These changes significantly affect the glacier’s mass and dynamics, demanding continuous monitoring. To address this need, deep learning models were developed that can automatically delineate the calving front in Synthetic Aperture Radar imagery. However, these models often struggle to correctly classify areas affected by seasonal conditions such as ice melange or snow-covered surfaces. To address this issue, we propose to process multiple frames from a satellite image time series of the same glacier in parallel and exchange temporal information between the corresponding feature maps to stabilize each prediction. We integrate our approach into the current state-of-the-art architecture Tyrion and accomplish a new state-of-the-art performance on the CaFFe benchmark dataset. In particular, we achieve a Mean Distance Error of 184.4 m and a mean Intersection over Union of 83.6.

33. DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry

Authors: Zhenyang Cai , Jiaming Zhang , Junjie Zhao , Ziyi Zeng , Yanchao Li , Jingyi Liang , Junying Chen , Yunjin Yang , Jiajun You , Shuzhi Deng , Tongfei Wang , Wanting Chen , Chunxiu Hao , Ruiqi Xie , Zhenwei Wen , Xiangyi Feng , Zou Ting , Jin Zou Lin , Jianquan Li , Guangjun Yu , Liangyi Chen , Junwen Wang , Shan Jiang , Benyou Wang
URL: https://arxiv.org/abs/2512.11558
Abstract:

Reliable interpretation of multimodal data in dentistry is essential for automated oral healthcare, yet current multimodal large language models (MLLMs) struggle to capture fine-grained dental visual details and lack sufficient reasoning ability for precise diagnosis. To address these limitations, we present DentalGPT, a specialized dental MLLM developed through high-quality domain knowledge injection and reinforcement learning. Specifically, the largest annotated multimodal dataset for dentistry to date was constructed by aggregating over 120k dental images paired with detailed descriptions that highlight diagnostically relevant visual features, making it the multimodal dataset with the most extensive collection of dental images to date. Training on this dataset significantly enhances the MLLM’s visual understanding of dental conditions, while the subsequent reinforcement learning stage further strengthens its capability for multimodal complex reasoning. Comprehensive evaluations on intraoral and panoramic benchmarks, along with dental subsets of medical VQA benchmarks, show that DentalGPT achieves superior performance in disease classification and dental VQA tasks, outperforming many state-of-the-art MLLMs despite having only 7B parameters. These results demonstrate that high-quality dental data combined with staged adaptation provides an effective pathway for building capable and domain-specialized dental MLLMs.

34. Optimizing the Training Diet: Data Mixture Search for Robust Time Series Forecasting

Authors: Federico Pennino , Maurizio Gabbrielli
URL: https://arxiv.org/abs/2512.11546
Abstract:

The standard paradigm for training deep learning models on sensor data assumes that more data is always better. However, raw sensor streams are often imbalanced and contain significant redundancy, meaning that not all data points contribute equally to model generalization. In this paper, we show that, in some cases, “less is more” when considering datasets. We do this by reframing the data selection problem: rather than tuning model hyperparameters, we fix the model and optimize the composition of the training data itself. We introduce a framework for discovering the optimal “training diet” from a large, unlabeled time series corpus. Our framework first uses a large-scale encoder and k-means clustering to partition the dataset into distinct, behaviorally consistent clusters. These clusters represent the fundamental ‘ingredients’ available for training. We then employ the Optuna optimization framework to search the high-dimensional space of possible data mixtures. For each trial, Optuna proposes a specific sampling ratio for each cluster, and a new training set is constructed based on this recipe. A smaller target model is then trained and evaluated. Our experiments reveal that this data-centric search consistently discovers data mixtures that yield models with significantly higher performance compared to baselines trained on the entire dataset. Specifically - evaluated on PMSM dataset - our method improved performance from a baseline MSE of 1.70 to 1.37, a 19.41% improvement.

35. Graph Embedding with Mel-spectrograms for Underwater Acoustic Target Recognition

Authors: Sheng Feng , Shuqing Ma , Xiaoqian Zhu
URL: https://arxiv.org/abs/2512.11545
Abstract:

Underwater acoustic target recognition (UATR) is extremely challenging due to the complexity of ship-radiated noise and the variability of ocean environments. Although deep learning (DL) approaches have achieved promising results, most existing models implicitly assume that underwater acoustic data lie in a Euclidean space. This assumption, however, is unsuitable for the inherently complex topology of underwater acoustic signals, which exhibit non-stationary, non-Gaussian, and nonlinear characteristics. To overcome this limitation, this paper proposes the UATR-GTransformer, a non-Euclidean DL model that integrates Transformer architectures with graph neural networks (GNNs). The model comprises three key components: a Mel patchify block, a GTransformer block, and a classification head. The Mel patchify block partitions the Mel-spectrogram into overlapping patches, while the GTransformer block employs a Transformer Encoder to capture mutual information between split patches to generate Mel-graph embeddings. Subsequently, a GNN enhances these embeddings by modeling local neighborhood relationships, and a feed-forward network (FFN) further performs feature transformation. Experiments results based on two widely used benchmark datasets demonstrate that the UATR-GTransformer achieves performance competitive with state-of-the-art methods. In addition, interpretability analysis reveals that the proposed model effectively extracts rich frequency-domain information, highlighting its potential for applications in ocean engineering.

36. Parallax: Runtime Parallelization for Operator Fallbacks in Heterogeneous Edge Systems

Authors: Chong Tang , Hao Dai , Jagmohan Chauhan
URL: https://arxiv.org/abs/2512.11532
Abstract:

The growing demand for real-time DNN applications on edge devices necessitates faster inference of increasingly complex models. Although many devices include specialized accelerators (e.g., mobile GPUs), dynamic control-flow operators and unsupported kernels often fall back to CPU execution. Existing frameworks handle these fallbacks poorly, leaving CPU cores idle and causing high latency and memory spikes. We introduce Parallax, a framework that accelerates mobile DNN inference without model refactoring or custom operator implementations. Parallax first partitions the computation DAG to expose parallelism, then employs branch-aware memory management with dedicated arenas and buffer reuse to reduce runtime footprint. An adaptive scheduler executes branches according to device memory constraints, meanwhile, fine-grained subgraph control enables heterogeneous inference of dynamic models. By evaluating on five representative DNNs across three different mobile devices, Parallax achieves up to 46% latency reduction, maintains controlled memory overhead (26.5% on average), and delivers up to 30% energy savings compared with state-of-the-art frameworks, offering improvements aligned with the responsiveness demands of real-time mobile inference.

37. Contrastive Time Series Forecasting with Anomalies

Authors: Joel Ekstrand , Zahra Taghiyarrenani , Slawomir Nowaczyk
URL: https://arxiv.org/abs/2512.11526
Abstract:

Time series forecasting predicts future values from past data. In real-world settings, some anomalous events have lasting effects and influence the forecast, while others are short-lived and should be ignored. Standard forecasting models fail to make this distinction, often either overreacting to noise or missing persistent shifts. We propose Co-TSFA (Contrastive Time Series Forecasting with Anomalies), a regularization framework that learns when to ignore anomalies and when to respond. Co-TSFA generates input-only and input-output augmentations to model forecast-irrelevant and forecast-relevant anomalies, and introduces a latent-output alignment loss that ties representation changes to forecast changes. This encourages invariance to irrelevant perturbations while preserving sensitivity to meaningful distributional shifts. Experiments on the Traffic and Electricity benchmarks, as well as on a real-world cash-demand dataset, demonstrate that Co-TSFA improves performance under anomalous conditions while maintaining accuracy on normal data. An anonymized GitHub repository with the implementation of Co-TSFA is provided and will be made public upon acceptance.

38. NeuralOGCM: Differentiable Ocean Modeling with Learnable Physics

Authors: Hao Wu , Yuan Gao , Fan Xu , Fan Zhang , Guangliang Liu , Yuxuan Liang , Xiaomeng Huang
URL: https://arxiv.org/abs/2512.11525
Abstract:

High-precision scientific simulation faces a long-standing trade-off between computational efficiency and physical fidelity. To address this challenge, we propose NeuralOGCM, an ocean modeling framework that fuses differentiable programming with deep learning. At the core of NeuralOGCM is a fully differentiable dynamical solver, which leverages physics knowledge as its core inductive bias. The learnable physics integration captures large-scale, deterministic physical evolution, and transforms key physical parameters (e.g., diffusion coefficients) into learnable parameters, enabling the model to autonomously optimize its physical core via end-to-end training. Concurrently, a deep neural network learns to correct for subgrid-scale processes and discretization errors not captured by the physics model. Both components work in synergy, with their outputs integrated by a unified ODE solver. Experiments demonstrate that NeuralOGCM maintains long-term stability and physical consistency, significantly outperforming traditional numerical models in speed and pure AI baselines in accuracy. Our work paves a new path for building fast, stable, and physically-plausible models for scientific computing.

39. Does Less Hallucination Mean Less Creativity? An Empirical Investigation in LLMs

Authors: Mohor Banerjee , Nadya Yuki Wangsajaya , Syed Ali Redha Alsagoff , Min Sen Tan , Zachary Choy Kit Chun , Alvin Chan Guo Wei
URL: https://arxiv.org/abs/2512.11509
Abstract:

Large Language Models (LLMs) exhibit remarkable capabilities in natural language understanding and reasoning, but suffer from hallucination: the generation of factually incorrect content. While numerous methods have been developed to reduce hallucinations, their impact on creative generations remains unexplored. This gap is particularly critical for AI-assisted scientific discovery, which requires both factual accuracy and creative hypothesis generation. We investigate how three hallucination-reduction techniques: Chain of Verification (CoVe), Decoding by Contrasting Layers (DoLa), and Retrieval-Augmented Generation (RAG), affect creativity in LLMs. Evaluating multiple model families (LLaMA, Qwen, Mistral) at varying scales (1B - 70B parameters) on two creativity benchmarks (NeoCoder and CS4), we find that these methods have opposing effects on divergent creativity. CoVe enhances divergent thinking, DoLa suppresses it, and RAG shows minimal impact. Our findings provide guidance for selecting appropriate hallucination-reduction methods in scientific applications, where the balance between factual accuracy and creative exploration is crucial.

40. Towards Privacy-Preserving Code Generation: Differentially Private Code Language Models

Authors: Melih Catal , Pooja Rani , Harald C. Gall
URL: https://arxiv.org/abs/2512.11482
Abstract:

Large language models specialized for code (CodeLLMs) have demonstrated remarkable capabilities in generating code snippets, documentation, and test cases. However, despite their promising capabilities, CodeLLMs can inadvertently memorize and reproduce snippets from their training data, which poses risks of privacy breaches and intellectual property violations. These risks restrict the deployment of CodeLLMs in sensitive domains and limit their training datasets to publicly available sources. To mitigate the memorization risk without compromising their task performance, we apply Differential Privacy (DP) to CodeLLMs. To the best of our knowledge, this is the first comprehensive study that systematically evaluates the effectiveness of DP in CodeLLMs. DP adds calibrated noise to the training process to protect individual data points while still allowing the model to learn useful patterns. To this end, we first identify and understand the driving reasons of the memorization behaviour of the CodeLLMs during their fine-tuning. Then, to address this issue, we empirically evaluate the effect of DP on mitigating memorization while preserving code generation capabilities. Our findings show that DP substantially reduces memorization in CodeLLMs across all the tested snippet types. The snippet types most prone to memorization are also the most effectively mitigated by DP. Furthermore, we observe that DP slightly increases perplexity but preserves, and can even enhance, the code generation capabilities of CodeLLMs, which makes it feasible to apply DP in practice without significantly compromising model utility. Finally, we analyze the impact of DP on training efficiency and energy consumption, finding that DP does not significantly affect training time or energy usage, making it a practical choice for privacy-preserving CodeLLMs training.

41. Exploring MLLM-Diffusion Information Transfer with MetaCanvas

Authors: Han Lin , Xichen Pan , Ziqi Huang , Ji Hou , Jialiang Wang , Weifeng Chen , Zecheng He , Felix Juefei-Xu , Junzhe Sun , Zhipeng Fan , Ali Thabet , Mohit Bansal , Chu Wang
URL: https://arxiv.org/abs/2512.11464
Abstract:

Multimodal learning has rapidly advanced visual understanding, largely via multimodal large language models (MLLMs) that use powerful LLMs as cognitive cores. In visual generation, however, these powerful core models are typically reduced to global text encoders for diffusion models, leaving most of their reasoning and planning ability unused. This creates a gap: current multimodal LLMs can parse complex layouts, attributes, and knowledge-intensive scenes, yet struggle to generate images or videos with equally precise and structured control. We propose MetaCanvas, a lightweight framework that lets MLLMs reason and plan directly in spatial and spatiotemporal latent spaces and interface tightly with diffusion generators. We empirically implement MetaCanvas on three different diffusion backbones and evaluate it across six tasks, including text-to-image generation, text/image-to-video generation, image/video editing, and in-context video generation, each requiring precise layouts, robust attribute binding, and reasoning-intensive control. MetaCanvas consistently outperforms global-conditioning baselines, suggesting that treating MLLMs as latent-space planners is a promising direction for narrowing the gap between multimodal understanding and generation.

42. Boosting Skeleton-based Zero-Shot Action Recognition with Training-Free Test-Time Adaptation

Authors: Jingmin Zhu , Anqi Zhu , Hossein Rahmani , Jun Liu , Mohammed Bennamoun , Qiuhong Ke
URL: https://arxiv.org/abs/2512.11458
Abstract:

We introduce Skeleton-Cache, the first training-free test-time adaptation framework for skeleton-based zero-shot action recognition (SZAR), aimed at improving model generalization to unseen actions during inference. Skeleton-Cache reformulates inference as a lightweight retrieval process over a non-parametric cache that stores structured skeleton representations, combining both global and fine-grained local descriptors. To guide the fusion of descriptor-wise predictions, we leverage the semantic reasoning capabilities of large language models (LLMs) to assign class-specific importance weights. By integrating these structured descriptors with LLM-guided semantic priors, Skeleton-Cache dynamically adapts to unseen actions without any additional training or access to training data. Extensive experiments on NTU RGB+D 60/120 and PKU-MMD II demonstrate that Skeleton-Cache consistently boosts the performance of various SZAR backbones under both zero-shot and generalized zero-shot settings. The code is publicly available at this https URL .

43. Flowception: Temporally Expansive Flow Matching for Video Generation

Authors: Tariq Berrada Ifriqi , John Nguyen , Karteek Alahari , Jakob Verbeek , Ricky T. Q. Chen
URL: https://arxiv.org/abs/2512.11438
Abstract:

We present Flowception, a novel non-autoregressive and variable-length video generation framework. Flowception learns a probability path that interleaves discrete frame insertions with continuous frame denoising. Compared to autoregressive methods, Flowception alleviates error accumulation/drift as the frame insertion mechanism during sampling serves as an efficient compression mechanism to handle long-term context. Compared to full-sequence flows, our method reduces FLOPs for training three-fold, while also being more amenable to local attention variants, and allowing to learn the length of videos jointly with their content. Quantitative experimental results show improved FVD and VBench metrics over autoregressive and full-sequence baselines, which is further validated with qualitative results. Finally, by learning to insert and denoise frames in a sequence, Flowception seamlessly integrates different tasks such as image-to-video generation and video interpolation.

44. Task-Specific Sparse Feature Masks for Molecular Toxicity Prediction with Chemical Language Models

Authors: Kwun Sy Lee , Jiawei Chen , Fuk Sheng Ford Chung , Tianyu Zhao , Zhenyuan Chen , Debby D. Wang
URL: https://arxiv.org/abs/2512.11412
Abstract:

Reliable in silico molecular toxicity prediction is a cornerstone of modern drug discovery, offering a scalable alternative to experimental screening. However, the black-box nature of state-of-the-art models remains a significant barrier to adoption, as high-stakes safety decisions demand verifiable structural insights alongside predictive performance. To address this, we propose a novel multi-task learning (MTL) framework designed to jointly enhance accuracy and interpretability. Our architecture integrates a shared chemical language model with task-specific attention modules. By imposing an L1 sparsity penalty on these modules, the framework is constrained to focus on a minimal set of salient molecular fragments for each distinct toxicity endpoint. The resulting framework is trained end-to-end and is readily adaptable to various transformer-based backbones. Evaluated on the ClinTox, SIDER, and Tox21 benchmark datasets, our approach consistently outperforms both single-task and standard MTL baselines. Crucially, the sparse attention weights provide chemically intuitive visualizations that reveal the specific fragments influencing predictions, thereby enhancing insight into the model’s decision-making process.

45. REMODEL-LLM: Transforming C code to Java using LLMs

Authors: Aryan Gupta , Y. Raghu Reddy
URL: https://arxiv.org/abs/2512.11402
Abstract:

The automated translation of C code to Java code is a notoriously difficult task, fraught with challenges stemming from fundamental paradigm shifts (procedural vs. Object Oriented), memory models (manual pointers vs. Garbage Collection), and incompatible data types. This paper investigates the efficacy of 19 small, quantized LLMs (under 20 billion parameters) for the C to Java translation task. We use a novel, hybrid pipeline that leverages Abstract Syntax Trees (ASTs) for semantic decomposition and employs a highly constrained, rule based prompting strategy. The results are stark: a clear multi tiered performance divide emerged. The vast majority of models (Tier 3, e.g., llama3.1, gemma3, starcoder2) failed 100\% of the tests, proving incapable of generating even basic, runnable Java boilerplate. A small middle tier (Tier 2, e.g., mistral-nemo and mistral) produced runnable code but was plagued by dangerous semantic failures and wrong translations. Only three models (Tier 1: phi4, deepseek-coder-v2, codeqwen) proved viable, passing over 50\% of the test suite. Even these top models failed on the most complex C concepts, such as function pointers, sizeof, and enum logic, revealing a hard ceiling for the reasoning capabilities of current quantized models.

46. Surveillance Video-Based Traffic Accident Detection Using Transformer Architecture

Authors: Tanu Singh , Pranamesh Chakraborty , Long T. Truong
URL: https://arxiv.org/abs/2512.11350
Abstract:

Road traffic accidents represent a leading cause of mortality globally, with incidence rates rising due to increasing population, urbanization, and motorization. Rising accident rates raise concerns about traffic surveillance effectiveness. Traditional computer vision methods for accident detection struggle with limited spatiotemporal understanding and poor cross-domain generalization. Recent advances in transformer architectures excel at modeling global spatial-temporal dependencies and parallel computation. However, applying these models to automated traffic accident detection is limited by small, non-diverse datasets, hindering the development of robust, generalizable systems. To address this gap, we curated a comprehensive and balanced dataset that captures a wide spectrum of traffic environments, accident types, and contextual variations. Utilizing the curated dataset, we propose an accident detection model based on a transformer architecture using pre-extracted spatial video features. The architecture employs convolutional layers to extract local correlations across diverse patterns within a frame, while leveraging transformers to capture sequential-temporal dependencies among the retrieved features. Moreover, most existing studies neglect the integration of motion cues, which are essential for understanding dynamic scenes, especially during accidents. These approaches typically rely on static features or coarse temporal information. In this study, multiple methods for incorporating motion cues were evaluated to identify the most effective strategy. Among the tested input approaches, concatenating RGB features with optical flow achieved the highest accuracy at 88.3%. The results were further compared with vision language models (VLM) such as GPT, Gemini, and LLaVA-NeXT-Video to assess the effectiveness of the proposed method.

47. MLLM Machine Unlearning via Visual Knowledge Distillation

Authors: Yuhang Wang , Zhenxing Niu , Haoxuan Ji , Guangyu He , Haichang Gao , Gang Hua
URL: https://arxiv.org/abs/2512.11325
Abstract:

Recently, machine unlearning approaches have been proposed to remove sensitive information from well-trained large models. However, most existing methods are tailored for LLMs, while MLLM-oriented unlearning remains at its early stage. Inspired by recent studies exploring the internal mechanisms of MLLMs, we propose to disentangle the visual and textual knowledge embedded within MLLMs and introduce a dedicated approach to selectively erase target visual knowledge while preserving textual knowledge. Unlike previous unlearning methods that rely on output-level supervision, our approach introduces a Visual Knowledge Distillation (VKD) scheme, which leverages intermediate visual representations within the MLLM as supervision signals. This design substantially enhances both unlearning effectiveness and model utility. Moreover, since our method only fine-tunes the visual components of the MLLM, it offers significant efficiency advantages. Extensive experiments demonstrate that our approach outperforms state-of-the-art unlearning methods in terms of both effectiveness and efficiency. Moreover, we are the first to evaluate the robustness of MLLM unlearning against relearning attacks.

48. Condensation-Concatenation Framework for Dynamic Graph Continual Learning

Authors: Tingxu Yan , Ye Yuan
URL: https://arxiv.org/abs/2512.11317
Abstract:

Dynamic graphs are prevalent in real-world scenarios, where continuous structural changes induce catastrophic forgetting in graph neural networks (GNNs). While continual learning has been extended to dynamic graphs, existing methods overlook the effects of topological changes on existing nodes. To address it, we propose a novel framework for continual learning on dynamic graphs, named Condensation-Concatenation-based Continual Learning (CCC). Specifically, CCC first condenses historical graph snapshots into compact semantic representations while aiming to preserve the original label distribution and topological properties. Then it concatenates these historical embeddings with current graph representations selectively. Moreover, we refine the forgetting measure (FM) to better adapt to dynamic graph scenarios by quantifying the predictive performance degradation of existing nodes caused by structural updates. CCC demonstrates superior performance over state-of-the-art baselines across four real-world datasets in extensive experiments.

49. Few-Shot VLM-Based G-Code and HMI Verification in CNC Machining

Authors: Yasaman Hashem Pour , Nazanin Mahjourian , Vinh Nguyen
URL: https://arxiv.org/abs/2512.11296
Abstract:

Manual generation of G-code is important for learning the operation of CNC machines. Prior work in G-code verification uses Large-Language Models (LLMs), which primarily examine errors in the written programming. However, CNC machining requires extensive use and knowledge of the Human-Machine Interface (HMI), which displays machine status and errors. LLMs currently lack the capability to leverage knowledge of HMIs due to their inability to access the vision modality. This paper proposes a few-shot VLM-based verification approach that simultaneously evaluates the G-code and the HMI display for errors and safety status. The input dataset includes paired G-code text and associated HMI screenshots from a 15-slant-PRO lathe, including both correct and error-prone cases. To enable few-shot learning, the VLM is provided with a structured JSON schema based on prior heuristic knowledge. After determining the prompts, instances of G-code and HMI that either contain errors or are error free are used as few-shot examples to guide the VLM. The model was then evaluated in comparison to a zero-shot VLM through multiple scenarios of incorrect G-code and HMI errors with respect to per-slot accuracy. The VLM showed that few-shot prompting led to overall enhancement of detecting HMI errors and discrepancies with the G-code for more comprehensive debugging. Therefore, the proposed framework was demonstrated to be suitable for verification of manually generated G-code that is typically developed in CNC training.

50. AI Autonomy or Human Dependency? Defining the Boundary in Responsible AI with the $α$-Coefficient

Authors: Nattaya Mairittha , Gabriel Phorncharoenmusikul , Sorawit Worapradidth
URL: https://arxiv.org/abs/2512.11295
Abstract:

The integrity of contemporary AI systems is undermined by a critical design flaw: the misappropriation of Human-in-the-Loop (HITL) models to mask systems that are fundamentally reliant on human labor. We term this structural reliance Human-Instead-of-AI (HISOAI). HISOAI systems represent an ethical failure and an unsustainable economic dependency, where human workers function as hidden operational fallbacks rather than strategic collaborators. To rectify this, we propose the AI-First, Human-Empowered (AFHE) paradigm. AFHE mandates a technological design where the AI component must achieve a minimum, quantifiable level of functional independence prior to deployment. This standard is formalized through the AI Autonomy Coefficient (alpha), a metric that determines the proportion of tasks that the AI successfully processes without mandatory human substitution. We introduce the AFHE Deployment Algorithm, an algorithmic gate that requires the system to meet a specified alpha threshold across both offline and shadow testing. By enforcing this structural separation, the AFHE framework redefines the human’s role to focus exclusively on high-value tasks, including ethical oversight, boundary pushing, and strategic model tuning, thereby ensuring true system transparency and operational independence. This work advocates for a critical shift toward metric-driven, structurally sound AI architecture, moving the industry beyond deceptive human dependency toward verifiable autonomy.

51. CIP: A Plug-and-Play Causal Prompting Framework for Mitigating Hallucinations under Long-Context Noise

Authors: Qingsen Ma , Dianyun Wang , Ran Jing , Yujun Sun , Zhenbo Xu
URL: https://arxiv.org/abs/2512.11282
Abstract:

Large language models often hallucinate when processing long and noisy retrieval contexts because they rely on spurious correlations rather than genuine causal relationships. We propose CIP, a lightweight and plug-and-play causal prompting framework that mitigates hallucinations at the input stage. CIP constructs a causal relation sequence among entities, actions, and events and injects it into the prompt to guide reasoning toward causally relevant evidence. Through causal intervention and counterfactual reasoning, CIP suppresses non causal reasoning paths, improving factual grounding and interpretability. Experiments across seven mainstream language models, including GPT-4o, Gemini 2.0 Flash, and Llama 3.1, show that CIP consistently enhances reasoning quality and reliability, achieving 2.6 points improvement in Attributable Rate, 0.38 improvement in Causal Consistency Score, and a fourfold increase in effective information density. API level profiling further shows that CIP accelerates contextual understanding and reduces end to end response latency by up to 55.1 percent. These results suggest that causal reasoning may serve as a promising paradigm for improving the explainability, stability, and efficiency of large language models.

52. Words to Describe What I’m Feeling: Exploring the Potential of AI Agents for High Subjectivity Decisions in Advance Care Planning

Authors: Kellie Yu Hui Sim , Pin Sym Foong , Chenyu Zhao , Melanie Yi Ning Quek , Swarangi Subodh Mehta , Kenny Tsu Wei Choo
URL: https://arxiv.org/abs/2512.11276
Abstract:

Serious illness can deprive patients of the capacity to speak for themselves. As populations age and caregiver networks shrink, the need for reliable support in Advance Care Planning (ACP) grows. To probe this fraught design space of using proxy agents for high-risk, high-subjectivity decisions, we built an experience prototype (\acpagent{}) and asked 15 participants in 4 workshops to train it to be their personal proxy in ACP decisions. We analysed their coping strategies and feature requests and mapped the results onto axes of agent autonomy and human control. Our findings argue for a potential new role of AI in ACP where agents act as personal advocates for individuals, building mutual intelligibility over time. We conclude with design recommendations to balance the risks and benefits of such an agent.

53. A Scalable Multi-GPU Framework for Encrypted Large-Model Inference

Authors: Siddharth Jayashankar , Joshua Kim , Michael B. Sullivan , Wenting Zheng , Dimitrios Skarlatos
URL: https://arxiv.org/abs/2512.11269
Abstract:

Encrypted AI using fully homomorphic encryption (FHE) provides strong privacy guarantees; but its slow performance has limited practical deployment. Recent works proposed ASICs to accelerate FHE, but require expensive advanced manufacturing processes that constrain their accessibility. GPUs are a far more accessible platform, but achieving ASIC-level performance using GPUs has remained elusive. Furthermore, state-of-the-art approaches primarily focus on small models that fit comfortably within a single device. Supporting large models such as LLMs in FHE introduces a dramatic increase in computational complexity that requires optimized GPU kernels, along with managing terabyte-scale memory footprints that far exceed the capacity of a single GPU. This paper presents Cerium, a multi-GPU framework for FHE inference on large models. Cerium integrates a domain-specific language, an optimizing compiler, and a runtime system to automatically generate high-performance GPU kernels, manage terabyte-scale memory footprints, and parallelize computation across multiple GPUs. It introduces new IR constructs, compiler passes, sparse polynomial representations, memory-efficient data layouts, and communication-aware parallelization techniques that together enable encrypted inference for models ranging from small CNNs to Llama3-8B. We build Cerium on NVIDIA GPUs and demonstrate significant performance gains. For small models, Cerium outperforms expert-written hand-optimized GPU libraries by up to 2.25 times. Cerium achieves performance competitive with state-of-the-art FHE ASICs, outright matching prior FHE ASIC CraterLake. It is the first GPU system to execute bootstrapping in under 10 milliseconds, achieving 7.5 milliseconds, and is the first to demonstrate encrypted inference for BERT-Base and Llama3-8B in 8 seconds and 134 seconds, respectively.

54. Multi-Intent Spoken Language Understanding: Methods, Trends, and Challenges

Authors: Di Wu , Ruiyu Fang , Liting Jiang , Shuangyong Song , Xiaomeng Huang , Shiquan Wang , Zhongqiu Li , Lingling Shi , Mengjiao Bao , Yongxiang Li , Hao Huang
URL: https://arxiv.org/abs/2512.11258
Abstract:

Multi-intent spoken language understanding (SLU) involves two tasks: multiple intent detection and slot filling, which jointly handle utterances containing more than one intent. Owing to this characteristic, which closely reflects real-world applications, the task has attracted increasing research attention, and substantial progress has been achieved. However, there remains a lack of a comprehensive and systematic review of existing studies on multi-intent SLU. To this end, this paper presents a survey of recent advances in multi-intent SLU. We provide an in-depth overview of previous research from two perspectives: decoding paradigms and modeling approaches. On this basis, we further compare the performance of representative models and analyze their strengths and limitations. Finally, we discuss the current challenges and outline promising directions for future research. We hope this survey will offer valuable insights and serve as a useful reference for advancing research in multi-intent SLU.

55. A Simple Generalisation of the Implicit Dynamics of In-Context Learning

Authors: Francesco Innocenti , El Mehdi Achour
URL: https://arxiv.org/abs/2512.11255
Abstract:

In-context learning (ICL) refers to the ability of a model to learn new tasks from examples in its input without any parameter updates. In contrast to previous theories of ICL relying on toy models and data settings, recently it has been shown that an abstraction of a transformer block can be seen as implicitly updating the weights of its feedforward network according to the context (Dherin et al., 2025). Here, we provide a simple generalisation of this result for (i) all sequence positions beyond the last, (ii) any transformer block beyond the first, and (iii) more realistic residual blocks including layer normalisation. We empirically verify our theory on simple in-context linear regression tasks and investigate the relationship between the implicit updates related to different tokens within and between blocks. These results help to bring the theory of Dherin et al. (2025) even closer to practice, with potential for validation on large-scale models.

56. VFMF: World Modeling by Forecasting Vision Foundation Model Features

Authors: Gabrijel Boduljak , Yushi Lan , Christian Rupprecht , Andrea Vedaldi
URL: https://arxiv.org/abs/2512.11225
Abstract:

Forecasting from partial observations is central to world modeling. Many recent methods represent the world through images, and reduce forecasting to stochastic video generation. Although such methods excel at realism and visual fidelity, predicting pixels is computationally intensive and not directly useful in many applications, as it requires translating RGB into signals useful for decision making. An alternative approach uses features from vision foundation models (VFMs) as world representations, performing deterministic regression to predict future world states. These features can be directly translated into actionable signals such as semantic segmentation and depth, while remaining computationally efficient. However, deterministic regression averages over multiple plausible futures, undermining forecast accuracy by failing to capture uncertainty. To address this crucial limitation, we introduce a generative forecaster that performs autoregressive flow matching in VFM feature space. Our key insight is that generative modeling in this space requires encoding VFM features into a compact latent space suitable for diffusion. We show that this latent space preserves information more effectively than previously used PCA-based alternatives, both for forecasting and other applications, such as image generation. Our latent predictions can be easily decoded into multiple useful and interpretable output modalities: semantic segmentation, depth, surface normals, and even RGB. With matched architecture and compute, our method produces sharper and more accurate predictions than regression across all modalities. Our results suggest that stochastic conditional generation of VFM features offers a promising and scalable foundation for future world models.

57. Adaptive Soft Rolling KV Freeze with Entropy-Guided Recovery: Sublinear Memory Growth for Efficient LLM Inference

Authors: Adilet Metinov , Gulida M. Kudakeeva , Bolotbek uulu Nursultan , Gulnara D. Kabaeva
URL: https://arxiv.org/abs/2512.11221
Abstract:

We present Adaptive Soft Rolling KV Freeze with Entropy-Guided Recovery (ASR-KF-EGR), a training-free inference-time framework for efficient large language model generation. Our method introduces a reversible soft-freeze mechanism that temporarily suspends key-value (KV) updates for low-importance tokens identified within a sliding attention window. Unlike eviction-based approaches that permanently discard context, ASR-KF-EGR preserves all tokens in off-GPU storage and restores them on demand. We extend the framework with sublinear freeze scheduling, where freeze duration grows sublinearly with repeated low-importance detections, preventing over-aggressive compression. Preliminary experiments on LLaMA-3 8B demonstrate 55-67% reduction in active KV cache size while maintaining generation quality and passing needle-in-haystack retrieval tests. The method is architecture-agnostic, requires no fine-tuning, and provides a practical solution for memory-constrained deployment of long-context LLMs.

58. amc: The Automated Mission Classifier for Telescope Bibliographies

Authors: John F. Wu , Joshua E. G. Peek , Sophie J. Miller , Jenny Novacescu , Achu J. Usha , Christopher A. Wilkinson
URL: https://arxiv.org/abs/2512.11202
Abstract:

Telescope bibliographies record the pulse of astronomy research by capturing publication statistics and citation metrics for telescope facilities. Robust and scalable bibliographies ensure that we can measure the scientific impact of our facilities and archives. However, the growing rate of publications threatens to outpace our ability to manually label astronomical literature. We therefore present the Automated Mission Classifier (amc), a tool that uses large language models (LLMs) to identify and categorize telescope references by processing large quantities of paper text. A modified version of amc performs well on the TRACS Kaggle challenge, achieving a macro $F_1$ score of 0.84 on the held-out test set. amc is valuable for other telescopes beyond TRACS; we developed the initial software for identifying papers that featured scientific results by NASA missions. Additionally, we investigate how amc can also be used to interrogate historical datasets and surface potential label errors. Our work demonstrates that LLM-based applications offer powerful and scalable assistance for library sciences.

59. Fast EXP3 Algorithms

Authors: Ryoma Sato , Shinji Ito
URL: https://arxiv.org/abs/2512.11201
Abstract:

We point out that EXP3 can be implemented in constant time per round, propose more practical algorithms, and analyze the trade-offs between the regret bounds and time complexities of these algorithms.

60. Image Tiling for High-Resolution Reasoning: Balancing Local Detail with Global Context

Authors: Anatole Jacquin de Margerie , Alexis Roger , Irina Rish
URL: https://arxiv.org/abs/2512.11167
Abstract:

Reproducibility remains a cornerstone of scientific progress, yet complex multimodal models often lack transparent implementation details and accessible training infrastructure. In this work, we present a detailed reproduction and critical analysis of the Monkey Vision-Language Model (VLM) (Li et al. 2023b) published in CVPR24, a recent approach to high-resolution image understanding via image tiling. The original paper proposed splitting large images into tiles to recover fine-grained visual details while maintaining computational efficiency. Our study replicates this strategy using open checkpoints and reimplements the training pipeline. We confirm the key finding of the original Monkey VLM work, namely that tiling effectively recovers local details. We then extend this work further, by investigating the effect of the inclusion of the global context, which provide practical insights for future high-resolution multimodal modeling. However, we also report deviations in the results, with the magnitude of these effects depending heavily on task type and tile granularity.

61. MiniScope: A Least Privilege Framework for Authorizing Tool Calling Agents

Authors: Jinhao Zhu , Kevin Tseng , Gil Vernik , Xiao Huang , Shishir G. Patil , Vivian Fang , Raluca Ada Popa
URL: https://arxiv.org/abs/2512.11147
Abstract:

Tool calling agents are an emerging paradigm in LLM deployment, with major platforms such as ChatGPT, Claude, and Gemini adding connectors and autonomous capabilities. However, the inherent unreliability of LLMs introduces fundamental security risks when these agents operate over sensitive user services. Prior approaches either rely on manually written policies that require security expertise, or place LLMs in the confinement loop, which lacks rigorous security guarantees. We present MiniScope, a framework that enables tool calling agents to operate on user accounts while confining potential damage from unreliable LLMs. MiniScope introduces a novel way to automatically and rigorously enforce least privilege principles by reconstructing permission hierarchies that reflect relationships among tool calls and combining them with a mobile-style permission model to balance security and ease of use. To evaluate MiniScope, we create a synthetic dataset derived from ten popular real-world applications, capturing the complexity of realistic agentic tasks beyond existing simplified benchmarks. Our evaluation shows that MiniScope incurs only 1-6% latency overhead compared to vanilla tool calling agents, while significantly outperforming the LLM based baseline in minimizing permissions as well as computational and operational costs.

62. Autoencoder-based Semi-Supervised Dimensionality Reduction and Clustering for Scientific Ensembles

Authors: Lennard Manuel , Hamid Gadirov , Steffen Frey
URL: https://arxiv.org/abs/2512.11145
Abstract:

Analyzing and visualizing scientific ensemble datasets with high dimensionality and complexity poses significant challenges. Dimensionality reduction techniques and autoencoders are powerful tools for extracting features, but they often struggle with such high-dimensional data. This paper presents an enhanced autoencoder framework that incorporates a clustering loss, based on the soft silhouette score, alongside a contrastive loss to improve the visualization and interpretability of ensemble datasets. First, EfficientNetV2 is used to generate pseudo-labels for the unlabeled portions of the scientific ensemble datasets. By jointly optimizing the reconstruction, clustering, and contrastive objectives, our method encourages similar data points to group together while separating distinct clusters in the latent space. UMAP is subsequently applied to this latent representation to produce 2D projections, which are evaluated using the silhouette score. Multiple types of autoencoders are evaluated and compared based on their ability to extract meaningful features. Experiments on two scientific ensemble datasets - channel structures in soil derived from Markov chain Monte Carlo, and droplet-on-film impact dynamics - show that models incorporating clustering or contrastive loss marginally outperform the baseline approaches.

63. Fairness-Regularized Online Optimization with Switching Costs

Authors: Pengfei Li , Yuelin Han , Adam Wierman , Shaolei Ren
URL: https://arxiv.org/abs/2512.11131
Abstract:

Fairness and action smoothness are two crucial considerations in many online optimization problems, but they have yet to be addressed simultaneously. In this paper, we study a new and challenging setting of fairness-regularized smoothed online convex optimization with switching costs. First, to highlight the fundamental challenges introduced by the long-term fairness regularizer evaluated based on the entire sequence of actions, we prove that even without switching costs, no online algorithms can possibly achieve a sublinear regret or finite competitive ratio compared to the offline optimal algorithm as the problem episode length $T$ increases. Then, we propose FairOBD (Fairness-regularized Online Balanced Descent), which reconciles the tension between minimizing the hitting cost, switching cost, and fairness cost. Concretely, FairOBD decomposes the long-term fairness cost into a sequence of online costs by introducing an auxiliary variable and then leverages the auxiliary variable to regularize the online actions for fair outcomes. Based on a new approach to account for switching costs, we prove that FairOBD offers a worst-case asymptotic competitive ratio against a novel benchmark – the optimal offline algorithm with parameterized constraints – by considering $T\to\infty$. Finally, we run trace-driven experiments of dynamic computing resource provisioning for socially responsible AI inference to empirically evaluate FairOBD, showing that FairOBD can effectively reduce the total fairness-regularized cost and better promote fair outcomes compared to existing baseline solutions.

64. In-Context Multi-Objective Optimization

Authors: Xinyu Zhang , Conor Hassan , Julien Martinelli , Daolang Huang , Samuel Kaski
URL: https://arxiv.org/abs/2512.11114
Abstract:

Balancing competing objectives is omnipresent across disciplines, from drug design to autonomous systems. Multi-objective Bayesian optimization is a promising solution for such expensive, black-box problems: it fits probabilistic surrogates and selects new designs via an acquisition function that balances exploration and exploitation. In practice, it requires tailored choices of surrogate and acquisition that rarely transfer to the next problem, is myopic when multi-step planning is often required, and adds refitting overhead, particularly in parallel or time-sensitive loops. We present TAMO, a fully amortized, universal policy for multi-objective black-box optimization. TAMO uses a transformer architecture that operates across varying input and objective dimensions, enabling pretraining on diverse corpora and transfer to new problems without retraining: at test time, the pretrained model proposes the next design with a single forward pass. We pretrain the policy with reinforcement learning to maximize cumulative hypervolume improvement over full trajectories, conditioning on the entire query history to approximate the Pareto frontier. Across synthetic benchmarks and real tasks, TAMO produces fast proposals, reducing proposal time by 50-1000x versus alternatives while matching or improving Pareto quality under tight evaluation budgets. These results show that transformers can perform multi-objective optimization entirely in-context, eliminating per-task surrogate fitting and acquisition engineering, and open a path to foundation-style, plug-and-play optimizers for scientific discovery workflows.

65. FIBER: A Multilingual Evaluation Resource for Factual Inference Bias

Authors: Evren Ayberk Munis , Deniz Yılmaz , Arianna Muti , Çağrı Toraman
URL: https://arxiv.org/abs/2512.11110
Abstract:

Large language models are widely used across domains, yet there are concerns about their factual reliability and biases. Factual knowledge probing offers a systematic means to evaluate these aspects. Most existing benchmarks focus on single-entity facts and monolingual data. We therefore present FIBER, a multilingual benchmark for evaluating factual knowledge in single- and multi-entity settings. The dataset includes sentence completion, question-answering, and object-count prediction tasks in English, Italian, and Turkish. Using FIBER, we examine whether the prompt language induces inference bias in entity selection and how large language models perform on multi-entity versus single-entity questions. The results indicate that the language of the prompt can influence the model’s generated output, particularly for entities associated with the country corresponding to that language. However, this effect varies across different topics such that 31% of the topics exhibit factual inference bias score greater than 0.5. Moreover, the level of bias differs across languages such that Turkish prompts show higher bias compared to Italian in 83% of the topics, suggesting a language-dependent pattern. Our findings also show that models face greater difficulty when handling multi-entity questions than the single-entity questions. Model performance differs across both languages and model sizes. The highest mean average precision is achieved in English, while Turkish and Italian lead to noticeably lower scores. Larger models, including Llama-3.1-8B and Qwen-2.5-7B, show consistently better performance than smaller 3B-4B models.

66. Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution

Authors: Jonathan Kamp , Roos Bakker , Dominique Blok
URL: https://arxiv.org/abs/2512.11108
Abstract:

Good quality explanations strengthen the understanding of language models and data. Feature attribution methods, such as Integrated Gradient, are a type of post-hoc explainer that can provide token-level insights. However, explanations on the same input may vary greatly due to underlying biases of different methods. Users may be aware of this issue and mistrust their utility, while unaware users may trust them inadequately. In this work, we delve beyond the superficial inconsistencies between attribution methods, structuring their biases through a model- and method-agnostic framework of three evaluation metrics. We systematically assess both the lexical and position bias (what and where in the input) for two transformers; first, in a controlled, pseudo-random classification task on artificial data; then, in a semi-controlled causal relation detection task on natural data. We find that lexical and position biases are structurally unbalanced in our model comparison, with models that score high on one type score low on the other. We also find signs that methods producing anomalous explanations are more likely to be biased themselves.

67. Clip-and-Verify: Linear Constraint-Driven Domain Clipping for Accelerating Neural Network Verification

Authors: Duo Zhou , Jorge Chavez , Hesun Chen , Grani A. Hanasusanto , Huan Zhang
URL: https://arxiv.org/abs/2512.11087
Abstract:

State-of-the-art neural network (NN) verifiers demonstrate that applying the branch-and-bound (BaB) procedure with fast bounding techniques plays a key role in tackling many challenging verification properties. In this work, we introduce the linear constraint-driven clipping framework, a class of scalable and efficient methods designed to enhance the efficacy of NN verifiers. Under this framework, we develop two novel algorithms that efficiently utilize linear constraints to 1) reduce portions of the input space that are either verified or irrelevant to a subproblem in the context of branch-and-bound, and 2) directly improve intermediate bounds throughout the network. The process novelly leverages linear constraints that often arise from bound propagation methods and is general enough to also incorporate constraints from other sources. It efficiently handles linear constraints using a specialized GPU procedure that can scale to large neural networks without the use of expensive external solvers. Our verification procedure, Clip-and-Verify, consistently tightens bounds across multiple benchmarks and can significantly reduce the number of subproblems handled during BaB. We show that our clipping algorithms can be integrated with BaB-based verifiers such as $\alpha,\beta$-CROWN, utilizing either the split constraints in activation-space BaB or the output constraints that denote the unverified input space. We demonstrate the effectiveness of our procedure on a broad range of benchmarks where, in some instances, we witness a 96% reduction in the number of subproblems during branch-and-bound, and also achieve state-of-the-art verified accuracy across multiple benchmarks. Clip-and-Verify is part of the $\alpha,\beta$-CROWN verifier ( this http URL ), the VNN-COMP 2025 winner. Code available at this https URL .

68. A probabilistic foundation model for crystal structure denoising, phase classification, and order parameters

Authors: Hyuna Kwon , Babak Sadigh , Sebastien Hamel , Vincenzo Lordi , John Klepeis , Fei Zhou
URL: https://arxiv.org/abs/2512.11077
Abstract:

Atomistic simulations generate large volumes of noisy structural data, but extracting phase labels, order parameters (OPs), and defect information in a way that is universal, robust, and interpretable remains challenging. Existing tools such as PTM and CNA are restricted to a small set of hand-crafted lattices (e.g.\ FCC/BCC/HCP), degrade under strong thermal disorder or defects, and produce hard, template-based labels without per-atom probability or confidence scores. Here we introduce a log-probability foundation model that unifies denoising, phase classification, and OP extraction within a single probabilistic framework. We reuse the MACE-MP foundation interatomic potential on crystal structures mapped to AFLOW prototypes, training it to predict per-atom, per-phase logits $l$ and to aggregate them into a global log-density $\log \hat{P}\theta(\boldsymbol{r})$ whose gradient defines a conservative score field. Denoising corresponds to gradient ascent on this learned log-density, phase labels follow from $\arg\max_c l{ac}$, and the $l$ values act as continuous, defect-sensitive and interpretable OPs quantifying the Euclidean distance to ideal phases. We demonstrate universality across hundreds of prototypes, robustness under strong thermal and defect-induced disorder, and accurate treatment of complex systems such as ice polymorphs, ice–water interfaces, and shock-compressed Ti.

69. Fast, accurate measurement of the worker populations of honey bee colonies using deep learning

Authors: Junmin Zhong , Jon F. Harrison , Jennie Si , Jun Chen
URL: https://arxiv.org/abs/2512.11075
Abstract:

Honey bees play a crucial role in pollination, contributing significantly to global agriculture and ecosystems. Accurately estimating hive populations is essential for understanding the effects of environmental factors on bee colonies, yet traditional methods of counting bees are time-consuming, labor-intensive, and prone to human error, particularly in large-scale studies. In this paper, we present a deep learning-based solution for automating bee population counting using CSRNet and introduce ASUBEE, the FIRST high-resolution dataset specifically designed for this task. Our method employs density map estimation to predict bee populations, effectively addressing challenges such as occlusion and overlapping bees that are common in hive monitoring. We demonstrate that CSRNet achieves superior performance in terms of time efficiency, with a computation time of just 1 second per image, while delivering accurate counts even in complex and densely populated hive scenarios. Our findings show that deep learning approaches like CSRNet can dramatically enhance the efficiency of hive population assessments, providing a valuable tool for researchers and beekeepers alike. This work marks a significant advancement in applying AI technologies to ecological research, offering scalable and precise monitoring solutions for honey bee populations.

70. MultiScript30k: Leveraging Multilingual Embeddings to Extend Cross Script Parallel Data

Authors: Christopher Driggers-Ellis , Detravious Brinkley , Ray Chen , Aashish Dhawan , Daisy Zhe Wang , Christan Grant
URL: https://arxiv.org/abs/2512.11074
Abstract:

Multi30k is frequently cited in the multimodal machine translation (MMT) literature, offering parallel text data for training and fine-tuning deep learning models. However, it is limited to four languages: Czech, English, French, and German. This restriction has led many researchers to focus their investigations only on these languages. As a result, MMT research on diverse languages has been stalled because the official Multi30k dataset only represents European languages in Latin scripts. Previous efforts to extend Multi30k exist, but the list of supported languages, represented language families, and scripts is still very short. To address these issues, we propose MultiScript30k, a new Multi30k dataset extension for global languages in various scripts, created by translating the English version of Multi30k (Multi30k-En) using NLLB200-3.3B. The dataset consists of over (30000) sentences and provides translations of all sentences in Multi30k-En into Ar, Es, Uk, Zh_Hans and Zh_Hant. Similarity analysis shows that Multi30k extension consistently achieves greater than (0.8) cosine similarity and symmetric KL divergence less than (0.000251) for all languages supported except Zh_Hant which is comparable to the previous Multi30k extensions ArEnMulti30k and Multi30k-Uk. COMETKiwi scores reveal mixed assessments of MultiScript30k as a translation of Multi30k-En in comparison to the related work. ArEnMulti30k scores nearly equal MultiScript30k-Ar, but Multi30k-Uk scores $6.4\%$ greater than MultiScript30k-Uk per split.

71. KathDB: Explainable Multimodal Database Management System with Human-AI Collaboration

Authors: Guorui Xiao , Enhao Zhang , Nicole Sullivan , Will Hansen , Magdalena Balazinska
URL: https://arxiv.org/abs/2512.11067
Abstract:

Traditional DBMSs execute user- or application-provided SQL queries over relational data with strong semantic guarantees and advanced query optimization, but writing complex SQL is hard and focuses only on structured tables. Contemporary multimodal systems (which operate over relations but also text, images, and even videos) either expose low-level controls that force users to use (and possibly create) machine learning UDFs manually within SQL or offload execution entirely to black-box LLMs, sacrificing usability or explainability. We propose KathDB, a new system that combines relational semantics with the reasoning power of foundation models over multimodal data. Furthermore, KathDB includes human-AI interaction channels during query parsing, execution, and result explanation, such that users can iteratively obtain explainable answers across data modalities.

72. WholeBodyVLA: Towards Unified Latent VLA for Whole-Body Loco-Manipulation Control

Authors: Haoran Jiang , Jin Chen , Qingwen Bu , Li Chen , Modi Shi , Yanjie Zhang , Delong Li , Chuanzhe Suo , Chuang Wang , Zhihui Peng , Hongyang Li
URL: https://arxiv.org/abs/2512.11047
Abstract:

Humanoid robots require precise locomotion and dexterous manipulation to perform challenging loco-manipulation tasks. Yet existing approaches, modular or end-to-end, are deficient in manipulation-aware locomotion. This confines the robot to a limited workspace, preventing it from performing large-space loco-manipulation. We attribute this to: (1) the challenge of acquiring loco-manipulation knowledge due to the scarcity of humanoid teleoperation data, and (2) the difficulty of faithfully and reliably executing locomotion commands, stemming from the limited precision and stability of existing RL controllers. To acquire richer loco-manipulation knowledge, we propose a unified latent learning framework that enables Vision-Language-Action (VLA) system to learn from low-cost action-free egocentric videos. Moreover, an efficient human data collection pipeline is devised to augment the dataset and scale the benefits. To execute the desired locomotion commands more precisely, we present a loco-manipulation-oriented (LMO) RL policy specifically tailored for accurate and stable core loco-manipulation movements, such as advancing, turning, and squatting. Building on these components, we introduce WholeBodyVLA, a unified framework for humanoid loco-manipulation. To the best of our knowledge, WholeBodyVLA is one of its kind enabling large-space humanoid loco-manipulation. It is verified via comprehensive experiments on the AgiBot X2 humanoid, outperforming prior baseline by 21.3%. It also demonstrates strong generalization and high extensibility across a broad range of tasks.

73. SoccerMaster: A Vision Foundation Model for Soccer Understanding

Authors: Haolin Yang , Jiayuan Rao , Haoning Wu , Weidi Xie
URL: https://arxiv.org/abs/2512.11016
Abstract:

Soccer understanding has recently garnered growing research interest due to its domain-specific complexity and unique challenges. Unlike prior works that typically rely on isolated, task-specific expert models, this work aims to propose a unified model to handle diverse soccer visual understanding tasks, ranging from fine-grained perception (e.g., athlete detection) to semantic reasoning (e.g., event classification). Specifically, our contributions are threefold: (i) we present SoccerMaster, the first soccer-specific vision foundation model that unifies diverse understanding tasks within a single framework via supervised multi-task pretraining; (ii) we develop an automated data curation pipeline to generate scalable spatial annotations, and integrate them with various existing soccer video datasets to construct SoccerFactory, a comprehensive pretraining data resource; and (iii) we conduct extensive evaluations demonstrating that SoccerMaster consistently outperforms task-specific expert models across diverse downstream tasks, highlighting its breadth and superiority. The data, code, and model will be publicly available.

74. Leveraging Text Guidance for Enhancing Demographic Fairness in Gender Classification

Authors: Anoop Krishnan
URL: https://arxiv.org/abs/2512.11015
Abstract:

In the quest for fairness in artificial intelligence, novel approaches to enhance it in facial image based gender classification algorithms using text guided methodologies are presented. The core methodology involves leveraging semantic information from image captions during model training to improve generalization capabilities. Two key strategies are presented: Image Text Matching (ITM) guidance and Image Text fusion. ITM guidance trains the model to discern fine grained alignments between images and texts to obtain enhanced multimodal representations. Image text fusion combines both modalities into comprehensive representations for improved fairness. Exensive experiments conducted on benchmark datasets demonstrate these approaches effectively mitigate bias and improve accuracy across gender racial groups compared to existing methods. Additionally, the unique integration of textual guidance underscores an interpretable and intuitive training paradigm for computer vision systems. By scrutinizing the extent to which semantic information reduces disparities, this research offers valuable insights into cultivating more equitable facial analysis algorithms. The proposed methodologies contribute to addressing the pivotal challenge of demographic bias in gender classification from facial images. Furthermore, this technique operates in the absence of demographic labels and is application agnostic.

75. Beyond Memristor: Neuromorphic Computing Using Meminductor

Authors: Frank Zhigang Wang
URL: https://arxiv.org/abs/2512.11002
Abstract:

Memristor (resistor with memory), inductor with memory (meminductor) and capacitor with memory (memcapacitor) have different roles to play in novel computing architectures. We found that a coil with a magnetic core is an inductor with memory (meminductor) in terms of its inductance L(q) being a function of the charge q. The history of the current passing through the coil is remembered by the magnetization inside the magnetic core. Such a meminductor can play a unique role (that cannot be played by a memristor) in neuromorphic computing, deep learning and brain inspired since the time constant of a neuromorphic RLC circuit is jointly determined by the inductance and capacitance, rather than the resistance. As an experimental verification, this newly invented meminductor was used to reproduce the observed biological behaviour of amoebae (the memorizing, timing and anticipating mechanisms). In conclusion, a beyond memristor computing paradigm is theoretically sensible and experimentally practical.

76. Unambiguous Representations in Neural Networks: An Information-Theoretic Approach to Intentionality

Authors: Francesco Lässig
URL: https://arxiv.org/abs/2512.11000

Abstract:

Representations pervade our daily experience, from letters representing sounds to bit strings encoding digital files. While such representations require externally defined decoders to convey meaning, conscious experience appears fundamentally different: a neural state corresponding to perceiving a red square cannot alternatively encode the experience of a green square. This intrinsic property of consciousness suggests that conscious representations must be unambiguous in a way that conventional representations are not. We formalize this intuition using information theory, defining representational ambiguity as the conditional entropy H(I R) over possible interpretations I given a representation R. Through experiments on neural networks trained to classify MNIST digits, we demonstrate that relational structures in network connectivity can unambiguously encode representational content. Using both learned decoders and direct geometric matching, we achieve perfect (100%) accuracy for dropout-trained networks and 38% for standard backpropagation in identifying output neuron class identity, despite identical task performance, demonstrating that representational ambiguity can arise orthogonally to behavioral accuracy. We further show that spatial position information of input neurons can be decoded from network connectivity with R2 up to 0.844. These results provide a quantitative method for measuring representational ambiguity in neural systems and demonstrate that neural networks can exhibit the low-ambiguity representations posited as necessary (though not sufficient) by theoretical accounts of consciousness.

77. MedBioRAG: Semantic Search and Retrieval-Augmented Generation with Large Language Models for Medical and Biological QA

Authors: Seonok Kim
URL: https://arxiv.org/abs/2512.10996
Abstract:

Recent advancements in retrieval-augmented generation (RAG) have significantly enhanced the ability of large language models (LLMs) to perform complex question-answering (QA) tasks. In this paper, we introduce MedBioRAG, a retrieval-augmented model designed to improve biomedical QA performance through a combination of semantic and lexical search, document retrieval, and supervised fine-tuning. MedBioRAG efficiently retrieves and ranks relevant biomedical documents, enabling precise and context-aware response generation. We evaluate MedBioRAG across text retrieval, close-ended QA, and long-form QA tasks using benchmark datasets such as NFCorpus, TREC-COVID, MedQA, PubMedQA, and BioASQ. Experimental results demonstrate that MedBioRAG outperforms previous state-of-the-art (SoTA) models and the GPT-4o base model in all evaluated tasks. Notably, our approach improves NDCG and MRR scores for document retrieval, while achieving higher accuracy in close-ended QA and ROUGE scores in long-form QA. Our findings highlight the effectiveness of semantic search-based retrieval and LLM fine-tuning in biomedical applications.

78. MolSculpt: Sculpting 3D Molecular Geometries from Chemical Syntax

Authors: Zhanpeng Chen , Weihao Gao , Shunyu Wang , Yanan Zhu , Hong Meng , Yuexian Zou
URL: https://arxiv.org/abs/2512.10991
Abstract:

Generating precise 3D molecular geometries is crucial for drug discovery and material science. While prior efforts leverage 1D representations like SELFIES to ensure molecular validity, they fail to fully exploit the rich chemical knowledge entangled within 1D models, leading to a disconnect between 1D syntactic generation and 3D geometric realization. To bridge this gap, we propose MolSculpt, a novel framework that “sculpts” 3D molecular geometries from chemical syntax. MolSculpt is built upon a frozen 1D molecular foundation model and a 3D molecular diffusion model. We introduce a set of learnable queries to extract inherent chemical knowledge from the foundation model, and a trainable projector then injects this cross-modal information into the conditioning space of the diffusion model to guide the 3D geometry generation. In this way, our model deeply integrates 1D latent chemical knowledge into the 3D generation process through end-to-end optimization. Experiments demonstrate that MolSculpt achieves state-of-the-art (SOTA) performance in \textit{de novo} 3D molecule generation and conditional 3D molecule generation, showing superior 3D fidelity and stability on both the GEOM-DRUGS and QM9 datasets. Code is available at this https URL .

79. Dora: QoE-Aware Hybrid Parallelism for Distributed Edge AI

Authors: Jianli Jin , Ziyang Lin , Qianli Dong , Yi Chen , Jayanth Srinivasa , Myungjin Lee , Zhaowei Tan , Fan Lai
URL: https://arxiv.org/abs/2512.10990
Abstract:

With the proliferation of edge AI applications, satisfying user quality of experience (QoE) requirements, such as model inference latency, has become a first class objective, as these models operate in resource constrained settings and directly interact with users. Yet, modern AI models routinely exceed the resource capacity of individual devices, necessitating distributed execution across heterogeneous devices over variable and contention prone networks. Existing planners for hybrid (e.g., data and pipeline) parallelism largely optimize for throughput or device utilization, overlooking QoE, leading to severe resource inefficiency (e.g., unnecessary energy drain) or QoE violations under runtime dynamics. We present Dora, a framework for QoE aware hybrid parallelism in distributed edge AI training and inference. Dora jointly optimizes heterogeneous computation, contention prone networks, and multi dimensional QoE objectives via three key mechanisms: (i) a heterogeneity aware model partitioner that determines and assigns model partitions across devices, forming a compact set of QoE compliant plans; (ii) a contention aware network scheduler that further refines these candidate plans by maximizing compute communication overlap; and (iii) a runtime adapter that adaptively composes multiple plans to maximize global efficiency while respecting overall QoEs. Across representative edge deployments, including smart homes, traffic analytics, and small edge clusters, Dora achieves 1.1–6.3 times faster execution and, alternatively, reduces energy consumption by 21–82 percent, all while maintaining QoE under runtime dynamics.

80. Mathematics of natural intelligence

Authors: Evgenii Vityaev
URL: https://arxiv.org/abs/2512.10988
Abstract:

In the process of evolution, the brain has achieved such perfection that artificial intelligence systems do not have and which needs its own mathematics. The concept of cognitome, introduced by the academician K.V. Anokhin, as the cognitive structure of the mind – a high-order structure of the brain and a neural hypernetwork, is considered as the basis for modeling. Consciousness then is a special form of dynamics in this hypernetwork – a large-scale integration of its cognitive elements. The cognitome, in turn, consists of interconnected COGs (cognitive groups of neurons) of two types – functional systems and cellular ensembles. K.V. Anokhin sees the task of the fundamental theory of the brain and mind in describing these structures, their origin, functions and processes in them. The paper presents mathematical models of these structures based on new mathematical results, as well as models of different cognitive processes in terms of these models. In addition, it is shown that these models can be derived based on a fairly general principle of the brain works: \textit{the brain discovers all possible causal relationships in the external world and draws all possible conclusions from them}. Based on these results, the paper presents models of: ``natural” classification; theory of functional brain systems by P.K. Anokhin; prototypical theory of categorization by E. Roche; theory of causal models by Bob Rehter; theory of consciousness as integrated information by G. Tononi.

81. Marti-5: A Mathematical Model of “Self in the World” as a First Step Toward Self-Awareness

Authors: Igor Pivovarov , Sergey Shumsky
URL: https://arxiv.org/abs/2512.10985
Abstract:

The existence of ‘what’ and ‘where’ pathways of information processing in the brain was proposed almost 30 years ago, but there is still a lack of a clear mathematical model that could show how these pathways work together. We propose a biologically inspired mathematical model that uses this idea to identify and separate the self from the environment and then build and use a self-model for better predictions. This is a model of neocortical columns governed by the basal ganglia to make predictions and choose the next action, where some columns act as ‘what’ columns and others act as ‘where’ columns. Based on this model, we present a reinforcement learning agent that learns purposeful behavior in a virtual environment. We evaluate the agent on the Atari games Pong and Breakout, where it successfully learns to play. We conclude that the ability to separate the self from the environment gives advantages to the agent and therefore such a model could appear in living organisms during evolution. We propose Self-Awareness Principle 1: the ability to separate the self from the world is a necessary but insufficient condition for self-awareness.

82. Developmental Symmetry-Loss: A Free-Energy Perspective on Brain-Inspired Invariance Learning

Authors: Arif Dönmez
URL: https://arxiv.org/abs/2512.10984
Abstract:

We propose Symmetry-Loss, a brain-inspired algorithmic principle that enforces invariance and equivariance through a differentiable constraint derived from environmental symmetries. The framework models learning as the iterative refinement of an effective symmetry group, paralleling developmental processes in which cortical representations align with the world’s structure. By minimizing structural surprise, i.e. deviations from symmetry consistency, Symmetry-Loss operationalizes a Free-Energy–like objective for representation learning. This formulation bridges predictive-coding and group-theoretic perspectives, showing how efficient, stable, and compositional representations can emerge from symmetry-based self-organization. The result is a general computational mechanism linking developmental learning in the brain with principled representation learning in artificial systems.

83. Reducing Fragmentation and Starvation in GPU Clusters through Dynamic Multi-Objective Scheduling

Authors: Akhmadillo Mamirov
URL: https://arxiv.org/abs/2512.10980
Abstract:

GPU clusters have become essential for training and deploying modern AI systems, yet real deployments continue to report average utilization near 50%. This inefficiency is largely caused by fragmentation, heterogeneous workloads, and the limitations of static scheduling policies. This work presents a systematic evaluation of these issues and introduces three specialized dynamic schedulers: Hybrid Priority (HPS), Predictive Backfill (PBS), and Smart Batch (SBS). These schedulers are designed to improve utilization, fairness, and overall throughput in multi-tenant GPU clusters. We evaluate all schedulers using a controlled simulation of 1,000 AI jobs on a 64-GPU, 8-node cluster that includes a realistic mix of training, inference, and research workloads. Static baselines (FIFO, SJF, Shortest, Shortest-GPU) achieve 45 to 67% GPU utilization and 12.5 to 18.3 jobs per hour and experience severe starvation, with as many as 156 jobs waiting longer than 30 minutes. The dynamic schedulers significantly outperform these policies. HPS achieves the highest utilization (78.2%), highest throughput (25.8 jobs per hour), and the lowest fairness variance among dynamic methods (457), reducing starvation to 12 jobs. PBS improves fragmentation handling and reaches 76.1% utilization, while SBS increases efficiency for structurally similar jobs and reaches 74.6% utilization. Across all key metrics, including throughput, job wait times, fairness variance, and starvation, dynamic multi-objective schedulers consistently outperform single-objective heuristics. These results show that targeted and transparent scheduling strategies can meaningfully increase GPU efficiency in heterogeneous AI clusters and provide a practical foundation for future production scheduling frameworks.

84. Cognitive Mirrors: Exploring the Diverse Functional Roles of Attention Heads in LLM Reasoning

Authors: Xueqi Ma , Jun Wang , Yanbei Jiang , Sarah Monazam Erfani , Tongliang Liu , James Bailey
URL: https://arxiv.org/abs/2512.10978
Abstract:

Large language models (LLMs) have achieved state-of-the-art performance in a variety of tasks, but remain largely opaque in terms of their internal mechanisms. Understanding these mechanisms is crucial to improve their reasoning abilities. Drawing inspiration from the interplay between neural processes and human cognition, we propose a novel interpretability framework to systematically analyze the roles and behaviors of attention heads, which are key components of LLMs. We introduce CogQA, a dataset that decomposes complex questions into step-by-step subquestions with a chain-of-thought design, each associated with specific cognitive functions such as retrieval or logical reasoning. By applying a multi-class probing method, we identify the attention heads responsible for these functions. Our analysis across multiple LLM families reveals that attention heads exhibit functional specialization, characterized as cognitive heads. These cognitive heads exhibit several key properties: they are universally sparse, vary in number and distribution across different cognitive functions, and display interactive and hierarchical structures. We further show that cognitive heads play a vital role in reasoning tasks - removing them leads to performance degradation, while augmenting them enhances reasoning accuracy. These insights offer a deeper understanding of LLM reasoning and suggest important implications for model design, training, and fine-tuning strategies.

85. Agent-Based Modular Learning for Multimodal Emotion Recognition in Human-Agent Systems

Authors: Matvey Nepomnyaschiy , Oleg Pereziabov , Anvar Tliamov , Stanislav Mikhailov , Ilya Afanasyev
URL: https://arxiv.org/abs/2512.10975
Abstract:

Effective human-agent interaction (HAI) relies on accurate and adaptive perception of human emotional states. While multimodal deep learning models - leveraging facial expressions, speech, and textual cues - offer high accuracy in emotion recognition, their training and maintenance are often computationally intensive and inflexible to modality changes. In this work, we propose a novel multi-agent framework for training multimodal emotion recognition systems, where each modality encoder and the fusion classifier operate as autonomous agents coordinated by a central supervisor. This architecture enables modular integration of new modalities (e.g., audio features via emotion2vec), seamless replacement of outdated components, and reduced computational overhead during training. We demonstrate the feasibility of our approach through a proof-of-concept implementation supporting vision, audio, and text modalities, with the classifier serving as a shared decision-making agent. Our framework not only improves training efficiency but also contributes to the design of more flexible, scalable, and maintainable perception modules for embodied and virtual agents in HAI scenarios.

86. Multimodal Fusion of Regional Brain Experts for Interpretable Alzheimer’s Disease Diagnosis

Authors: Farica Zhuang , Dinara Aliyeva , Shu Yang , Zixuan Wen , Duy Duong-Tran , Christos Davatzikos , Tianlong Chen , Song Wang , Li Shen
URL: https://arxiv.org/abs/2512.10966
Abstract:

Accurate and early diagnosis of Alzheimer’s disease (AD) can benefit from integrating complementary information from multiple modalities, mirroring clinical practice. However, conventional fusion approaches often rely on simple concatenation of features, which cannot adaptively balance the contributions of biomarkers such as amyloid PET and MRI across brain regions. In this work, we propose MREF-AD, a Multimodal Regional Expert Fusion model for AD diagnosis. It is a Mixture-of-Experts (MoE) framework that models meso-scale brain regions in each modality as an independent expert and employs two-level gating networks to learn subject-specific fusion weights. Beyond improving diagnostic performance, MREF-AD provides modality- and region-level insight into how structural and molecular imaging jointly contribute to disease diagnosis. Using data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI), MREF-AD achieves state-of-the-art performance over baselines while providing enhanced interpretability of brain region-specific biomarker relevance, underscoring its utility as a general framework for adaptive and interpretable multimodal fusion in neuroimaging.

Authors: Zheqi Hu , Xuanjing Chen , Jinlin Hu
URL: https://arxiv.org/abs/2512.10963
Abstract:

With the rapid growth of AI-generated content (AIGC) across domains such as music, video, and literature, the demand for emotionally aware recommendation systems has become increasingly important. Traditional recommender systems primarily rely on user behavioral data such as clicks, views, or ratings, while neglecting users’ real-time emotional and intentional states during content interaction. To address this limitation, this study proposes a Multi-Modal Emotion and Intent Recognition Model (MMEI) based on a BERT-based Cross-Modal Transformer with Attention-Based Fusion, integrated into a cloud-native personalized AIGC recommendation framework. The proposed system jointly processes visual (facial expression), auditory (speech tone), and textual (comments or utterances) modalities through pretrained encoders ViT, Wav2Vec2, and BERT, followed by an attention-based fusion module to learn emotion-intent representations. These embeddings are then used to drive personalized content recommendations through a contextual matching layer. Experiments conducted on benchmark emotion datasets (AIGC-INT, MELD, and CMU-MOSEI) and an AIGC interaction dataset demonstrate that the proposed MMEI model achieves a 4.3% improvement in F1-score and a 12.3% reduction in cross-entropy loss compared to the best fusion-based transformer baseline. Furthermore, user-level online evaluations reveal that emotion-driven recommendations increase engagement time by 15.2% and enhance satisfaction scores by 11.8%, confirming the model’s effectiveness in aligning AI-generated content with users’ affective and intentional states. This work highlights the potential of cross-modal emotional intelligence for next-generation AIGC ecosystems, enabling adaptive, empathetic, and context-aware recommendation experiences.

88. Scalable Data Synthesis for Computer Use Agents with Step-Level Filtering

Authors: Yifei He , Pranit Chawla , Yaser Souri , Subhojit Som , Xia Song
URL: https://arxiv.org/abs/2512.10962
Abstract:

Computer use agents (CUAs) can operate real-world digital interfaces but remain difficult to train due to the high cost of graphical user interface (GUI) interaction and the scarcity of high-quality trajectory data. Existing datasets rely on human demonstrations, limiting scalability. A natural alternative is to synthesize data from strong CUAs, yet their rollouts are highly noisy, with incorrect or suboptimal actions consisting a large proportion of the steps, making naive imitation ineffective. To tackle this challenge, we introduce a scalable data synthesis pipeline that transforms noisy rollouts into reliable supervision without human annotation. The core idea is step-level filtering, which evaluates actions individually to retain only correct steps, complemented by reasoning augmentation for improved planning. Using this pipeline, we construct WebSTAR, a dataset of 13.3K trajectories and 100K graded, reasoning-rich steps synthesized from OpenAI’s computer-use-preview model. We train Qwen-2.5-VL-Instruct models (7B and 32B) on WebSTAR. On WebVoyager, our 7B model surpasses SoTA open-source CUA model UI-TARS-1.5-7B by more than 15% with only supervised finetuning. Building on step-level grading, we further create WebSCORE, a dataset of graded step-level actions, and train StepRM, a 7B multimodal reward model distilled from o4-mini, which matches its grading quality while being far more efficient to deploy at scale. Our results establish step-level filtering as a key principle for scalable CUA training and construct two new datasets (WebSTAR, WebSCORE) and a lightweight reward model (StepRM) as practical tools to advance robust and efficient CUAs.

89. AI as Cognitive Amplifier: Rethinking Human Judgment in the Age of Generative AI

Authors: Tao An
URL: https://arxiv.org/abs/2512.10961
Abstract:

Through extensive experience training professionals and individual users in AI tool adoption since the GPT-3 era, I have observed a consistent pattern: the same AI tool produces dramatically different results depending on who uses it. While some frame AI as a replacement for human intelligence, and others warn of cognitive decline, this position paper argues for a third perspective grounded in practical observation: AI as a cognitive amplifier that magnifies existing human capabilities rather than substituting for them. Drawing on research in human-computer interaction, cognitive augmentation theory, and educational technology, alongside field observations from corporate training across writing, software development, and data analysis domains, I present a framework positioning AI tools as intelligence amplification systems where output quality depends fundamentally on user expertise and judgment. Through analysis of empirical studies on expert-novice differences and systematic observations from professional training contexts, I demonstrate that domain knowledge, quality judgment, and iterative refinement capabilities create substantial performance gaps between users. I propose a three-level model of AI engagement – from passive acceptance through iterative collaboration to cognitive direction – and argue that the transition between levels requires not technical training but development of domain expertise and metacognitive skills. This position has critical implications for workforce development and AI system design. Rather than focusing solely on AI literacy or technical prompt engineering, I advocate for integrated approaches that strengthen domain expertise, evaluative judgment, and reflective practice.

90. Measuring skill-based uplift from AI in a real biological laboratory

Authors: Ethan Obie Romero-Severson , Tara Harvey , Nick Generous , Phillip M. Mach
URL: https://arxiv.org/abs/2512.10960
Abstract:

Understanding how AI systems are used by people in real situations that mirror aspects of both legitimate and illegitimate use is key to predicting the risks and benefits of AI systems. This is especially true in biological applications, where skill rather than knowledge is often the primary barrier for an untrained person. The challenge is that these studies are difficult to execute well and can take months to plan and run. Here we report the results of a pilot study that attempted to empirically measure the magnitude of \emph{skills-based uplift} caused by access to an AI reasoning model, compared with a control group that had only internet access. Participants – drawn from a diverse pool of Los Alamos National Laboratory employees with no prior wet-lab experience – were asked to transform \ecoli{} with a provided expression construct, induce expression of a reporter peptide, and have expression confirmed by mass spectrometry. We recorded quantitative outcomes (e.g., successful completion of experimental segments) and qualitative observations about how participants interacted with the AI system, the internet, laboratory equipment, and one another. We present the results of the study and lessons learned in designing and executing this type of study, and we discuss these results in the context of future studies of the evolving relationship between AI and global biosecurity.