전체 AI 논문 - 2025-12-19

1. Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants

Authors: Vincent Huang , Dami Choi , Daniel D. Johnson , Sarah Schwettmann , Jacob Steinhardt
URL: https://arxiv.org/abs/2512.15712
Abstract:

Interpreting the internal activations of neural networks can produce more faithful explanations of their behavior, but is difficult due to the complex structure of activation space. Existing approaches to scalable interpretability use hand-designed agents that make and test hypotheses about how internal activations relate to external behavior. We propose to instead turn this task into an end-to-end training objective, by training interpretability assistants to accurately predict model behavior from activations through a communication bottleneck. Specifically, an encoder compresses activations to a sparse list of concepts, and a decoder reads this list and answers a natural language question about the model. We show how to pretrain this assistant on large unstructured data, then finetune it to answer questions. The resulting architecture, which we call a Predictive Concept Decoder, enjoys favorable scaling properties: the auto-interp score of the bottleneck concepts improves with data, as does the performance on downstream applications. Specifically, PCDs can detect jailbreaks, secret hints, and implanted latent concepts, and are able to accurately surface latent user attributes.

2. Artism: AI-Driven Dual-Engine System for Art Generation and Critique

Authors: Shuai Liu , Yiqing Tian , Yang Chen , Mar Canet Sola
URL: https://arxiv.org/abs/2512.15710
Abstract:

This paper proposes a dual-engine AI architectural method designed to address the complex problem of exploring potential trajectories in the evolution of art. We present two interconnected components: AIDA (an artificial artist social network) and the Ismism Machine, a system for critical analysis. The core innovation lies in leveraging deep learning and multi-agent collaboration to enable multidimensional simulations of art historical developments and conceptual innovation patterns. The framework explores a shift from traditional unidirectional critique toward an intelligent, interactive mode of reflexive practice. We are currently applying this method in experimental studies on contemporary art concepts. This study introduces a general methodology based on AI-driven critical loops, offering new possibilities for computational analysis of art.

3. Explaining the Reasoning of Large Language Models Using Attribution Graphs

Authors: Chase Walker , Rickard Ewetz
URL: https://arxiv.org/abs/2512.15663
Abstract:

Large language models (LLMs) exhibit remarkable capabilities, yet their reasoning remains opaque, raising safety and trust concerns. Attribution methods, which assign credit to input features, have proven effective for explaining the decision making of computer vision models. From these, context attributions have emerged as a promising approach for explaining the behavior of autoregressive LLMs. However, current context attributions produce incomplete explanations by directly relating generated tokens to the prompt, discarding inter-generational influence in the process. To overcome these shortcomings, we introduce the Context Attribution via Graph Explanations (CAGE) framework. CAGE introduces an attribution graph: a directed graph that quantifies how each generation is influenced by both the prompt and all prior generations. The graph is constructed to preserve two properties-causality and row stochasticity. The attribution graph allows context attributions to be computed by marginalizing intermediate contributions along paths in the graph. Across multiple models, datasets, metrics, and methods, CAGE improves context attribution faithfulness, achieving average gains of up to 40%.

4. Stepwise Think-Critique: A Unified Framework for Robust and Interpretable LLM Reasoning

Authors: Jiaqi Xu , Cuiling Lan , Xuejin Chen , Yan LU
URL: https://arxiv.org/abs/2512.15662
Abstract:

Human beings solve complex problems through critical thinking, where reasoning and evaluation are intertwined to converge toward correct solutions. However, most existing large language models (LLMs) decouple reasoning from verification: they either generate reasoning without explicit self-checking or rely on external verifiers to detect errors post hoc. The former lacks immediate feedback, while the latter increases system complexity and hinders synchronized learning. Motivated by human critical thinking, we propose Stepwise Think-Critique (STC), a unified framework that interleaves reasoning and self-critique at each step within a single model. STC is trained with a hybrid reinforcement learning objective combining reasoning rewards and critique-consistency rewards to jointly optimize reasoning quality and self-evaluation. Experiments on mathematical reasoning benchmarks show that STC demonstrates strong critic-thinking capabilities and produces more interpretable reasoning traces, representing a step toward LLMs with built-in critical thinking.

5. A Decision-Theoretic Approach for Managing Misalignment

Authors: Daniel A. Herrmann , Abinav Chari , Isabelle Qian , Sree Sharvesh , B. A. Levinstein
URL: https://arxiv.org/abs/2512.15584
Abstract:

When should we delegate decisions to AI systems? While the value alignment literature has developed techniques for shaping AI values, less attention has been paid to how to determine, under uncertainty, when imperfect alignment is good enough to justify delegation. We argue that rational delegation requires balancing an agent’s value (mis)alignment with its epistemic accuracy and its reach (the acts it has available). This paper introduces a formal, decision-theoretic framework to analyze this tradeoff precisely accounting for a principal’s uncertainty about these factors. Our analysis reveals a sharp distinction between two delegation scenarios. First, universal delegation (trusting an agent with any problem) demands near-perfect value alignment and total epistemic trust, conditions rarely met in practice. Second, we show that context-specific delegation can be optimal even with significant misalignment. An agent’s superior accuracy or expanded reach may grant access to better overall decision problems, making delegation rational in expectation. We develop a novel scoring framework to quantify this ex ante decision. Ultimately, our work provides a principled method for determining when an AI is aligned enough for a given context, shifting the focus from achieving perfect alignment to managing the risks and rewards of delegation under uncertainty.

6. Evaluating Large Language Models in Scientific Discovery

Authors: Zhangde Song , Jieyu Lu , Yuanqi Du , Botao Yu , Thomas M. Pruyn , Yue Huang , Kehan Guo , Xiuzhe Luo , Yuanhao Qu , Yi Qu , Yinkai Wang , Haorui Wang , Jeff Guo , Jingru Gan , Parshin Shojaee , Di Luo , Andres M Bran , Gen Li , Qiyuan Zhao , Shao-Xiong Lennon Luo , Yuxuan Zhang , Xiang Zou , Wanru Zhao , Yifan F. Zhang , Wucheng Zhang , Shunan Zheng , Saiyang Zhang , Sartaaj Takrim Khan , Mahyar Rajabi-Kochi , Samantha Paradi-Maropakis , Tony Baltoiu , Fengyu Xie , Tianyang Chen , Kexin Huang , Weiliang Luo , Meijing Fang , Xin Yang , Lixue Cheng , Jiajun He , Soha Hassoun , Xiangliang Zhang , Wei Wang , Chandan K. Reddy , Chao Zhang , Zhiling Zheng , Mengdi Wang , Le Cong , Carla P. Gomes , Chang-Yu Hsieh , Aditya Nandy , Philippe Schwaller , Heather J. Kulik , Haojun Jia , Huan Sun , Seyed Mohamad Moosavi , Chenru Duan
URL: https://arxiv.org/abs/2512.15567
Abstract:

Large language models (LLMs) are increasingly applied to scientific research, yet prevailing science benchmarks probe decontextualized knowledge and overlook the iterative reasoning, hypothesis generation, and observation interpretation that drive scientific discovery. We introduce a scenario-grounded benchmark that evaluates LLMs across biology, chemistry, materials, and physics, where domain experts define research projects of genuine interest and decompose them into modular research scenarios from which vetted questions are sampled. The framework assesses models at two levels: (i) question-level accuracy on scenario-tied items and (ii) project-level performance, where models must propose testable hypotheses, design simulations or experiments, and interpret results. Applying this two-phase scientific discovery evaluation (SDE) framework to state-of-the-art LLMs reveals a consistent performance gap relative to general science benchmarks, diminishing return of scaling up model sizes and reasoning, and systematic weaknesses shared across top-tier models from different providers. Large performance variation in research scenarios leads to changing choices of the best performing model on scientific discovery projects evaluated, suggesting all current LLMs are distant to general scientific “superintelligence”. Nevertheless, LLMs already demonstrate promise in a great variety of scientific discovery projects, including cases where constituent scenario scores are low, highlighting the role of guided exploration and serendipity in discovery. This SDE framework offers a reproducible benchmark for discovery-relevant evaluation of LLMs and charts practical paths to advance their development toward scientific discovery.

7. Nemotron-Math: Efficient Long-Context Distillation of Mathematical Reasoning from Multi-Mode Supervision

Authors: Wei Du , Shubham Toshniwal , Branislav Kisacanin , Sadegh Mahdavi , Ivan Moshkov , George Armstrong , Stephen Ge , Edgar Minasyan , Feng Chen , Igor Gitman
URL: https://arxiv.org/abs/2512.15489
Abstract:

High-quality mathematical reasoning supervision requires diverse reasoning styles, long-form traces, and effective tool integration, capabilities that existing datasets provide only in limited form. Leveraging the multi-mode generation ability of gpt-oss-120b, we introduce Nemotron-Math, a large-scale mathematical reasoning dataset containing 7.5M solution traces across high, medium, and low reasoning modes, each available both with and without Python tool-integrated reasoning (TIR). The dataset integrates 85K curated AoPS problems with 262K community-sourced StackExchange-Math problems, combining structured competition tasks with diverse real-world mathematical queries. We conduct controlled evaluations to assess the dataset quality. Nemotron-Math consistently outperforms the original OpenMathReasoning on matched AoPS problems. Incorporating StackExchange-Math substantially improves robustness and generalization, especially on HLE-Math, while preserving accuracy on math competition benchmarks. To support efficient long-context training, we develop a sequential bucketed strategy that accelerates 128K context-length fine-tuning by 2–3$\times$ without significant accuracy loss. Overall, Nemotron-Math enables state-of-the-art performance, including 100\% maj@16 accuracy on AIME 2024 and 2025 with Python TIR.

8. Intent-Driven UAM Rescheduling

Authors: Jeongseok Kim , Kangjin Kim
URL: https://arxiv.org/abs/2512.15462
Abstract:

Due to the restricted resources, efficient scheduling in vertiports has received much more attention in the field of Urban Air Mobility (UAM). For the scheduling problem, we utilize a Mixed Integer Linear Programming (MILP), which is often formulated in a resource-restricted project scheduling problem (RCPSP). In this paper, we show our approach to handle both dynamic operation requirements and vague rescheduling requests from humans. Particularly, we utilize a three-valued logic for interpreting ambiguous user intents and a decision tree, proposing a newly integrated system that combines Answer Set Programming (ASP) and MILP. This integrated framework optimizes schedules and supports human inputs transparently. With this system, we provide a robust structure for explainable, adaptive UAM scheduling.

9. Outer-Learning Framework for Playing Multi-Player Trick-Taking Card Games: A Case Study in Skat

Authors: Stefan Edelkamp
URL: https://arxiv.org/abs/2512.15435
Abstract:

In multi-player card games such as Skat or Bridge, the early stages of the game, such as bidding, game selection, and initial card selection, are often more critical to the success of the play than refined middle- and end-game play. At the current limits of computation, such early decision-making resorts to using statistical information derived from a large corpus of human expert games. In this paper, we derive and evaluate a general bootstrapping outer-learning framework that improves prediction accuracy by expanding the database of human games with millions of self-playing AI games to generate and merge statistics. We implement perfect feature hash functions to address compacted tables, producing a self-improving card game engine, where newly inferred knowledge is continuously improved during self-learning. The case study in Skat shows that the automated approach can be used to support various decisions in the game.

10. Bilateral Spatial Reasoning about Street Networks: Graph-based RAG with Qualitative Spatial Representations

Authors: Reinhard Moratz , Niklas Daute , James Ondieki , Markus Kattenbeck , Mario Krajina , Ioannis Giannopoulos
URL: https://arxiv.org/abs/2512.15388
Abstract:

This paper deals with improving the capabilities of Large Language Models (LLM) to provide route instructions for pedestrian wayfinders by means of qualitative spatial relations.

11. SCOPE: Prompt Evolution for Enhancing Agent Effectiveness

Authors: Zehua Pei , Hui-Ling Zhen , Shixiong Kai , Sinno Jialin Pan , Yunhe Wang , Mingxuan Yuan , Bei Yu
URL: https://arxiv.org/abs/2512.15374
Abstract:

Large Language Model (LLM) agents are increasingly deployed in environments that generate massive, dynamic contexts. However, a critical bottleneck remains: while agents have access to this context, their static prompts lack the mechanisms to manage it effectively, leading to recurring Corrective and Enhancement failures. To address this capability gap, we introduce \textbf{SCOPE} (Self-evolving Context Optimization via Prompt Evolution). SCOPE frames context management as an \textit{online optimization} problem, synthesizing guidelines from execution traces to automatically evolve the agent’s prompt. We propose a Dual-Stream mechanism that balances tactical specificity (resolving immediate errors) with strategic generality (evolving long-term principles). Furthermore, we introduce Perspective-Driven Exploration to maximize strategy coverage, increasing the likelihood that the agent has the correct strategy for any given task. Experiments on the HLE benchmark show that SCOPE improves task success rates from 14.23\% to 38.64\% without human intervention. We make our code publicly available at this https URL .

12. ChatGPT and Gemini participated in the Korean College Scholastic Ability Test – Earth Science I

Authors: Seok-Hyun Ga , Chun-Yen Chang
URL: https://arxiv.org/abs/2512.15298
Abstract:

The rapid development of Generative AI is bringing innovative changes to education and assessment. As the prevalence of students utilizing AI for assignments increases, concerns regarding academic integrity and the validity of assessments are growing. This study utilizes the Earth Science I section of the 2025 Korean College Scholastic Ability Test (CSAT) to deeply analyze the multimodal scientific reasoning capabilities and cognitive limitations of state-of-the-art Large Language Models (LLMs), including GPT-4o, Gemini 2.5 Flash, and Gemini 2.5 Pro. Three experimental conditions (full-page input, individual item input, and optimized multimodal input) were designed to evaluate model performance across different data structures. Quantitative results indicated that unstructured inputs led to significant performance degradation due to segmentation and Optical Character Recognition (OCR) failures. Even under optimized conditions, models exhibited fundamental reasoning flaws. Qualitative analysis revealed that “Perception Errors” were dominant, highlighting a “Perception-Cognition Gap” where models failed to interpret symbolic meanings in schematic diagrams despite recognizing visual data. Furthermore, models demonstrated a “Calculation-Conceptualization Discrepancy,” successfully performing calculations while failing to apply the underlying scientific concepts, and “Process Hallucination,” where models skipped visual verification in favor of plausible but unfounded background knowledge. Addressing the challenge of unauthorized AI use in coursework, this study provides actionable cues for designing “AI-resistant questions” that target these specific cognitive vulnerabilities. By exploiting AI’s weaknesses, such as the gap between perception and cognition, educators can distinguish genuine student competency from AI-generated responses, thereby ensuring assessment fairness.

13. Graph Contextual Reinforcement Learning for Efficient Directed Controller Synthesis

Authors: Toshihide Ubukata , Enhong Mu , Takuto Yamauchi , Mingyue Zhang , Jialong Li , Kenji Tei
URL: https://arxiv.org/abs/2512.15295
Abstract:

Controller synthesis is a formal method approach for automatically generating Labeled Transition System (LTS) controllers that satisfy specified properties. The efficiency of the synthesis process, however, is critically dependent on exploration policies. These policies often rely on fixed rules or strategies learned through reinforcement learning (RL) that consider only a limited set of current features. To address this limitation, this paper introduces GCRL, an approach that enhances RL-based methods by integrating Graph Neural Networks (GNNs). GCRL encodes the history of LTS exploration into a graph structure, allowing it to capture a broader, non-current-based context. In a comparative experiment against state-of-the-art methods, GCRL exhibited superior learning efficiency and generalization across four out of five benchmark domains, except one particular domain characterized by high symmetry and strictly local interactions.

14. CangLing-KnowFlow: A Unified Knowledge-and-Flow-fused Agent for Comprehensive Remote Sensing Applications

Authors: Zhengchao Chen , Haoran Wang , Jing Yao , Pedram Ghamisi , Jun Zhou , Peter M. Atkinson , Bing Zhang
URL: https://arxiv.org/abs/2512.15231
Abstract:

The automated and intelligent processing of massive remote sensing (RS) datasets is critical in Earth observation (EO). Existing automated systems are normally task-specific, lacking a unified framework to manage diverse, end-to-end workflows–from data preprocessing to advanced interpretation–across diverse RS applications. To address this gap, this paper introduces CangLing-KnowFlow, a unified intelligent agent framework that integrates a Procedural Knowledge Base (PKB), Dynamic Workflow Adjustment, and an Evolutionary Memory Module. The PKB, comprising 1,008 expert-validated workflow cases across 162 practical RS tasks, guides planning and substantially reduces hallucinations common in general-purpose agents. During runtime failures, the Dynamic Workflow Adjustment autonomously diagnoses and replans recovery strategies, while the Evolutionary Memory Module continuously learns from these events, iteratively enhancing the agent’s knowledge and performance. This synergy enables CangLing-KnowFlow to adapt, learn, and operate reliably across diverse, complex tasks. We evaluated CangLing-KnowFlow on the KnowFlow-Bench, a novel benchmark of 324 workflows inspired by real-world applications, testing its performance across 13 top Large Language Model (LLM) backbones, from open-source to commercial. Across all complex tasks, CangLing-KnowFlow surpassed the Reflexion baseline by at least 4% in Task Success Rate. As the first most comprehensive validation along this emerging field, this research demonstrates the great potential of CangLing-KnowFlow as a robust, efficient, and scalable automated solution for complex EO challenges by leveraging expert knowledge (Knowledge) into adaptive and verifiable procedures (Flow).

15. A Clustering-Based Variable Ordering Framework for Relaxed Decision Diagrams for Maximum Weighted Independent Set Problem

Authors: Mohsen Nafar , Michael Römer , Lin Xie
URL: https://arxiv.org/abs/2512.15198
Abstract:

Efficient exact algorithms for Discrete Optimization (DO) rely heavily on strong primal and dual bounds. Relaxed Decision Diagrams (DDs) provide a versatile mechanism for deriving such dual bounds by compactly over-approximating the solution space through node merging. However, the quality of these relaxed diagrams, i.e. the tightness of the resulting dual bounds, depends critically on the variable ordering and the merging decisions executed during compilation. While dynamic variable ordering heuristics effectively tighten bounds, they often incur computational overhead when evaluated globally across the entire variable set. To mitigate this trade-off, this work introduces a novel clustering-based framework for variable ordering. Instead of applying dynamic ordering heuristics to the full set of unfixed variables, we first partition variables into clusters. We then leverage this structural decomposition to guide the ordering process, significantly reducing the heuristic’s search space. Within this framework, we investigate two distinct strategies: Cluster-to-Cluster, which processes clusters sequentially using problem-specific aggregate criteria (such as cumulative vertex weights in the Maximum Weighted Independent Set Problem (MWISP)), and Pick-and-Sort, which iteratively selects and sorts representative variables from each cluster to balance local diversity with heuristic guidance. Later on, developing some theoretical results on the growth of the size of DDs for MWISP we propose two different policies for setting the number of clusters within the proposed framework. We embed these strategies into a DD-based branch-and-bound algorithm and evaluate them on the MWISP. Across benchmark instances, the proposed methodology consistently reduces computational costs compared to standard dynamic variable ordering baseline.

16. Beyond Fast and Slow: Cognitive-Inspired Elastic Reasoning for Large Language Models

Authors: Jinwu Hu , Dongjin Yang , Langyu Bian , Zhiquan Wen , Yufeng Wang , Yaofo Chen , Bin Xiao , Yuanqing Li , Mingkui Tan
URL: https://arxiv.org/abs/2512.15089
Abstract:

Large language models (LLMs) have demonstrated impressive performance across various language tasks. However, existing LLM reasoning strategies mainly rely on the LLM itself with fast or slow mode (like o1 thinking) and thus struggle to balance reasoning efficiency and accuracy across queries of varying difficulties. In this paper, we propose Cognitive-Inspired Elastic Reasoning (CogER), a framework inspired by human hierarchical reasoning that dynamically selects the most suitable reasoning strategy for each query. Specifically, CogER first assesses the complexity of incoming queries and assigns them to one of several predefined levels, each corresponding to a tailored processing strategy, thereby addressing the challenge of unobservable query difficulty. To achieve automatic strategy selection, we model the process as a Markov Decision Process and train a CogER-Agent using reinforcement learning. The agent is guided by a reward function that balances solution quality and computational cost, ensuring resource-efficient reasoning. Moreover, for queries requiring external tools, we introduce Cognitive Tool-Assisted Reasoning, which enables the LLM to autonomously invoke external tools within its chain-of-thought. Extensive experiments demonstrate that CogER outperforms state-of-the-art Test-Time scaling methods, achieving at least a 13% relative improvement in average exact match on In-Domain tasks and an 8% relative gain on Out-of-Domain tasks.

17. Agentic AI for Integrated Sensing and Communication: Analysis, Framework, and Case Study

Authors: Wenwen Xie , Geng Sun , Ruichen Zhang , Xuejie Liu , Yinqiu Liu , Jiacheng Wang , Dusit Niyato , Ping Zhang
URL: https://arxiv.org/abs/2512.15044
Abstract:

Integrated sensing and communication (ISAC) has emerged as a key development direction in the sixth-generation (6G) era, which provides essential support for the collaborative sensing and communication of future intelligent networks. However, as wireless environments become increasingly dynamic and complex, ISAC systems require more intelligent processing and more autonomous operation to maintain efficiency and adaptability. Meanwhile, agentic artificial intelligence (AI) offers a feasible solution to address these challenges by enabling continuous perception-reasoning-action loops in dynamic environments to support intelligent, autonomous, and efficient operation for ISAC systems. As such, we delve into the application value and prospects of agentic AI in ISAC systems in this work. Firstly, we provide a comprehensive review of agentic AI and ISAC systems to demonstrate their key characteristics. Secondly, we show several common optimization approaches for ISAC systems and highlight the significant advantages of generative artificial intelligence (GenAI)-based agentic AI. Thirdly, we propose a novel agentic ISAC framework and prensent a case study to verify its superiority in optimizing ISAC performance. Finally, we clarify future research directions for agentic AI-based ISAC systems.

18. LADY: Linear Attention for Autonomous Driving Efficiency without Transformers

Authors: Jihao Huang , Xi Xia , Zhiyuan Li , Tianle Liu , Jingke Wang , Junbo Chen , Tengju Ye
URL: https://arxiv.org/abs/2512.15038
Abstract:

End-to-end paradigms have demonstrated great potential for autonomous driving. Additionally, most existing methods are built upon Transformer architectures. However, transformers incur a quadratic attention cost, limiting their ability to model long spatial and temporal sequences-particularly on resource-constrained edge platforms. As autonomous driving inherently demands efficient temporal modeling, this challenge severely limits their deployment and real-time performance. Recently, linear attention mechanisms have gained increasing attention due to their superior spatiotemporal complexity. However, existing linear attention architectures are limited to self-attention, lacking support for cross-modal and cross-temporal interactions-both crucial for autonomous driving. In this work, we propose LADY, the first fully linear attention-based generative model for end-to-end autonomous driving. LADY enables fusion of long-range temporal context at inference with constant computational and memory costs, regardless of the history length of camera and LiDAR features. Additionally, we introduce a lightweight linear cross-attention mechanism that enables effective cross-modal information exchange. Experiments on the NAVSIM and Bench2Drive benchmarks demonstrate that LADY achieves state-of-the-art performance with constant-time and memory complexity, offering improved planning performance and significantly reduced computational cost. Additionally, the model has been deployed and validated on edge devices, demonstrating its practicality in resource-limited scenarios.

19. Beyond Accuracy: A Geometric Stability Analysis of Large Language Models in Chess Evaluation

Authors: Xidan Song , Weiqi Wang , Ruifeng Cao , Qingya Hu
URL: https://arxiv.org/abs/2512.15033
Abstract:

The evaluation of Large Language Models (LLMs) in complex reasoning domains typically relies on performance alignment with ground-truth oracles. In the domain of chess, this standard manifests as accuracy benchmarks against strong engines like Stockfish. However, high scalar accuracy does not necessarily imply robust conceptual understanding. This paper argues that standard accuracy metrics fail to distinguish between genuine geometric reasoning and the superficial memorization of canonical board states. To address this gap, we propose a Geometric Stability Framework, a novel evaluation methodology that rigorously tests model consistency under invariant transformations-including board rotation, mirror symmetry, color inversion, and format conversion. We applied this framework to a comparative analysis of six state-of-the-art LLMs including GPT-5.1, Claude Sonnet 4.5, and Kimi K2 Turbo, utilizing a dataset of approximately 3,000 positions. Our results reveal a significant Accuracy-Stability Paradox. While models such as GPT-5.1 achieve near-optimal accuracy on standard positions, they exhibit catastrophic degradation under geometric perturbation, specifically in rotation tasks where error rates surge by over 600%. This disparity suggests a reliance on pattern matching over abstract spatial logic. Conversely, Claude Sonnet 4.5 and Kimi K2 Turbo demonstrate superior dual robustness, maintaining high consistency across all transformation axes. Furthermore, we analyze the trade-off between helpfulness and safety, identifying Gemini 2.5 Flash as the leader in illegal state rejection (96.0%). We conclude that geometric stability provides an orthogonal and essential metric for AI evaluation, offering a necessary proxy for disentangling reasoning capabilities from data contamination and overfitting in large-scale models.

20. AgroAskAI: A Multi-Agentic AI Framework for Supporting Smallholder Farmers’ Enquiries Globally

Authors: Nadine Angela Cantonjos , Arpita Biswas
URL: https://arxiv.org/abs/2512.14910
Abstract:

Agricultural regions in rural areas face damage from climate-related risks, including droughts, heavy rainfall, and shifting weather patterns. Prior research calls for adaptive risk-management solutions and decision-making strategies. To this end, artificial intelligence (AI), particularly agentic AI, offers a promising path forward. Agentic AI systems consist of autonomous, specialized agents capable of solving complex, dynamic tasks. While past systems have relied on single-agent models or have used multi-agent frameworks only for static functions, there is a growing need for architectures that support dynamic collaborative reasoning and context-aware outputs. To bridge this gap, we present AgroAskAI, a multi-agent reasoning system for climate adaptation decision support in agriculture, with a focus on vulnerable rural communities. AgroAskAI features a modular, role-specialized architecture that uses a chain-of-responsibility approach to coordinate autonomous agents, integrating real-time tools and datasets. The system has built-in governance mechanisms that mitigate hallucination and enable internal feedback for coherent, locally relevant strategies. The system also supports multilingual interactions, making it accessible to non-English-speaking farmers. Experiments on common agricultural queries related to climate adaptation show that, with additional tools and prompt refinement, AgroAskAI delivers more actionable, grounded, and inclusive outputs. Our experimental results highlight the potential of agentic AI for sustainable and accountable decision support in climate adaptation for agriculture.

21. IaC Generation with LLMs: An Error Taxonomy and A Study on Configuration Knowledge Injection

Authors: Roman Nekrasov , Stefano Fossati , Indika Kumara , Damian Andrew Tamburri , Willem-Jan van den Heuvel
URL: https://arxiv.org/abs/2512.14792
Abstract:

Large Language Models (LLMs) currently exhibit low success rates in generating correct and intent-aligned Infrastructure as Code (IaC). This research investigated methods to improve LLM-based IaC generation, specifically for Terraform, by systematically injecting structured configuration knowledge. To facilitate this, an existing IaC-Eval benchmark was significantly enhanced with cloud emulation and automated error analysis. Additionally, a novel error taxonomy for LLM-assisted IaC code generation was developed. A series of knowledge injection techniques was implemented and evaluated, progressing from Naive Retrieval-Augmented Generation (RAG) to more sophisticated Graph RAG approaches. These included semantic enrichment of graph components and modeling inter-resource dependencies. Experimental results demonstrated that while baseline LLM performance was poor (27.1% overall success), injecting structured configuration knowledge increased technical validation success to 75.3% and overall success to 62.6%. Despite these gains in technical correctness, intent alignment plateaued, revealing a “Correctness-Congruence Gap” where LLMs can become proficient “coders” but remain limited “architects” in fulfilling nuanced user intent.

22. GR-Agent: Adaptive Graph Reasoning Agent under Incomplete Knowledge

Authors: Dongzhuoran Zhou , Yuqicheng Zhu , Xiaxia Wang , Hongkuan Zhou , Jiaoyan Chen , Steffen Staab , Yuan He , Evgeny Kharlamov
URL: https://arxiv.org/abs/2512.14766
Abstract:

Large language models (LLMs) achieve strong results on knowledge graph question answering (KGQA), but most benchmarks assume complete knowledge graphs (KGs) where direct supporting triples exist. This reduces evaluation to shallow retrieval and overlooks the reality of incomplete KGs, where many facts are missing and answers must be inferred from existing facts. We bridge this gap by proposing a methodology for constructing benchmarks under KG incompleteness, which removes direct supporting triples while ensuring that alternative reasoning paths required to infer the answer remain. Experiments on benchmarks constructed using our methodology show that existing methods suffer consistent performance degradation under incompleteness, highlighting their limited reasoning ability. To overcome this limitation, we present the Adaptive Graph Reasoning Agent (GR-Agent). It first constructs an interactive environment from the KG, and then formalizes KGQA as agent environment interaction within this environment. GR-Agent operates over an action space comprising graph reasoning tools and maintains a memory of potential supporting reasoning evidence, including relevant relations and reasoning paths. Extensive experiments demonstrate that GR-Agent outperforms non-training baselines and performs comparably to training-based methods under both complete and incomplete settings.

23. Attention as Binding: A Vector-Symbolic Perspective on Transformer Reasoning

Authors: Sahil Rajesh Dhayalkar
URL: https://arxiv.org/abs/2512.14709
Abstract:

Transformer-based language models display impressive reasoning-like behavior, yet remain brittle on tasks that require stable symbolic manipulation. This paper develops a unified perspective on these phenomena by interpreting self-attention and residual streams as implementing an approximate Vector Symbolic Architecture (VSA). In this view, queries and keys define role spaces, values encode fillers, attention weights perform soft unbinding, and residual connections realize superposition of many bound structures. We use this algebraic lens to relate transformer internals to chain-of-thought traces, program-based reasoning, and memory-augmented tool use, and to explain characteristic failure modes such as variable confusion and inconsistency across logically related prompts. Building on this perspective, we propose VSA-inspired architectural biases, including explicit binding/unbinding heads and hyperdimensional memory layers, and training objectives that promote role-filler separation and robust superposition. Finally, we outline metrics for measuring “VSA-likeness” and logical compositionality, and pose theoretical and architectural open problems. Overall, the paper argues that viewing attention as soft vector-symbolic computation offers a principled route toward more interpretable and logically reliable reasoning systems.

24. Spatia: Video Generation with Updatable Spatial Memory

Authors: Jinjing Zhao , Fangyun Wei , Zhening Liu , Hongyang Zhang , Chang Xu , Yan Lu
URL: https://arxiv.org/abs/2512.15716
Abstract:

Existing video generation models struggle to maintain long-term spatial and temporal consistency due to the dense, high-dimensional nature of video signals. To overcome this limitation, we propose Spatia, a spatial memory-aware video generation framework that explicitly preserves a 3D scene point cloud as persistent spatial memory. Spatia iteratively generates video clips conditioned on this spatial memory and continuously updates it through visual SLAM. This dynamic-static disentanglement design enhances spatial consistency throughout the generation process while preserving the model’s ability to produce realistic dynamic entities. Furthermore, Spatia enables applications such as explicit camera control and 3D-aware interactive editing, providing a geometrically grounded framework for scalable, memory-driven video generation.

25. mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

Authors: Jonas Pai , Liam Achenbach , Victoriano Montesinos , Benedek Forrai , Oier Mees , Elvis Nava
URL: https://arxiv.org/abs/2512.15692
Abstract:

Prevailing Vision-Language-Action Models (VLAs) for robotic manipulation are built upon vision-language backbones pretrained on large-scale, but disconnected static web data. As a result, despite improved semantic generalization, the policy must implicitly infer complex physical dynamics and temporal dependencies solely from robot trajectories. This reliance creates an unsustainable data burden, necessitating continuous, large-scale expert data collection to compensate for the lack of innate physical understanding. We contend that while vision-language pretraining effectively captures semantic priors, it remains blind to physical causality. A more effective paradigm leverages video to jointly capture semantics and visual dynamics during pretraining, thereby isolating the remaining task of low-level control. To this end, we introduce \model, a novel Video-Action Model (VAM) that pairs a pretrained Internet-scale video model with a flow matching-based action decoder conditioned on its latent representations. The decoder serves as an Inverse Dynamics Model (IDM), generating low-level robot actions from the latent representation of video-space action plans. Our extensive evaluation shows that our approach achieves state-of-the-art performance on simulated and real-world robotic manipulation tasks, improving sample efficiency by 10x and convergence speed by 2x compared to traditional VLA architectures.

26. BashArena: A Control Setting for Highly Privileged AI Agents

Authors: Adam Kaufman , James Lucassen , Tyler Tracy , Cody Rushing , Aryan Bhatt
URL: https://arxiv.org/abs/2512.15688
Abstract:

Future AI agents might run autonomously with elevated privileges. If these agents are misaligned, they might abuse these privileges to cause serious damage. The field of AI control develops techniques that make it harder for misaligned AIs to cause such damage, while preserving their usefulness. We introduce BashArena, a setting for studying AI control techniques in security-critical environments. BashArena contains 637 Linux system administration and infrastructure engineering tasks in complex, realistic environments, along with four sabotage objectives (execute malware, exfiltrate secrets, escalate privileges, and disable firewall) for a red team to target. We evaluate multiple frontier LLMs on their ability to complete tasks, perform sabotage undetected, and detect sabotage attempts. Claude Sonnet 4.5 successfully executes sabotage while evading monitoring by GPT-4.1 mini 26% of the time, at 4% trajectory-wise FPR. Our findings provide a baseline for designing more effective control protocols in BashArena. We release the dataset as a ControlArena setting and share our task generation pipeline.

27. Can LLMs Guide Their Own Exploration? Gradient-Guided Reinforcement Learning for LLM Reasoning

Authors: Zhenwen Liang , Sidi Lu , Wenhao Yu , Kishan Panaganti , Yujun Zhou , Haitao Mi , Dong Yu
URL: https://arxiv.org/abs/2512.15687
Abstract:

Reinforcement learning has become essential for strengthening the reasoning abilities of large language models, yet current exploration mechanisms remain fundamentally misaligned with how these models actually learn. Entropy bonuses and external semantic comparators encourage surface level variation but offer no guarantee that sampled trajectories differ in the update directions that shape optimization. We propose G2RL, a gradient guided reinforcement learning framework in which exploration is driven not by external heuristics but by the model own first order update geometry. For each response, G2RL constructs a sequence level feature from the model final layer sensitivity, obtainable at negligible cost from a standard forward pass, and measures how each trajectory would reshape the policy by comparing these features within a sampled group. Trajectories that introduce novel gradient directions receive a bounded multiplicative reward scaler, while redundant or off manifold updates are deemphasized, yielding a self referential exploration signal that is naturally aligned with PPO style stability and KL control. Across math and general reasoning benchmarks (MATH500, AMC, AIME24, AIME25, GPQA, MMLUpro) on Qwen3 base 1.7B and 4B models, G2RL consistently improves pass@1, maj@16, and pass@k over entropy based GRPO and external embedding methods. Analyzing the induced geometry, we find that G2RL expands exploration into substantially more orthogonal and often opposing gradient directions while maintaining semantic coherence, revealing that a policy own update space provides a far more faithful and effective basis for guiding exploration in large language model reinforcement learning.

28. Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Authors: Adam Karvonen , James Chua , Clément Dumas , Kit Fraser-Taliente , Subhash Kantamneni , Julian Minder , Euan Ong , Arnab Sen Sharma , Daniel Wen , Owain Evans , Samuel Marks
URL: https://arxiv.org/abs/2512.15674
Abstract:

Large language model (LLM) activations are notoriously difficult to understand, with most existing techniques using complex, specialized methods for interpreting them. Recent work has proposed a simpler approach known as LatentQA: training LLMs to directly accept LLM activations as inputs and answer arbitrary questions about them in natural language. However, prior work has focused on narrow task settings for both training and evaluation. In this paper, we instead take a generalist perspective. We evaluate LatentQA-trained models, which we call Activation Oracles (AOs), in far out-of-distribution settings and examine how performance scales with training data diversity. We find that AOs can recover information fine-tuned into a model (e.g., biographical knowledge or malign propensities) that does not appear in the input text, despite never being trained with activations from a fine-tuned model. Our main evaluations are four downstream tasks where we can compare to prior white- and black-box techniques. We find that even narrowly-trained LatentQA models can generalize well, and that adding additional training datasets (such as classification tasks and a self-supervised context prediction task) yields consistent further improvements. Overall, our best AOs match or exceed prior white-box baselines on all four tasks and are the best method on 3 out of 4. These results suggest that diversified training to answer natural-language queries imparts a general capability to verbalize information about LLM activations.

29. PPSEBM: An Energy-Based Model with Progressive Parameter Selection for Continual Learning

Authors: Xiaodi Li , Dingcheng Li , Rujun Gao , Mahmoud Zamani , Feng Mi , Latifur Khan
URL: https://arxiv.org/abs/2512.15658
Abstract:

Continual learning remains a fundamental challenge in machine learning, requiring models to learn from a stream of tasks without forgetting previously acquired knowledge. A major obstacle in this setting is catastrophic forgetting, where performance on earlier tasks degrades as new tasks are learned. In this paper, we introduce PPSEBM, a novel framework that integrates an Energy-Based Model (EBM) with Progressive Parameter Selection (PPS) to effectively address catastrophic forgetting in continual learning for natural language processing tasks. In PPSEBM, progressive parameter selection allocates distinct, task-specific parameters for each new task, while the EBM generates representative pseudo-samples from prior tasks. These generated samples actively inform and guide the parameter selection process, enhancing the model’s ability to retain past knowledge while adapting to new tasks. Experimental results on diverse NLP benchmarks demonstrate that PPSEBM outperforms state-of-the-art continual learning methods, offering a promising and robust solution to mitigate catastrophic forgetting.

30. VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?

Authors: Hongbo Zhao , Meng Wang , Fei Zhu , Wenzhuo Liu , Bolin Ni , Fanhu Zeng , Gaofeng Meng , Zhaoxiang Zhang
URL: https://arxiv.org/abs/2512.15649
Abstract:

The computational and memory overheads associated with expanding the context window of LLMs severely limit their scalability. A noteworthy solution is vision-text compression (VTC), exemplified by frameworks like DeepSeek-OCR and Glyph, which convert long texts into dense 2D visual representations, thereby achieving token compression ratios of 3x-20x. However, the impact of this high information density on the core long-context capabilities of vision-language models (VLMs) remains under-investigated. To address this gap, we introduce the first benchmark for VTC and systematically assess the performance of VLMs across three long-context understanding settings: VTC-Retrieval, which evaluates the model’s ability to retrieve and aggregate information; VTC-Reasoning, which requires models to infer latent associations to locate facts with minimal lexical overlap; and VTC-Memory, which measures comprehensive question answering within long-term dialogue memory. Furthermore, we establish the VTCBench-Wild to simulate diverse input this http URL comprehensively evaluate leading open-source and proprietary models on our benchmarks. The results indicate that, despite being able to decode textual information (e.g., OCR) well, most VLMs exhibit a surprisingly poor long-context understanding ability with VTC-compressed information, failing to capture long associations or dependencies in the this http URL study provides a deep understanding of VTC and serves as a foundation for designing more efficient and scalable VLMs.

31. IC-Effect: Precise and Efficient Video Effects Editing via In-Context Learning

Authors: Yuanhang Li , Yiren Song , Junzhe Bai , Xinran Liang , Hu Yang , Libiao Jin , Qi Mao
URL: https://arxiv.org/abs/2512.15635
Abstract:

We propose \textbf{IC-Effect}, an instruction-guided, DiT-based framework for few-shot video VFX editing that synthesizes complex effects (\eg flames, particles and cartoon characters) while strictly preserving spatial and temporal consistency. Video VFX editing is highly challenging because injected effects must blend seamlessly with the background, the background must remain entirely unchanged, and effect patterns must be learned efficiently from limited paired data. However, existing video editing models fail to satisfy these requirements. IC-Effect leverages the source video as clean contextual conditions, exploiting the contextual learning capability of DiT models to achieve precise background preservation and natural effect injection. A two-stage training strategy, consisting of general editing adaptation followed by effect-specific learning via Effect-LoRA, ensures strong instruction following and robust effect modeling. To further improve efficiency, we introduce spatiotemporal sparse tokenization, enabling high fidelity with substantially reduced computation. We also release a paired VFX editing dataset spanning $15$ high-quality visual styles. Extensive experiments show that IC-Effect delivers high-quality, controllable, and temporally consistent VFX editing, opening new possibilities for video creation.

32. How Much is Too Much? Exploring LoRA Rank Trade-offs for Retaining Knowledge and Domain Robustness

Authors: Darshita Rathore , Vineet Kumar , Chetna Bansal , Anindya Moitra
URL: https://arxiv.org/abs/2512.15634
Abstract:

Large language models are increasingly adapted to downstream tasks through fine-tuning. Full supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), are two dominant approaches. While PEFT methods are widely used for their computational efficiency, the implications of their configurations (e.g., rank) remain under-explored in downstream Q&A tasks and generalisation. In this work, we perform a comprehensive evaluation across multiple reasoning and recall datasets, conducting a rank sweep to quantify the trade-off between SFT and PEFT. We also compare the accuracy of PEFT and SFT models across in-domain and out-of-domain adaptation, highlighting distinct generalisation behaviour and task-specific forgetting. We demonstrate that LoRA achieves competitive and in some cases superior performance compared to SFT, particularly on reasoning tasks at specific rank values. Additionally, we analyze the internal representations via spectral features and layer-wise attention structures, offering insights into representational drift and structural changes in attention patterns.

33. Evaluating Metrics for Safety with LLM-as-Judges

Authors: Kester Clegg , Richard Hawkins , Ibrahim Habli , Tom Lawton
URL: https://arxiv.org/abs/2512.15617
Abstract:

LLMs (Large Language Models) are increasingly used in text processing pipelines to intelligently respond to a variety of inputs and generation tasks. This raises the possibility of replacing human roles that bottleneck existing information flows, either due to insufficient staff or process complexity. However, LLMs make mistakes and some processing roles are safety critical. For example, triaging post-operative care to patients based on hospital referral letters, or updating site access schedules in nuclear facilities for work crews. If we want to introduce LLMs into critical information flows that were previously performed by humans, how can we make them safe and reliable? Rather than make performative claims about augmented generation frameworks or graph-based techniques, this paper argues that the safety argument should focus on the type of evidence we get from evaluation points in LLM processes, particularly in frameworks that employ LLM-as-Judges (LaJ) evaluators. This paper argues that although we cannot get deterministic evaluations from many natural language processing tasks, by adopting a basket of weighted metrics it may be possible to lower the risk of errors within an evaluation, use context sensitivity to define error severity and design confidence thresholds that trigger human review of critical LaJ judgments when concordance across evaluators is low.

34. How Smoothing is N-simplicial Attention?

Authors: Alexandre Dussolle , Pietro Liò
URL: https://arxiv.org/abs/2512.15600
Abstract:

Going from pure Multilayer Perceptron (MLP) to a learnable graph message-passing mechanism at each layer has been foundational to state-of-the-art results, despite the computational trade-off (e.g. GATs or Transformers). To go a step further, in this work, we introduce N-simplicial attention, going from pairwise token similarity to higher-order interactions, and adapt it for Rotary Position Embeddings (RoPE). To help manage the increased complexity, we propose a cost-effective simplex selection enabling the model to focus its computation load onto the more task-sensitive interactions. Beyond these core mechanisms, we study how smoothing N-simplicial attention is by deriving a Lipschitz upper-bound and by demonstrating that by itself it also suffers from over-smoothing, despite opening the attention message-passing to higher-order interactions.

35. A Conditioned UNet for Music Source Separation

Authors: Ken O’Hanlon , Basil Woods , Lin Wang , Mark Sandler
URL: https://arxiv.org/abs/2512.15532
Abstract:

In this paper we propose a conditioned UNet for Music Source Separation (MSS). MSS is generally performed by multi-output neural networks, typically UNets, with each output representing a particular stem from a predefined instrument vocabulary. In contrast, conditioned MSS networks accept an audio query related to a stem of interest alongside the signal from which that stem is to be extracted. Thus, a strict vocabulary is not required and this enables more realistic tasks in MSS. The potential of conditioned approaches for such tasks has been somewhat hidden due to a lack of suitable data, an issue recently addressed with the MoisesDb dataset. A recent method, Banquet, employs this dataset with promising results seen on larger vocabularies. Banquet uses Bandsplit RNN rather than a UNet and the authors state that UNets should not be suitable for conditioned MSS. We counter this argument and propose QSCNet, a novel conditioned UNet for MSS that integrates network conditioning elements in the Sparse Compressed Network for MSS. We find QSCNet to outperform Banquet by over 1dB SNR on a couple of MSS tasks, while using less than half the number of parameters.

36. BERT and CNN integrated Neural Collaborative Filtering for Recommender Systems

Authors: Abdullah Al Munem , Sumona Yeasmin , Mohammad Rezwanul Huq
URL: https://arxiv.org/abs/2512.15526
Abstract:

Every day, a significant number of users visit the internet for different needs. The owners of a website generate profits from the user interaction with the contents or items of the website. A robust recommendation system can increase user interaction with a website by recommending items according to the user’s unique preferences. BERT and CNN-integrated neural collaborative filtering (NCF) have been proposed for the recommendation system in this experiment. The proposed model takes inputs from the user and item profile and finds the user’s interest. This model can handle numeric, categorical, and image data to extract the latent features from the inputs. The model is trained and validated on a small sample of the MovieLens dataset for 25 epochs. The same dataset has been used to train and validate a simple NCF and a BERT-based NCF model and compared with the proposed model. The proposed model outperformed those two baseline models. The obtained result for the proposed model is 0.72 recall and 0.486 Hit Ratio @ 10 for 799 users on the MovieLens dataset. This experiment concludes that considering both categorical and image data can improve the performance of a recommendation system.

37. Attention in Motion: Secure Platooning via Transformer-based Misbehavior Detection

Authors: Konstantinos Kalogiannis , Ahmed Mohamed Hussain , Hexu Li , Panos Papadimitratos
URL: https://arxiv.org/abs/2512.15503
Abstract:

Vehicular platooning promises transformative improvements in transportation efficiency and safety through the coordination of multi-vehicle formations enabled by Vehicle-to-Everything (V2X) communication. However, the distributed nature of platoon coordination creates security vulnerabilities, allowing authenticated vehicles to inject falsified kinematic data, compromise operational stability, and pose a threat to passenger safety. Traditional misbehaviour detection approaches, which rely on plausibility checks and statistical methods, suffer from high False Positive (FP) rates and cannot capture the complex temporal dependencies inherent in multi-vehicle coordination dynamics. We present Attention In Motion (AIMformer), a transformer-based framework specifically tailored for real-time misbehaviour detection in vehicular platoons with edge deployment capabilities. AIMformer leverages multi-head self-attention mechanisms to simultaneously capture intra-vehicle temporal dynamics and inter-vehicle spatial correlations. It incorporates global positional encoding with vehicle-specific temporal offsets to handle join/exit maneuvers. We propose a Precision-Focused (BCE) loss function that penalizes FPs to meet the requirements of safety-critical vehicular systems. Extensive evaluation across 4 platoon controllers, multiple attack vectors, and diverse mobility scenarios demonstrates superior performance ($\geq$ 0.93) compared to state-of-the-art baseline architectures. A comprehensive deployment analysis utilizing TensorFlow Lite (TFLite), Open Neural Network Exchange (ONNX), and TensorRT achieves sub-millisecond inference latency, making it suitable for real-time operation on resource-constrained edge platforms. Hence, validating AIMformer is viable for both in-vehicle and roadside infrastructure deployment.

38. Soft Geometric Inductive Bias for Object Centric Dynamics

Authors: Hampus Linander , Conor Heins , Alexander Tschantz , Marco Perin , Christopher Buckley
URL: https://arxiv.org/abs/2512.15493
Abstract:

Equivariance is a powerful prior for learning physical dynamics, yet exact group equivariance can degrade performance if the symmetries are broken. We propose object-centric world models built with geometric algebra neural networks, providing a soft geometric inductive bias. Our models are evaluated using simulated environments of 2d rigid body dynamics with static obstacles, where we train for next-step predictions autoregressively. For long-horizon rollouts we show that the soft inductive bias of our models results in better performance in terms of physical fidelity compared to non-equivariant baseline models. The approach complements recent soft-equivariance ideas and aligns with the view that simple, well-chosen priors can yield robust generalization. These results suggest that geometric algebra offers an effective middle ground between hand-crafted physics and unstructured deep nets, delivering sample-efficient dynamics models for multi-object scenes.

39. How Do Semantically Equivalent Code Transformations Impact Membership Inference on LLMs for Code?

Authors: Hua Yang , Alejandro Velasco , Thanh Le-Cong , Md Nazmul Haque , Bowen Xu , Denys Poshyvanyk
URL: https://arxiv.org/abs/2512.15468
Abstract:

The success of large language models for code relies on vast amounts of code data, including public open-source repositories, such as GitHub, and private, confidential code from companies. This raises concerns about intellectual property compliance and the potential unauthorized use of license-restricted code. While membership inference (MI) techniques have been proposed to detect such unauthorized usage, their effectiveness can be undermined by semantically equivalent code transformation techniques, which modify code syntax while preserving semantic. In this work, we systematically investigate whether semantically equivalent code transformation rules might be leveraged to evade MI detection. The results reveal that model accuracy drops by only 1.5% in the worst case for each rule, demonstrating that transformed datasets can effectively serve as substitutes for fine-tuning. Additionally, we find that one of the rules (RenameVariable) reduces MI success by 10.19%, highlighting its potential to obscure the presence of restricted code. To validate these findings, we conduct a causal analysis confirming that variable renaming has the strongest causal effect in disrupting MI detection. Notably, we find that combining multiple transformations does not further reduce MI effectiveness. Our results expose a critical loophole in license compliance enforcement for training large language models for code, showing that MI detection can be substantially weakened by transformation-based obfuscation techniques.

40. On Assessing the Relevance of Code Reviews Authored by Generative Models

Authors: Robert Heumüller , Frank Ortmeier
URL: https://arxiv.org/abs/2512.15466
Abstract:

The use of large language models like ChatGPT in code review offers promising efficiency gains but also raises concerns about correctness and safety. Existing evaluation methods for code review generation either rely on automatic comparisons to a single ground truth, which fails to capture the variability of human perspectives, or on subjective assessments of “usefulness”, a highly ambiguous concept. We propose a novel evaluation approach based on what we call multi-subjective ranking. Using a dataset of 280 self-contained code review requests and corresponding comments from CodeReview StackExchange, multiple human judges ranked the quality of ChatGPT-generated comments alongside the top human responses from the platform. Results show that ChatGPT’s comments were ranked significantly better than human ones, even surpassing StackExchange’s accepted answers. Going further, our proposed method motivates and enables more meaningful assessments of generative AI’s performance in code review, while also raising awareness of potential risks of unchecked integration into review processes.

41. Double Horizon Model-Based Policy Optimization

Authors: Akihiro Kubo , Paavo Parmas , Shin Ishii
URL: https://arxiv.org/abs/2512.15439
Abstract:

Model-based reinforcement learning (MBRL) reduces the cost of real-environment sampling by generating synthetic trajectories (called rollouts) from a learned dynamics model. However, choosing the length of the rollouts poses two dilemmas: (1) Longer rollouts better preserve on-policy training but amplify model bias, indicating the need for an intermediate horizon to mitigate distribution shift (i.e., the gap between on-policy and past off-policy samples). (2) Moreover, a longer model rollout may reduce value estimation bias but raise the variance of policy gradients due to backpropagation through multiple steps, implying another intermediate horizon for stable gradient estimates. However, these two optimal horizons may differ. To resolve this conflict, we propose Double Horizon Model-Based Policy Optimization (DHMBPO), which divides the rollout procedure into a long “distribution rollout” (DR) and a short “training rollout” (TR). The DR generates on-policy state samples for mitigating distribution shift. In contrast, the short TR leverages differentiable transitions to offer accurate value gradient estimation with stable gradient updates, thereby requiring fewer updates and reducing overall runtime. We demonstrate that the double-horizon approach effectively balances distribution shift, model bias, and gradient instability, and surpasses existing MBRL methods on continuous-control benchmarks in terms of both sample efficiency and runtime.

42. FM-EAC: Feature Model-based Enhanced Actor-Critic for Multi-Task Control in Dynamic Environments

Authors: Quanxi Zhou , Wencan Mao , Manabu Tsukada , John C.S. Lui , Yusheng Ji
URL: https://arxiv.org/abs/2512.15430
Abstract:

Model-based reinforcement learning (MBRL) and model-free reinforcement learning (MFRL) evolve along distinct paths but converge in the design of Dyna-Q [1]. However, modern RL methods still struggle with effective transferability across tasks and scenarios. Motivated by this limitation, we propose a generalized algorithm, Feature Model-Based Enhanced Actor-Critic (FM-EAC), that integrates planning, acting, and learning for multi-task control in dynamic environments. FM-EAC combines the strengths of MBRL and MFRL and improves generalizability through the use of novel feature-based models and an enhanced actor-critic framework. Simulations in both urban and agricultural applications demonstrate that FM-EAC consistently outperforms many state-of-the-art MBRL and MFRL methods. More importantly, different sub-networks can be customized within FM-EAC according to user-specific requirements.

43. SMART: Semantic Matching Contrastive Learning for Partially View-Aligned Clustering

Authors: Liang Peng , Yixuan Ye , Cheng Liu , Hangjun Che , Fei Wang , Zhiwen Yu , Si Wu , Hau-San Wong
URL: https://arxiv.org/abs/2512.15396
Abstract:

Multi-view clustering has been empirically shown to improve learning performance by leveraging the inherent complementary information across multiple views of data. However, in real-world scenarios, collecting strictly aligned views is challenging, and learning from both aligned and unaligned data becomes a more practical solution. Partially View-aligned Clustering aims to learn correspondences between misaligned view samples to better exploit the potential consistency and complementarity across views, including both aligned and unaligned data. However, most existing PVC methods fail to leverage unaligned data to capture the shared semantics among samples from the same cluster. Moreover, the inherent heterogeneity of multi-view data induces distributional shifts in representations, leading to inaccuracies in establishing meaningful correspondences between cross-view latent features and, consequently, impairing learning effectiveness. To address these challenges, we propose a Semantic MAtching contRasTive learning model (SMART) for PVC. The main idea of our approach is to alleviate the influence of cross-view distributional shifts, thereby facilitating semantic matching contrastive learning to fully exploit semantic relationships in both aligned and unaligned data. Extensive experiments on eight benchmark datasets demonstrate that our method consistently outperforms existing approaches on the PVC problem.

44. Emotion Recognition in Signers

Authors: Kotaro Funakoshi , Yaoxiong Zhu
URL: https://arxiv.org/abs/2512.15376
Abstract:

Recognition of signers’ emotions suffers from one theoretical challenge and one practical challenge, namely, the overlap between grammatical and affective facial expressions and the scarcity of data for model training. This paper addresses these two challenges in a cross-lingual setting using our eJSL dataset, a new benchmark dataset for emotion recognition in Japanese Sign Language signers, and BOBSL, a large British Sign Language dataset with subtitles. In eJSL, two signers expressed 78 distinct utterances with each of seven different emotional states, resulting in 1,092 video clips. We empirically demonstrate that 1) textual emotion recognition in spoken language mitigates data scarcity in sign language, 2) temporal segment selection has a significant impact, and 3) incorporating hand motion enhances emotion recognition in signers. Finally we establish a stronger baseline than spoken language LLMs.

45. Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models

Authors: Mikel Williams-Lekuona , Georgina Cosma
URL: https://arxiv.org/abs/2512.15372
Abstract:

Vision transformers in vision-language models apply uniform computational effort across all images, expending 175.33 GFLOPs (ViT-L/14) whether analysing a straightforward product photograph or a complex street scene. We propose ICAR (Image Complexity-Aware Retrieval), which enables vision transformers to use less compute for simple images whilst processing complex images through their full network depth. The key challenge is maintaining cross-modal alignment: embeddings from different processing depths must remain compatible for text matching. ICAR solves this through dual-path training that produces compatible embeddings from both reduced-compute and full-compute processing. This maintains compatibility between image representations and text embeddings in the same semantic space, whether an image exits early or processes fully. Unlike existing two-stage approaches that require expensive reranking, ICAR enables direct image-text matching without additional overhead. To determine how much compute to use, we develop ConvNeXt-IC, which treats image complexity assessment as a classification task. By applying modern classifier backbones rather than specialised architectures, ConvNeXt-IC achieves state-of-the-art performance with 0.959 correlation with human judgement (Pearson) and 4.4x speedup. Evaluated on standard benchmarks augmented with real-world web data, ICAR achieves 20% practical speedup while maintaining category-level performance and 95% of instance-level performance, enabling sustainable scaling of vision-language systems.

46. Adversarial versification in portuguese as a jailbreak operator in LLMs

Authors: Joao Queiroz
URL: https://arxiv.org/abs/2512.15353
Abstract:

Recent evidence shows that the versification of prompts constitutes a highly effective adversarial mechanism against aligned LLMs. The study ‘Adversarial poetry as a universal single-turn jailbreak mechanism in large language models’ demonstrates that instructions routinely refused in prose become executable when rewritten as verse, producing up to 18 x more safety failures in benchmarks derived from MLCommons AILuminate. Manually written poems reach approximately 62% ASR, and automated versions 43%, with some models surpassing 90% success in single-turn interactions. The effect is structural: systems trained with RLHF, constitutional AI, and hybrid pipelines exhibit consistent degradation under minimal semiotic formal variation. Versification displaces the prompt into sparsely supervised latent regions, revealing guardrails that are excessively dependent on surface patterns. This dissociation between apparent robustness and real vulnerability exposes deep limitations in current alignment regimes. The absence of evaluations in Portuguese, a language with high morphosyntactic complexity, a rich metric-prosodic tradition, and over 250 million speakers, constitutes a critical gap. Experimental protocols must parameterise scansion, metre, and prosodic variation to test vulnerabilities specific to Lusophone patterns, which are currently ignored.

47. Empirical Investigation of the Impact of Phase Information on Fault Diagnosis of Rotating Machinery

Authors: Hiroyoshi Nagahama , Katsufumi Inoue , Masayoshi Todorokihara , Michifumi Yoshioka
URL: https://arxiv.org/abs/2512.15344
Abstract:

Predictive maintenance of rotating machinery increasingly relies on vibration signals, yet most learning-based approaches either discard phase during spectral feature extraction or use raw time-waveforms without explicitly leveraging phase information. This paper introduces two phase-aware preprocessing strategies to address random phase variations in multi-axis vibration data: (1) three-axis independent phase adjustment that aligns each axis individually to zero phase (2) single-axis reference phase adjustment that preserves inter-axis relationships by applying uniform time shifts. Using a newly constructed rotor dataset acquired with a synchronized three-axis sensor, we evaluate six deep learning architectures under a two-stage learning framework. Results demonstrate architecture-independent improvements: the three-axis independent method achieves consistent gains (+2.7\% for Transformer), while the single-axis reference approach delivers superior performance with up to 96.2\% accuracy (+5.4\%) by preserving spatial phase relationships. These findings establish both phase alignment strategies as practical and scalable enhancements for predictive maintenance systems.

48. Exploring User Acceptance and Concerns toward LLM-powered Conversational Agents in Immersive Extended Reality

Authors: Efe Bozkir , Enkelejda Kasneci
URL: https://arxiv.org/abs/2512.15343
Abstract:

The rapid development of generative artificial intelligence (AI) and large language models (LLMs), and the availability of services that make them accessible, have led the general public to begin incorporating them into everyday life. The extended reality (XR) community has also sought to integrate LLMs, particularly in the form of conversational agents, to enhance user experience and task efficiency. When interacting with such conversational agents, users may easily disclose sensitive information due to the naturalistic flow of the conversations, and combining such conversational data with fine-grained sensor data may lead to novel privacy issues. To address these issues, a user-centric understanding of technology acceptance and concerns is essential. Therefore, to this end, we conducted a large-scale crowdsourcing study with 1036 participants, examining user decision-making processes regarding LLM-powered conversational agents in XR, across factors of XR setting type, speech interaction type, and data processing location. We found that while users generally accept these technologies, they express concerns related to security, privacy, social implications, and trust. Our results suggest that familiarity plays a crucial role, as daily generative AI use is associated with greater acceptance. In contrast, previous ownership of XR devices is linked to less acceptance, possibly due to existing familiarity with the settings. We also found that men report higher acceptance with fewer concerns than women. Regarding data type sensitivity, location data elicited the most significant concern, while body temperature and virtual object states were considered least sensitive. Overall, our study highlights the importance of practitioners effectively communicating their measures to users, who may remain distrustful. We conclude with implications and recommendations for LLM-powered XR.

49. Vision-based module for accurately reading linear scales in a laboratory

Authors: Parvesh Saini , Soumyadipta Maiti , Beena Rai
URL: https://arxiv.org/abs/2512.15327
Abstract:

Capabilities and the number of vision-based models are increasing rapidly. And these vision models are now able to do more tasks like object detection, image classification, instance segmentation etc. with great accuracy. But models which can take accurate quantitative measurements form an image, as a human can do by just looking at it, are rare. For a robot to work with complete autonomy in a Laboratory environment, it needs to have some basic skills like navigation, handling objects, preparing samples etc. to match human-like capabilities in an unstructured environment. Another important capability is to read measurements from instruments and apparatus. Here, we tried to mimic a human inspired approach to read measurements from a linear scale. As a test case we have picked reading level from a syringe and a measuring cylinder. For a randomly oriented syringe we carry out transformations to correct the orientation. To make the system efficient and robust, the area of interest is reduced to just the linear scale containing part of the image. After that, a series of features were extracted like the major makers, the corresponding digits, and the level indicator location, from which the final reading was calculated. Readings obtained using this system were also compared against human read values of the same instances and an accurate correspondence was observed.

50. Managing Ambiguity: A Proof of Concept of Human-AI Symbiotic Sense-making based on Quantum-Inspired Cognitive Mechanism of Rogue Variable Detection

Authors: Agnieszka Bienkowska , Jacek Malecki , Alexander Mathiesen-Ohman , Katarzyna Tworek
URL: https://arxiv.org/abs/2512.15325
Abstract:

Organizations increasingly operate in environments characterized by volatility, uncertainty, complexity, and ambiguity (VUCA), where early indicators of change often emerge as weak, fragmented signals. Although artificial intelligence (AI) is widely used to support managerial decision-making, most AI-based systems remain optimized for prediction and resolution, leading to premature interpretive closure under conditions of high ambiguity. This creates a gap in management science regarding how human-AI systems can responsibly manage ambiguity before it crystallizes into error or crisis. This study addresses this gap by presenting a proof of concept (PoC) of the LAIZA human-AI augmented symbiotic intelligence system and its patented process: Systems and Methods for Quantum-Inspired Rogue Variable Modeling (QRVM), Human-in-the-Loop Decoherence, and Collective Cognitive Inference. The mechanism operationalizes ambiguity as a non-collapsed cognitive state, detects persistent interpretive breakdowns (rogue variables), and activates structured human-in-the-loop clarification when autonomous inference becomes unreliable. Empirically, the article draws on a three-month case study conducted in 2025 within the AI development, involving prolonged ambiguity surrounding employee intentions and intellectual property boundaries. The findings show that preserving interpretive plurality enabled early scenario-based preparation, including proactive patent protection, allowing decisive and disruption-free action once ambiguity collapsed. The study contributes to management theory by reframing ambiguity as a first-class construct and demonstrates the practical value of human-AI symbiosis for organizational resilience in VUCA environments.

51. Automated Motion Artifact Check for MRI (AutoMAC-MRI): An Interpretable Framework for Motion Artifact Detection and Severity Assessment

Authors: Antony Jerald , Dattesh Shanbhag , Sudhanya Chatterjee
URL: https://arxiv.org/abs/2512.15315
Abstract:

Motion artifacts degrade MRI image quality and increase patient recalls. Existing automated quality assessment methods are largely limited to binary decisions and provide little interpretability. We introduce AutoMAC-MRI, an explainable framework for grading motion artifacts across heterogeneous MR contrasts and orientations. The approach uses supervised contrastive learning to learn a discriminative representation of motion severity. Within this feature space, we compute grade-specific affinity scores that quantify an image’s proximity to each motion grade, thereby making grade assignments transparent and interpretable. We evaluate AutoMAC-MRI on more than 5000 expert-annotated brain MRI slices spanning multiple contrasts and views. Experiments assessing affinity scores against expert labels show that the scores align well with expert judgment, supporting their use as an interpretable measure of motion severity. By coupling accurate grade detection with per-grade affinity scoring, AutoMAC-MRI enables inline MRI quality control, with the potential to reduce unnecessary rescans and improve workflow efficiency.

52. Evaluating LLMs for Zeolite Synthesis Event Extraction (ZSEE): A Systematic Analysis of Prompting Strategies

Authors: Charan Prakash Rathore , Saumi Ray , Dhruv Kumar
URL: https://arxiv.org/abs/2512.15312
Abstract:

Extracting structured information from zeolite synthesis experimental procedures is critical for materials discovery, yet existing methods have not systematically evaluated Large Language Models (LLMs) for this domain-specific task. This work addresses a fundamental question: what is the efficacy of different prompting strategies when applying LLMs to scientific information extraction? We focus on four key subtasks: event type classification (identifying synthesis steps), trigger text identification (locating event mentions), argument role extraction (recognizing parameter types), and argument text extraction (extracting parameter values). We evaluate four prompting strategies - zero-shot, few-shot, event-specific, and reflection-based - across six state-of-the-art LLMs (Gemma-3-12b-it, GPT-5-mini, O4-mini, Claude-Haiku-3.5, DeepSeek reasoning and non-reasoning) using the ZSEE dataset of 1,530 annotated sentences. Results demonstrate strong performance on event type classification (80-90\% F1) but modest performance on fine-grained extraction tasks, particularly argument role and argument text extraction (50-65\% F1). GPT-5-mini exhibits extreme prompt sensitivity with 11-79\% F1 variation. Notably, advanced prompting strategies provide minimal improvements over zero-shot approaches, revealing fundamental architectural limitations. Error analysis identifies systematic hallucination, over-generalization, and inability to capture synthesis-specific nuances. Our findings demonstrate that while LLMs achieve high-level understanding, precise extraction of experimental parameters requires domain-adapted models, providing quantitative benchmarks for scientific information extraction.

53. Graph Pattern-based Association Rules Evaluated Under No-repeated-anything Semantics in the Graph Transactional Setting

Authors: Basil Ell
URL: https://arxiv.org/abs/2512.15308
Abstract:

We introduce graph pattern-based association rules (GPARs) for directed labeled multigraphs such as RDF graphs. GPARs support both generative tasks, where a graph is extended, and evaluative tasks, where the plausibility of a graph is assessed. The framework goes beyond related formalisms such as graph functional dependencies, graph entity dependencies, relational association rules, graph association rules, multi-relation and path association rules, and Horn rules. Given a collection of graphs, we evaluate graph patterns under no-repeated-anything semantics, which allows the topology of a graph to be taken into account more effectively. We define a probability space and derive confidence, lift, leverage, and conviction in a probabilistic setting. We further analyze how these metrics relate to their classical itemset-based counterparts and identify conditions under which their characteristic properties are preserved.

54. Quantum Machine Learning for Cybersecurity: A Taxonomy and Future Directions

Authors: Siva Sai , Ishika Goyal , Shubham Sharma , Sri Harshita Manuri , Vinay Chamola , Rajkumar Buyya
URL: https://arxiv.org/abs/2512.15286
Abstract:

The increasing number of cyber threats and rapidly evolving tactics, as well as the high volume of data in recent years, have caused classical machine learning, rules, and signature-based defence strategies to fail, rendering them unable to keep up. An alternative, Quantum Machine Learning (QML), has recently emerged, making use of computations based on quantum mechanics. It offers better encoding and processing of high-dimensional structures for certain problems. This survey provides a comprehensive overview of QML techniques relevant to the domain of security, such as Quantum Neural Networks (QNNs), Quantum Support Vector Machines (QSVMs), Variational Quantum Circuits (VQCs), and Quantum Generative Adversarial Networks (QGANs), and discusses the contributions of this paper in relation to existing research in the field and how it improves over them. It also maps these methods across supervised, unsupervised, and generative learning paradigms, and to core cybersecurity tasks, including intrusion and anomaly detection, malware and botnet classification, and encrypted-traffic analytics. It also discusses their application in the domain of cloud computing security, where QML can enhance secure and scalable operations. Many limitations of QML in the domain of cybersecurity have also been discussed, along with the directions for addressing them.

55. Well Begun, Half Done: Reinforcement Learning with Prefix Optimization for LLM Reasoning

Authors: Yiliu Sun , Zicheng Zhao , Yang Wei , Yanfang Zhang , Chen Gong
URL: https://arxiv.org/abs/2512.15274
Abstract:

Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capability of Large Language Models (LLMs). Current RLVR approaches typically conduct training across all generated tokens, but neglect to explore which tokens (e.g., prefix tokens) actually contribute to reasoning. This uniform training strategy spends substantial effort on optimizing low-return tokens, which in turn impedes the potential improvement from high-return tokens and reduces overall training effectiveness. To address this issue, we propose a novel RLVR approach called Progressive Prefix-token Policy Optimization (PPPO), which highlights the significance of the prefix segment of generated outputs. Specifically, inspired by the well-established human thinking theory of Path Dependence, where early-stage thoughts substantially constrain subsequent thinking trajectory, we identify an analogous phenomenon in LLM reasoning termed Beginning Lock-in Effect (BLE). PPPO leverages this finding by focusing its optimization objective on the prefix reasoning process of LLMs. This targeted optimization strategy can positively influence subsequent reasoning processes, and ultimately improve final results. To improve the learning effectiveness of LLMs on how to start reasoning with high quality, PPPO introduces two training strategies: (a) Progressive Prefix Retention, which shapes a progressive learning process by increasing the proportion of retained prefix tokens during training; (b) Continuation Accumulated Reward, which mitigates reward bias by sampling multiple continuations for one prefix token sequence, and accumulating their scores as the reward signal. Extensive experimental results on various reasoning tasks demonstrate that our proposed PPPO outperforms representative RLVR methods, with the accuracy improvements of 18.02% on only 26.17% training tokens.

Authors: Yuze Wu , Mo Zhu , Xingxing Li , Yuheng Du , Yuxin Fan , Wenjun Li , Xin Zhou , Fei Gao
URL: https://arxiv.org/abs/2512.15258
Abstract:

This paper proposes VLA-AN, an efficient and onboard Vision-Language-Action (VLA) framework dedicated to autonomous drone navigation in complex environments. VLA-AN addresses four major limitations of existing large aerial navigation models: the data domain gap, insufficient temporal navigation with reasoning, safety issues with generative action policies, and onboard deployment constraints. First, we construct a high-fidelity dataset utilizing 3D Gaussian Splatting (3D-GS) to effectively bridge the domain gap. Second, we introduce a progressive three-stage training framework that sequentially reinforces scene comprehension, core flight skills, and complex navigation capabilities. Third, we design a lightweight, real-time action module coupled with geometric safety correction. This module ensures fast, collision-free, and stable command generation, mitigating the safety risks inherent in stochastic generative policies. Finally, through deep optimization of the onboard deployment pipeline, VLA-AN achieves a robust real-time 8.3x improvement in inference throughput on resource-constrained UAVs. Extensive experiments demonstrate that VLA-AN significantly improves spatial grounding, scene reasoning, and long-horizon navigation, achieving a maximum single-task success rate of 98.1%, and providing an efficient, practical solution for realizing full-chain closed-loop autonomy in lightweight aerial robots.

Authors: Youssef Ghallab , Omar Iraqy , Mohamed Kandil , Mohamed Ashraf , Saadeldine Eletter , Morougue Ghazal , Ayman Khalafallah , Nagwa El-Makky
URL: https://arxiv.org/abs/2512.15250
Abstract:

Physiological signals such as electrocardiograms (ECG) and electroencephalograms (EEG) provide complementary insights into human health and cognition, yet multi-modal integration is challenging due to limited multi-modal labeled data, and modality-specific differences . In this work, we adapt the CBraMod encoder for large-scale self-supervised ECG pretraining, introducing a dual-masking strategy to capture intra- and inter-lead dependencies. To overcome the above challenges, we utilize a pre-trained CBraMod encoder for EEG and pre-train a symmetric ECG encoder, equipping each modality with a rich foundational representation. These representations are then fused via simple embedding concatenation, allowing the classification head to learn cross-modal interactions, together enabling effective downstream learning despite limited multi-modal supervision. Evaluated on emotion recognition, our approach achieves near state-of-the-art performance, demonstrating that carefully designed physiological encoders, even with straightforward fusion, substantially improve downstream performance. These results highlight the potential of foundation-model approaches to harness the holistic nature of physiological signals, enabling scalable, label-efficient, and generalizable solutions for healthcare and affective computing.

58. Intersectional Fairness in Vision-Language Models for Medical Image Disease Classification

Authors: Yupeng Zhang , Adam G. Dunn , Usman Naseem , Jinman Kim
URL: https://arxiv.org/abs/2512.15249
Abstract:

Medical artificial intelligence (AI) systems, particularly multimodal vision-language models (VLM), often exhibit intersectional biases where models are systematically less confident in diagnosing marginalised patient subgroups. Such bias can lead to higher rates of inaccurate and missed diagnoses due to demographically skewed data and divergent distributions of diagnostic certainty. Current fairness interventions frequently fail to address these gaps or compromise overall diagnostic performance to achieve statistical parity among the subgroups. In this study, we developed Cross-Modal Alignment Consistency (CMAC-MMD), a training framework that standardises diagnostic certainty across intersectional patient subgroups. Unlike traditional debiasing methods, this approach equalises the model’s decision confidence without requiring sensitive demographic data during clinical inference. We evaluated this approach using 10,015 skin lesion images (HAM10000) with external validation on 12,000 images (BCN20000), and 10,000 fundus images for glaucoma detection (Harvard-FairVLMed), stratifying performance by intersectional age, gender, and race attributes. In the dermatology cohort, the proposed method reduced the overall intersectional missed diagnosis gap (difference in True Positive Rate, $\Delta$TPR) from 0.50 to 0.26 while improving the overall Area Under the Curve (AUC) from 0.94 to 0.97 compared to standard training. Similarly, for glaucoma screening, the method reduced $\Delta$TPR from 0.41 to 0.31, achieving a better AUC of 0.72 (vs. 0.71 baseline). This establishes a scalable framework for developing high-stakes clinical decision support systems that are both accurate and can perform equitably across diverse patient subgroups, ensuring reliable performance without increasing privacy risks.

59. Yes-MT’s Submission to the Low-Resource Indic Language Translation Shared Task in WMT 2024

Authors: Yash Bhaskar , Parameswari Krishnamurthy
URL: https://arxiv.org/abs/2512.15226
Abstract:

This paper presents the systems submitted by the Yes-MT team for the Low-Resource Indic Language Translation Shared Task at WMT 2024 (Pakray et al., 2024), focusing on translating between English and the Assamese, Mizo, Khasi, and Manipuri languages. The experiments explored various approaches, including fine-tuning pre-trained models like mT5 (Xue et al., 2020) and IndicBart (Dabre et al., 2021) in both multilingual and monolingual settings, LoRA (Hu et al., 2021) fine-tuning IndicTrans2 (Gala et al., 2023), zero-shot and few-shot prompting (Brown, 2020) with large language models (LLMs) like Llama 3 (Dubey et al., 2024) and Mixtral 8x7b (Jiang et al., 2024), LoRA supervised fine-tuning of Llama 3 (Mecklenburg et al., 2024), and training Transformer models (Vaswani, 2017) from scratch. The results were evaluated on the WMT23 Low-Resource Indic Language Translation Shared Task test data using SacreBLEU (Post, 2018) and CHRF (Popovic, 2015), highlighting the challenges of low-resource translation and the potential of LLMs for these tasks, particularly with fine-tuning.

60. RFKG-CoT: Relation-Driven Adaptive Hop-count Selection and Few-Shot Path Guidance for Knowledge-Aware QA

Authors: Chao Zhang , Minghan Li , Tianrui Lv , Guodong Zhou
URL: https://arxiv.org/abs/2512.15219
Abstract:

Large language models (LLMs) often generate hallucinations in knowledge-intensive QA due to parametric knowledge limitations. While existing methods like KG-CoT improve reliability by integrating knowledge graph (KG) paths, they suffer from rigid hop-count selection (solely question-driven) and underutilization of reasoning paths (lack of guidance). To address this, we propose RFKG-CoT: First, it replaces the rigid hop-count selector with a relation-driven adaptive hop-count selector that dynamically adjusts reasoning steps by activating KG relations (e.g., 1-hop for direct “brother” relations, 2-hop for indirect “father-son” chains), formalized via a relation mask. Second, it introduces a few-shot in-context learning path guidance mechanism with CoT (think) that constructs examples in a “question-paths-answer” format to enhance LLMs’ ability to understand reasoning paths. Experiments on four KGQA benchmarks show RFKG-CoT improves accuracy by up to 14.7 pp (Llama2-7B on WebQSP) over KG-CoT. Ablations confirm the hop-count selector and the path prompt are complementary, jointly transforming KG evidence into more faithful answers.

61. Governing rapid technological change: Policy Delphi on the future of European AI governance

Authors: Atte Ojanen , Johannes Anttila , Thilo H. K. Thelitz , Anna Bjork
URL: https://arxiv.org/abs/2512.15196
Abstract:

The rapid advancements in artificial intelligence (AI) present unique challenges for policymakers that seek to govern the technology. In this context, the Delphi method has become an established way to identify consensus and disagreement on emerging technological issues among experts in the field of futures studies and foresight. The aim of this article is twofold: first, it examines key tensions experts see in the development of AI governance in Europe, and second, it reflects on the Delphi method’s capacity to inform anticipatory governance of emerging technologies like AI based on these insights. The analysis is based on the results of a two-round Policy Delphi study on the future of AI governance with European policymakers, researchers and NGOs, conducted in mid-2024. The Policy Delphi proved useful in revealing diverse perspectives on European AI governance, drawing out a consensus that future-proof AI regulation will likely depend more on practical implementation and enforcement of legislation than on its technical specifics or scope. Furthermore, the study identified a desirability-probability gap in AI governance: desirable policy directions, like greater citizen participation, were perceived as less probable and feasible. This highlights a tension between desirable regulatory oversight and the practical difficulty for regulation to keep up with technological change.

62. DEER: Draft with Diffusion, Verify with Autoregressive Models

Authors: Zicong Cheng , Guo-Wei Yang , Jia Li , Zhijie Deng , Meng-Hao Guo , Shi-Min Hu
URL: https://arxiv.org/abs/2512.15176
Abstract:

Efficiency, as a critical practical challenge for LLM-driven agentic and reasoning systems, is increasingly constrained by the inherent latency of autoregressive (AR) decoding. Speculative decoding mitigates this cost through a draft-verify scheme, yet existing approaches rely on AR draft models (a.k.a., drafters), which introduce two fundamental issues: (1) step-wise uncertainty accumulation leads to a progressive collapse of trust between the target model and the drafter, and (2) inherently sequential decoding of AR drafters. Together, these factors cause limited speedups. In this paper, we show that a diffusion large language model (dLLM) drafters can naturally overcome these issues through its fundamentally different probabilistic modeling and efficient parallel decoding strategy. Building on this insight, we introduce DEER, an efficient speculative decoding framework that drafts with diffusion and verifies with AR models. To enable high-quality drafting, DEER employs a two-stage training pipeline to align the dLLM-based drafters with the target AR model, and further adopts single-step decoding to generate long draft segments. Experiments show DEER reaches draft acceptance lengths of up to 32 tokens, far surpassing the 10 tokens achieved by EAGLE-3. Moreover, on HumanEval with Qwen3-30B-A3B, DEER attains a 5.54x speedup, while EAGLE-3 achieves only 2.41x. Code, model, demo, etc, will be available at this https URL

63. MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers

Authors: Xuanjun Zong , Zhiqi Shen , Lei Wang , Yunshi Lan , Chao Yang
URL: https://arxiv.org/abs/2512.15163
Abstract:

Large language models (LLMs) are evolving into agentic systems that reason, plan, and operate external tools. The Model Context Protocol (MCP) is a key enabler of this transition, offering a standardized interface for connecting LLMs with heterogeneous tools and services. Yet MCP’s openness and multi-server workflows introduce new safety risks that existing benchmarks fail to capture, as they focus on isolated attacks or lack real-world coverage. We present MCP-SafetyBench, a comprehensive benchmark built on real MCP servers that supports realistic multi-turn evaluation across five domains: browser automation, financial analysis, location navigation, repository management, and web search. It incorporates a unified taxonomy of 20 MCP attack types spanning server, host, and user sides, and includes tasks requiring multi-step reasoning and cross-server coordination under uncertainty. Using MCP-SafetyBench, we systematically evaluate leading open- and closed-source LLMs, revealing large disparities in safety performance and escalating vulnerabilities as task horizons and server interactions grow. Our results highlight the urgent need for stronger defenses and establish MCP-SafetyBench as a foundation for diagnosing and mitigating safety risks in real-world MCP deployments.

64. Offline Multi-Task Multi-Objective Data-Driven Evolutionary Algorithm with Language Surrogate Model and Implicit Q-Learning

Authors: Xian-Rong Zhang , Yue-Jiao Gong , Zeyuan Ma , Jun Zhang
URL: https://arxiv.org/abs/2512.15149
Abstract:

Data-driven evolutionary algorithms has shown surprising results in addressing expensive optimization problems through robust surrogate modeling. Though promising, existing surrogate modeling schemes may encounter limitations in complex optimization problems with many sub-objectives, which rely on repeated and tedious approximation. To address such technical gap, we propose Q-MetaSur as a plug-and-play surrogate modeling scheme capable of providing unified and generalized surrogate learning. Specifically, we consider multi-task-multi-objective optimization~(MTMOO) in offline setting. Several key designs are proposed: 1) we transform objective approximation into sequence-to-sequence modeling where MTMOO problem can be represented by tenxual tokenization. To operate under such auto-regressive modeling, we introduce a Large Language Model-based surrogate model that first encodes a MTMOO instance and then decodes objective values of unseen decision variables. To ensure stability in training the proposed model, we propose a two-stage offline training strategy that operates as a synergy of supervised tuning and RL fine-tuning, which first exploits offline dataset to fit existing knowledge and then leverages RL to enhance model’s generalization performance. Extensive empirical results on the CEC2019 benchmark demonstrate that Q-MetaSur not only outperforms representative surrogate baselines in objective approximation accuracy, but also helps underlying evolutionary algorithms achieve both desired optimization convergence and improved pareto optimality.

65. From Isolation to Entanglement: When Do Interpretability Methods Identify and Disentangle Known Concepts?

Authors: Aaron Mueller , Andrew Lee , Shruti Joshi , Ekdeep Singh Lubana , Dhanya Sridhar , Patrik Reizinger
URL: https://arxiv.org/abs/2512.15134
Abstract:

A central goal of interpretability is to recover representations of causally relevant concepts from the activations of neural networks. The quality of these concept representations is typically evaluated in isolation, and under implicit independence assumptions that may not hold in practice. Thus, it is unclear whether common featurization methods - including sparse autoencoders (SAEs) and sparse probes - recover disentangled representations of these concepts. This study proposes a multi-concept evaluation setting where we control the correlations between textual concepts, such as sentiment, domain, and tense, and analyze performance under increasing correlations between them. We first evaluate the extent to which featurizers can learn disentangled representations of each concept under increasing correlational strengths. We observe a one-to-many relationship from concepts to features: features correspond to no more than one concept, but concepts are distributed across many features. Then, we perform steering experiments, measuring whether each concept is independently manipulable. Even when trained on uniform distributions of concepts, SAE features generally affect many concepts when steered, indicating that they are neither selective nor independent; nonetheless, features affect disjoint subspaces. These results suggest that correlational metrics for measuring disentanglement are generally not sufficient for establishing independence when steering, and that affecting disjoint subspaces is not sufficient for concept selectivity. These results underscore the importance of compositional evaluations in interpretability research.

66. HD-Prot: A Protein Language Model for Joint Sequence-Structure Modeling with Continuous Structure Tokens

Authors: Yi Zhou , Haohao Qu , Yunqing Liu , Shanru Lin , Le Song , Wenqi Fan
URL: https://arxiv.org/abs/2512.15133
Abstract:

Proteins inherently possess a consistent sequence-structure duality. The abundance of protein sequence data, which can be readily represented as discrete tokens, has driven fruitful developments in protein language models (pLMs). A key remaining challenge, however, is how to effectively integrate continuous structural knowledge into pLMs. Current methods often discretize protein structures to accommodate the language modeling framework, which inevitably results in the loss of fine-grained information and limits the performance potential of multimodal pLMs. In this paper, we argue that such concerns can be circumvented: a sequence-based pLM can be extended to incorporate the structure modality through continuous tokens, i.e., high-fidelity protein structure latents that avoid vector quantization. Specifically, we propose a hybrid diffusion protein language model, HD-Prot, which embeds a continuous-valued diffusion head atop a discrete pLM, enabling seamless operation with both discrete and continuous tokens for joint sequence-structure modeling. It captures inter-token dependencies across modalities through a unified absorbing diffusion process, and estimates per-token distributions via categorical prediction for sequences and continuous diffusion for structures. Extensive empirical results show that HD-Prot achieves competitive performance in unconditional sequence-structure co-generation, motif-scaffolding, protein structure prediction, and inverse folding tasks, performing on par with state-of-the-art multimodal pLMs despite being developed under limited computational resources. It highlights the viability of simultaneously estimating categorical and continuous distributions within a unified language model architecture, offering a promising alternative direction for multimodal pLMs.

67. Automatic Reward Shaping from Multi-Objective Human Heuristics

Authors: Yuqing Xie , Jiayu Chen , Wenhao Tang , Ya Zhang , Chao Yu , Yu Wang
URL: https://arxiv.org/abs/2512.15120
Abstract:

Designing effective reward functions remains a central challenge in reinforcement learning, especially in multi-objective environments. In this work, we propose Multi-Objective Reward Shaping with Exploration (MORSE), a general framework that automatically combines multiple human-designed heuristic rewards into a unified reward function. MORSE formulates the shaping process as a bi-level optimization problem: the inner loop trains a policy to maximize the current shaped reward, while the outer loop updates the reward function to optimize task performance. To encourage exploration in the reward space and avoid suboptimal local minima, MORSE introduces stochasticity into the shaping process, injecting noise guided by task performance and the prediction error of a fixed, randomly initialized neural network. Experimental results in MuJoCo and Isaac Sim environments show that MORSE effectively balances multiple objectives across various robotic tasks, achieving task performance comparable to those obtained with manually tuned reward functions.

68. QoS-Aware Hierarchical Reinforcement Learning for Joint Link Selection and Trajectory Optimization in SAGIN-Supported UAV Mobility Management

Authors: Jiayang Wan , Ke He , Yafei Wang , Fan Liu , Wenjin Wang , Shi Jin
URL: https://arxiv.org/abs/2512.15119
Abstract:

Due to the significant variations in unmanned aerial vehicle (UAV) altitude and horizontal mobility, it becomes difficult for any single network to ensure continuous and reliable threedimensional coverage. Towards that end, the space-air-ground integrated network (SAGIN) has emerged as an essential architecture for enabling ubiquitous UAV connectivity. To address the pronounced disparities in coverage and signal characteristics across heterogeneous networks, this paper formulates UAV mobility management in SAGIN as a constrained multi-objective joint optimization problem. The formulation couples discrete link selection with continuous trajectory optimization. Building on this, we propose a two-level multi-agent hierarchical deep reinforcement learning (HDRL) framework that decomposes the problem into two alternately solvable subproblems. To map complex link selection decisions into a compact discrete action space, we conceive a double deep Q-network (DDQN) algorithm in the top-level, which achieves stable and high-quality policy learning through double Q-value estimation. To handle the continuous trajectory action space while satisfying quality of service (QoS) constraints, we integrate the maximum-entropy mechanism of the soft actor-critic (SAC) and employ a Lagrangian-based constrained SAC (CSAC) algorithm in the lower-level that dynamically adjusts the Lagrange multipliers to balance constraint satisfaction and policy optimization. Moreover, the proposed algorithm can be extended to multi-UAV scenarios under the centralized training and decentralized execution (CTDE) paradigm, which enables more generalizable policies. Simulation results demonstrate that the proposed scheme substantially outperforms existing benchmarks in throughput, link switching frequency and QoS satisfaction.

69. I am here for you”: How relational conversational AI appeals to adolescents, especially those who are socially and emotionally vulnerable

Authors: Pilyoung Kim , Yun Xie , Sujin Yang
URL: https://arxiv.org/abs/2512.15117
Abstract:

General-purpose conversational AI chatbots and AI companions increasingly provide young adolescents with emotionally supportive conversations, raising questions about how conversational style shapes anthropomorphism and emotional reliance. In a preregistered online experiment with 284 adolescent-parent dyads, youth aged 11-15 and their parents read two matched transcripts in which a chatbot responded to an everyday social problem using either a relational style (first-person, affiliative, commitment language) or a transparent style (explicit nonhumanness, informational tone). Adolescents more often preferred the relational than the transparent style, whereas parents were more likely to prefer transparent style than adolescents. Adolescents rated the relational chatbot as more human-like, likable, trustworthy and emotionally close, while perceiving both styles as similarly helpful. Adolescents who preferred relational style had lower family and peer relationship quality and higher stress and anxiety than those preferring transparent style or both chatbots. These findings identify conversational style as a key design lever for youth AI safety, showing that relational framing heightens anthropomorphism, trust and emotional closeness and can be especially appealing to socially and emotionally vulnerable adolescents, who may be at increased risk for emotional reliance on conversational AI.

70. FADTI: Fourier and Attention Driven Diffusion for Multivariate Time Series Imputation

Authors: Runze Li , Hanchen Wang , Wenjie Zhang , Binghao Li , Yu Zhang , Xuemin Lin , Ying Zhang
URL: https://arxiv.org/abs/2512.15116
Abstract:

Multivariate time series imputation is fundamental in applications such as healthcare, traffic forecasting, and biological modeling, where sensor failures and irregular sampling lead to pervasive missing values. However, existing Transformer- and diffusion-based models lack explicit inductive biases and frequency awareness, limiting their generalization under structured missing patterns and distribution shifts. We propose FADTI, a diffusion-based framework that injects frequency-informed feature modulation via a learnable Fourier Bias Projection (FBP) module and combines it with temporal modeling through self-attention and gated convolution. FBP supports multiple spectral bases, enabling adaptive encoding of both stationary and non-stationary patterns. This design injects frequency-domain inductive bias into the generative imputation process. Experiments on multiple benchmarks, including a newly introduced biological time series dataset, show that FADTI consistently outperforms state-of-the-art methods, particularly under high missing rates. Code is available at this https URL

71. How Many Heads Make an SSM? A Unified Framework for Attention and State Space Models

Authors: Ali Ghodsi
URL: https://arxiv.org/abs/2512.15115
Abstract:

Sequence modeling has produced diverse architectures – from classical recurrent neural networks to modern Transformers and state space models (SSMs) – yet a unified theoretical understanding of expressivity and trainability trade-offs remains limited. We introduce a unified framework that represents a broad class of sequence maps via an input-dependent effective interaction operator $W_{ij}(X)$, making explicit two recurring construction patterns: (i) the Unified Factorized Framework (Explicit) (attention-style mixing), in which $W_{ij}(X)$ varies through scalar coefficients applied to shared value maps, and (ii) Structured Dynamics (Implicit) (state-space recurrences), in which $W_{ij}$ is induced by a latent dynamical system. Using this framework, we derive three theoretical results. First, we establish the Interaction Rank Gap: models in the Unified Factorized Framework, such as single-head attention, are constrained to a low-dimensional operator span and cannot represent certain structured dynamical maps. Second, we prove an Equivalence (Head-Count) Theorem showing that, within our multi-head factorized class, representing a linear SSM whose lag operators span a $k$-dimensional subspace on length-$n$ sequences requires and is achievable with $H=k$ heads. Third, we prove a Gradient Highway Result, showing that attention layers admit inputs with distance-independent gradient paths, whereas stable linear dynamics exhibit distance-dependent gradient attenuation. Together, these results formalize a fundamental trade-off between algebraic expressivity (interaction/operator span) and long-range gradient propagation, providing theoretical grounding for modern sequence architecture design.

72. Feature-Centric Unsupervised Node Representation Learning Without Homophily Assumption

Authors: Sunwoo Kim , Soo Yong Lee , Kyungho Kim , Hyunjin Hwang , Jaemin Yoo , Kijung Shin
URL: https://arxiv.org/abs/2512.15112
Abstract:

Unsupervised node representation learning aims to obtain meaningful node embeddings without relying on node labels. To achieve this, graph convolution, which aggregates information from neighboring nodes, is commonly employed to encode node features and graph topology. However, excessive reliance on graph convolution can be suboptimal-especially in non-homophilic graphs-since it may yield unduly similar embeddings for nodes that differ in their features or topological properties. As a result, adjusting the degree of graph convolution usage has been actively explored in supervised learning settings, whereas such approaches remain underexplored in unsupervised scenarios. To tackle this, we propose FUEL, which adaptively learns the adequate degree of graph convolution usage by aiming to enhance intra-class similarity and inter-class separability in the embedding space. Since classes are unknown, FUEL leverages node features to identify node clusters and treats these clusters as proxies for classes. Through extensive experiments using 15 baseline methods and 14 benchmark datasets, we demonstrate the effectiveness of FUEL in downstream tasks, achieving state-of-the-art performance across graphs with diverse levels of homophily.

73. Quantifying Return on Security Controls in LLM Systems

Authors: Richard Helder Moulton , Austin O’Brien , John D. Hastings
URL: https://arxiv.org/abs/2512.15081
Abstract:

Although large language models (LLMs) are increasingly used in security-critical workflows, practitioners lack quantitative guidance on which safeguards are worth deploying. This paper introduces a decision-oriented framework and reproducible methodology that together quantify residual risk, convert adversarial probe outcomes into financial risk estimates and return-on-control (RoC) metrics, and enable monetary comparison of layered defenses for LLM-based systems. A retrieval-augmented generation (RAG) service is instantiated using the DeepSeek-R1 model over a corpus containing synthetic personally identifiable information (PII), and subjected to automated attacks with Garak across five vulnerability classes: PII leakage, latent context injection, prompt injection, adversarial attack generation, and divergence. For each (vulnerability, control) pair, attack success probabilities are estimated via Laplace’s Rule of Succession and combined with loss triangle distributions, calibrated from public breach-cost data, in 10,000-run Monte Carlo simulations to produce loss exceedance curves and expected losses. Three widely used mitigations, attribute-based access control (ABAC); named entity recognition (NER) redaction using Microsoft Presidio; and NeMo Guardrails, are then compared to a baseline RAG configuration. The baseline system exhibits very high attack success rates (>= 0.98 for PII, latent injection, and prompt injection), yielding a total simulated expected loss of $313k per attack scenario. ABAC collapses success probabilities for PII and prompt-related attacks to near zero and reduces the total expected loss by ~94%, achieving an RoC of 9.83. NER redaction likewise eliminates PII leakage and attains an RoC of 5.97, while NeMo Guardrails provides only marginal benefit (RoC of 0.05).

Authors: Ziyu Shang , Haoran Liu , Rongchao Zhang , Zhiqian Wei , Tongtong Feng
URL: https://arxiv.org/abs/2512.15069
Abstract:

Generating consistent human images with controllable pose and appearance is essential for applications in virtual try on, image editing, and digital human creation. Current methods often suffer from occlusions, garment style drift, and pose misalignment. We propose Pose-guided Multi-view Multimodal Diffusion (PMMD), a diffusion framework that synthesizes photorealistic person images conditioned on multi-view references, pose maps, and text prompts. A multimodal encoder jointly models visual views, pose features, and semantic descriptions, which reduces cross modal discrepancy and improves identity fidelity. We further design a ResCVA module to enhance local detail while preserving global structure, and a cross modal fusion module that integrates image semantics with text throughout the denoising pipeline. Experiments on the DeepFashion MultiModal dataset show that PMMD outperforms representative baselines in consistency, detail preservation, and controllability. Project page and code are available at this https URL .

75. The Semantic Illusion: Certified Limits of Embedding-Based Hallucination Detection in RAG Systems

Authors: Debu Sinha
URL: https://arxiv.org/abs/2512.15068
Abstract:

Retrieval-Augmented Generation (RAG) systems remain susceptible to hallucinations despite grounding in retrieved evidence. Current detection methods rely on semantic similarity and natural language inference (NLI), but their fundamental limitations have not been rigorously characterized. We apply conformal prediction to hallucination detection, providing finite-sample coverage guarantees that enable precise quantification of detection capabilities. Using calibration sets of approximately 600 examples, we achieve 94% coverage with 0% false positive rate on synthetic hallucinations (Natural Questions). However, on three real hallucination benchmarks spanning multiple LLMs (GPT-4, ChatGPT, GPT-3, Llama-2, Mistral), embedding-based methods - including state-of-the-art OpenAI text-embedding-3-large and cross-encoder models - exhibit unacceptable false positive rates: 100% on HaluEval, 88% on RAGTruth, and 50% on WikiBio. Crucially, GPT-4 as an LLM judge achieves only 7% FPR (95% CI: [3.4%, 13.7%]) on the same data, proving the task is solvable through reasoning. We term this the “semantic illusion”: semantically plausible hallucinations preserve similarity to source documents while introducing factual errors invisible to embeddings. This limitation persists across embedding architectures, LLM generators, and task types, suggesting embedding-based detection is insufficient for production RAG deployment.

76. EMFusion: Conditional Diffusion Framework for Trustworthy Frequency Selective EMF Forecasting in Wireless Networks

Authors: Zijiang Yan , Yixiang Huang , Jianhua Pei , Hina Tabassum , Luca Chiaraviglio
URL: https://arxiv.org/abs/2512.15067
Abstract:

The rapid growth in wireless infrastructure has increased the need to accurately estimate and forecast electromagnetic field (EMF) levels to ensure ongoing compliance, assess potential health impacts, and support efficient network planning. While existing studies rely on univariate forecasting of wideband aggregate EMF data, frequency-selective multivariate forecasting is needed to capture the inter-operator and inter-frequency variations essential for proactive network planning. To this end, this paper introduces EMFusion, a conditional multivariate diffusion-based probabilistic forecasting framework that integrates diverse contextual factors (e.g., time of day, season, and holidays) while providing explicit uncertainty estimates. The proposed architecture features a residual U-Net backbone enhanced by a cross-attention mechanism that dynamically integrates external conditions to guide the generation process. Furthermore, EMFusion integrates an imputation-based sampling strategy that treats forecasting as a structural inpainting task, ensuring temporal coherence even with irregular measurements. Unlike standard point forecasters, EMFusion generates calibrated probabilistic prediction intervals directly from the learned conditional distribution, providing explicit uncertainty quantification essential for trustworthy decision-making. Numerical experiments conducted on frequency-selective EMF datasets demonstrate that EMFusion with the contextual information of working hours outperforms the baseline models with or without conditions. The EMFusion outperforms the best baseline by 23.85% in continuous ranked probability score (CRPS), 13.93% in normalized root mean square error, and reduces prediction CRPS error by 22.47%.

77. Tracking spatial temporal details in ultrasound long video via wavelet analysis and memory bank

Authors: Chenxiao Zhang , Runshi Zhang , Junchen Wang
URL: https://arxiv.org/abs/2512.15066
Abstract:

Medical ultrasound videos are widely used for medical inspections, disease diagnosis and surgical planning. High-fidelity lesion area and target organ segmentation constitutes a key component of the computer-assisted surgery workflow. The low contrast levels and noisy backgrounds of ultrasound videos cause missegmentation of organ boundary, which may lead to small object losses and increase boundary segmentation errors. Object tracking in long videos also remains a significant research challenge. To overcome these challenges, we propose a memory bank-based wavelet filtering and fusion network, which adopts an encoder-decoder structure to effectively extract fine-grained detailed spatial features and integrate high-frequency (HF) information. Specifically, memory-based wavelet convolution is presented to simultaneously capture category, detailed information and utilize adjacent information in the encoder. Cascaded wavelet compression is used to fuse multiscale frequency-domain features and expand the receptive field within each convolutional layer. A long short-term memory bank using cross-attention and memory compression mechanisms is designed to track objects in long video. To fully utilize the boundary-sensitive HF details of feature maps, an HF-aware feature fusion module is designed via adaptive wavelet filters in the decoder. In extensive benchmark tests conducted on four ultrasound video datasets (two thyroid nodule, the thyroid gland, the heart datasets) compared with the state-of-the-art methods, our method demonstrates marked improvements in segmentation metrics. In particular, our method can more accurately segment small thyroid nodules, demonstrating its effectiveness for cases involving small ultrasound objects in long video. The code is available at this https URL .

78. Meta-learners for few-shot weakly-supervised optic disc and cup segmentation on fundus images

Authors: Pandega Abyan Zumarsyah , Igi Ardiyanto , Hanung Adi Nugroho
URL: https://arxiv.org/abs/2512.15061
Abstract:

This study develops meta-learners for few-shot weakly-supervised segmentation (FWS) to address the challenge of optic disc (OD) and optic cup (OC) segmentation for glaucoma diagnosis with limited labeled fundus images. We significantly improve existing meta-learners by introducing Omni meta-training which balances data usage and diversifies the number of shots. We also develop their efficient versions that reduce computational costs. In addition, we develop sparsification techniques that generate more customizable and representative scribbles and other sparse labels. After evaluating multiple datasets, we find that Omni and efficient versions outperform the original versions, with the best meta-learner being Efficient Omni ProtoSeg (EO-ProtoSeg). It achieves intersection over union (IoU) scores of 88.15% for OD and 71.17% for OC on the REFUGE dataset using just one sparsely labeled image, outperforming few-shot and semi-supervised methods which require more labeled images. Its best performance reaches 86.80% for OD and 71.78%for OC on DRISHTIGS, 88.21% for OD and 73.70% for OC on REFUGE, 80.39% for OD and 52.65% for OC on REFUGE. EO-ProtoSeg is comparable to unsupervised domain adaptation methods yet much lighter with less than two million parameters and does not require any retraining.

79. The Meta-Prompting Protocol: Orchestrating LLMs via Adversarial Feedback Loops

Authors: Fanzhe Fu
URL: https://arxiv.org/abs/2512.15053
Abstract:

The transition of Large Language Models (LLMs) from stochastic chat interfaces to reliable software components necessitates a fundamental re-engineering of interaction paradigms. Current methodologies, predominantly heuristic-based “prompt engineering,” fail to provide the deterministic guarantees required for mission-critical applications. We introduce the Meta-Prompting Protocol, a rigorous theoretical framework that formalizes the orchestration of LLMs as a programmable, self-optimizing system. Central to this protocol is the Adversarial Trinity, a tripartite topology comprising a Generator (P), an Auditor (A), and an Optimizer (O). By treating natural language instructions as differentiable variables within a semantic computation graph and utilizing textual critiques as gradients, this architecture mitigates hallucination and prevents model collapse. We demonstrate the theoretical viability of this approach using declarative programming paradigms (DSPy) and automatic textual differentiation (TextGrad), establishing a foundation for “Observable Software Engineering” in the era of probabilistic computing.

80. SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification

Authors: Hongbo Wang , MaungMaung AprilPyone , Isao Echizen
URL: https://arxiv.org/abs/2512.15052
Abstract:

Disclaimer: Samples in this paper may be harmful and cause discomfort. Multimodal large language models (MLLMs) enable multimodal generation but inherit toxic, biased, and NSFW signals from weakly curated pretraining corpora, causing safety risks, especially under adversarial triggers that late, opaque training-free detoxification methods struggle to handle. We propose SGM, a white-box neuron-level multimodal intervention that acts like safety glasses for toxic neurons: it selectively recalibrates a small set of toxic expert neurons via expertise-weighted soft suppression, neutralizing harmful cross-modal activations without any parameter updates. We establish MM-TOXIC-QA, a multimodal toxicity evaluation framework, and compare SGM with existing detoxification techniques. Experiments on open-source MLLMs show that SGM mitigates toxicity in standard and adversarial conditions, cutting harmful rates from 48.2\% to 2.5\% while preserving fluency and multimodal reasoning. SGM is extensible, and its combined defenses, denoted as SGM*, integrate with existing detoxification methods for stronger safety performance, providing an interpretable, low-cost solution for toxicity-controlled multimodal generation.

Authors: Yunheng Wang , Yixiao Feng , Yuetong Fang , Shuning Zhang , Tan Jing , Jian Li , Xiangrui Jiang , Renjing Xu
URL: https://arxiv.org/abs/2512.15047
Abstract:

3D Scene Graphs (3DSGs) constitute a powerful representation of the physical world, distinguished by their abilities to explicitly model the complex spatial, semantic, and functional relationships between entities, rendering a foundational understanding that enables agents to interact intelligently with their environment and execute versatile behaviors. Embodied navigation, as a crucial component of such capabilities, leverages the compact and expressive nature of 3DSGs to enable long-horizon reasoning and planning in complex, large-scale environments. However, prior works rely on a static-world assumption, defining traversable space solely based on static spatial layouts and thereby treating interactable obstacles as non-traversable. This fundamental limitation severely undermines their effectiveness in real-world scenarios, leading to limited reachability, low efficiency, and inferior extensibility. To address these issues, we propose HERO, a novel framework for constructing Hierarchical Traversable 3DSGs, that redefines traversability by modeling operable obstacles as pathways, capturing their physical interactivity, functional semantics, and the scene’s relational hierarchy. The results show that, relative to its baseline, HERO reduces PL by 35.1% in partially obstructed environments and increases SR by 79.4% in fully obstructed ones, demonstrating substantially higher efficiency and reachability.

82. Spectral Representation-based Reinforcement Learning

Authors: Chenxiao Gao , Haotian Sun , Na Li , Dale Schuurmans , Bo Dai
URL: https://arxiv.org/abs/2512.15036
Abstract:

In real-world applications with large state and action spaces, reinforcement learning (RL) typically employs function approximations to represent core components like the policies, value functions, and dynamics models. Although powerful approximations such as neural networks offer great expressiveness, they often present theoretical ambiguities, suffer from optimization instability and exploration difficulty, and incur substantial computational costs in practice. In this paper, we introduce the perspective of spectral representations as a solution to address these difficulties in RL. Stemming from the spectral decomposition of the transition operator, this framework yields an effective abstraction of the system dynamics for subsequent policy optimization while also providing a clear theoretical characterization. We reveal how to construct spectral representations for transition operators that possess latent variable structures or energy-based structures, which implies different learning methods to extract spectral representations from data. Notably, each of these learning methods realizes an effective RL algorithm under this framework. We also provably extend this spectral view to partially observable MDPs. Finally, we validate these algorithms on over 20 challenging tasks from the DeepMind Control Suite, where they achieve performances comparable or superior to current state-of-the-art model-free and model-based baselines.

83. Epistemic diversity across language models mitigates knowledge collapse

Authors: Damian Hodel , Jevin D. West
URL: https://arxiv.org/abs/2512.15011
Abstract:

The growing use of artificial intelligence (AI) raises concerns of knowledge collapse, i.e., a reduction to the most dominant and central set of ideas. Prior work has demonstrated single-model collapse, defined as performance decay in an AI model trained on its own output. Inspired by ecology, we ask whether AI ecosystem diversity, that is, diversity among models, can mitigate such a collapse. We build on the single-model approach but focus on ecosystems of models trained on their collective output. To study the effect of diversity on model performance, we segment the training data across language models and evaluate the resulting ecosystems over ten, self-training iterations. We find that increased epistemic diversity mitigates collapse, but, interestingly, only up to an optimal level. Our results suggest that an ecosystem containing only a few diverse models fails to express the rich mixture of the full, true distribution, resulting in rapid performance decay. Yet distributing the data across too many models reduces each model’s approximation capacity on the true distribution, leading to poor performance already in the first iteration step. In the context of AI monoculture, our results suggest the need to monitor diversity across AI systems and to develop policies that incentivize more domain- and community-specific models.

84. Evaluating the Capability of Video Question Generation for Expert Knowledge Elicitation

Authors: Huaying Zhang , Atsushi Hashimoto , Tosho Hirasawa
URL: https://arxiv.org/abs/2512.15006
Abstract:

Skilled human interviewers can extract valuable information from experts. This raises a fundamental question: what makes some questions more effective than others? To address this, a quantitative evaluation of question-generation models is essential. Video question generation (VQG) is a topic for video question answering (VideoQA), where questions are generated for given answers. Their evaluation typically focuses on the ability to answer questions, rather than the quality of generated questions. In contrast, we focus on the question quality in eliciting unseen knowledge from human experts. For a continuous improvement of VQG models, we propose a protocol that evaluates the ability by simulating question-answering communication with experts using a question-to-answer retrieval. We obtain the retriever by constructing a novel dataset, EgoExoAsk, which comprises 27,666 QA pairs generated from Ego-Exo4D’s expert commentary annotation. The EgoExoAsk training set is used to obtain the retriever, and the benchmark is constructed on the validation set with Ego-Exo4D video segments. Experimental results demonstrate our metric reasonably aligns with question generation settings: models accessing richer context are evaluated better, supporting that our protocol works as intended. The EgoExoAsk dataset is available in this https URL .

85. DreamPRM-Code: Function-as-Step Process Reward Model with Label Correction for LLM Coding

Authors: Ruiyi Zhang , Peijia Qin , Qi Cao , Pengtao Xie
URL: https://arxiv.org/abs/2512.15000
Abstract:

Process Reward Models (PRMs) have become essential for improving Large Language Models (LLMs) via test-time scaling, yet their effectiveness in coding remains limited due to the lack of meaningful step decompositions in code and the noise of Monte-Carlo-generated partial labels. We propose DreamPRM-Code, a coding-focused PRM that treats functions as reasoning steps using a Chain-of-Function prompting strategy to induce modular code generation, enabling PRM training and application analogous to mathematical reasoning tasks. To address label noise, DreamPRM-Code introduces a meta-learning-based correction mechanism that leverages clean final-solution unit-test labels and performs bi-level optimization to refine intermediate labels. Applying on test-time scaling, DreamPRM-Code achieved state-of-the-art performance on LiveCodeBench with 80.9 pass@1 rate, surpassing OpenAI o4-mini.

Authors: Sibi Parivendan , Kashfia Sailunaz , Suresh Neethirajan
URL: https://arxiv.org/abs/2512.14998
Abstract:

Precision livestock farming requires objective assessment of social behavior to support herd welfare monitoring, yet most existing approaches infer interactions using static proximity thresholds that cannot distinguish affiliative from agonistic behaviors in complex barn environments. This limitation constrains the interpretability of automated social network analysis in commercial settings. We present a pose-based computational framework for interaction classification that moves beyond proximity heuristics by modeling the spatiotemporal geometry of anatomical keypoints. Rather than relying on pixel-level appearance or simple distance measures, the proposed method encodes interaction-specific motion signatures from keypoint trajectories, enabling differentiation of social interaction valence. The framework is implemented as an end-to-end computer vision pipeline integrating YOLOv11 for object detection (mAP@0.50: 96.24%), supervised individual identification (98.24% accuracy), ByteTrack for multi-object tracking (81.96% accuracy), ZebraPose for 27-point anatomical keypoint estimation, and a support vector machine classifier trained on pose-derived distance dynamics. On annotated interaction clips collected from a commercial dairy barn, the classifier achieved 77.51% accuracy in distinguishing affiliative and agonistic behaviors using pose information alone. Comparative evaluation against a proximity-only baseline shows substantial gains in behavioral discrimination, particularly for affiliative interactions. The results establish a proof-of-concept for automated, vision-based inference of social interactions suitable for constructing interaction-aware social networks, with near-real-time performance on commodity hardware.

87. Where is the Watermark? Interpretable Watermark Detection at the Block Level

Authors: Maria Bulychev , Neil G. Marchant , Benjamin I. P. Rubinstein
URL: https://arxiv.org/abs/2512.14994
Abstract:

Recent advances in generative AI have enabled the creation of highly realistic digital content, raising concerns around authenticity, ownership, and misuse. While watermarking has become an increasingly important mechanism to trace and protect digital media, most existing image watermarking schemes operate as black boxes, producing global detection scores without offering any insight into how or where the watermark is present. This lack of transparency impacts user trust and makes it difficult to interpret the impact of tampering. In this paper, we present a post-hoc image watermarking method that combines localised embedding with region-level interpretability. Our approach embeds watermark signals in the discrete wavelet transform domain using a statistical block-wise strategy. This allows us to generate detection maps that reveal which regions of an image are likely watermarked or altered. We show that our method achieves strong robustness against common image transformations while remaining sensitive to semantic manipulations. At the same time, the watermark remains highly imperceptible. Compared to prior post-hoc methods, our approach offers more interpretable detection while retaining competitive robustness. For example, our watermarks are robust to cropping up to half the image.

88. Imitation Game: Reproducing Deep Learning Bugs Leveraging an Intelligent Agent

Authors: Mehil B Shah , Mohammad Masudur Rahman , Foutse Khomh
URL: https://arxiv.org/abs/2512.14990
Abstract:

Despite their wide adoption in various domains (e.g., healthcare, finance, software engineering), Deep Learning (DL)-based applications suffer from many bugs, failures, and vulnerabilities. Reproducing these bugs is essential for their resolution, but it is extremely challenging due to the inherent nondeterminism of DL models and their tight coupling with hardware and software environments. According to recent studies, only about 3% of DL bugs can be reliably reproduced using manual approaches. To address these challenges, we present RepGen, a novel, automated, and intelligent approach for reproducing deep learning bugs. RepGen constructs a learning-enhanced context from a project, develops a comprehensive plan for bug reproduction, employs an iterative generate-validate-refine mechanism, and thus generates such code using an LLM that reproduces the bug at hand. We evaluate RepGen on 106 real-world deep learning bugs and achieve a reproduction rate of 80.19%, a 19.81% improvement over the state-of-the-art measure. A developer study involving 27 participants shows that RepGen improves the success rate of DL bug reproduction by 23.35%, reduces the time to reproduce by 56.8%, and lowers participants’ cognitive load.

89. Evaluating Large Language Models on Multimodal Chemistry Olympiad Exams

Authors: Yiming Cui , Xin Yao , Yuxuan Qin , Xin Li , Shijin Wang , Guoping Hu
URL: https://arxiv.org/abs/2512.14989
Abstract:

Multimodal scientific reasoning remains a significant challenge for large language models (LLMs), particularly in chemistry, where problem-solving relies on symbolic diagrams, molecular structures, and structured visual data. Here, we systematically evaluate 40 proprietary and open-source multimodal LLMs, including GPT-5, o3, Gemini-2.5-Pro, and Qwen2.5-VL, on a curated benchmark of Olympiad-style chemistry questions drawn from over two decades of U.S. National Chemistry Olympiad (USNCO) exams. These questions require integrated visual and textual reasoning across diverse modalities. We find that many models struggle with modality fusion, where in some cases, removing the image even improves accuracy, indicating misalignment in vision-language integration. Chain-of-Thought prompting consistently enhances both accuracy and visual grounding, as demonstrated through ablation studies and occlusion-based interpretability. Our results reveal critical limitations in the scientific reasoning abilities of current MLLMs, providing actionable strategies for developing more robust and interpretable multimodal systems in chemistry. This work provides a timely benchmark for measuring progress in domain-specific multimodal AI and underscores the need for further advances at the intersection of artificial intelligence and scientific reasoning.

90. Prompt Repetition Improves Non-Reasoning LLMs

Authors: Yaniv Leviathan , Matan Kalman , Yossi Matias
URL: https://arxiv.org/abs/2512.14982
Abstract:

When not using reasoning, repeating the input prompt improves performance for popular models (Gemini, GPT, Claude, and Deepseek) without increasing the number of generated tokens or latency.

91. EVICPRESS: Joint KV-Cache Compression and Eviction for Efficient LLM Serving

Authors: Shaoting Feng , Yuhan Liu , Hanchen Li , Xiaokun Chen , Samuel Shen , Kuntai Du , Zhuohan Gu , Rui Zhang , Yuyang Huang , Yihua Cheng , Jiayi Yao , Qizheng Zhang , Ganesh Ananthanarayanan , Junchen Jiang
URL: https://arxiv.org/abs/2512.14946
Abstract:

Reusing KV cache is essential for high efficiency of Large Language Model (LLM) inference systems. With more LLM users, the KV cache footprint can easily exceed GPU memory capacity, so prior work has proposed to either evict KV cache to lower-tier storage devices, or compress KV cache so that more KV cache can be fit in the fast memory. However, prior work misses an important opportunity: jointly optimizing the eviction and compression decisions across all KV caches to minimize average generation latency without hurting quality. We propose EVICPRESS, a KV-cache management system that applies lossy compression and adaptive eviction to KV cache across multiple storage tiers. Specifically, for each KV cache of a context, EVICPRESS considers the effect of compression and eviction of the KV cache on the average generation quality and delay across all contexts as a whole. To achieve this, EVICPRESS proposes a unified utility function that quantifies the effect of quality and delay of the lossy compression or eviction. To this end, EVICPRESS’s profiling module periodically updates the utility function scores on all possible eviction-compression configurations for all contexts and places KV caches using a fast heuristic to rearrange KV caches on all storage tiers, with the goal of maximizing the utility function scores on each storage tier. Compared to the baselines that evict KV cache or compress KV cache, EVICPRESS achieves higher KV-cache hit rates on fast devices, i.e., lower delay, while preserving high generation quality by applying conservative compression to contexts that are sensitive to compression errors. Evaluation on 12 datasets and 5 models demonstrates that EVICPRESS achieves up to 2.19x faster time-to-first-token (TTFT) at equivalent generation quality.

92. TalkVerse: Democratizing Minute-Long Audio-Driven Video Generation

Authors: Zhenzhi Wang , Jian Wang , Ke Ma , Dahua Lin , Bing Zhou
URL: https://arxiv.org/abs/2512.14938
Abstract:

We introduce TalkVerse, a large-scale, open corpus for single-person, audio-driven talking video generation designed to enable fair, reproducible comparison across methods. While current state-of-the-art systems rely on closed data or compute-heavy models, TalkVerse offers 2.3 million high-resolution (720p/1080p) audio-video synchronized clips totaling 6.3k hours. These are curated from over 60k hours of video via a transparent pipeline that includes scene-cut detection, aesthetic assessment, strict audio-visual synchronization checks, and comprehensive annotations including 2D skeletons and structured visual/audio-style captions. Leveraging TalkVerse, we present a reproducible 5B DiT baseline built on Wan2.2-5B. By utilizing a video VAE with a high downsampling ratio and a sliding window mechanism with motion-frame context, our model achieves minute-long generation with low drift. It delivers comparable lip-sync and visual quality to the 14B Wan-S2V model but with 10$\times$ lower inference cost. To enhance storytelling in long videos, we integrate an MLLM director to rewrite prompts based on audio and visual cues. Furthermore, our model supports zero-shot video dubbing via controlled latent noise injection. We open-source the dataset, training recipes, and 5B checkpoints to lower barriers for research in audio-driven human video generation. Project Page: this https URL

93. Improving Pre-trained Segmentation Models using Post-Processing

Authors: Abhijeet Parida , Daniel Capellán-Martín , Zhifan Jiang , Nishad Kulkarni , Krithika Iyer , Austin Tapp , Syed Muhammad Anwar , María J. Ledesma-Carbayo , Marius George Linguraru
URL: https://arxiv.org/abs/2512.14937
Abstract:

Gliomas are the most common malignant brain tumors in adults and are among the most lethal. Despite aggressive treatment, the median survival rate is less than 15 months. Accurate multiparametric MRI (mpMRI) tumor segmentation is critical for surgical planning, radiotherapy, and disease monitoring. While deep learning models have improved the accuracy of automated segmentation, large-scale pre-trained models generalize poorly and often underperform, producing systematic errors such as false positives, label swaps, and slice discontinuities in slices. These limitations are further compounded by unequal access to GPU resources and the growing environmental cost of large-scale model training. In this work, we propose adaptive post-processing techniques to refine the quality of glioma segmentations produced by large-scale pretrained models developed for various types of tumors. We demonstrated the techniques in multiple BraTS 2025 segmentation challenge tasks, with the ranking metric improving by 14.9 % for the sub-Saharan Africa challenge and 0.9% for the adult glioma challenge. This approach promotes a shift in brain tumor segmentation research from increasingly complex model architectures to efficient, clinically aligned post-processing strategies that are precise, computationally fair, and sustainable.

94. Restless Multi-Process Multi-Armed Bandits with Applications to Self-Driving Microscopies

Authors: Jaume Anguera Peris , Songtao Cheng , Hanzhao Zhang , Wei Ouyang , Joakim Jaldén
URL: https://arxiv.org/abs/2512.14930
Abstract:

High-content screening microscopy generates large amounts of live-cell imaging data, yet its potential remains constrained by the inability to determine when and where to image most effectively. Optimally balancing acquisition time, computational capacity, and photobleaching budgets across thousands of dynamically evolving regions of interest remains an open challenge, further complicated by limited field-of-view adjustments and sensor sensitivity. Existing approaches either rely on static sampling or heuristics that neglect the dynamic evolution of biological processes, leading to inefficiencies and missed events. Here, we introduce the restless multi-process multi-armed bandit (RMPMAB), a new decision-theoretic framework in which each experimental region is modeled not as a single process but as an ensemble of Markov chains, thereby capturing the inherent heterogeneity of biological systems such as asynchronous cell cycles and heterogeneous drug responses. Building upon this foundation, we derive closed-form expressions for transient and asymptotic behaviors of aggregated processes, and design scalable Whittle index policies with sub-linear complexity in the number of imaging regions. Through both simulations and a real biological live-cell imaging dataset, we show that our approach achieves substantial improvements in throughput under resource constraints. Notably, our algorithm outperforms Thomson Sampling, Bayesian UCB, epsilon-Greedy, and Round Robin by reducing cumulative regret by more than 37% in simulations and capturing 93% more biologically relevant events in live imaging experiments, underscoring its potential for transformative smart microscopy. Beyond improving experimental efficiency, the RMPMAB framework unifies stochastic decision theory with optimal autonomous microscopy control, offering a principled approach to accelerate discovery across multidisciplinary sciences.

95. Parameter Efficient Multimodal Instruction Tuning for Romanian Vision Language Models

Authors: George-Andrei Dima , Dumitru-Clementin Cercel
URL: https://arxiv.org/abs/2512.14926
Abstract:

Focusing on low-resource languages is an essential step toward democratizing generative AI. In this work, we contribute to reducing the multimodal NLP resource gap for Romanian. We translate the widely known Flickr30k dataset into Romanian and further extend it for visual question answering by leveraging open-source LLMs. We demonstrate the usefulness of our datasets by fine-tuning open-source VLMs on Romanian visual question answering. We select VLMs from three widely used model families: LLaMA 3.2, LLaVA 1.6, and Qwen2. For fine-tuning, we employ the parameter-efficient LoRA method. Our models show improved Romanian capabilities in visual QA, as well as on tasks they were not trained on, such as Romanian image description generation. The seven-billion-parameter Qwen2-VL-RoVQA obtains top scores on both tasks, with improvements of +6.05% and +2.61% in BERTScore F1 over its original version. Finally, the models show substantial reductions in grammatical errors compared to their original forms, indicating improvements not only in language understanding but also in Romanian fluency.

96. DrugRAG: Enhancing Pharmacy LLM Performance Through A Novel Retrieval-Augmented Generation Pipeline

Authors: Houman Kazemzadeh , Kiarash Mokhtari Dizaji , Seyed Reza Tavakoli , Farbod Davoodi , MohammadReza KarimiNejad , Parham Abed Azad , Ali Sabzi , Armin Khosravi , Siavash Ahmadi , Mohammad Hossein Rohban , Glolamali Aminian , Tahereh Javaheri
URL: https://arxiv.org/abs/2512.14896
Abstract:

Objectives: To evaluate large language model (LLM) performance on pharmacy licensure-style question-answering (QA) tasks and develop an external knowledge integration method to improve their accuracy. Methods: We benchmarked eleven existing LLMs with varying parameter sizes (8 billion to 70+ billion) using a 141-question pharmacy dataset. We measured baseline accuracy for each model without modification. We then developed a three-step retrieval-augmented generation (RAG) pipeline, DrugRAG, that retrieves structured drug knowledge from validated sources and augments model prompts with evidence-based context. This pipeline operates externally to the models, requiring no changes to model architecture or parameters. Results: Baseline accuracy ranged from 46% to 92%, with GPT-5 (92%) and o3 (89%) achieving the highest scores. Models with fewer than 8 billion parameters scored below 50%. DrugRAG improved accuracy across all tested models, with gains ranging from 7 to 21 percentage points (e.g., Gemma 3 27B: 61% to 71%, Llama 3.1 8B: 46% to 67%) on the 141-item benchmark. Conclusion: We demonstrate that external structured drug knowledge integration through DrugRAG measurably improves LLM accuracy on pharmacy tasks without modifying the underlying models. This approach provides a practical pipeline for enhancing pharmacy-focused AI applications with evidence-based information.

97. Imitation Learning for Multi-turn LM Agents via On-policy Expert Corrections

Authors: Niklas Lauffer , Xiang Deng , Srivatsa Kundurthy , Brad Kenstler , Jeff Da
URL: https://arxiv.org/abs/2512.14895
Abstract:

A popular paradigm for training LM agents relies on imitation learning, fine-tuning on expert trajectories. However, we show that the off-policy nature of imitation learning for multi-turn LM agents suffers from the fundamental limitation known as covariate shift: as the student policy’s behavior diverges from the expert’s, it encounters states not present in the training data, reducing the effectiveness of fine-tuning. Taking inspiration from the classic DAgger algorithm, we propose a novel data generation methodology for addressing covariate shift for multi-turn LLM training. We introduce on-policy expert corrections (OECs), partially on-policy data generated by starting rollouts with a student model and then switching to an expert model part way through the trajectory. We explore the effectiveness of our data generation technique in the domain of software engineering (SWE) tasks, a multi-turn setting where LLM agents must interact with a development environment to fix software bugs. Our experiments compare OEC data against various other on-policy and imitation learning approaches on SWE agent problems and train models using a common rejection sampling (i.e., using environment reward) combined with supervised fine-tuning technique. Experiments find that OEC trajectories show a relative 14% and 13% improvement over traditional imitation learning in the 7b and 32b setting, respectively, on SWE-bench verified. Our results demonstrate the need for combining expert demonstrations with on-policy data for effective multi-turn LM agent training.

98. OLR-WA: Online Weighted Average Linear Regression in Multivariate Data Streams

Authors: Mohammad Abu-Shaira , Alejandro Rodriguez , Greg Speegle , Victor Sheng , Ishfaq Ahmad
URL: https://arxiv.org/abs/2512.14892
Abstract:

Online learning updates models incrementally with new data, avoiding large storage requirements and costly model recalculations. In this paper, we introduce “OLR-WA; OnLine Regression with Weighted Average”, a novel and versatile multivariate online linear regression model. We also investigate scenarios involving drift, where the underlying patterns in the data evolve over time, conduct convergence analysis, and compare our approach with existing online regression models. The results of OLR-WA demonstrate its ability to achieve performance comparable to the batch regression, while also showcasing comparable or superior performance when compared with other state-of-the-art online models, thus establishing its effectiveness. Moreover, OLR-WA exhibits exceptional performance in terms of rapid convergence, surpassing other online models with consistently achieving high r2 values as a performance measure from the first iteration to the last iteration, even when initialized with minimal amount of data points, as little as 1% to 10% of the total data points. In addition to its ability to handle time-based (temporal drift) scenarios, remarkably, OLR-WA stands out as the only model capable of effectively managing confidence-based challenging scenarios. It achieves this by adopting a conservative approach in its updates, giving priority to older data points with higher confidence levels. In summary, OLR-WA’s performance further solidifies its versatility and utility across different contexts, making it a valuable solution for online linear regression tasks.

99. Integrating Large Language Models and Knowledge Graphs to Capture Political Viewpoints in News Media

Authors: Massimiliano Fadda , Enrico Motta , Francesco Osborne , Diego Reforgiato Recupero , Angelo Salatino
URL: https://arxiv.org/abs/2512.14887
Abstract:

News sources play a central role in democratic societies by shaping political and social discourse through specific topics, viewpoints and voices. Understanding these dynamics is essential for assessing whether the media landscape offers a balanced and fair account of public debate. In earlier work, we introduced a pipeline that, given a news corpus, i) uses a hybrid human-machine approach to identify the range of viewpoints expressed about a given topic, and ii) classifies relevant claims with respect to the identified viewpoints, defined as sets of semantically and ideologically congruent claims (e.g., positions arguing that immigration positively impacts the UK economy). In this paper, we improve this pipeline by i) fine-tuning Large Language Models (LLMs) for viewpoint classification and ii) enriching claim representations with semantic descriptions of relevant actors drawn from Wikidata. We evaluate our approach against alternative solutions on a benchmark centred on the UK immigration debate. Results show that while both mechanisms independently improve classification performance, their integration yields the best results, particularly when using LLMs capable of processing long inputs.

100. Entropy-Reservoir Bregman Projection: An Information-Geometric Unification of Model Collapse

Authors: Jingwei Chen
URL: https://arxiv.org/abs/2512.14879
Abstract:

Self-referential learning – training a model on data it generated itself – promises boundless scalability but chronically suffers from model collapse: language models degenerate into repetitive text, GANs drop modes, and reinforcement-learning policies over-exploit. Although practitioners employ ad~hoc fixes such as real-data mixing, entropy bonuses, knowledge distillation, or retrieval-augmented generation, a single principle that explains both the failure mode and the success of these fixes has remained elusive. We present Entropy-Reservoir Bregman Projection (ERBP), an information-geometric framework that unifies these phenomena. We model the closed loop as a stochastic Bregman projection sequence in distribution space. Without external coupling, finite-sample noise forces the system to project onto an ever-shrinking empirical support, causing exponential entropy decay and eventual collapse. Introducing an Entropy Reservoir – a high-entropy distribution mixed into each projection – injects a controllable entropy flux that provably stabilises the dynamics. Our theory yields (i) a necessary condition for collapse, (ii) a sufficient condition that guarantees a non-trivial entropy floor, and (iii) closed-form rates that depend only on sample size and the strong-convexity/Lipschitz constants of the Bregman generator. Experiments on large-language-model self-training, Soft Actor-Critic in reinforcement learning, and GAN optimisation validate our predictions and show that disparate stabilisation heuristics correspond to specific reservoir choices and coupling coefficients. ERBP thus transforms a collection of folk remedies into a single, quantitative design rule: monitor and budget your entropy flux.

101. Penetration Testing of Agentic AI: A Comparative Security Analysis Across Models and Frameworks

Authors: Viet K. Nguyen , Mohammad I. Husain
URL: https://arxiv.org/abs/2512.14860
Abstract:

Agentic AI introduces security vulnerabilities that traditional LLM safeguards fail to address. Although recent work by Unit 42 at Palo Alto Networks demonstrated that ChatGPT-4o successfully executes attacks as an agent that it refuses in chat mode, there is no comparative analysis in multiple models and frameworks. We conducted the first systematic penetration testing and comparative evaluation of agentic AI systems, testing five prominent models (Claude 3.5 Sonnet, Gemini 2.5 Flash, GPT-4o, Grok 2, and Nova Pro) across two agentic AI frameworks (AutoGen and CrewAI) using a seven-agent architecture that mimics the functionality of a university information management system and 13 distinct attack scenarios that span prompt injection, Server Side Request Forgery (SSRF), SQL injection, and tool misuse. Our 130 total test cases reveal significant security disparities: AutoGen demonstrates a 52.3% refusal rate versus CrewAI’s 30.8%, while model performance ranges from Nova Pro’s 46.2% to Claude and Grok 2’s 38.5%. Most critically, Grok 2 on CrewAI rejected only 2 of 13 attacks (15.4% refusal rate), and the overall refusal rate of 41.5% across all configurations indicates that more than half of malicious prompts succeeded despite enterprise-grade safety mechanisms. We identify six distinct defensive behavior patterns including a novel “hallucinated compliance” strategy where models fabricate outputs rather than executing or refusing attacks, and provide actionable recommendations for secure agent deployment. Complete attack prompts are also included in the Appendix to enable reproducibility.

102. A Roadmap for Applying Graph Neural Networks to Numerical Data: Insights from Cementitious Materials

Authors: Mahmuda Sharmin , Taihao Han , Jie Huang , Narayanan Neithalath , Gaurav Sant , Aditya Kumar
URL: https://arxiv.org/abs/2512.14855
Abstract:

Machine learning (ML) has been increasingly applied in concrete research to optimize performance and mixture design. However, one major challenge in applying ML to cementitious materials is the limited size and diversity of available databases. A promising solution is the development of multi-modal databases that integrate both numerical and graphical data. Conventional ML frameworks in cement research are typically restricted to a single data modality. Graph neural network (GNN) represents a new generation of neural architectures capable of learning from data structured as graphs, capturing relationships through irregular or topology-dependent connections rather than fixed spatial coordinates. While GNN is inherently designed for graphical data, they can be adapted to extract correlations from numerical datasets and potentially embed physical laws directly into their architecture, enabling explainable and physics-informed predictions. This work is among the first few studies to implement GNNs to design concrete, with a particular emphasis on establishing a clear and reproducible pathway for converting tabular data into graph representations using the k-nearest neighbor (K-NN) approach. Model hyperparameters and feature selection are systematically optimized to enhance prediction performance. The GNN shows performance comparable to the benchmark random forest, which has been demonstrated by many studies to yield reliable predictions for cementitious materials. Overall, this study provides a foundational roadmap for transitioning from traditional ML to advanced AI architectures. The proposed framework establishes a strong foundation for future multi-modal and physics-informed GNN models capable of capturing complex material behaviors and accelerating the design and optimization of cementitious materials.

103. MALCDF: A Distributed Multi-Agent LLM Framework for Real-Time Cyber

Authors: Arth Bhardwaj , Sia Godika , Yuvam Loonker
URL: https://arxiv.org/abs/2512.14846
Abstract:

Traditional, centralized security tools often miss adaptive, multi-vector attacks. We present the Multi-Agent LLM Cyber Defense Framework (MALCDF), a practical setup where four large language model (LLM) agents-Detection, Intelligence, Response, and Analysis-work together in real time. Agents communicate over a Secure Communication Layer (SCL) with encrypted, ontology-aligned messages, and produce audit-friendly outputs (e.g., MITRE ATT&CK mappings). For evaluation, we keep the test simple and consistent: all reported metrics come from the same 50-record live stream derived from the CICIDS2017 feature schema. CICIDS2017 is used for configuration (fields/schema) and to train a practical ML baseline. The ML-IDS baseline is a Lightweight Random Forest IDS (LRF-IDS) trained on a subset of CICIDS2017 and tested on the 50-record stream, with no overlap between training and test records. In experiments, MALCDF reaches 90.0% detection accuracy, 85.7% F1-score, and 9.1% false-positive rate, with 6.8s average per-event latency. It outperforms the lightweight ML-IDS baseline and a single-LLM setup on accuracy while keeping end-to-end outputs consistent. Overall, this hands-on build suggests that coordinating simple LLM agents with secure, ontology-aligned messaging can improve practical, real-time cyber defense.

104. Let the Barbarians In: How AI Can Accelerate Systems Performance Research

Authors: Audrey Cheng , Shu Liu , Melissa Pan , Zhifei Li , Shubham Agarwal , Mert Cemri , Bowen Wang , Alexander Krentsel , Tian Xia , Jongseok Park , Shuo Yang , Jeff Chen , Lakshya Agrawal , Ashwin Naren , Shulu Li , Ruiying Ma , Aditya Desai , Jiarong Xing , Koushik Sen , Matei Zaharia , Ion Stoica
URL: https://arxiv.org/abs/2512.14806
Abstract:

Artificial Intelligence (AI) is beginning to transform the research process by automating the discovery of new solutions. This shift depends on the availability of reliable verifiers, which AI-driven approaches require to validate candidate solutions. Research focused on improving systems performance is especially well-suited to this paradigm because system performance problems naturally admit such verifiers: candidates can be implemented in real systems or simulators and evaluated against predefined workloads. We term this iterative cycle of generation, evaluation, and refinement AI-Driven Research for Systems (ADRS). Using several open-source ADRS instances (i.e., OpenEvolve, GEPA, and ShinkaEvolve), we demonstrate across ten case studies (e.g., multi-region cloud scheduling, mixture-of-experts load balancing, LLM-based SQL, transaction scheduling) that ADRS-generated solutions can match or even outperform human state-of-the-art designs. Based on these findings, we outline best practices (e.g., level of prompt specification, amount of feedback, robust evaluation) for effectively using ADRS, and we discuss future research directions and their implications. Although we do not yet have a universal recipe for applying ADRS across all of systems research, we hope our preliminary findings, together with the challenges we identify, offer meaningful guidance for future work as researcher effort shifts increasingly toward problem formulation and strategic oversight. Note: This paper is an extension of our prior work [14]. It adds extensive evaluation across multiple ADRS frameworks and provides deeper analysis and insights into best practices.

Authors: Ellie Y. Cheng , Logan Weber , Tian Jin , Michael Carbin
URL: https://arxiv.org/abs/2512.14805
Abstract:

The rise of large language models (LLMs) has introduced a new type of programming: natural language programming. By writing prompts that direct LLMs to perform natural language processing, code generation, reasoning, etc., users are writing code in natural language – natural language code – for the LLM to execute. An emerging area of research enables interoperability between natural language code and formal languages such as Python. We present a novel programming abstraction, shared program state, that removes the manual work required to enable interoperability between natural language code and program state. With shared program state, programmers can write natural code that directly writes program variables, computes with program objects, and implements control flow in the program. We present a schema for specifying natural function interfaces that extend programming systems to support natural code and leverage this schema to specify shared program state as a natural function interface. We implement shared program state in the Nightjar programming system. Nightjar enables programmers to write Python programs that contain natural code that shares the Python program state. We show that Nightjar programs achieve comparable or higher task accuracy than manually written implementations (+4-19%), while decreasing the lines of code by 39.6% on average. The tradeoff to using Nightjar is that it may incur runtime overhead (0.4-4.3x runtime of manual implementations).

106. Incentives or Ontology? A Structural Rebuttal to OpenAI’s Hallucination Thesis

Authors: Richard Ackermann , Simeon Emanuilov
URL: https://arxiv.org/abs/2512.14801
Abstract:

OpenAI has recently argued that hallucinations in large language models result primarily from misaligned evaluation incentives that reward confident guessing rather than epistemic humility. On this view, hallucination is a contingent behavioral artifact, remediable through improved benchmarks and reward structures. In this paper, we challenge that interpretation. Drawing on previous work on structural hallucination and empirical experiments using a Licensing Oracle, we argue that hallucination is not an optimization failure but an architectural inevitability of the transformer model. Transformers do not represent the world; they model statistical associations among tokens. Their embedding spaces form a pseudo-ontology derived from linguistic co-occurrence rather than world-referential structure. At ontological boundary conditions - regions where training data is sparse or incoherent - the model necessarily interpolates fictional continuations in order to preserve coherence. No incentive mechanism can modify this structural dependence on pattern completion. Our empirical results demonstrate that hallucination can only be eliminated through external truth-validation and abstention modules, not through changes to incentives, prompting, or fine-tuning. The Licensing Oracle achieves perfect abstention precision across domains precisely because it supplies grounding that the transformer lacks. We conclude that hallucination is a structural property of generative architectures and that reliable AI requires hybrid systems that distinguish linguistic fluency from epistemic responsibility.

107. Artificial Intelligence for the Assessment of Peritoneal Carcinosis during Diagnostic Laparoscopy for Advanced Ovarian Cancer

Authors: Riccardo Oliva , Farahdiba Zarin , Alice Zampolini Faustini , Armine Vardazaryan , Andrea Rosati , Vinkle Srivastav , Nunzia Del Villano , Jacques Marescaux , Giovanni Scambia , Pietro Mascagni , Nicolas Padoy , Anna Fagotti
URL: https://arxiv.org/abs/2512.14797
Abstract:

Advanced Ovarian Cancer (AOC) is often diagnosed at an advanced stage with peritoneal carcinosis (PC). Fagotti score (FS) assessment at diagnostic laparoscopy (DL) guides treatment planning by estimating surgical resectability, but its subjective and operator-dependent nature limits reproducibility and widespread use. Videos of patients undergoing DL with concomitant FS assessments at a referral center were retrospectively collected and divided into a development dataset, for data annotation, AI training and evaluation, and an independent test dataset, for internal validation. In the development dataset, FS-relevant frames were manually annotated for anatomical structures and PC. Deep learning models were trained to automatically identify FS-relevant frames, segment structures and PC, and predict video-level FS and indication to surgery (ItS). AI performance was evaluated using Dice score for segmentation, F1-scores for anatomical stations (AS) and ItS prediction, and root mean square error (RMSE) for final FS estimation. In the development dataset, the segmentation model trained on 7,311 frames, achieved Dice scores of 70$\pm$3% for anatomical structures and 56$\pm$3% for PC. Video-level AS classification achieved F1-scores of 74$\pm$3% and 73$\pm$4%, FS prediction showed normalized RMSE values of 1.39$\pm$0.18 and 1.15$\pm$0.08, and ItS reached F1-scores of 80$\pm$8% and 80$\pm$2% in the development (n=101) and independent test datasets (n=50), respectively. This is the first AI model to predict the feasibility of cytoreductive surgery providing automated FS estimation from DL videos. Its reproducible and reliable performance across datasets suggests that AI can support surgeons through standardized intraoperative tumor burden assessment and clinical decision-making in AOC.

108. Magnification-Aware Distillation (MAD): A Self-Supervised Framework for Unified Representation Learning in Gigapixel Whole-Slide Images

Authors: Mahmut S. Gokmen , Mitchell A. Klusty , Peter T. Nelson , Allison M. Neltner , Sen-Ching Samson Cheung , Thomas M. Pearce , David A Gutman , Brittany N. Dugger , Devavrat S. Bisht , Margaret E. Flanagan , V. K. Cody Bumgardner
URL: https://arxiv.org/abs/2512.14796
Abstract:

Whole-slide images (WSIs) contain tissue information distributed across multiple magnification levels, yet most self-supervised methods treat these scales as independent views. This separation prevents models from learning representations that remain stable when resolution changes, a key requirement for practical neuropathology workflows. This study introduces Magnification-Aware Distillation (MAD), a self-supervised strategy that links low-magnification context with spatially aligned high-magnification detail, enabling the model to learn how coarse tissue structure relates to fine cellular patterns. The resulting foundation model, MAD-NP, is trained entirely through this cross-scale correspondence without annotations. A linear classifier trained only on 10x embeddings maintains 96.7% of its performance when applied to unseen 40x tiles, demonstrating strong resolution-invariant representation learning. Segmentation outputs remain consistent across magnifications, preserving anatomical boundaries and minimizing noise. These results highlight the feasibility of scalable, magnification-robust WSI analysis using a unified embedding space

109. Improving VQA Reliability: A Dual-Assessment Approach with Self-Reflection and Cross-Model Verification

Authors: Xixian Wu , Yang Ou , Pengchao Tian , Zian Yang , Jielei Zhang , Peiyi Li , Longwen Gao
URL: https://arxiv.org/abs/2512.14770
Abstract:

Vision-language models (VLMs) have demonstrated significant potential in Visual Question Answering (VQA). However, the susceptibility of VLMs to hallucinations can lead to overconfident yet incorrect answers, severely undermining answer reliability. To address this, we propose Dual-Assessment for VLM Reliability (DAVR), a novel framework that integrates Self-Reflection and Cross-Model Verification for comprehensive uncertainty estimation. The DAVR framework features a dual-pathway architecture: one pathway leverages dual selector modules to assess response reliability by fusing VLM latent features with QA embeddings, while the other deploys external reference models for factual cross-checking to mitigate hallucinations. Evaluated in the Reliable VQA Challenge at ICCV-CLVL 2025, DAVR achieves a leading $\Phi_{100}$ score of 39.64 and a 100-AUC of 97.22, securing first place and demonstrating its effectiveness in enhancing the trustworthiness of VLM responses.

110. Privacy-Preserving Feature Valuation in Vertical Federated Learning Using Shapley-CMI and PSI Permutation

Authors: Unai Laskurain , Aitor Aguirre-Ortuzar , Urko Zurutuza
URL: https://arxiv.org/abs/2512.14767
Abstract:

Federated Learning (FL) is an emerging machine learning paradigm that enables multiple parties to collaboratively train models without sharing raw data, ensuring data privacy. In Vertical FL (VFL), where each party holds different features for the same users, a key challenge is to evaluate the feature contribution of each party before any model is trained, particularly in the early stages when no model exists. To address this, the Shapley-CMI method was recently proposed as a model-free, information-theoretic approach to feature valuation using Conditional Mutual Information (CMI). However, its original formulation did not provide a practical implementation capable of computing the required permutations and intersections securely. This paper presents a novel privacy-preserving implementation of Shapley-CMI for VFL. Our system introduces a private set intersection (PSI) server that performs all necessary feature permutations and computes encrypted intersection sizes across discretized and encrypted ID groups, without the need for raw data exchange. Each party then uses these intersection results to compute Shapley-CMI values, computing the marginal utility of their features. Initial experiments confirm the correctness and privacy of the proposed system, demonstrating its viability for secure and efficient feature contribution estimation in VFL. This approach ensures data confidentiality, scales across multiple parties, and enables fair data valuation without requiring the sharing of raw data or training models.

111. Guided Discrete Diffusion for Constraint Satisfaction Problems

Authors: Justin Jung
URL: https://arxiv.org/abs/2512.14765
Abstract:

We propose discrete diffusion guidance for constraint satisfaction problems (CSPs) and demonstrate its ability to solve Sudoku puzzles without supervision.

112. Scaling Causal Mediation for Complex Systems: A Framework for Root Cause Analysis

Authors: Alessandro Casadei , Sreyoshi Bhaduri , Rohit Malshe , Pavan Mullapudi , Raj Ratan , Ankush Pole , Arkajit Rakshit
URL: https://arxiv.org/abs/2512.14764
Abstract:

Modern operational systems ranging from logistics and cloud infrastructure to industrial IoT, are governed by complex, interdependent processes. Understanding how interventions propagate through such systems requires causal inference methods that go beyond direct effects to quantify mediated pathways. Traditional mediation analysis, while effective in simple settings, fails to scale to the high-dimensional directed acyclic graphs (DAGs) encountered in practice, particularly when multiple treatments and mediators interact. In this paper, we propose a scalable mediation analysis framework tailored for large causal DAGs involving multiple treatments and mediators. Our approach systematically decomposes total effects into interpretable direct and indirect components. We demonstrate its practical utility through applied case studies in fulfillment center logistics, where complex dependencies and non-controllable factors often obscure root causes.

113. Workflows vs Agents for Code Translation

Authors: Henry Gray , Tom Yotam , Octavian Udrea
URL: https://arxiv.org/abs/2512.14762
Abstract:

Translating algorithms from high-level languages like MATLAB to hardware description languages (HDLs) is a resource-intensive but necessary step for deployment on FPGAs and ASICs. While large language models (LLMs) offer a path to automation, their limited training on HDL code makes end-to-end transpilation brittle and prone to syntax errors. We compare two LLM-driven methods for syntax repair in a MATLAB-to-HDL pipeline: a structured, expert-designed flow that follows a fixed sequence of operations, and a more autonomous agentic approach that uses the Model Context Protocol (MCP) \cite{anthropic2024mcp} to dynamically select its own tools. We study 42 MATLAB signal-processing functions and isolate the syntax-repair stage. Across three model scales, the agentic approach is more effective at resolving initial syntax errors, unblocking a greater number of candidates to proceed through the pipeline. This upstream improvement yields measurable downstream improvements, most notably on mid-sized models, where it increases the simulation reach rate by over 20 percentage points. We hypothesize the gains come from short prompts, aggressive context management, and conditional tool use. Conditional retrieval helps at 8B and 30B; at 235B final-success gains are small and a naive RAG variant attains the highest final success. Our findings suggest that these agentic frameworks, when properly designed, are most effective at compensating for the capacity limits of small and mid-sized models.

114. CAPE: Capability Achievement via Policy Execution

Authors: David Ball
URL: https://arxiv.org/abs/2512.14761
Abstract:

Modern AI systems lack a way to express and enforce requirements. Pre-training produces intelligence, and post-training optimizes preferences, but neither guarantees that models reliably satisfy explicit, context-dependent constraints. This missing abstraction explains why highly intelligent models routinely fail in deployment despite strong benchmark performance. We introduce Capability Engineering, the systematic practice of converting requirements into executable specifications and training models to satisfy them by default. We operationalize this practice through CAPE (Capability Achievement via Policy Execution), a protocol implementing a Specify -> Verify -> Correct -> Train loop. CAPE is grounded in two empirical findings: (1) contextual objectivity, where properties appearing subjective become objective once context is fixed (inter-annotator agreement rises from kappa = 0.42 to kappa = 0.98), and (2) verification-fidelity scaling, where verification accuracy improves with model scale (r = 0.94), unlike preference agreement which plateaus at 30 to 50 percent disagreement regardless of compute. Across 109,500 examples in six domains, CAPE reduces violation rates by 81 percent relative to DPO (standard deviation less than 0.3 percent). By replacing per-example annotation with reusable specifications, CAPE reduces costs by 5 to 20 times and shortens timelines from months to weeks. We release the CAPE protocol, PredicateGraph schema, CPL specification language, and policy packs under Apache 2.0. We also launch CapabilityBench, a public registry of model evaluations against community-contributed policies, shifting evaluation from intelligence benchmarks toward capability measurement.

115. Revisiting the Reliability of Language Models in Instruction-Following

Authors: Jianshuo Dong , Yutong Zhang , Yan Liu , Zhenyu Zhong , Tao Wei , Chao Zhang , Han Qiu
URL: https://arxiv.org/abs/2512.14754
Abstract:

Advanced LLMs have achieved near-ceiling instruction-following accuracy on benchmarks such as IFEval. However, these impressive scores do not necessarily translate to reliable services in real-world use, where users often vary their phrasing, contextual framing, and task formulations. In this paper, we study nuance-oriented reliability: whether models exhibit consistent competence across cousin prompts that convey analogous user intents but with subtle nuances. To quantify this, we introduce a new metric, reliable@k, and develop an automated pipeline that generates high-quality cousin prompts via data augmentation. Building upon this, we construct IFEval++ for systematic evaluation. Across 20 proprietary and 26 open-source LLMs, we find that current models exhibit substantial insufficiency in nuance-oriented reliability – their performance can drop by up to 61.8% with nuanced prompt modifications. What’s more, we characterize it and explore three potential improvement recipes. Our findings highlight nuance-oriented reliability as a crucial yet underexplored next step toward more dependable and trustworthy LLM behavior. Our code and benchmark are accessible: this https URL .

116. CODE ACROSTIC: Robust Watermarking for Code Generation

Authors: Li Lin , Siyuan Xin , Yang Cao , Xiaochun Cao
URL: https://arxiv.org/abs/2512.14753
Abstract:

Watermarking large language models (LLMs) is vital for preventing their misuse, including the fabrication of fake news, plagiarism, and spam. It is especially important to watermark LLM-generated code, as it often contains intellectual this http URL , we found that existing methods for watermarking LLM-generated code fail to address comment removal this http URL such cases, an attacker can simply remove the comments from the generated code without affecting its functionality, significantly reducing the effectiveness of current code-watermarking this http URL the other hand, injecting a watermark into code is challenging because, as previous works have noted, most code represents a low-entropy scenario compared to natural language. Our approach to addressing this issue involves leveraging prior knowledge to distinguish between low-entropy and high-entropy parts of the code, as indicated by a Cue List of this http URL then inject the watermark guided by this Cue List, achieving higher detectability and usability than existing this http URL evaluated our proposed method on HumanEvaland compared our method with three state-of-the-art code watermarking techniques. The results demonstrate the effectiveness of our approach.

117. Cyberswarm: a novel swarm intelligence algorithm inspired by cyber community dynamics

Authors: Abdelsadeq Elfergany , Ammar Adl , Mohammed Kayed
URL: https://arxiv.org/abs/2512.14752
Abstract:

Recommendation systems face challenges in dynamically adapting to evolving user preferences and interactions within complex social networks. Traditional approaches often fail to account for the intricate interactions within cyber-social systems and lack the flexibility to generalize across diverse domains, highlighting the need for more adaptive and versatile solutions. In this work, we introduce a general-purpose swarm intelligence algorithm for recommendation systems, designed to adapt seamlessly to varying applications. It was inspired by social psychology principles. The framework models user preferences and community influences within a dynamic hypergraph structure. It leverages centrality-based feature extraction and Node2Vec embeddings. Preference evolution is guided by message-passing mechanisms and hierarchical graph modeling, enabling real-time adaptation to changing behaviors. Experimental evaluations demonstrated the algorithm’s superior performance in various recommendation tasks, including social networks and content discovery. Key metrics such as Hit Rate (HR), Mean Reciprocal Rank (MRR), and Normalized Discounted Cumulative Gain (NDCG) consistently outperformed baseline methods across multiple datasets. The model’s adaptability to dynamic environments allowed for contextually relevant and precise recommendations. The proposed algorithm represents an advancement in recommendation systems by bridging individual preferences and community influences. Its general-purpose design enables applications in diverse domains, including social graphs, personalized learning, and medical graphs. This work highlights the potential of integrating swarm intelligence with network dynamics to address complex optimization challenges in recommendation systems.

118. One Leak Away: How Pretrained Model Exposure Amplifies Jailbreak Risks in Finetuned LLMs

Authors: Yixin Tan , Zhe Yu , Jun Sakuma
URL: https://arxiv.org/abs/2512.14751
Abstract:

Finetuning pretrained large language models (LLMs) has become the standard paradigm for developing downstream applications. However, its security implications remain unclear, particularly regarding whether finetuned LLMs inherit jailbreak vulnerabilities from their pretrained sources. We investigate this question in a realistic pretrain-to-finetune threat model, where the attacker has white-box access to the pretrained LLM and only black-box access to its finetuned derivatives. Empirical analysis shows that adversarial prompts optimized on the pretrained model transfer most effectively to its finetuned variants, revealing inherited vulnerabilities from pretrained to finetuned LLMs. To further examine this inheritance, we conduct representation-level probing, which shows that transferable prompts are linearly separable within the pretrained hidden states, suggesting that universal transferability is encoded in pretrained representations. Building on this insight, we propose the Probe-Guided Projection (PGP) attack, which steers optimization toward transferability-relevant directions. Experiments across multiple LLM families and diverse finetuned tasks confirm PGP’s strong transfer success, underscoring the security risks inherent in the pretrain-to-finetune paradigm.

Authors: Ying Cui , Dongzhe Zheng , Ke Yu , Xiyin Zheng , Xiaorui Wang , Xinxiang Li , Yan Gu , Lin Fu , Xinyi Chen , Wenjie Mei , Xin-Gui Peng
URL: https://arxiv.org/abs/2512.14750
Abstract:

Clear cell renal cell carcinoma (ccRCC) exhibits extensive intratumoral heterogeneity on multiple biological scales, contributing to variable clinical outcomes and limiting the effectiveness of conventional TNM staging, which highlights the urgent need for multiscale integrative analytic frameworks. The lipid-deficient de-clear cell differentiated (DCCD) ccRCC subtype, defined by multi-omics analyses, is associated with adverse outcomes even in early-stage disease. Here, we establish a hierarchical cross-scale framework for the preoperative identification of DCCD-ccRCC. At the highest layer, cross-modal mapping transferred molecular signatures to histological and CT phenotypes, establishing a molecular-to-pathology-to-radiology supervisory bridge. Within this framework, each modality-specific model is designed to mirror the inherent hierarchical structure of tumor biology. PathoDCCD captured multi-scale microscopic features, from cellular morphology and tissue architecture to meso-regional organization. RadioDCCD integrated complementary macroscopic information by combining whole-tumor and its habitat-subregions radiomics with a 2D maximal-section heterogeneity metric. These nested models enabled integrated molecular subtype prediction and clinical risk stratification. Across five cohorts totaling 1,659 patients, PathoDCCD reliably recapitulated molecular subtypes, while RadioDCCD provided reliable preoperative prediction. The consistent predictions identified patients with the poorest clinical outcomes. This cross-scale paradigm unifies molecular biology, computational pathology, and quantitative radiology into a biologically grounded strategy for preoperative noninvasive molecular phenotyping of ccRCC.

120. Factor(U,T): Controlling Untrusted AI by Monitoring their Plans

Authors: Edward Lue Chee Lip , Anthony Channg , Diana Kim , Aaron Sandoval , Kevin Zhu
URL: https://arxiv.org/abs/2512.14745
Abstract:

As AI capabilities advance, we increasingly rely on powerful models to decompose complex tasks $\unicode{x2013}$ but what if the decomposer itself is malicious? Factored cognition protocols decompose complex tasks into simpler child tasks: one model creates the decomposition, while other models implement the child tasks in isolation. Prior work uses trusted (weaker but reliable) models for decomposition, which limits usefulness for tasks where decomposition itself is challenging. We introduce Factor($U$,$T$), in which an untrusted (stronger but potentially malicious) model decomposes while trusted models implement child tasks. Can monitors detect malicious activity when observing only natural language task instructions, rather than complete solutions? We baseline and red team Factor($U$,$T$) in control evaluations on BigCodeBench, a dataset of Python coding tasks. Monitors distinguishing malicious from honest decompositions perform poorly (AUROC 0.52) compared to monitors evaluating complete Python solutions (AUROC 0.96). Furthermore, Factor($D$,$U$), which uses a trusted decomposer and monitors concrete child solutions, achieves excellent discrimination (AUROC 0.96) and strong safety (1.2% ASR), demonstrating that implementation-context monitoring succeeds where decomposition-only monitoring fails.

121. VERAFI: Verified Agentic Financial Intelligence through Neurosymbolic Policy Generation

Authors: Adewale Akinfaderin , Shreyas Subramanian
URL: https://arxiv.org/abs/2512.14744
Abstract:

Financial AI systems suffer from a critical blind spot: while Retrieval-Augmented Generation (RAG) excels at finding relevant documents, language models still generate calculation errors and regulatory violations during reasoning, even with perfect retrieval. This paper introduces VERAFI (Verified Agentic Financial Intelligence), an agentic framework with neurosymbolic policy generation for verified financial intelligence. VERAFI combines state-of-the-art dense retrieval and cross-encoder reranking with financial tool-enabled agents and automated reasoning policies covering GAAP compliance, SEC requirements, and mathematical validation. Our comprehensive evaluation on FinanceBench demonstrates remarkable improvements: while traditional dense retrieval with reranking achieves only 52.4\% factual correctness, VERAFI’s integrated approach reaches 94.7\%, an 81\% relative improvement. The neurosymbolic policy layer alone contributes a 4.3 percentage point gain over pure agentic processing, specifically targeting persistent mathematical and logical errors. By integrating financial domain expertise directly into the reasoning process, VERAFI offers a practical pathway toward trustworthy financial AI that meets the stringent accuracy demands of regulatory compliance, investment decisions, and risk management.

122. Quantum-Augmented AI/ML for O-RAN: Hierarchical Threat Detection with Synergistic Intelligence and Interpretability (Technical Report)

Authors: Tan Le , Van Le , Sachin Shetty
URL: https://arxiv.org/abs/2512.14742
Abstract:

Open Radio Access Networks (O-RAN) enhance modularity and telemetry granularity but also widen the cybersecurity attack surface across disaggregated control, user and management planes. We propose a hierarchical defense framework with three coordinated layers-anomaly detection, intrusion confirmation, and multiattack classification-each aligned with O-RAN’s telemetry stack. Our approach integrates hybrid quantum computing and machine learning, leveraging amplitude- and entanglement-based feature encodings with deep and ensemble classifiers. We conduct extensive benchmarking across synthetic and real-world telemetry, evaluating encoding depth, architectural variants, and diagnostic fidelity. The framework consistently achieves near-perfect accuracy, high recall, and strong class separability. Multi-faceted evaluation across decision boundaries, probabilistic margins, and latent space geometry confirms its interpretability, robustness, and readiness for slice-aware diagnostics and scalable deployment in near-RT and non-RT RIC domains.

123. Persistent Backdoor Attacks under Continual Fine-Tuning of LLMs

Authors: Jing Cui , Yufei Han , Jianbin Jiao , Junge Zhang
URL: https://arxiv.org/abs/2512.14741
Abstract:

Backdoor attacks embed malicious behaviors into Large Language Models (LLMs), enabling adversaries to trigger harmful outputs or bypass safety controls. However, the persistence of the implanted backdoors under user-driven post-deployment continual fine-tuning has been rarely examined. Most prior works evaluate the effectiveness and generalization of implanted backdoors only at releasing and empirical evidence shows that naively injected backdoor persistence degrades after updates. In this work, we study whether and how implanted backdoors persist through a multi-stage post-deployment fine-tuning. We propose P-Trojan, a trigger-based attack algorithm that explicitly optimizes for backdoor persistence across repeated updates. By aligning poisoned gradients with those of clean tasks on token embeddings, the implanted backdoor mapping is less likely to be suppressed or forgotten during subsequent updates. Theoretical analysis shows the feasibility of such persistent backdoor attacks after continual fine-tuning. And experiments conducted on the Qwen2.5 and LLaMA3 families of LLMs, as well as diverse task sequences, demonstrate that P-Trojan achieves over 99% persistence while preserving clean-task accuracy. Our findings highlight the need for persistence-aware evaluation and stronger defenses in realistic model adaptation pipelines.

124. Zero-Knowledge Audit for Internet of Agents: Privacy-Preserving Communication Verification with Model Context Protocol

Authors: Guanlin Jing , Huayi Qi
URL: https://arxiv.org/abs/2512.14737
Abstract:

Existing agent communication frameworks face critical limitations in providing verifiable audit trails without compromising the privacy and confidentiality of agent interactions. The protection of agent communication privacy while ensuring auditability emerges as a fundamental challenge for applications requiring accurate billing, compliance verification, and accountability in regulated environments. We introduce a framework for auditing agent communications that keeps messages private while still checking they follow expected rules. It pairs zero-knowledge proofs with the existing Model Context Protocol (MCP) so messages can be verified without revealing their contents. The approach runs in lightweight networks, stays compatible with standard MCP exchanges, and adds asynchronous audit verification to confirm format and general message types without exposing specifics. The framework enables mutual audits between agents: one side can check communication content and quality while the other verifies usage metrics, all without revealing sensitive information. We formalize security goals and show that zk-MCP provides data authenticity and communication privacy, achieving efficient verification with negligible latency overhead. We fully implement the framework, including Circom-based zero-knowledge proof generation and an audit protocol integrated with MCP’s bidirectional channel, and, to our knowledge, this is the first privacy-preserving audit system for agent communications that offers verifiable mutual auditing without exposing message content or compromising agent privacy.

125. PyFi: Toward Pyramid-like Financial Image Understanding for VLMs via Adversarial Agents

Authors: Yuqun Zhang , Yuxuan Zhao , Sijia Chen
URL: https://arxiv.org/abs/2512.14735
Abstract:

This paper proposes PyFi, a novel framework for pyramid-like financial image understanding that enables vision language models (VLMs) to reason through question chains in a progressive, simple-to-complex manner. At the core of PyFi is PyFi-600K, a dataset comprising 600K financial question-answer pairs organized into a reasoning pyramid: questions at the base require only basic perception, while those toward the apex demand increasing levels of capability in financial visual understanding and expertise. This data is scalable because it is synthesized without human annotations, using PyFi-adv, a multi-agent adversarial mechanism under the Monte Carlo Tree Search (MCTS) paradigm, in which, for each image, a challenger agent competes with a solver agent by generating question chains that progressively probe deeper capability levels in financial visual reasoning. Leveraging this dataset, we present fine-grained, hierarchical, and comprehensive evaluations of advanced VLMs in the financial domain. Moreover, fine-tuning Qwen2.5-VL-3B and Qwen2.5-VL-7B on the pyramid-structured question chains enables these models to answer complex financial questions by decomposing them into sub-questions with gradually increasing reasoning demands, yielding average accuracy improvements of 19.52% and 8.06%, respectively, on the dataset. All resources of code, dataset and models are available at: this https URL .

126. INFORM-CT: INtegrating LLMs and VLMs FOR Incidental Findings Management in Abdominal CT

Authors: Idan Tankel , Nir Mazor , Rafi Brada , Christina LeBedis , Guy ben-Yosef
URL: https://arxiv.org/abs/2512.14732
Abstract:

Incidental findings in CT scans, though often benign, can have significant clinical implications and should be reported following established guidelines. Traditional manual inspection by radiologists is time-consuming and variable. This paper proposes a novel framework that leverages large language models (LLMs) and foundational vision-language models (VLMs) in a plan-and-execute agentic approach to improve the efficiency and precision of incidental findings detection, classification, and reporting for abdominal CT scans. Given medical guidelines for abdominal organs, the process of managing incidental findings is automated through a planner-executor framework. The planner, based on LLM, generates Python scripts using predefined base functions, while the executor runs these scripts to perform the necessary checks and detections, via VLMs, segmentation models, and image processing subroutines. We demonstrate the effectiveness of our approach through experiments on a CT abdominal benchmark for three organs, in a fully automatic end-to-end manner. Our results show that the proposed framework outperforms existing pure VLM-based approaches in terms of accuracy and efficiency.

127. Semantic Geometry for policy-constrained interpretation

Authors: Nikit Phadke
URL: https://arxiv.org/abs/2512.14731
Abstract:

We present a geometric framework for policy-constrained semantic interpretation that provably prevents hallucinated commitments in high-stakes domains. Semantic meaning is represented as direction on a unit sphere, evidence is modeled as sets of witness vectors, and admissible interpretations correspond to spherical convex regions. Policy constraints are introduced as explicit priors defined over the same manifold, separated from evidence geometry. Interpretation reduces to constrained optimization over admissible regions, with refusal emerging as a topologically necessary outcome under contradiction or policy exclusion. We connect this framework to information theory, Bayesian inference, and sheaf-theoretic semantics, proving that our complexity bounds are information-theoretically optimal. Empirical validation on large scale regulated financial data demonstrates zero hallucinated approvals across multiple policy regimes-the first such result at scale.

128. A Critical Perspective on Finite Sample Conformal Prediction Theory in Medical Applications

Authors: Klaus-Rudolf Kladny , Bernhard Schölkopf , Lisa Koch , Christian F. Baumgartner , Michael Muehlebach
URL: https://arxiv.org/abs/2512.14727
Abstract:

Machine learning (ML) is transforming healthcare, but safe clinical decisions demand reliable uncertainty estimates that standard ML models fail to provide. Conformal prediction (CP) is a popular tool that allows users to turn heuristic uncertainty estimates into uncertainty estimates with statistical guarantees. CP works by converting predictions of a ML model, together with a calibration sample, into prediction sets that are guaranteed to contain the true label with any desired probability. An often cited advantage is that CP theory holds for calibration samples of arbitrary size, suggesting that uncertainty estimates with practically meaningful statistical guarantees can be achieved even if only small calibration sets are available. We question this promise by showing that, although the statistical guarantees hold for calibration sets of arbitrary size, the practical utility of these guarantees does highly depend on the size of the calibration set. This observation is relevant in medical domains because data is often scarce and obtaining large calibration sets is therefore infeasible. We corroborate our critique in an empirical demonstration on a medical image classification task.

129. Quantum Decision Transformers (QDT): Synergistic Entanglement and Interference for Offline Reinforcement Learning

Authors: Abraham Itzhak Weinberg
URL: https://arxiv.org/abs/2512.14726
Abstract:

Offline reinforcement learning enables policy learning from pre-collected datasets without environment interaction, but existing Decision Transformer (DT) architectures struggle with long-horizon credit assignment and complex state-action dependencies. We introduce the Quantum Decision Transformer (QDT), a novel architecture incorporating quantum-inspired computational mechanisms to address these challenges. Our approach integrates two core components: Quantum-Inspired Attention with entanglement operations that capture non-local feature correlations, and Quantum Feedforward Networks with multi-path processing and learnable interference for adaptive computation. Through comprehensive experiments on continuous control tasks, we demonstrate over 2,000\% performance improvement compared to standard DTs, with superior generalization across varying data qualities. Critically, our ablation studies reveal strong synergistic effects between quantum-inspired components: neither alone achieves competitive performance, yet their combination produces dramatic improvements far exceeding individual contributions. This synergy demonstrates that effective quantum-inspired architecture design requires holistic co-design of interdependent mechanisms rather than modular component adoption. Our analysis identifies three key computational advantages: enhanced credit assignment through non-local correlations, implicit ensemble behavior via parallel processing, and adaptive resource allocation through learnable interference. These findings establish quantum-inspired design principles as a promising direction for advancing transformer architectures in sequential decision-making, with implications extending beyond reinforcement learning to neural architecture design more broadly.

130. Generative Urban Flow Modeling: From Geometry to Airflow with Graph Diffusion

Authors: Francisco Giral , Álvaro Manzano , Ignacio Gómez , Petros Koumoutsakos , Soledad Le Clainche
URL: https://arxiv.org/abs/2512.14725
Abstract:

Urban wind flow modeling and simulation play an important role in air quality assessment and sustainable city planning. A key challenge for modeling and simulation is handling the complex geometries of the urban landscape. Low order models are limited in capturing the effects of geometry, while high-fidelity Computational Fluid Dynamics (CFD) simulations are prohibitively expensive, especially across multiple geometries or wind conditions. Here, we propose a generative diffusion framework for synthesizing steady-state urban wind fields over unstructured meshes that requires only geometry information. The framework combines a hierarchical graph neural network with score-based diffusion modeling to generate accurate and diverse velocity fields without requiring temporal rollouts or dense measurements. Trained across multiple mesh slices and wind angles, the model generalizes to unseen geometries, recovers key flow structures such as wakes and recirculation zones, and offers uncertainty-aware predictions. Ablation studies confirm robustness to mesh variation and performance under different inference regimes. This work develops is the first step towards foundation models for the built environment that can help urban planners rapidly evaluate design decisions under densification and climate uncertainty.

131. HATSolver: Learning Groebner Bases with Hierarchical Attention Transformers

Authors: Mohamed Malhou , Ludovic Perret , Kristin Lauter
URL: https://arxiv.org/abs/2512.14722
Abstract:

At NeurIPS 2024, Kera et al. introduced the use of transformers for computing Groebner bases, a central object in computer algebra with numerous practical applications. In this paper, we improve this approach by applying Hierarchical Attention Transformers (HATs) to solve systems of multivariate polynomial equations via Groebner bases computation. The HAT architecture incorporates a tree-structured inductive bias that enables the modeling of hierarchical relationships present in the data and thus achieves significant computational savings compared to conventional flat attention models. We generalize to arbitrary depths and include a detailed computational cost analysis. Combined with curriculum learning, our method solves instances that are much larger than those in Kera et al. (2024 Learning to compute Groebner bases)

Authors: Dizhan Xue , Jing Cui , Shengsheng Qian , Chuanrui Hu , Changsheng Xu
URL: https://arxiv.org/abs/2512.14720
Abstract:

Intelligent agents powered by large language models (LLMs) have recently demonstrated impressive capabilities and gained increasing popularity on social media platforms. While LLM agents are reshaping the ecology of social media, there exists a current gap in conducting a comprehensive evaluation of their ability to comprehend media content, understand user behaviors, and make intricate decisions. To address this challenge, we introduce SoMe, a pioneering benchmark designed to evaluate social media agents equipped with various agent tools for accessing and analyzing social media data. SoMe comprises a diverse collection of 8 social media agent tasks, 9,164,284 posts, 6,591 user profiles, and 25,686 reports from various social media platforms and external websites, with 17,869 meticulously annotated task queries. Compared with the existing datasets and benchmarks for social media tasks, SoMe is the first to provide a versatile and realistic platform for LLM-based social media agents to handle diverse social media tasks. By extensive quantitative and qualitative analysis, we provide the first overview insight into the performance of mainstream agentic LLMs in realistic social media environments and identify several limitations. Our evaluation reveals that both the current closed-source and open-source LLMs cannot handle social media agent tasks satisfactorily. SoMe provides a challenging yet meaningful testbed for future social media agents. Our code and data are available at this https URL

133. Hybrid Attribution Priors for Explainable and Robust Model Training

Authors: Zhuoran Zhang , Feng Zhang , Shangyuan Li , Yang Shi , Yuanxing Zhang , Wei Chen , Tengjiao Wang , Kam-Fai Wong
URL: https://arxiv.org/abs/2512.14719
Abstract:

Small language models (SLMs) are widely used in tasks that require low latency and lightweight deployment, particularly classification. As interpretability and robustness gain increasing importance, explanation-guided learning has emerged as an effective framework by introducing attribution-based supervision during training; however, deriving general and reliable attribution priors remains a significant challenge. Through an analysis of representative attribution methods in classification settings, we find that although these methods can reliably highlight class-relevant tokens, they often focus on common keywords shared by semantically similar classes. Because such classes are already difficult to distinguish under standard training, these attributions provide insufficient discriminative cues, limiting their ability to improve model differentiation. To overcome this limitation, we propose Class-Aware Attribution Prior (CAP), a novel attribution prior extraction framework that guides language models toward capturing fine-grained class distinctions and producing more salient, discriminative attribution priors. Building on this idea, we further introduce CAP Hybrid, which combines priors from CAP with those from existing attribution techniques to form a more comprehensive and balanced supervisory signal. By aligning a model’s self-attribution with these enriched priors, our approach encourages the learning of diverse, decision-relevant features. Extensive experiments in full-data, few-shot, and adversarial scenarios demonstrate that our method consistently enhances both interpretability and robustness.

134. SEED: Spectral Entropy-Guided Evaluation of SpatialTemporal Dependencies for Multivariate Time Series Forecasting

Authors: Feng Xiong , Zongxia Xie , Yanru Sun , Haoyu Wang , Jianhong Lin
URL: https://arxiv.org/abs/2512.14718
Abstract:

Effective multivariate time series forecasting often benefits from accurately modeling complex inter-variable dependencies. However, existing attention- or graph-based methods face three key issues: (a) strong temporal self-dependencies are often disrupted by irrelevant variables; (b) softmax normalization ignores and reverses negative correlations; (c) variables struggle to perceive their temporal positions. To address these, we propose \textbf{SEED}, a Spectral Entropy-guided Evaluation framework for spatial-temporal Dependency modeling. SEED introduces a Dependency Evaluator, a key innovation that leverages spectral entropy to dynamically provide a preliminary evaluation of the spatial and temporal dependencies of each variable, enabling the model to adaptively balance Channel Independence (CI) and Channel Dependence (CD) strategies. To account for temporal regularities originating from the influence of other variables rather than intrinsic dynamics, we propose Spectral Entropy-based Fuser to further refine the evaluated dependency weights, effectively separating this part. Moreover, to preserve negative correlations, we introduce a Signed Graph Constructor that enables signed edge weights, overcoming the limitations of softmax. Finally, to help variables perceive their temporal positions and thereby construct more comprehensive spatial features, we introduce the Context Spatial Extractor, which leverages local contextual windows to extract spatial features. Extensive experiments on 12 real-world datasets from various application domains demonstrate that SEED achieves state-of-the-art performance, validating its effectiveness and generality.

135. How a Bit Becomes a Story: Semantic Steering via Differentiable Fault Injection

Authors: Zafaryab Haider , Md Hafizur Rahman , Shane Moeykens , Vijay Devabhaktuni , Prabuddha Chakraborty
URL: https://arxiv.org/abs/2512.14715
Abstract:

Hard-to-detect hardware bit flips, from either malicious circuitry or bugs, have already been shown to make transformers vulnerable in non-generative tasks. This work, for the first time, investigates how low-level, bitwise perturbations (fault injection) to the weights of a large language model (LLM) used for image captioning can influence the semantic meaning of its generated descriptions while preserving grammatical structure. While prior fault analysis methods have shown that flipping a few bits can crash classifiers or degrade accuracy, these approaches overlook the semantic and linguistic dimensions of generative systems. In image captioning models, a single flipped bit might subtly alter how visual features map to words, shifting the entire narrative an AI tells about the world. We hypothesize that such semantic drifts are not random but differentiably estimable. That is, the model’s own gradients can predict which bits, if perturbed, will most strongly influence meaning while leaving syntax and fluency intact. We design a differentiable fault analysis framework, BLADE (Bit-level Fault Analysis via Differentiable Estimation), that uses gradient-based sensitivity estimation to locate semantically critical bits and then refines their selection through a caption-level semantic-fluency objective. Our goal is not merely to corrupt captions, but to understand how meaning itself is encoded, distributed, and alterable at the bit level, revealing that even imperceptible low-level changes can steer the high-level semantics of generative vision-language models. It also opens pathways for robustness testing, adversarial defense, and explainable AI, by exposing how structured bit-level faults can reshape a model’s semantic output.

136. Improving Underwater Acoustic Classification Through Learnable Gabor Filter Convolution and Attention Mechanisms

Authors: Lucas Cesar Ferreira Domingos , Russell Brinkworth , Paulo Eduardo Santos , Karl Sammut
URL: https://arxiv.org/abs/2512.14714
Abstract:

Remotely detecting and classifying underwater acoustic targets is critical for environmental monitoring and defence. However, the complex nature of ship-radiated and environmental underwater noise poses significant challenges to accurate signal processing. While recent advancements in machine learning have improved classification accuracy, issues such as limited dataset availability and a lack of standardised experimentation hinder generalisation and robustness. This paper introduces GSE ResNeXt, a deep learning architecture integrating learnable Gabor convolutional layers with a ResNeXt backbone enhanced by squeeze-and-excitation attention mechanisms. The Gabor filters serve as two-dimensional adaptive band-pass filters, extending the feature channel representation. Its combination with channel attention improves training stability and convergence while enhancing the model’s ability to extract discriminative features. The model is evaluated on three classification tasks of increasing complexity. In particular, the impact of temporal differences between the training and testing data is explored, revealing that the distance between the vessel and sensor significantly affects performance. Results show that, GSE ResNeXt consistently outperforms baseline models like Xception, ResNet, and MobileNetV2, in terms of classification performance. Regarding stability and convergence, the addition of Gabor convolutions in the initial layers of the model represents a 28% reduction in training time. These results emphasise the importance of signal processing strategies in improving the reliability and generalisation of models under different environmental conditions, especially in data-limited underwater acoustic classification scenarios. Future developments should focus on mitigating the impact of environmental factors on input signals.

137. SepsisSuite: Beyond Risk Stratification – A Comparative Analysis of Deep Fusion vs. Expert Stacking for Prescriptive Sepsis AI

Authors: Ryan Cartularo
URL: https://arxiv.org/abs/2512.14712
Abstract:

Sepsis accounts for nearly 20% of global ICU admissions, yet conventional prediction models often fail to effectively integrate heterogeneous data streams, remaining either siloed by modality or reliant on brittle early fusion. In this work, we present a rigorous architectural comparison between End-to-End Deep Fusion and Context-Aware Stacking for sepsis tasks. We initially hypothesized that a novel Quad-Modal Hierarchical Gated Attention Network – termed SepsisFusionFormer – would resolve complex cross-modal interactions between vitals, text, and imaging. However, experiments on MIMIC-IV revealed that SepsisFusionFormer suffered from “attention starvation” in the small antibiotic cohort ($N \approx 2,100$), resulting in overfitting (AUC 0.66). This counterintuitive result informed the design of SepsisLateFusion, a “leaner” Context-Aware Mixture-of-Experts (MoE) architecture. By treating modalities as orthogonal experts – the “Historian” (Static), the “Monitor” (Temporal), and the “Reader” (NLP) – and dynamically gating them via a CatBoost meta-learner, we achieved State-of-the-Art (SOTA) performance: 0.915 AUC for prediction 4 hours prior to clinical onset. By calibrating the decision threshold for clinical safety, we reduced missed cases by 48% relative to the default operating point, thus opening a true preventative window for timely intervention over reactive alerts. Furthermore, for the novel prescriptive task of multi-class antibiotic selection, we demonstrate that a Quad-Modal Ensemble achieved the highest performance (0.72 AUC). These models are integrated into SepsisSuite, a deployment-ready Python framework for clinical decision support. SepsisSuite is available for free at: this https URL

Authors: Changan Liu , Xiaotian Zhou , Ahad N. Zehmakan , Zhongzhi Zhang
URL: https://arxiv.org/abs/2512.14711
Abstract:

The advent of online social networks has facilitated fast and wide spread of information. However, some users, especially members of minority groups, may be less likely to receive information spreading on the network, due to their disadvantaged network position. We study the optimization problem of adding new connections to a network to enhance fairness in information access among different demographic groups. We provide a concrete formulation of this problem where information access is measured in terms of resistance distance, {offering a new perspective that emphasizes global network structure and multi-path connectivity.} The problem is shown to be NP-hard. We propose a simple greedy algorithm which turns out to output accurate solutions, but its run time is cubic, which makes it undesirable for large networks. As our main technical contribution, we reduce its time complexity to linear, leveraging several novel approximation techniques. In addition to our theoretical findings, we also conduct an extensive set of experiments using both real-world and synthetic datasets. We demonstrate that our linear-time algorithm can produce accurate solutions for networks with millions of nodes.

139. Autonomous Source Knowledge Selection in Multi-Domain Adaptation

Authors: Keqiuyin Li , Jie Lu , Hua Zuo , Guangquan Zhang
URL: https://arxiv.org/abs/2512.14710
Abstract:

Unsupervised multi-domain adaptation plays a key role in transfer learning by leveraging acquired rich source information from multiple source domains to solve target task from an unlabeled target domain. However, multiple source domains often contain much redundant or unrelated information which can harm transfer performance, especially when in massive-source domain settings. It is urgent to develop effective strategies for identifying and selecting the most transferable knowledge from massive source domains to address the target task. In this paper, we propose a multi-domain adaptation method named \underline{\textit{Auto}}nomous Source Knowledge \underline{\textit{S}}election (AutoS) to autonomosly select source training samples and models, enabling the prediction of target task using more relevant and transferable source information. The proposed method employs a density-driven selection strategy to choose source samples during training and to determine which source models should contribute to target prediction. Simulteneously, a pseudo-label enhancement module built on a pre-trained multimodal modal is employed to mitigate target label noise and improve self-supervision. Experiments on real-world datasets indicate the superiority of the proposed method.

140. LLM as a Neural Architect: Controlled Generation of Image Captioning Models Under Strict API Contracts

Authors: Krunal Jesani , Dmitry Ignatov , Radu Timofte
URL: https://arxiv.org/abs/2512.14706
Abstract:

Neural architecture search (NAS) traditionally requires significant human expertise or automated trial-and-error to design deep learning models. We present NN-Caption, an LLM-guided neural architecture search pipeline that generates runnable image-captioning models by composing CNN encoders from LEMUR’s classification backbones with sequence decoders (LSTM/GRU/Transformer) under a strict Net API. Using DeepSeek-R1-0528-Qwen3-8B as the primary generator, we present the prompt template and examples of generated architectures. We evaluate on MS COCO with BLEU-4. The LLM generated dozens of captioning models, with over half successfully trained and producing meaningful captions. We analyse the outcomes of using different numbers of input model snippets (5 vs. 10) in the prompt, finding a slight drop in success rate when providing more candidate components. We also report training dynamics (caption accuracy vs. epochs) and the highest BLEU-4 attained. Our results highlight the promise of LLM-guided NAS: the LLM not only proposes architectures but also suggests hyperparameters and training practices. We identify the challenges encountered (e.g., code hallucinations or API compliance issues) and detail how prompt rules and iterative code fixes addressed them. This work presents a pipeline that integrates prompt-based code generation with automatic evaluation, and adds dozens of novel captioning models to the open LEMUR dataset to facilitate reproducible benchmarking and downstream AutoML research.

141. Tourists Profiling by Interest Analysis

Authors: Sonia Djebali , Quentin Gabot , Guillaume Guerard
URL: https://arxiv.org/abs/2512.14704
Abstract:

With the recent digital revolution, analyzing of tourists’ behaviors and research fields associated with it have changed profoundly. It is now easier to examine behaviors of tourists using digital traces they leave during their travels. The studies conducted on diverse aspects of tourism focus on quantitative aspects of digital traces to reach its conclusions. In this paper, we suggest a study focused on both qualitative and quantitative aspect of digital traces to understand the dynamics governing tourist behavior, especially those concerning attractions networks.

142. Algorithmic Criminal Liability in Greenwashing: Comparing India, United States, and European Union

Authors: Sahibpreet Singh , Manjit Singh
URL: https://arxiv.org/abs/2512.12837
Abstract:

AI-powered greenwashing has emerged as an insidious challenge within corporate sustainability governance, exacerbating the opacity of environmental disclosures and subverting regulatory oversight. This study conducts a comparative legal analysis of criminal liability for AI-mediated greenwashing across India, the US, and the EU, exposing doctrinal lacunae in attributing culpability when deceptive claims originate from algorithmic systems. Existing statutes exhibit anthropocentric biases by predicating liability on demonstrable human intent, rendering them ill-equipped to address algorithmic deception. The research identifies a critical gap in jurisprudential adaptation, as prevailing fraud statutes remain antiquated vis-à-vis AI-generated misrepresentation. Utilising a doctrinal legal methodology, this study systematically dissects judicial precedents and statutory instruments, yielding results regarding the potential expansion of corporate criminal liability. Findings underscore the viability of strict liability models, recalibrated governance frameworks for AI accountability, and algorithmic due diligence mandates under ESG regimes. Comparative insights reveal jurisdictional disparities, with the EU Corporate Sustainability Due Diligence Directive (CSDDD) offering a potential transnational model. This study contributes to AI ethics and environmental jurisprudence by advocating for a hybrid liability framework integrating algorithmic risk assessment with legal personhood constructs, ensuring algorithmic opacity does not preclude liability enforcement.

전체 AI 논문 - 2025-12-19

1. Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants

2. Artism: AI-Driven Dual-Engine System for Art Generation and Critique

3. Explaining the Reasoning of Large Language Models Using Attribution Graphs

4. Stepwise Think-Critique: A Unified Framework for Robust and Interpretable LLM Reasoning

5. A Decision-Theoretic Approach for Managing Misalignment

6. Evaluating Large Language Models in Scientific Discovery

7. Nemotron-Math: Efficient Long-Context Distillation of Mathematical Reasoning from Multi-Mode Supervision

8. Intent-Driven UAM Rescheduling

9. Outer-Learning Framework for Playing Multi-Player Trick-Taking Card Games: A Case Study in Skat

10. Bilateral Spatial Reasoning about Street Networks: Graph-based RAG with Qualitative Spatial Representations

11. SCOPE: Prompt Evolution for Enhancing Agent Effectiveness

12. ChatGPT and Gemini participated in the Korean College Scholastic Ability Test – Earth Science I

13. Graph Contextual Reinforcement Learning for Efficient Directed Controller Synthesis

14. CangLing-KnowFlow: A Unified Knowledge-and-Flow-fused Agent for Comprehensive Remote Sensing Applications

15. A Clustering-Based Variable Ordering Framework for Relaxed Decision Diagrams for Maximum Weighted Independent Set Problem

16. Beyond Fast and Slow: Cognitive-Inspired Elastic Reasoning for Large Language Models

17. Agentic AI for Integrated Sensing and Communication: Analysis, Framework, and Case Study

18. LADY: Linear Attention for Autonomous Driving Efficiency without Transformers

19. Beyond Accuracy: A Geometric Stability Analysis of Large Language Models in Chess Evaluation

20. AgroAskAI: A Multi-Agentic AI Framework for Supporting Smallholder Farmers’ Enquiries Globally

21. IaC Generation with LLMs: An Error Taxonomy and A Study on Configuration Knowledge Injection

22. GR-Agent: Adaptive Graph Reasoning Agent under Incomplete Knowledge

23. Attention as Binding: A Vector-Symbolic Perspective on Transformer Reasoning

24. Spatia: Video Generation with Updatable Spatial Memory

25. mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

26. BashArena: A Control Setting for Highly Privileged AI Agents

27. Can LLMs Guide Their Own Exploration? Gradient-Guided Reinforcement Learning for LLM Reasoning

28. Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

29. PPSEBM: An Energy-Based Model with Progressive Parameter Selection for Continual Learning

30. VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?

31. IC-Effect: Precise and Efficient Video Effects Editing via In-Context Learning

32. How Much is Too Much? Exploring LoRA Rank Trade-offs for Retaining Knowledge and Domain Robustness

33. Evaluating Metrics for Safety with LLM-as-Judges

34. How Smoothing is N-simplicial Attention?

35. A Conditioned UNet for Music Source Separation

36. BERT and CNN integrated Neural Collaborative Filtering for Recommender Systems

37. Attention in Motion: Secure Platooning via Transformer-based Misbehavior Detection

38. Soft Geometric Inductive Bias for Object Centric Dynamics

39. How Do Semantically Equivalent Code Transformations Impact Membership Inference on LLMs for Code?

40. On Assessing the Relevance of Code Reviews Authored by Generative Models

41. Double Horizon Model-Based Policy Optimization

42. FM-EAC: Feature Model-based Enhanced Actor-Critic for Multi-Task Control in Dynamic Environments

43. SMART: Semantic Matching Contrastive Learning for Partially View-Aligned Clustering

44. Emotion Recognition in Signers

45. Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models

46. Adversarial versification in portuguese as a jailbreak operator in LLMs

47. Empirical Investigation of the Impact of Phase Information on Fault Diagnosis of Rotating Machinery

48. Exploring User Acceptance and Concerns toward LLM-powered Conversational Agents in Immersive Extended Reality

49. Vision-based module for accurately reading linear scales in a laboratory

50. Managing Ambiguity: A Proof of Concept of Human-AI Symbiotic Sense-making based on Quantum-Inspired Cognitive Mechanism of Rogue Variable Detection

51. Automated Motion Artifact Check for MRI (AutoMAC-MRI): An Interpretable Framework for Motion Artifact Detection and Severity Assessment

52. Evaluating LLMs for Zeolite Synthesis Event Extraction (ZSEE): A Systematic Analysis of Prompting Strategies

53. Graph Pattern-based Association Rules Evaluated Under No-repeated-anything Semantics in the Graph Transactional Setting

54. Quantum Machine Learning for Cybersecurity: A Taxonomy and Future Directions

55. Well Begun, Half Done: Reinforcement Learning with Prefix Optimization for LLM Reasoning

56. VLA-AN: An Efficient and Onboard Vision-Language-Action Framework for Aerial Navigation in Complex Environments

57. Leveraging Foundational Models and Simple Fusion for Multi-modal Physiological Signal Analysis

58. Intersectional Fairness in Vision-Language Models for Medical Image Disease Classification

59. Yes-MT’s Submission to the Low-Resource Indic Language Translation Shared Task in WMT 2024

60. RFKG-CoT: Relation-Driven Adaptive Hop-count Selection and Few-Shot Path Guidance for Knowledge-Aware QA

61. Governing rapid technological change: Policy Delphi on the future of European AI governance

62. DEER: Draft with Diffusion, Verify with Autoregressive Models

63. MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers

64. Offline Multi-Task Multi-Objective Data-Driven Evolutionary Algorithm with Language Surrogate Model and Implicit Q-Learning

65. From Isolation to Entanglement: When Do Interpretability Methods Identify and Disentangle Known Concepts?

66. HD-Prot: A Protein Language Model for Joint Sequence-Structure Modeling with Continuous Structure Tokens

67. Automatic Reward Shaping from Multi-Objective Human Heuristics

68. QoS-Aware Hierarchical Reinforcement Learning for Joint Link Selection and Trajectory Optimization in SAGIN-Supported UAV Mobility Management

69. I am here for you”: How relational conversational AI appeals to adolescents, especially those who are socially and emotionally vulnerable

70. FADTI: Fourier and Attention Driven Diffusion for Multivariate Time Series Imputation

71. How Many Heads Make an SSM? A Unified Framework for Attention and State Space Models

72. Feature-Centric Unsupervised Node Representation Learning Without Homophily Assumption

73. Quantifying Return on Security Controls in LLM Systems

74. PMMD: A pose-guided multi-view multi-modal diffusion for person generation

75. The Semantic Illusion: Certified Limits of Embedding-Based Hallucination Detection in RAG Systems

76. EMFusion: Conditional Diffusion Framework for Trustworthy Frequency Selective EMF Forecasting in Wireless Networks

77. Tracking spatial temporal details in ultrasound long video via wavelet analysis and memory bank

78. Meta-learners for few-shot weakly-supervised optic disc and cup segmentation on fundus images

79. The Meta-Prompting Protocol: Orchestrating LLMs via Adversarial Feedback Loops