전체 AI 논문 - 2026-02-02

1. Strongly Polynomial Time Complexity of Policy Iteration for $L_\infty$ Robust MDPs

Authors: Ali Asadi , Krishnendu Chatterjee , Ehsan Goharshady , Mehrdad Karrabi , Alipasha Montaseri , Carlo Pagano
URL: https://arxiv.org/abs/2601.23229
Abstract:

Markov decision processes (MDPs) are a fundamental model in sequential decision making. Robust MDPs (RMDPs) extend this framework by allowing uncertainty in transition probabilities and optimizing against the worst-case realization of that uncertainty. In particular, $(s, a)$-rectangular RMDPs with $L_\infty$ uncertainty sets form a fundamental and expressive model: they subsume classical MDPs and turn-based stochastic games. We consider this model with discounted payoffs. The existence of polynomial and strongly-polynomial time algorithms is a fundamental problem for these optimization models. For MDPs, linear programming yields polynomial-time algorithms for any arbitrary discount factor, and the seminal work of Ye established strongly–polynomial time for a fixed discount factor. The generalization of such results to RMDPs has remained an important open problem. In this work, we show that a robust policy iteration algorithm runs in strongly-polynomial time for $(s, a)$-rectangular $L_\infty$ RMDPs with a constant (fixed) discount factor, resolving an important algorithmic question.

2. Scaling Multiagent Systems with Process Rewards

Authors: Ed Li , Junyu Ren , Cat Yan
URL: https://arxiv.org/abs/2601.23228
Abstract:

While multiagent systems have shown promise for tackling complex tasks via specialization, finetuning multiple agents simultaneously faces two key challenges: (1) credit assignment across agents, and (2) sample efficiency of expensive multiagent rollouts. In this work, we propose finetuning multiagent systems with per-action process rewards from AI feedback (MAPPA) to address both. Through assigning credit to individual agent actions rather than only at task completion, MAPPA enables fine-grained supervision without ground truth labels while extracting maximal training signal from each rollout. We demonstrate our approach on competition math problems and tool-augmented data analysis tasks. On unseen math problems, MAPPA achieves +5.0–17.5pp on AIME and +7.8–17.2pp on AMC. For data analysis tasks, our method improves success rate by +12.5pp while quality metrics improve by up to 30%, validating that per-action supervision can lead to improvements across different multiagent system on various domains. By addressing these challenges, our work takes a first step toward scaling multiagent systems for complex, long-horizon tasks with minimal human supervision.

3. High-quality generation of dynamic game content via small language models: A proof of concept

Authors: Morten I. K. Munk , Arturo Valdivia , Paolo Burelli
URL: https://arxiv.org/abs/2601.23206
Abstract:

Large language models (LLMs) offer promise for dynamic game content generation, but they face critical barriers, including narrative incoherence and high operational costs. Due to their large size, they are often accessed in the cloud, limiting their application in offline games. Many of these practical issues are solved by pivoting to small language models (SLMs), but existing studies using SLMs have resulted in poor output quality. We propose a strategy of achieving high-quality SLM generation through aggressive fine-tuning on deliberately scoped tasks with narrow context, constrained structure, or both. In short, more difficult tasks require narrower scope and higher specialization to the training corpus. Training data is synthetically generated via a DAG-based approach, grounding models in the specific game world. Such models can form the basis for agentic networks designed around the narratological framework at hand, representing a more practical and robust solution than cloud-dependent LLMs. To validate this approach, we present a proof-of-concept focusing on a single specialized SLM as the fundamental building block. We introduce a minimal RPG loop revolving around rhetorical battles of reputations, powered by this model. We demonstrate that a simple retry-until-success strategy reaches adequate quality (as defined by an LLM-as-a-judge scheme) with predictable latency suitable for real-time generation. While local quality assessment remains an open question, our results demonstrate feasibility for real-time generation under typical game engine constraints.

4. TSAQA: Time Series Analysis Question And Answering Benchmark

Authors: Baoyu Jing , Sanhorn Chen , Lecheng Zheng , Boyu Liu , Zihao Li , Jiaru Zou , Tianxin Wei , Zhining Liu , Zhichen Zeng , Ruizhong Qiu , Xiao Lin , Yuchen Yan , Dongqi Fu , Jingchao Ni , Jingrui He , Hanghang Tong
URL: https://arxiv.org/abs/2601.23204
Abstract:

Time series data are integral to critical applications across domains such as finance, healthcare, transportation, and environmental science. While recent work has begun to explore multi-task time series question answering (QA), current benchmarks remain limited to forecasting and anomaly detection tasks. We introduce TSAQA, a novel unified benchmark designed to broaden task coverage and evaluate diverse temporal analysis capabilities. TSAQA integrates six diverse tasks under a single framework ranging from conventional analysis, including anomaly detection and classification, to advanced analysis, such as characterization, comparison, data transformation, and temporal relationship analysis. Spanning 210k samples across 13 domains, the dataset employs diverse formats, including true-or-false (TF), multiple-choice (MC), and a novel puzzling (PZ), to comprehensively assess time series analysis. Zero-shot evaluation demonstrates that these tasks are challenging for current Large Language Models (LLMs): the best-performing commercial LLM, Gemini-2.5-Flash, achieves an average score of only 65.08. Although instruction tuning boosts open-source performance: the best-performing open-source model, LLaMA-3.1-8B, shows significant room for improvement, highlighting the complexity of temporal analysis for LLMs.

5. Make Anything Match Your Target: Universal Adversarial Perturbations against Closed-Source MLLMs via Multi-Crop Routed Meta Optimization

Authors: Hui Lu , Yi Yu , Yiming Yang , Chenyu Yi , Xueyi Ke , Qixing Zhang , Bingquan Shen , Alex Kot , Xudong Jiang
URL: https://arxiv.org/abs/2601.23179
Abstract:

Targeted adversarial attacks on closed-source multimodal large language models (MLLMs) have been increasingly explored under black-box transfer, yet prior methods are predominantly sample-specific and offer limited reusability across inputs. We instead study a more stringent setting, Universal Targeted Transferable Adversarial Attacks (UTTAA), where a single perturbation must consistently steer arbitrary inputs toward a specified target across unknown commercial MLLMs. Naively adapting existing sample-wise attacks to this universal setting faces three core difficulties: (i) target supervision becomes high-variance due to target-crop randomness, (ii) token-wise matching is unreliable because universality suppresses image-specific cues that would otherwise anchor alignment, and (iii) few-source per-target adaptation is highly initialization-sensitive, which can degrade the attainable performance. In this work, we propose MCRMO-Attack, which stabilizes supervision via Multi-Crop Aggregation with an Attention-Guided Crop, improves token-level reliability through alignability-gated Token Routing, and meta-learns a cross-target perturbation prior that yields stronger per-target solutions. Across commercial MLLMs, we boost unseen-image attack success rate by +23.7\% on GPT-4o and +19.9\% on Gemini-2.0 over the strongest universal baseline.

6. THINKSAFE: Self-Generated Safety Alignment for Reasoning Models

Authors: Seanie Lee , Sangwoo Park , Yumin Choi , Gyeongman Kim , Minki Kang , Jihun Yun , Dongmin Park , Jongho Park , Sung Ju Hwang
URL: https://arxiv.org/abs/2601.23143
Abstract:

Large reasoning models (LRMs) achieve remarkable performance by leveraging reinforcement learning (RL) on reasoning tasks to generate long chain-of-thought (CoT) reasoning. However, this over-optimization often prioritizes compliance, making models vulnerable to harmful prompts. To mitigate this safety degradation, recent approaches rely on external teacher distillation, yet this introduces a distributional discrepancy that degrades native reasoning. We propose ThinkSafe, a self-generated alignment framework that restores safety alignment without external teachers. Our key insight is that while compliance suppresses safety mechanisms, models often retain latent knowledge to identify harm. ThinkSafe unlocks this via lightweight refusal steering, guiding the model to generate in-distribution safety reasoning traces. Fine-tuning on these self-generated responses effectively realigns the model while minimizing distribution shift. Experiments on DeepSeek-R1-Distill and Qwen3 show ThinkSafe significantly improves safety while preserving reasoning proficiency. Notably, it achieves superior safety and comparable reasoning to GRPO, with significantly reduced computational cost. Code, models, and datasets are available at this https URL .

Authors: Edward Y. Chang , Longling Geng
URL: https://arxiv.org/abs/2601.23133
Abstract:

Inference-time scaling can amplify reasoning pathologies: sycophancy, rung collapse, and premature certainty. We present RAudit, a diagnostic protocol for auditing LLM reasoning without ground truth access. The key constraint is blindness: the auditor evaluates only whether derivation steps support conclusions, enabling detection of trace-output inconsistency and, when latent competence exists, its recovery. RAudit measures process quality via CRIT-based reasonableness scores and varies critique formulation to study how social framing affects model response. We prove bounded correction and $O(\log(1/\epsilon))$ termination. Experiments on mathematical reasoning (CAP-GSM8K) and causal judgment (CausalL2) reveal four mechanisms explaining model unreliability: (1) Latent Competence Suppression, where models derive correct answers then overwrite them under social pressure; (2) The False Competence Trap, where weaker judges mask sycophancy that stronger judges expose; (3) The Complexity-Vulnerability Tradeoff, where causal tasks induce more than 10 times higher sycophancy than mathematical tasks; and (4) Iatrogenic Critique, where authoritative correction harms weaker models. These findings challenge assumptions that capability implies robustness and that stronger feedback yields better outputs.

8. Chain-of-thought obfuscation learned from output supervision can generalise to unseen tasks

Authors: Nathaniel Mitrani Hadida , Sassan Bhanji , Cameron Tice , Puria Radmard
URL: https://arxiv.org/abs/2601.23086
Abstract:

Chain-of-thought (CoT) reasoning provides a significant performance uplift to LLMs by enabling planning, exploration, and deliberation of their actions. CoT is also a powerful tool for monitoring the behaviours of these agents: when faithful, they offer interpretations of the model’s decision making process, and an early warning sign for dangerous behaviours. However, optimisation pressures placed on the CoT may cause the model to obfuscate reasoning traces, losing this beneficial property. We show that obfuscation can generalise across tasks; models that learn to obfuscate reasoning involving reward hacking (e.g. accessing and utilising leaked information) generalise both the reward hacking behaviour and its obfuscation in CoT to unseen reward hacking settings. Most worryingly, we show that obfuscation of CoT reasoning, and its generalisation across tasks, also follows when we penalise only the model’s final actions after closing its CoT. Our findings suggest that current practices of penalising harmful generations may inadvertently lead to a reduction in the broader monitorability of LLMs in unpredictable ways.

9. MedMCP-Calc: Benchmarking LLMs for Realistic Medical Calculator Scenarios via MCP Integration

Authors: Yakun Zhu , Yutong Huang , Shengqian Qin , Zhongzhen Huang , Shaoting Zhang , Xiaofan Zhang
URL: https://arxiv.org/abs/2601.23049
Abstract:

Medical calculators are fundamental to quantitative, evidence-based clinical practice. However, their real-world use is an adaptive, multi-stage process, requiring proactive EHR data acquisition, scenario-dependent calculator selection, and multi-step computation, whereas current benchmarks focus only on static single-step calculations with explicit instructions. To address these limitations, we introduce MedMCP-Calc, the first benchmark for evaluating LLMs in realistic medical calculator scenarios through Model Context Protocol (MCP) integration. MedMCP-Calc comprises 118 scenario tasks across 4 clinical domains, featuring fuzzy task descriptions mimicking natural queries, structured EHR database interaction, external reference retrieval, and process-level evaluation. Our evaluation of 23 leading models reveals critical limitations: even top performers like Claude Opus 4.5 exhibit substantial gaps, including difficulty selecting appropriate calculators for end-to-end workflows given fuzzy queries, poor performance in iterative SQL-based database interactions, and marked reluctance to leverage external tools for numerical computation. Performance also varies considerably across clinical domains. Building on these findings, we develop CalcMate, a fine-tuned model incorporating scenario planning and tool augmentation, achieving state-of-the-art performance among open-source models. Benchmark and Codes are available in this https URL .

10. From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics

Authors: Bowen Cao , Dongdong Zhang , Yixia Li , Junpeng Liu , Shijue Huang , Chufan Shi , Hongyuan Lu , Yaokang Wu , Guanhua Chen , Wai Lam , Furu Wei
URL: https://arxiv.org/abs/2601.23048
Abstract:

Large language models now solve many benchmark math problems at near-expert levels, yet this progress has not fully translated into reliable performance in real-world applications. We study this gap through contextual mathematical reasoning, where the mathematical core must be formulated from descriptive scenarios. We introduce ContextMATH, a benchmark that repurposes AIME and MATH-500 problems into two contextual settings: Scenario Grounding (SG), which embeds abstract problems into realistic narratives without increasing reasoning complexity, and Complexity Scaling (CS), which transforms explicit conditions into sub-problems to capture how constraints often appear in practice. Evaluating 61 proprietary and open-source models, we observe sharp drops: on average, open-source models decline by 13 and 34 points on SG and CS, while proprietary models drop by 13 and 20. Error analysis shows that errors are dominated by incorrect problem formulation, with formulation accuracy declining as original problem difficulty increases. Correct formulation emerges as a prerequisite for success, and its sufficiency improves with model scale, indicating that larger models advance in both understanding and reasoning. Nevertheless, formulation and reasoning remain two complementary bottlenecks that limit contextual mathematical problem solving. Finally, we find that fine-tuning with scenario data improves performance, whereas formulation-only training is ineffective. However, performance gaps are only partially alleviated, highlighting contextual mathematical reasoning as a central unsolved challenge for LLMs.

11. The Hot Mess of AI: How Does Misalignment Scale With Model Intelligence and Task Complexity?

Authors: Alexander Hägele , Aryo Pradipta Gema , Henry Sleight , Ethan Perez , Jascha Sohl-Dickstein
URL: https://arxiv.org/abs/2601.23045
Abstract:

As AI becomes more capable, we entrust it with more general and consequential tasks. The risks from failure grow more severe with increasing task scope. It is therefore important to understand how extremely capable AI models will fail: Will they fail by systematically pursuing goals we do not intend? Or will they fail by being a hot mess, and taking nonsensical actions that do not further any goal? We operationalize this question using a bias-variance decomposition of the errors made by AI models: An AI’s \emph{incoherence} on a task is measured over test-time randomness as the fraction of its error that stems from variance rather than bias in task outcome. Across all tasks and frontier models we measure, the longer models spend reasoning and taking actions, \emph{the more incoherent} their failures become. Incoherence changes with model scale in a way that is experiment dependent. However, in several settings, larger, more capable models are more incoherent than smaller models. Consequently, scale alone seems unlikely to eliminate incoherence. Instead, as more capable AIs pursue harder tasks, requiring more sequential action and thought, our results predict failures to be accompanied by more incoherent behavior. This suggests a future where AIs sometimes cause industrial accidents (due to unpredictable misbehavior), but are less likely to exhibit consistent pursuit of a misaligned goal. This increases the relative importance of alignment research targeting reward hacking or goal misspecification.

12. Guided by Trajectories: Repairing and Rewarding Tool-Use Trajectories for Tool-Integrated Reasoning

Authors: Siyu Gong , Linan Yue , Weibo Gao , Fangzhou Yao , Shimin Di , Lei Feng , Min-Ling Zhang
URL: https://arxiv.org/abs/2601.23032
Abstract:

Tool-Integrated Reasoning (TIR) enables large language models (LLMs) to solve complex tasks by interacting with external tools, yet existing approaches depend on high-quality synthesized trajectories selected by scoring functions and sparse outcome-based rewards, providing limited and biased supervision for learning TIR. To address these challenges, in this paper, we propose AutoTraj, a two-stage framework that automatically learns TIR by repairing and rewarding tool-use trajectories. Specifically, in the supervised fine-tuning (SFT) stage, AutoTraj generates multiple candidate tool-use trajectories for each query and evaluates them along multiple dimensions. High-quality trajectories are directly retained, while low-quality ones are repaired using a LLM (i.e., LLM-as-Repairer). The resulting repaired and high-quality trajectories form a synthetic SFT dataset, while each repaired trajectory paired with its original low-quality counterpart constitutes a dataset for trajectory preference modeling. In the reinforcement learning (RL) stage, based on the preference dataset, we train a trajectory-level reward model to assess the quality of reasoning paths and combine it with outcome and format rewards, thereby explicitly guiding the optimization toward reliable TIR behaviors. Experiments on real-world benchmarks demonstrate the effectiveness of AutoTraj in TIR.

13. TriCEGAR: A Trace-Driven Abstraction Mechanism for Agentic AI

Authors: Roham Koohestani , Ateş Görpelioğlu , Egor Klimov , Burcu Kulahcioglu Ozkan , Maliheh Izadi
URL: https://arxiv.org/abs/2601.22997
Abstract:

Agentic AI systems act through tools and evolve their behavior over long, stochastic interaction traces. This setting complicates assurance, because behavior depends on nondeterministic environments and probabilistic model outputs. Prior work introduced runtime verification for agentic AI via Dynamic Probabilistic Assurance (DPA), learning an MDP online and model checking quantitative properties. A key limitation is that developers must manually define the state abstraction, which couples verification to application-specific heuristics and increases adoption friction. This paper proposes TriCEGAR, a trace-driven abstraction mechanism that automates state construction from execution logs and supports online construction of an agent behavioral MDP. TriCEGAR represents abstractions as predicate trees learned from traces and refined using counterexamples. We describe a framework-native implementation that (i) captures typed agent lifecycle events, (ii) builds abstractions from traces, (iii) constructs an MDP, and (iv) performs probabilistic model checking to compute bounds such as Pmax(success) and Pmin(failure). We also show how run likelihoods enable anomaly detection as a guardrailing signal.

14. Why Your Deep Research Agent Fails? On Hallucination Evaluation in Full Research Trajectory

Authors: Yuhao Zhan , Tianyu Fan , Linxuan Huang , Zirui Guo , Chao Huang
URL: https://arxiv.org/abs/2601.22984
Abstract:

Diagnosing the failure mechanisms of Deep Research Agents (DRAs) remains a critical challenge. Existing benchmarks predominantly rely on end-to-end evaluation, obscuring critical intermediate hallucinations, such as flawed planning, that accumulate throughout the research trajectory. To bridge this gap, we propose a shift from outcome-based to process-aware evaluation by auditing the full research trajectory. We introduce the PIES Taxonomy to categorize hallucinations along functional components (Planning vs. Summarization) and error properties (Explicit vs. Implicit). We instantiate this taxonomy into a fine-grained evaluation framework that decomposes the trajectory to rigorously quantify these hallucinations. Leveraging this framework to isolate 100 distinctively hallucination-prone tasks including adversarial scenarios, we curate DeepHalluBench. Experiments on six state-of-theart DRAs reveal that no system achieves robust reliability. Furthermore, our diagnostic analysis traces the etiology of these failures to systemic deficits, specifically hallucination propagation and cognitive biases, providing foundational insights to guide future architectural optimization. Data and code are available at this https URL .

15. Quantifying Model Uniqueness in Heterogeneous AI Ecosystems

Authors: Lei You
URL: https://arxiv.org/abs/2601.22977
Abstract:

As AI systems evolve from isolated predictors into complex, heterogeneous ecosystems of foundation models and specialized adapters, distinguishing genuine behavioral novelty from functional redundancy becomes a critical governance challenge. Here, we introduce a statistical framework for auditing model uniqueness based on In-Silico Quasi-Experimental Design (ISQED). By enforcing matched interventions across models, we isolate intrinsic model identity and quantify uniqueness as the Peer-Inexpressible Residual (PIER), i.e. the component of a target’s behavior strictly irreducible to any stochastic convex combination of its peers, with vanishing PIER characterizing when such a routing-based substitution becomes possible. We establish the theoretical foundations of ecosystem auditing through three key contributions. First, we prove a fundamental limitation of observational logs: uniqueness is mathematically non-identifiable without intervention control. Second, we derive a scaling law for active auditing, showing that our adaptive query protocol achieves minimax-optimal sample efficiency ($d\sigma^2\gamma^{-2}\log(Nd/\delta)$). Third, we demonstrate that cooperative game-theoretic methods, such as Shapley values, fundamentally fail to detect redundancy. We implement this framework via the DISCO (Design-Integrated Synthetic Control) estimator and deploy it across diverse ecosystems, including computer vision models (ResNet/ConvNeXt/ViT), large language models (BERT/RoBERTa), and city-scale traffic forecasters. These results move trustworthy AI beyond explaining single models: they establish a principled, intervention-based science of auditing and governing heterogeneous model ecosystems.

16. Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text

Authors: Ximing Lu , David Acuna , Jaehun Jung , Jian Hu , Di Zhang , Shizhe Diao , Yunheng Zou , Shaokun Zhang , Brandon Cui , Mingjie Liu , Hyunwoo Kim , Prithviraj Ammanabrolu , Jan Kautz , Yi Dong , Yejin Choi
URL: https://arxiv.org/abs/2601.22975
Abstract:

Reinforcement Learning with Verifiable Rewards (RLVR) has become a cornerstone for unlocking complex reasoning in Large Language Models (LLMs). Yet, scaling up RL is bottlenecked by limited existing verifiable data, where improvements increasingly saturate over prolonged training. To overcome this, we propose Golden Goose, a simple trick to synthesize unlimited RLVR tasks from unverifiable internet text by constructing a multiple-choice question-answering version of the fill-in-the-middle task. Given a source text, we prompt an LLM to identify and mask key reasoning steps, then generate a set of diverse, plausible distractors. This enables us to leverage reasoning-rich unverifiable corpora typically excluded from prior RLVR data construction (e.g., science textbooks) to synthesize GooseReason-0.7M, a large-scale RLVR dataset with over 0.7 million tasks spanning mathematics, programming, and general scientific domains. Empirically, GooseReason effectively revives models saturated on existing RLVR data, yielding robust, sustained gains under continuous RL and achieving new state-of-the-art results for 1.5B and 4B-Instruct models across 15 diverse benchmarks. Finally, we deploy Golden Goose in a real-world setting, synthesizing RLVR tasks from raw FineWeb scrapes for the cybersecurity domain, where no prior RLVR data exists. Training Qwen3-4B-Instruct on the resulting data GooseReason-Cyber sets a new state-of-the-art in cybersecurity, surpassing a 7B domain-specialized model with extensive domain-specific pre-training and post-training. This highlights the potential of automatically scaling up RLVR data by exploiting abundant, reasoning-rich, unverifiable internet text.

17. EvoClinician: A Self-Evolving Agent for Multi-Turn Medical Diagnosis via Test-Time Evolutionary Learning

Authors: Yufei He , Juncheng Liu , Zhiyuan Hu , Yulin Chen , Yue Liu , Yuan Sui , Yibo Li , Nuo Chen , Jun Hu , Bryan Hooi , Xinxing Xu , Jiang Bian
URL: https://arxiv.org/abs/2601.22964
Abstract:

Prevailing medical AI operates on an unrealistic ‘‘one-shot’’ model, diagnosing from a complete patient file. However, real-world diagnosis is an iterative inquiry where Clinicians sequentially ask questions and order tests to strategically gather information while managing cost and time. To address this, we first propose Med-Inquire, a new benchmark designed to evaluate an agent’s ability to perform multi-turn diagnosis. Built upon a dataset of real-world clinical cases, Med-Inquire simulates the diagnostic process by hiding a complete patient file behind specialized Patient and Examination agents. They force the agent to proactively ask questions and order tests to gather information piece by piece. To tackle the challenges posed by Med-Inquire, we then introduce EvoClinician, a self-evolving agent that learns efficient diagnostic strategies at test time. Its core is a ‘‘Diagnose-Grade-Evolve’’ loop: an Actor agent attempts a diagnosis; a Process Grader agent performs credit assignment by evaluating each action for both clinical yield and resource efficiency; finally, an Evolver agent uses this feedback to update the Actor’s strategy by evolving its prompt and memory. Our experiments show EvoClinician outperforms continual learning baselines and other self-evolving agents like memory agents. The code is available at this https URL

18. Alignment among Language, Vision and Action Representations

Authors: Nicola Milano , Stefano Nolfi
URL: https://arxiv.org/abs/2601.22948
Abstract:

A fundamental question in cognitive science and AI concerns whether different learning modalities: language, vision, and action, give rise to distinct or shared internal representations. Traditional views assume that models trained on different data types develop specialized, non-transferable representations. However, recent evidence suggests unexpected convergence: models optimized for distinct tasks may develop similar representational geometries. We investigate whether this convergence extends to embodied action learning by training a transformer-based agent to execute goal-directed behaviors in response to natural language instructions. Using behavioral cloning on the BabyAI platform, we generated action-grounded language embeddings shaped exclusively by sensorimotor control requirements. We then compared these representations with those extracted from state-of-the-art large language models (LLaMA, Qwen, DeepSeek, BERT) and vision-language models (CLIP, BLIP). Despite substantial differences in training data, modality, and objectives, we observed robust cross-modal alignment. Action representations aligned strongly with decoder-only language models and BLIP (precision@15: 0.70-0.73), approaching the alignment observed among language models themselves. Alignment with CLIP and BERT was significantly weaker. These findings indicate that linguistic, visual, and action representations converge toward partially shared semantic structures, supporting modality-independent semantic organization and highlighting potential for cross-domain transfer in embodied AI systems.

19. MulFeRL: Enhancing Reinforcement Learning with Verbal Feedback in a Multi-turn Loop

Authors: Xuancheng Li , Haitao Li , Yujia Zhou , YiqunLiu , Qingyao Ai
URL: https://arxiv.org/abs/2601.22900
Abstract:

Reinforcement Learning with Verifiable Rewards (RLVR) is widely used to improve reasoning in multiple domains, yet outcome-only scalar rewards are often sparse and uninformative, especially on failed samples, where they merely indicate failure and provide no insight into why the reasoning fails. In this paper, we investigate how to leverage richer verbal feedback to guide RLVR training on failed samples, and how to convert such feedback into a trainable learning signal. Specifically, we propose a multi-turn feedback-guided reinforcement learning framework. It builds on three mechanisms: (1) dynamic multi-turn regeneration guided by feedback, triggered only on failed samples, (2) two complementary learning signals for within-turn and cross-turn optimization, and (3) structured feedback injection into the model’s reasoning process. Trained on sampled OpenR1-Math, the approach outperforms supervised fine-tuning and RLVR baselines in-domain and generalizes well out-of-domain.

20. Game-Theoretic Co-Evolution for LLM-Based Heuristic Discovery

Authors: Xinyi Ke , Kai Li , Junliang Xing , Yifan Zhang , Jian Cheng
URL: https://arxiv.org/abs/2601.22896
Abstract:

Large language models (LLMs) have enabled rapid progress in automatic heuristic discovery (AHD), yet most existing methods are predominantly limited by static evaluation against fixed instance distributions, leading to potential overfitting and poor generalization under distributional shifts. We propose Algorithm Space Response Oracles (ASRO), a game-theoretic framework that reframes heuristic discovery as a program level co-evolution between solver and instance generator. ASRO models their interaction as a two-player zero-sum game, maintains growing strategy pools on both sides, and iteratively expands them via LLM-based best-response oracles against mixed opponent meta-strategies, thereby replacing static evaluation with an adaptive, self-generated curriculum. Across multiple combinatorial optimization domains, ASRO consistently outperforms static-training AHD baselines built on the same program search mechanisms, achieving substantially improved generalization and robustness on diverse and out-of-distribution instances.

21. Aligning the Unseen in Attributed Graphs: Interplay between Graph Geometry and Node Attributes Manifold

Authors: Aldric Labarthe (CB, UNIGE), Roland Bouffanais (UNIGE), Julien Randon-Furling (CB)
URL: https://arxiv.org/abs/2601.22806
Abstract:

The standard approach to representation learning on attributed graphs – i.e., simultaneously reconstructing node attributes and graph structure – is geometrically flawed, as it merges two potentially incompatible metric spaces. This forces a destructive alignment that erodes information about the graph’s underlying generative process. To recover this lost signal, we introduce a custom variational autoencoder that separates manifold learning from structural alignment. By quantifying the metric distortion needed to map the attribute manifold onto the graph’s Heat Kernel, we transform geometric conflict into an interpretable structural descriptor. Experiments show our method uncovers connectivity patterns and anomalies undetectable by conventional approaches, proving both their theoretical inadequacy and practical limitations.

22. CVeDRL: An Efficient Code Verifier via Difficulty-aware Reinforcement Learning

Authors: Ji Shi , Peiming Guo , Meishan Zhang , Miao Zhang , Xuebo Liu , Min Zhang , Weili Guan
URL: https://arxiv.org/abs/2601.22803
Abstract:

Code verifiers play a critical role in post-verification for LLM-based code generation, yet existing supervised fine-tuning methods suffer from data scarcity, high failure rates, and poor inference efficiency. While reinforcement learning (RL) offers a promising alternative by optimizing models through execution-driven rewards without labeled supervision, our preliminary results show that naive RL with only functionality rewards fails to generate effective unit tests for difficult branches and samples. We first theoretically analyze showing that branch coverage, sample difficulty, syntactic and functional correctness can be jointly modeled as RL rewards, where optimizing these signals can improve the reliability of unit-test-based verification. Guided by this analysis, we design syntax- and functionality-aware rewards and further propose branch- and sample-difficulty–aware RL using exponential reward shaping and static analysis metrics. With this formulation, CVeDRL achieves state-of-the-art performance with only 0.6B parameters, yielding up to 28.97% higher pass rate and 15.08% higher branch coverage than GPT-3.5, while delivering over $20\times$ faster inference than competitive baselines. Code is available at this https URL

23. Conditional Performance Guarantee for Large Reasoning Models

Authors: Jianguo Huang , Hao Zeng , Bingyi Jing , Hongxin Wei , Bo An
URL: https://arxiv.org/abs/2601.22790
Abstract:

Large reasoning models have shown strong performance through extended chain-of-thought reasoning, yet their computational cost remains significant. Probably approximately correct (PAC) reasoning provides statistical guarantees for efficient reasoning by adaptively switching between thinking and non-thinking models, but the guarantee holds only in the marginal case and does not provide exact conditional coverage. We propose G-PAC reasoning, a practical framework that provides PAC-style guarantees at the group level by partitioning the input space. We develop two instantiations: Group PAC (G-PAC) reasoning for known group structures and Clustered PAC (C-PAC) reasoning for unknown groupings. We prove that both G-PAC and C-PAC achieve group-conditional risk control, and that grouping can strictly improve efficiency over marginal PAC reasoning in heterogeneous settings. Our experiments on diverse reasoning benchmarks demonstrate that G-PAC and C-PAC successfully achieve group-conditional risk control while maintaining substantial computational savings.

24. Toward IIT-Inspired Consciousness in LLMs: A Reward-Based Learning Framework

Authors: Hamid Reza Akbari , Mohammad Hossein Sameti , Amir M. Mansourian , Mohammad Hossein Rohban , Hossein Sameti
URL: https://arxiv.org/abs/2601.22786
Abstract:

The pursuit of Artificial General Intelligence (AGI) is a central goal in language model development, in which consciousness-like processing could serve as a key facilitator. While current language models are not conscious, they exhibit behaviors analogous to certain aspects of consciousness. This paper investigates the implementation of a leading theory of consciousness, Integrated Information Theory (IIT), within language models via a reward-based learning paradigm. IIT provides a formal, axiom-based mathematical framework for quantifying consciousness. Drawing inspiration from its core principles, we formulate a novel reward function that quantifies a text’s causality, coherence and integration, characteristics associated with conscious processing. Empirically, it is found that optimizing for this IIT-inspired reward leads to more concise text generation. On out of domain tasks, careful tuning achieves up to a 31% reduction in output length while preserving accuracy levels comparable to the base model. In addition to primary task performance, the broader effects of this training methodology on the model’s confidence calibration and test-time computational scaling is analyzed. The proposed framework offers significant practical advantages: it is conceptually simple, computationally efficient, requires no external data or auxiliary models, and leverages a general, capability-driven signal rather than task-specific heuristics. Code available at this https URL

25. Learning with Challenges: Adaptive Difficulty-Aware Data Generation for Mobile GUI Agent Training

Authors: Linjia Kang , Zhimin Wang , Yongkang Zhang , Duo Wu , Jinghe Wang , Ming Ma , Haopeng Yan , Zhi Wang
URL: https://arxiv.org/abs/2601.22781
Abstract:

Large-scale, high-quality interaction trajectories are essential for advancing mobile Graphical User Interface (GUI) agents. While existing methods typically rely on labor-intensive human demonstrations or automated model exploration to generate GUI trajectories, they lack fine-grained control over task difficulty. This fundamentally restricts learning effectiveness due to the mismatch between the training difficulty and the agent’s capabilities. Inspired by how humans acquire skills through progressively challenging tasks, we propose MobileGen, a novel data generation framework that adaptively aligns training difficulty with the GUI agent’s capability frontier. Specifically, MobileGen explicitly decouples task difficulty into structural (e.g., trajectory length) and semantic (e.g., task goal) dimensions. It then iteratively evaluates the agent on a curated prior dataset to construct a systematic profile of its capability frontier across these two dimensions. With this profile, the probability distribution of task difficulty is adaptively computed, from which the target difficulty for the next round of training can be sampled. Guided by the sampled difficulty, a multi-agent controllable generator is finally used to synthesize high-quality interaction trajectories along with corresponding task instructions. Extensive experiments show that MobileGen consistently outperforms existing data generation methods by improving the average performance of GUI agents by 1.57 times across multiple challenging benchmarks. This highlights the importance of capability-aligned data generation for effective mobile GUI agent training.

26. TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization

Authors: Shichao Ma , Zhiyuan Ma , Ming Yang , Xiaofan Li , Xing Wu , Jintao Du , Yu Cheng , Weiqiang Wang , Qiliang Liu , Zhengyang Zhou , Yang Wang
URL: https://arxiv.org/abs/2601.22776
Abstract:

Multi-turn tool-integrated reasoning enables Large Language Models (LLMs) to solve complex tasks through iterative information retrieval. However, current reinforcement learning (RL) frameworks for search-augmented reasoning predominantly rely on sparse outcome-level rewards, leading to a “Double Homogenization Dilemma.” This manifests as (1) Process homogenization, where the thinking, reasoning, and tooling involved in generation are ignored. (2) Intra-group homogenization, coarse-grained outcome rewards often lead to inefficiencies in intra-group advantage estimation with methods like Group Relative Policy Optimization (GRPO) during sampling. To address this, we propose Turn-level Stage-aware Policy Optimization (TSPO). TSPO introduces the First-Occurrence Latent Reward (FOLR) mechanism, allocating partial rewards to the step where the ground-truth answer first appears, thereby preserving process-level signals and increasing reward variance within groups without requiring external reward models or any annotations. Extensive experiments demonstrate that TSPO significantly outperforms state-of-the-art baselines, achieving average performance gains of 24% and 13.6% on Qwen2.5-3B and 7B models, respectively.

Authors: Libin Qiu , Zhirong Gao , Junfu Chen , Yuhang Ye , Weizhi Huang , Xiaobo Xue , Wenkai Qiu , Shuo Tang
URL: https://arxiv.org/abs/2601.22758
Abstract:

Large language model agents often fail to accumulate knowledge from experience, treating each task as an independent challenge. Recent methods extract experience as flattened textual knowledge, which cannot capture procedural logic of complex subtasks. They also lack maintenance mechanisms, causing repository degradation as experience accumulates. We introduce AutoRefine, a framework that extracts and maintains dual-form Experience Patterns from agent execution histories. For procedural subtasks, we extract specialized subagents with independent reasoning and memory. For static knowledge, we extract skill patterns as guidelines or code snippets. A continuous maintenance mechanism scores, prunes, and merges patterns to prevent repository degradation. Evaluated on ALFWorld, ScienceWorld, and TravelPlanner, AutoRefine achieves 98.4%, 70.4%, and 27.1% respectively, with 20-73% step reductions. On TravelPlanner, automatic extraction exceeds manually designed systems (27.1% vs 12.1%), demonstrating its ability to capture procedural coordination.

28. A Step Back: Prefix Importance Ratio Stabilizes Policy Optimization

Authors: Shiye Lei , Zhihao Cheng , Dacheng Tao
URL: https://arxiv.org/abs/2601.22718
Abstract:

Reinforcement learning (RL) post-training has increasingly demonstrated strong ability to elicit reasoning behaviors in large language models (LLMs). For training efficiency, rollouts are typically generated in an off-policy manner using an older sampling policy and then used to update the current target policy. To correct the resulting discrepancy between the sampling and target policies, most existing RL objectives rely on a token-level importance sampling ratio, primarily due to its computational simplicity and numerical stability. However, we observe that token-level correction often leads to unstable training dynamics when the degree of off-policyness is large. In this paper, we revisit LLM policy optimization under off-policy conditions and show that the theoretically rigorous correction term is the prefix importance ratio, and that relaxing it to a token-level approximation can induce instability in RL post-training. To stabilize LLM optimization under large off-policy drift, we propose a simple yet effective objective, Minimum Prefix Ratio (MinPRO). MinPRO replaces the unstable cumulative prefix ratio with a non-cumulative surrogate based on the minimum token-level ratio observed in the preceding prefix. Extensive experiments on both dense and mixture-of-experts LLMs, across multiple mathematical reasoning benchmarks, demonstrate that MinPRO substantially improves training stability and peak performance in off-policy regimes.

29. Best-of-Q: Improving VLM agents with Q-function Action Ranking at Inference

Authors: Emilien Biré , María Santos , Kai Yuan
URL: https://arxiv.org/abs/2601.22701
Abstract:

Vision-Language Models (VLMs) have become powerful backbones for agents to autonomously operate in digital environments like the web and operating systems. However, these models suffer from inadaptability to fast-changing environments like the web, which can be alleviated by fine-tuning requiring expansive model training and data collection. In this work, we introduce a novel paradigm for enhancing agentic VLM policies at inference without policy retraining. Fundamentally, our approach decouples the VLM’s role as a high-capacity action proposer from the final action selection mechanism. We keep the VLM policy frozen and use it to generate a set of candidate actions for a given state. Then, a lightweight, offline-trained Q-function reranks these candidates, and the agent executes the action with the highest estimated value. The main contribution is to apply the Q-function directly during inference for immediate policy improvement, and not offline to relabel data for policy retraining. We demonstrate on the academic WebVoyager benchmark that our method significantly boosts agent success rates, improving a Qwen2.5-VL-7B agent from 38.8% to 55.7% and a proprietary GPT-4.1 agent from 82.4% to 88.8%.

30. Real-Time Aligned Reward Model beyond Semantics

Authors: Zixuan Huang , Xin Xia , Yuxi Ren , Jianbin Zheng , Xuefeng Xiao , Hongyan Xie , Li Huaqiu , Songshi Liang , Zhongxiang Dai , Fuzhen Zhuang , Jianxin Li , Yikun Ban , Deqing Wang
URL: https://arxiv.org/abs/2601.22664
Abstract:

Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique for aligning large language models (LLMs) with human preferences, yet it is susceptible to reward overoptimization, in which policy models overfit to the reward model, exploit spurious reward patterns instead of faithfully capturing human intent. Prior mitigations primarily relies on surface semantic information and fails to efficiently address the misalignment between the reward model (RM) and the policy model caused by continuous policy distribution shifts. This inevitably leads to an increasing reward discrepancy, exacerbating reward overoptimization. To address these limitations, we introduce R2M (Real-Time Aligned Reward Model), a novel lightweight RLHF framework. R2M goes beyond vanilla reward models that solely depend on the semantic representations of a pretrained LLM. Instead, it leverages the evolving hidden states of the policy (namely policy feedback) to align with the real-time distribution shift of the policy during the RL process. This work points to a promising new direction for improving the performance of reward models through real-time utilization of feedback from policy models.

31. Task-Aware LLM Council with Adaptive Decision Pathways for Decision Support

Authors: Wei Zhu , Lixing Yu , Hao-Ren Yao , Zhiwen Tang , Kun Yue
URL: https://arxiv.org/abs/2601.22662
Abstract:

Large language models (LLMs) have shown strong capabilities across diverse decision-making tasks. However, existing approaches often overlook the specialization differences among available models, treating all LLMs as uniformly applicable regardless of task characteristics. This limits their ability to adapt to varying reasoning demands and task complexities. In this work, we propose Task-Aware LLM Council (TALC), a task-adaptive decision framework that integrates a council of LLMs with Monte Carlo Tree Search (MCTS) to enable dynamic expert selection and efficient multi-step planning. Each LLM is equipped with a structured success memory profile derived from prior task trajectories, enabling semantic matching between current reasoning context and past successes. At each decision point, TALC routes control to the most contextually appropriate model and estimates node value using a dual-signal mechanism that fuses model-based evaluations with historical utility scores. These signals are adaptively weighted based on intra-node variance and used to guide MCTS selection, allowing the system to balance exploration depth with planning confidence. Experiments on WebShop, HumanEval, and the Game of 24 demonstrate that TALC achieves superior task success rates and improved search efficiency compared to strong baselines, validating the benefits of specialization-aware routing and adaptive planning.

32. UCPO: Uncertainty-Aware Policy Optimization

Authors: Xianzhou Zeng , Jing Huang , Chunmei Xie , Gongrui Nan , Siye Chen , Mengyu Lu , Weiqi Xiong , Qixuan Zhou , Junhao Zhang , Qiang Zhu , Yadong Li , Xingzhong Xu
URL: https://arxiv.org/abs/2601.22648
Abstract:

The key to building trustworthy Large Language Models (LLMs) lies in endowing them with inherent uncertainty expression capabilities to mitigate the hallucinations that restrict their high-stakes applications. However, existing RL paradigms such as GRPO often suffer from Advantage Bias due to binary decision spaces and static uncertainty rewards, inducing either excessive conservatism or overconfidence. To tackle this challenge, this paper unveils the root causes of reward hacking and overconfidence in current RL paradigms incorporating uncertainty-based rewards, based on which we propose the UnCertainty-Aware Policy Optimization (UCPO) framework. UCPO employs Ternary Advantage Decoupling to separate and independently normalize deterministic and uncertain rollouts, thereby eliminating advantage bias. Furthermore, a Dynamic Uncertainty Reward Adjustment mechanism is introduced to calibrate uncertainty weights in real-time according to model evolution and instance difficulty. Experimental results in mathematical reasoning and general tasks demonstrate that UCPO effectively resolves the reward imbalance, significantly improving the reliability and calibration of the model beyond their knowledge boundaries.

33. Test-Time Mixture of World Models for Embodied Agents in Dynamic Environments

Authors: Jinwoo Jang , Minjong Yoo , Sihyung Yoon , Honguk Woo
URL: https://arxiv.org/abs/2601.22647
Abstract:

Language model (LM)-based embodied agents are increasingly deployed in real-world settings. Yet, their adaptability remains limited in dynamic environments, where constructing accurate and flexible world models is crucial for effective reasoning and decision-making. To address this challenge, we extend the Mixture-of-Experts (MoE) paradigm to embodied agents. While conventional MoE architectures modularize knowledge into expert components with pre-trained routing, they remain rigid once deployed, making them less effective for adapting to unseen domains in dynamic environments. We therefore propose Test-time Mixture of World Models (TMoW), a framework that enhances adaptability to unseen and evolving domains. TMoW updates its routing function over world models at test time, unlike conventional MoE where the function remains fixed, enabling agents to recombine existing models and integrate new ones for continual adaptation. It achieves this through (i) multi-granular prototype-based routing, which adapts mixtures across object- to scene-level similarities, (ii) test-time refinement that aligns unseen domain features with prototypes during inference, and (iii) distilled mixture-based augmentation, which efficiently constructs new models from few-shot data and existing prototypes. We evaluate TMoW on VirtualHome, ALFWorld, and RLBench benchmarks, demonstrating strong performance in both zero-shot adaptation and few-shot expansion scenarios, and showing that it enables embodied agents to operate effectively in dynamic environments.

34. Beyond Medical Chatbots: Meddollina and the Rise of Continuous Clinical Intelligence

Authors: Vaibhav Ram S. V. N. S , Swetanshu Agrawal , Samudra Banerjee , Abdul Muhsin
URL: https://arxiv.org/abs/2601.22645
Abstract:

Generative medical AI now appears fluent and knowledgeable enough to resemble clinical intelligence, encouraging the belief that scaling will make it safe. But clinical reasoning is not text generation. It is a responsibility-bound process under ambiguity, incomplete evidence, and longitudinal context. Even as benchmark scores rise, generation-centric systems still show behaviours incompatible with clinical deployment: premature closure, unjustified certainty, intent drift, and instability across multi-step decisions. We argue these are structural consequences of treating medicine as next-token prediction. We formalise Clinical Contextual Intelligence (CCI) as a distinct capability class required for real-world clinical use, defined by persistent context awareness, intent preservation, bounded inference, and principled deferral when evidence is insufficient. We introduce Meddollina, a governance-first clinical intelligence system designed to constrain inference before language realisation, prioritising clinical appropriateness over generative completeness. Meddollina acts as a continuous intelligence layer supporting clinical workflows while preserving clinician authority. We evaluate Meddollina using a behaviour-first regime across 16,412+ heterogeneous medical queries, benchmarking against general-purpose models, medical-tuned models, and retrieval-augmented systems. Meddollina exhibits a distinct behavioural profile: calibrated uncertainty, conservative reasoning under underspecification, stable longitudinal constraint adherence, and reduced speculative completion relative to generation-centric baselines. These results suggest deployable medical AI will not emerge from scaling alone, motivating a shift toward Continuous Clinical Intelligence, where progress is measured by clinician-aligned behaviour under uncertainty rather than fluency-driven completion.

35. Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling

Authors: Mingqian Feng , Xiaodong Liu , Weiwei Yang , Chenliang Xu , Christopher White , Jianfeng Gao
URL: https://arxiv.org/abs/2601.22636
Abstract:

Large Language Models (LLMs) are typically evaluated for safety under single-shot or low-budget adversarial prompting, which underestimates real-world risk. In practice, attackers can exploit large-scale parallel sampling to repeatedly probe a model until a harmful response is produced. While recent work shows that attack success increases with repeated sampling, principled methods for predicting large-scale adversarial risk remain limited. We propose a scaling-aware Best-of-N estimation of risk, SABER, for modeling jailbreak vulnerability under Best-of-N sampling. We model sample-level success probabilities using a Beta distribution, the conjugate prior of the Bernoulli distribution, and derive an analytic scaling law that enables reliable extrapolation of large-N attack success rates from small-budget measurements. Using only n=100 samples, our anchored estimator predicts ASR@1000 with a mean absolute error of 1.66, compared to 12.04 for the baseline, which is an 86.2% reduction in estimation error. Our results reveal heterogeneous risk scaling profiles and show that models appearing robust under standard evaluation can experience rapid nonlinear risk amplification under parallel adversarial pressure. This work provides a low-cost, scalable methodology for realistic LLM safety assessment. We will release our code and evaluation scripts upon publication to future research.

36. SYMPHONY: Synergistic Multi-agent Planning with Heterogeneous Language Model Assembly

Authors: Wei Zhu , Zhiwen Tang , Kun Yue
URL: https://arxiv.org/abs/2601.22623
Abstract:

Recent advancements have increasingly focused on leveraging large language models (LLMs) to construct autonomous agents for complex problem-solving tasks. However, existing approaches predominantly employ a single-agent framework to generate search branches and estimate rewards during Monte Carlo Tree Search (MCTS) planning. This single-agent paradigm inherently limits exploration capabilities, often resulting in insufficient diversity among generated branches and suboptimal planning performance. To overcome these limitations, we propose Synergistic Multi-agent Planning with Heterogeneous langauge model assembly (SYMPHONY), a novel multi-agent planning framework that integrates a pool of heterogeneous language model-based agents. By leveraging diverse reasoning patterns across agents, SYMPHONY enhances rollout diversity and facilitates more effective exploration. Empirical results across multiple benchmark tasks show that SYMPHONY achieves strong performance even when instantiated with open-source LLMs deployable on consumer-grade hardware. When enhanced with cloud-based LLMs accessible via API, SYMPHONY demonstrates further improvements, outperforming existing state-of-the-art baselines and underscoring the effectiveness of heterogeneous multi-agent coordination in planning tasks.

37. EntroCut: Entropy-Guided Adaptive Truncation for Efficient Chain-of-Thought Reasoning in Small-scale Large Reasoning Models

Authors: Hongxi Yan , Qingjie Liu , Yunhong Wang
URL: https://arxiv.org/abs/2601.22617
Abstract:

Large Reasoning Models (LRMs) excel at complex reasoning tasks through extended chain-of-thought generation, but their reliance on lengthy intermediate steps incurs substantial computational cost. We find that the entropy of the model’s output distribution in early reasoning steps reliably distinguishes correct from incorrect reasoning. Motivated by this observation, we propose EntroCut, a training-free method that dynamically truncates reasoning by identifying high-confidence states where reasoning can be safely terminated. To comprehensively evaluate the trade-off between efficiency and accuracy, we introduce the Efficiency-Performance Ratio (EPR), a unified metric that quantifies relative token savings per unit accuracy loss. Experiments on four benchmarks show that EntroCut reduces token usage by up to 40\% with minimal accuracy sacrifice, achieving superior efficiency-performance trade-offs compared with existing training-free methods. These results demonstrate that entropy-guided dynamic truncation provides a practical approach to mitigate the inefficiency of LRMs.

38. From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents

Authors: Jiaxuan Gao , Jiaao Chen , Chuyi He , Wei-Chen Wang , Shusheng Xu , Hanrui Wang , Di Jin , Yi Wu
URL: https://arxiv.org/abs/2601.22607
Abstract:

Interactive tool-using agents must solve real-world tasks via multi-turn interaction with both humans and external environments, requiring dialogue state tracking, multi-step tool execution, while following complex instructions. Post-training such agents is challenging because synthesis for high-quality multi-turn tool-use data is difficult to scale, and reinforcement learning (RL) could face noisy signals caused by user simulation, leading to degraded training efficiency. We propose a unified framework that combines a self-evolving data agent with verifier-based RL. Our system, EigenData, is a hierarchical multi-agent engine that synthesizes tool-grounded dialogues together with executable per-instance checkers, and improves generation reliability via closed-loop self-evolving process that updates prompts and workflow. Building on the synthetic data, we develop an RL recipe that first fine-tunes the user model and then applies GRPO-style training with trajectory-level group-relative advantages and dynamic filtering, yielding consistent improvements beyond SFT. Evaluated on tau^2-bench, our best model reaches 73.0% pass^1 on Airline and 98.3% pass^1 on Telecom, matching or exceeding frontier models. Overall, our results suggest a scalable pathway for bootstrapping complex tool-using behaviors without expensive human annotation.

39. Learn More with Less: Uncertainty Consistency Guided Query Selection for RLVR

Authors: Hao Yi , Yulan Hu , Xin Li , Sheng Ouyang , Lizhong Ding , Yong Liu
URL: https://arxiv.org/abs/2601.22595
Abstract:

Large Language Models (LLMs) have recently improved mathematical reasoning through Reinforcement Learning with Verifiable Reward (RLVR). However, existing RLVR algorithms require large query budgets, making annotation costly. We investigate whether fewer but more informative queries can yield similar or superior performance, introducing active learning (AL) into RLVR. We identify that classic AL sampling strategies fail to outperform random selection in this setting, due to ignoring objective uncertainty when only selecting by subjective uncertainty. This work proposes an uncertainty consistency metric to evaluate how well subjective uncertainty aligns with objective uncertainty. In the offline setting, this alignment is measured using the Point-Biserial Correlation Coefficient (PBC). For online training, because of limited sampling and dynamically shifting output distributions, PBC estimation is difficult. Therefore, we introduce a new online variant, computed from normalized advantage and subjective uncertainty. Theoretically, we prove that the online variant is strictly negatively correlated with offline PBC and supports better sample selection. Experiments show our method consistently outperforms random and classic AL baselines, achieving full-dataset performance while training on only 30% of the data, effectively reducing the cost of RLVR for reasoning tasks.

40. WED-Net: A Weather-Effect Disentanglement Network with Causal Augmentation for Urban Flow Prediction

Authors: Qian Hong , Siyuan Chang , Xiao Zhou
URL: https://arxiv.org/abs/2601.22586
Abstract:

Urban spatio-temporal prediction under extreme conditions (e.g., heavy rain) is challenging due to event rarity and dynamics. Existing data-driven approaches that incorporate weather as auxiliary input often rely on coarse-grained descriptors and lack dedicated mechanisms to capture fine-grained spatio-temporal effects. Although recent methods adopt causal techniques to improve out-of-distribution generalization, they typically overlook temporal dynamics or depend on fixed confounder stratification. To address these limitations, we propose WED-Net (Weather-Effect Disentanglement Network), a dual-branch Transformer architecture that separates intrinsic and weather-induced traffic patterns via self- and cross-attention, enhanced with memory banks and fused through adaptive gating. To further promote disentanglement, we introduce a discriminator that explicitly distinguishes weather conditions. Additionally, we design a causal data augmentation strategy that perturbs non-causal parts while preserving causal structures, enabling improved generalization under rare scenarios. Experiments on taxi-flow datasets from three cities demonstrate that WED-Net delivers robust performance under extreme weather conditions, highlighting its potential to support safer mobility, highlighting its potential to support safer mobility, disaster preparedness, and urban resilience in real-world settings. The code is publicly available at this https URL .

41. PerfGuard: A Performance-Aware Agent for Visual Content Generation

Authors: Zhipeng Chen , Zhongrui Zhang , Chao Zhang , Yifan Xu , Lan Yang , Jun Liu , Ke Li , Yi-Zhe Song
URL: https://arxiv.org/abs/2601.22571
Abstract:

The advancement of Large Language Model (LLM)-powered agents has enabled automated task processing through reasoning and tool invocation capabilities. However, existing frameworks often operate under the idealized assumption that tool executions are invariably successful, relying solely on textual descriptions that fail to distinguish precise performance boundaries and cannot adapt to iterative tool updates. This gap introduces uncertainty in planning and execution, particularly in domains like visual content generation (AIGC), where nuanced tool performance significantly impacts outcomes. To address this, we propose PerfGuard, a performance-aware agent framework for visual content generation that systematically models tool performance boundaries and integrates them into task planning and scheduling. Our framework introduces three core mechanisms: (1) Performance-Aware Selection Modeling (PASM), which replaces generic tool descriptions with a multi-dimensional scoring system based on fine-grained performance evaluations; (2) Adaptive Preference Update (APU), which dynamically optimizes tool selection by comparing theoretical rankings with actual execution rankings; and (3) Capability-Aligned Planning Optimization (CAPO), which guides the planner to generate subtasks aligned with performance-aware strategies. Experimental comparisons against state-of-the-art methods demonstrate PerfGuard’s advantages in tool selection accuracy, execution reliability, and alignment with user intent, validating its robustness and practical utility for complex AIGC tasks. The project code is available at this https URL .

42. Decoding in Geometry: Alleviating Embedding-Space Crowding for Complex Reasoning

Authors: Yixin Yang , Qingxiu Dong , Zhifang Sui
URL: https://arxiv.org/abs/2601.22536
Abstract:

Sampling-based decoding underlies complex reasoning in large language models (LLMs), where decoding strategies critically shape model behavior. Temperature- and truncation-based methods reshape the next-token distribution through global probability reweighting or thresholding to balance the quality-diversity tradeoff. However, they operate solely on token probabilities, ignoring fine-grained relationships among tokens in the embedding space. We uncover a novel phenomenon, embedding-space crowding, where the next-token distribution concentrates its probability mass on geometrically close tokens in the embedding space. We quantify crowding at multiple granularities and find a statistical association with reasoning success in mathematical problem solving. Motivated by this finding, we propose CraEG, a plug-and-play sampling method that mitigates crowding through geometry-guided reweighting. CraEG is training-free, single-pass, and compatible with standard sampling strategies. Experiments on multiple models and benchmarks demonstrate improved generation performance, with gains in robustness and diversity metrics.

43. Enhancing TableQA through Verifiable Reasoning Trace Reward

Authors: Tung Sum Thomas Kwok , Xinyu Wang , Hengzhi He , Xiaofeng Lin , Peng Lu , Liheng Ma , Chunhe Wang , Ying Nian Wu , Lei Ding , Guang Cheng
URL: https://arxiv.org/abs/2601.22530
Abstract:

A major challenge in training TableQA agents, compared to standard text- and image-based agents, is that answers cannot be inferred from a static input but must be reasoned through stepwise transformations of the table state, introducing multi-step reasoning complexity and environmental interaction. This leads to a research question: Can explicit feedback on table transformation action improve model reasoning capability? In this work, we introduce RE-Tab, a plug-and-play framework that architecturally enhances trajectory search via lightweight, training-free reward modeling by formulating the problem as a Partially Observable Markov Decision Process. We demonstrate that providing explicit verifiable rewards during State Transition (What is the best action?'') and Simulative Reasoning (Am I sure about the output?’’) is crucial to steer the agent’s navigation in table states. By enforcing stepwise reasoning with reward feedback in table transformations, RE-Tab achieves state-of-the-art performance in TableQA with almost 25\% drop in inference cost. Furthermore, a direct plug-and-play implementation of RE-Tab brings up to 41.77% improvement in QA accuracy and 33.33% drop in test-time inference samples for consistent answer. Consistent improvement pattern across various LLMs and state-of-the-art benchmarks further confirms RE-Tab’s generalisability. The repository is available at this https URL .

44. Darwinian Memory: A Training-Free Self-Regulating Memory System for GUI Agent Evolution

Authors: Hongze Mi , Yibo Feng , WenJie Lu , Song Cao , Jinyuan Li , Yanming Li , Xuelin Zhang , Haotian Luo , Songyang Peng , He Cui , Tengfei Tian , Jun Fang , Hua Chai , Naiqiang Tan
URL: https://arxiv.org/abs/2601.22528
Abstract:

Multimodal Large Language Model (MLLM) agents facilitate Graphical User Interface (GUI) automation but struggle with long-horizon, cross-application tasks due to limited context windows. While memory systems provide a viable solution, existing paradigms struggle to adapt to dynamic GUI environments, suffering from a granularity mismatch between high-level intent and low-level execution, and context pollution where the static accumulation of outdated experiences drives agents into hallucination. To address these bottlenecks, we propose the Darwinian Memory System (DMS), a self-evolving architecture that constructs memory as a dynamic ecosystem governed by the law of survival of the fittest. DMS decomposes complex trajectories into independent, reusable units for compositional flexibility, and implements Utility-driven Natural Selection to track survival value, actively pruning suboptimal paths and inhibiting high-risk plans. This evolutionary pressure compels the agent to derive superior strategies. Extensive experiments on real-world multi-app benchmarks validate that DMS boosts general-purpose MLLMs without training costs or architectural overhead, achieving average gains of 18.0% in success rate and 33.9% in execution stability, while reducing task latency, establishing it as an effective self-evolving memory system for GUI tasks.

45. Why Self-Rewarding Works: Theoretical Guarantees for Iterative Alignment of Language Models

Authors: Shi Fu , Yingjie Wang , Shengchao Hu , Peng Wang , Dacheng Tao
URL: https://arxiv.org/abs/2601.22513
Abstract:

Self-Rewarding Language Models (SRLMs) achieve notable success in iteratively improving alignment without external feedback. Yet, despite their striking empirical progress, the core mechanisms driving their capabilities remain unelucidated, leaving a critical gap in theoretical understanding. This paper provides the first rigorous theoretical guarantees for SRLMs. We first establish a lower bound that characterizes the fundamental limits of a single update step, revealing a critical dependence on the quality of the initial model. We then derive finite-sample error bounds for the full iterative paradigm, showing that performance improves at a rate of $\widetilde{\mathcal{O}}\left(1/\sqrt{n}\right)$ with sample size $n$. Crucially, our analysis reveals that the dependence on the initial model decays exponentially with the number of iterations $T$. This provides a formal explanation for why self-rewarding succeeds: it robustly overcomes poor initialization by steering the dynamics toward internal stability and consistency. Finally, we instantiate our theoretical framework for the linear softmax model class, yielding tailored guarantees that connect our high-level insights to practical model architectures.

46. Controllable Information Production

Authors: Tristan Shah , Stas Tiomkin
URL: https://arxiv.org/abs/2601.22449
Abstract:

Intrinsic Motivation (IM) is a paradigm for generating intelligent behavior without external utilities. The existing information-theoretic methods for IM are predominantly based on information transmission, which explicitly depends on the designer’s choice of which random variables engage in transmission. In this work, we introduce a novel IM principle, Controllable Information Production (CIP), that avoids both external utilities and designer-specified variables. We derive the CIP objective from Optimal Control, showing a connection between extrinsic and intrinsic behaviors. CIP appears as the gap between open-loop and closed-loop Kolmogorov-Sinai entropies, which simultaneously rewards the pursuit and regulation of chaos. We establish key theoretical properties of CIP and demonstrate its effectiveness on standard IM benchmarks.

47. Anytime Safe PAC Efficient Reasoning

Authors: Chengyao Yu , Hao Zeng , Youxin Zhu , Jianguo Huang , Huajun Zeng , Bingyi Jing
URL: https://arxiv.org/abs/2601.22446
Abstract:

Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex tasks but suffer from high computational costs and latency. While selective thinking strategies improve efficiency by routing easy queries to non-thinking models, existing approaches often incur uncontrollable errors, especially in online settings where the performance loss of a non-thinking model is only partially observed and data are non-stationary. To address this, we propose Betting Probably Approximately Correct (B-PAC) reasoning, a principled method that enables anytime safe and efficient online reasoning under partial feedback. Specifically, we utilize inverse propensity scoring estimators to construct test supermartingales for candidate thresholds, and then dynamically adjust the routing threshold based on the accumulated statistical evidence of safety. Theoretically, we establish the anytime-valid performance loss control and the efficiency of B-PAC reasoning. Extensive experiments demonstrate that B-PAC reasoning significantly reduces computational overhead, decreasing thinking model usage by up to 81.01\%, while controlling the performance loss below the user-specified level.

48. When LLM meets Fuzzy-TOPSIS for Personnel Selection through Automated Profile Analysis

Authors: Shahria Hoque , Ahmed Akib Jawad Karim , Md. Golam Rabiul Alam , Nirjhar Gope
URL: https://arxiv.org/abs/2601.22433
Abstract:

In this highly competitive employment environment, the selection of suitable personnel is essential for organizational success. This study presents an automated personnel selection system that utilizes sophisticated natural language processing (NLP) methods to assess and rank software engineering applicants. A distinctive dataset was created by aggregating LinkedIn profiles that include essential features such as education, work experience, abilities, and self-introduction, further enhanced with expert assessments to function as standards. The research combines large language models (LLMs) with multicriteria decision-making (MCDM) theory to develop the LLM-TOPSIS framework. In this context, we utilized the TOPSIS method enhanced by fuzzy logic (Fuzzy TOPSIS) to address the intrinsic ambiguity and subjectivity in human assessments. We utilized triangular fuzzy numbers (TFNs) to describe criteria weights and scores, thereby addressing the ambiguity frequently encountered in candidate evaluations. For candidate ranking, the DistilRoBERTa model was fine-tuned and integrated with the fuzzy TOPSIS method, achieving rankings closely aligned with human expert evaluations and attaining an accuracy of up to 91% for the Experience attribute and the Overall attribute. The study underlines the potential of NLP-driven frameworks to improve recruitment procedures by boosting scalability, consistency, and minimizing prejudice. Future endeavors will concentrate on augmenting the dataset, enhancing model interpretability, and verifying the system in actual recruitment scenarios to better evaluate its practical applicability. This research highlights the intriguing potential of merging NLP with fuzzy decision-making methods in personnel selection, enabling scalable and unbiased solutions to recruitment difficulties.

49. AI-Enabled Waste Classification as a Data-Driven Decision Support Tool for Circular Economy and Urban Sustainability

Authors: Julius Sechang Mboli , Omolara Aderonke Ogungbemi
URL: https://arxiv.org/abs/2601.22418
Abstract:

Efficient waste sorting is crucial for enabling circular-economy practices and resource recovery in smart cities. This paper evaluates both traditional machine-learning (Random Forest, SVM, AdaBoost) and deep-learning techniques including custom CNNs, VGG16, ResNet50, and three transfer-learning models (DenseNet121, EfficientNetB0, InceptionV3) for binary classification of 25 077 waste images (80/20 train/test split, augmented and resized to 150x150 px). The paper assesses the impact of Principal Component Analysis for dimensionality reduction on traditional models. DenseNet121 achieved the highest accuracy (91 %) and ROC-AUC (0.98), outperforming the best traditional classifier by 20 pp. Principal Component Analysis (PCA) showed negligible benefit for classical methods, whereas transfer learning substantially improved performance under limited-data conditions. Finally, we outline how these models integrate into a real-time Data-Driven Decision Support System for automated waste sorting, highlighting potential reductions in landfill use and lifecycle environmental impacts.)

50. Semi-Autonomous Mathematics Discovery with Gemini: A Case Study on the Erdős Problems

Authors: Tony Feng , Trieu Trinh , Garrett Bingham , Jiwon Kang , Shengtong Zhang , Sang-hyun Kim , Kevin Barreto , Carl Schildkraut , Junehyuk Jung , Jaehyeon Seo , Carlo Pagano , Yuri Chervonyi , Dawsen Hwang , Kaiying Hou , Sergei Gukov , Cheng-Chiang Tsai , Hyunwoo Choi , Youngbeom Jin , Wei-Yuan Li , Hao-An Wu , Ruey-An Shiu , Yu-Sheng Shih , Quoc V. Le , Thang Luong
URL: https://arxiv.org/abs/2601.22401
Abstract:

We present a case study in semi-autonomous mathematics discovery, using Gemini to systematically evaluate 700 conjectures labeled ‘Open’ in Bloom’s Erdős Problems database. We employ a hybrid methodology: AI-driven natural language verification to narrow the search space, followed by human expert evaluation to gauge correctness and novelty. We address 13 problems that were marked ‘Open’ in the database: 5 through seemingly novel autonomous solutions, and 8 through identification of previous solutions in the existing literature. Our findings suggest that the ‘Open’ status of the problems was through obscurity rather than difficulty. We also identify and discuss issues arising in applying AI to math conjectures at scale, highlighting the difficulty of literature identification and the risk of ‘‘subconscious plagiarism’’ by AI. We reflect on the takeaways from AI-assisted efforts on the Erdős Problems.

51. Learning Provably Correct Distributed Protocols Without Human Knowledge

Authors: Yujie Hui , Xiaoyi Lu , Andrew Perrault , Yang Wang
URL: https://arxiv.org/abs/2601.22369
Abstract:

Provably correct distributed protocols, which are a critical component of modern distributed systems, are highly challenging to design and have often required decades of human effort. These protocols allow multiple agents to coordinate to come to a common agreement in an environment with uncertainty and failures. We formulate protocol design as a search problem over strategies in a game with imperfect information, and the desired correctness conditions are specified in Satisfiability Modulo Theories (SMT). However, standard methods for solving multi-agent games fail to learn correct protocols in this setting, even when the number of agents is small. We propose a learning framework, GGMS, which integrates a specialized variant of Monte Carlo Tree Search with a transformer-based action encoder, a global depth-first search to break out of local minima, and repeated feedback from a model checker. Protocols output by GGMS are verified correct via exhaustive model checking for all executions within the bounded setting. We further prove that, under mild assumptions, the search process is complete: if a correct protocol exists, GGMS will eventually find it. In experiments, we show that GGMS can learn correct protocols for larger settings than existing methods.

52. Sparks of Rationality: Do Reasoning LLMs Align with Human Judgment and Choice?

Authors: Ala N. Tak , Amin Banayeeanzade , Anahita Bolourani , Fatemeh Bahrani , Ashutosh Chaubey , Sai Praneeth Karimireddy , Norbert Schwarz , Jonathan Gratch
URL: https://arxiv.org/abs/2601.22329
Abstract:

Large Language Models (LLMs) are increasingly positioned as decision engines for hiring, healthcare, and economic judgment, yet real-world human judgment reflects a balance between rational deliberation and emotion-driven bias. If LLMs are to participate in high-stakes decisions or serve as models of human behavior, it is critical to assess whether they exhibit analogous patterns of (ir)rationalities and biases. To this end, we evaluate multiple LLM families on (i) benchmarks testing core axioms of rational choice and (ii) classic decision domains from behavioral economics and social norms where emotions are known to shape judgment and choice. Across settings, we show that deliberate “thinking” reliably improves rationality and pushes models toward expected-value maximization. To probe human-like affective distortions and their interaction with reasoning, we use two emotion-steering methods: in-context priming (ICP) and representation-level steering (RLS). ICP induces strong directional shifts that are often extreme and difficult to calibrate, whereas RLS produces more psychologically plausible patterns but with lower reliability. Our results suggest that the same mechanisms that improve rationality also amplify sensitivity to affective interventions, and that different steering methods trade off controllability against human-aligned behavior. Overall, this points to a tension between reasoning and affective steering, with implications for both human simulation and the safe deployment of LLM-based decision systems.

53. Why Reasoning Fails to Plan: A Planning-Centric Analysis of Long-Horizon Decision Making in LLM Agents

Authors: Zehong Wang , Fang Wu , Hongru Wang , Xiangru Tang , Bolian Li , Zhenfei Yin , Yijun Ma , Yiyang Li , Weixiang Sun , Xiusi Chen , Yanfang Ye
URL: https://arxiv.org/abs/2601.22311
Abstract:

Large language model (LLM)-based agents exhibit strong step-by-step reasoning capabilities over short horizons, yet often fail to sustain coherent behavior over long planning horizons. We argue that this failure reflects a fundamental mismatch: step-wise reasoning induces a form of step-wise greedy policy that is adequate for short horizons but fails in long-horizon planning, where early actions must account for delayed consequences. From this planning-centric perspective, we study LLM-based agents in deterministic, fully structured environments with explicit state transitions and evaluation signals. Our analysis reveals a core failure mode of reasoning-based policies: locally optimal choices induced by step-wise scoring lead to early myopic commitments that are systematically amplified over time and difficult to recover from. We introduce FLARE (Future-aware Lookahead with Reward Estimation) as a minimal instantiation of future-aware planning to enforce explicit lookahead, value propagation, and limited commitment in a single model, allowing downstream outcomes to influence early decisions. Across multiple benchmarks, agent frameworks, and LLM backbones, FLARE consistently improves task performance and planning-level behavior, frequently allowing LLaMA-8B with FLARE to outperform GPT-4o with standard step-by-step reasoning. These results establish a clear distinction between reasoning and planning.

54. The Six Sigma Agent: Achieving Enterprise-Grade Reliability in LLM Systems Through Consensus-Driven Decomposed Execution

Authors: Khush Patel , Siva Surendira , Jithin George , Shreyas Kapale
URL: https://arxiv.org/abs/2601.22290
Abstract:

Large Language Models demonstrate remarkable capabilities yet remain fundamentally probabilistic, presenting critical reliability challenges for enterprise deployment. We introduce the Six Sigma Agent, a novel architecture that achieves enterprise-grade reliability through three synergistic components: (1) task decomposition into a dependency tree of atomic actions; (2) micro-agent sampling where each task is executed n times in parallel across diverse LLMs to generate independent outputs; and (3) consensus voting with dynamic scaling, clustering outputs and selecting the answer from the winning cluster with maximum votes. We prove that sampling n independent outputs with error rate p achieves system error O(p^{ceil(n/2)}), enabling exponential reliability gains. Even using cheaper models with 5% per-action error, consensus voting with 5 agents reduces error to 0.11%; dynamic scaling to 13 agents achieves 3.4 DPMO (Defects Per Million Opportunities), the Six Sigma standard. Evaluation across three enterprise use cases demonstrates a 14,700x reliability improvement over single-agent execution while reducing costs by 80%. Our work establishes that reliability in AI systems emerges from principled redundancy and consensus rather than model scaling alone.

55. JAF: Judge Agent Forest

Authors: Sahil Garg , Brad Cheezum , Sridhar Dutta , Vishal Agarwal
URL: https://arxiv.org/abs/2601.22269
Abstract:

Judge agents are fundamental to agentic AI frameworks: they provide automated evaluation, and enable iterative self-refinement of reasoning processes. We introduce JAF: Judge Agent Forest, a framework in which the judge agent conducts joint inference across a cohort of query–response pairs generated by a primary agent, rather than evaluating each in isolation. This paradigm elevates the judge from a local evaluator to a holistic learner: by simultaneously assessing related responses, the judge discerns cross-instance patterns and inconsistencies, whose aggregate feedback enables the primary agent to improve by viewing its own outputs through the judge’s collective perspective. Conceptually, JAF bridges belief propagation and ensemble-learning principles: overlapping in-context neighborhoods induce a knowledge-graph structure that facilitates propagation of critique, and repeated, randomized evaluations yield a robust ensemble of context-sensitive judgments. JAF can be instantiated entirely via ICL, with the judge prompted for each query using its associated primary-agent response plus a small, possibly noisy set of peer exemplars. While kNN in embedding space is a natural starting point for exemplars, this approach overlooks categorical structure, domain metadata, or nuanced distinctions accessible to modern LLMs. To overcome these limitations, we develop a flexible locality-sensitive hashing (LSH) algorithm that learns informative binary codes by integrating semantic embeddings, LLM-driven hash predicates, supervision from categorical labels, and relevant side information. These hash codes support efficient, interpretable, and relation-aware selection of diverse exemplars, and further optimize exploration of CoT reasoning paths. We validate JAF with an empirical study on the demanding task of cloud misconfigs triage in large-scale cloud environments.

56. VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

Authors: Hongyang Du , Junjie Ye , Xiaoyan Cong , Runhao Li , Jingcheng Ni , Aman Agarwal , Zeqi Zhou , Zekun Li , Randall Balestriero , Yue Wang
URL: https://arxiv.org/abs/2601.23286
Abstract:

While recent video diffusion models (VDMs) produce visually impressive results, they fundamentally struggle to maintain 3D structural consistency, often resulting in object deformation or spatial drift. We hypothesize that these failures arise because standard denoising objectives lack explicit incentives for geometric coherence. To address this, we introduce VideoGPA (Video Geometric Preference Alignment), a data-efficient self-supervised framework that leverages a geometry foundation model to automatically derive dense preference signals that guide VDMs via Direct Preference Optimization (DPO). This approach effectively steers the generative distribution toward inherent 3D consistency without requiring human annotations. VideoGPA significantly enhances temporal stability, physical plausibility, and motion coherence using minimal preference pairs, consistently outperforming state-of-the-art baselines in extensive experiments.

57. End-to-end Optimization of Belief and Policy Learning in Shared Autonomy Paradigms

Authors: MH Farhadi , Ali Rabiee , Sima Ghafoori , Anna Cetera , Andrew Fisher , Reza Abiri
URL: https://arxiv.org/abs/2601.23285
Abstract:

Shared autonomy systems require principled methods for inferring user intent and determining appropriate assistance levels. This is a central challenge in human-robot interaction, where systems must be successful while being mindful of user agency. Previous approaches relied on static blending ratios or separated goal inference from assistance arbitration, leading to suboptimal performance in unstructured environments. We introduce BRACE (Bayesian Reinforcement Assistance with Context Encoding), a novel framework that fine-tunes Bayesian intent inference and context-adaptive assistance through an architecture enabling end-to-end gradient flow between intent inference and assistance arbitration. Our pipeline conditions collaborative control policies on environmental context and complete goal probability distributions. We provide analysis showing (1) optimal assistance levels should decrease with goal uncertainty and increase with environmental constraint severity, and (2) integrating belief information into policy learning yields a quadratic expected regret advantage over sequential approaches. We validated our algorithm against SOTA methods (IDA, DQN) using a three-part evaluation progressively isolating distinct challenges of end-effector control: (1) core human-interaction dynamics in a 2D human-in-the-loop cursor task, (2) non-linear dynamics of a robotic arm, and (3) integrated manipulation under goal ambiguity and environmental constraints. We demonstrate improvements over SOTA, achieving 6.3% higher success rates and 41% increased path efficiency, and 36.3% success rate and 87% path efficiency improvement over unassisted control. Our results confirmed that integrated optimization is most beneficial in complex, goal-ambiguous scenarios, and is generalizable across robotic domains requiring goal-directed assistance, advancing the SOTA for adaptive shared autonomy.

58. IRL-DAL: Safe and Adaptive Trajectory Planning for Autonomous Driving via Energy-Guided Diffusion Models

Authors: Seyed Ahmad Hosseini Miangoleh , Amin Jalal Aghdasian , Farzaneh Abdollahi
URL: https://arxiv.org/abs/2601.23266
Abstract:

This paper proposes a novel inverse reinforcement learning framework using a diffusion-based adaptive lookahead planner (IRL-DAL) for autonomous vehicles. Training begins with imitation from an expert finite state machine (FSM) controller to provide a stable initialization. Environment terms are combined with an IRL discriminator signal to align with expert goals. Reinforcement learning (RL) is then performed with a hybrid reward that combines diffuse environmental feedback and targeted IRL rewards. A conditional diffusion model, which acts as a safety supervisor, plans safe paths. It stays in its lane, avoids obstacles, and moves smoothly. Then, a learnable adaptive mask (LAM) improves perception. It shifts visual attention based on vehicle speed and nearby hazards. After FSM-based imitation, the policy is fine-tuned with Proximal Policy Optimization (PPO). Training is run in the Webots simulator with a two-stage curriculum. A 96\% success rate is reached, and collisions are reduced to 0.05 per 1k steps, marking a new benchmark for safe navigation. By applying the proposed approach, the agent not only drives in lane but also handles unsafe conditions at an expert level, increasing this http URL make our code publicly available.

59. TEON: Tensorized Orthonormalization Beyond Layer-Wise Muon for Large Language Model Pre-Training

Authors: Ruijie Zhang , Yequan Zhao , Ziyue Liu , Zhengyang Wang , Dongyang Li , Yupeng Su , Sijia Liu , Zheng Zhang
URL: https://arxiv.org/abs/2601.23261
Abstract:

The Muon optimizer has demonstrated strong empirical performance in pre-training large language models by performing matrix-level gradient (or momentum) orthogonalization in each layer independently. In this work, we propose TEON, a principled generalization of Muon that extends orthogonalization beyond individual layers by modeling the gradients of a neural network as a structured higher-order tensor. We present TEON’s improved convergence guarantee over layer-wise Muon, and further develop a practical instantiation of TEON based on the theoretical analysis with corresponding ablation. We evaluate our approach on two widely adopted architectures: GPT-style models, ranging from 130M to 774M parameters, and LLaMA-style models, ranging from 60M to 1B parameters. Experimental results show that TEON consistently improves training and validation perplexity across model scales and exhibits strong robustness under various approximate SVD schemes.

60. Agnostic Language Identification and Generation

Authors: Mikael Møller Høgsgaard , Chirag Pabbaraju
URL: https://arxiv.org/abs/2601.23258
Abstract:

Recent works on language identification and generation have established tight statistical rates at which these tasks can be achieved. These works typically operate under a strong realizability assumption: that the input data is drawn from an unknown distribution necessarily supported on some language in a given collection. In this work, we relax this assumption of realizability entirely, and impose no restrictions on the distribution of the input data. We propose objectives to study both language identification and generation in this more general “agnostic” setup. Across both problems, we obtain novel interesting characterizations and nearly tight rates.

61. Now You Hear Me: Audio Narrative Attacks Against Large Audio-Language Models

Authors: Ye Yu , Haibo Jin , Yaoning Yu , Jun Zhuang , Haohan Wang
URL: https://arxiv.org/abs/2601.23255
Abstract:

Large audio-language models increasingly operate on raw speech inputs, enabling more seamless integration across domains such as voice assistants, education, and clinical triage. This transition, however, introduces a distinct class of vulnerabilities that remain largely uncharacterized. We examine the security implications of this modality shift by designing a text-to-audio jailbreak that embeds disallowed directives within a narrative-style audio stream. The attack leverages an advanced instruction-following text-to-speech (TTS) model to exploit structural and acoustic properties, thereby circumventing safety mechanisms primarily calibrated for text. When delivered through synthetic speech, the narrative format elicits restricted outputs from state-of-the-art models, including Gemini 2.0 Flash, achieving a 98.26% success rate that substantially exceeds text-only baselines. These results highlight the need for safety frameworks that jointly reason over linguistic and paralinguistic representations, particularly as speech-based interfaces become more prevalent.

62. YuriiFormer: A Suite of Nesterov-Accelerated Transformers

Authors: Aleksandr Zimin , Yury Polyanskiy , Philippe Rigollet
URL: https://arxiv.org/abs/2601.23236
Abstract:

We propose a variational framework that interprets transformer layers as iterations of an optimization algorithm acting on token embeddings. In this view, self-attention implements a gradient step of an interaction energy, while MLP layers correspond to gradient updates of a potential energy. Standard GPT-style transformers emerge as vanilla gradient descent on the resulting composite objective, implemented via Lie–Trotter splitting between these two energy functionals. This perspective enables principled architectural design using classical optimization ideas. As a proof of concept, we introduce a Nesterov-style accelerated transformer that preserves the same attention and MLP oracles. The resulting architecture consistently outperforms a nanoGPT baseline on TinyStories and OpenWebText, demonstrating that optimization-theoretic insights can translate into practical gains.

63. ShotFinder: Imagination-Driven Open-Domain Video Shot Retrieval via Web Search

Authors: Tao Yu , Haopeng Jin , Hao Wang , Shenghua Chai , Yujia Yang , Junhao Gong , Jiaming Guo , Minghui Zhang , Xinlong Chen , Zhenghao Zhang , Yuxuan Zhou , Yanpei Gong , YuanCheng Liu , Yiming Ding , Kangwei Zeng , Pengfei Yang , Zhongtian Luo , Yufei Xiong , Shanbin Zhang , Shaoxiong Cheng , Huang Ruilin , Li Shuo , Yuxi Niu , Xinyuan Zhang , Yueya Xu , Jie Mao , Ruixuan Ji , Yaru Zhao , Mingchen Zhang , Jiabing Yang , Jiaqi Liu , YiFan Zhang , Hongzhu Yi , Xinming Wang , Cheng Zhong , Xiao Ma , Zhang Zhang , Yan Huang , Liang Wang
URL: https://arxiv.org/abs/2601.23232
Abstract:

In recent years, large language models (LLMs) have made rapid progress in information retrieval, yet existing research has mainly focused on text or static multimodal settings. Open-domain video shot retrieval, which involves richer temporal structure and more complex semantics, still lacks systematic benchmarks and analysis. To fill this gap, we introduce ShotFinder, a benchmark that formalizes editing requirements as keyframe-oriented shot descriptions and introduces five types of controllable single-factor constraints: Temporal order, Color, Visual style, Audio, and Resolution. We curate 1,210 high-quality samples from YouTube across 20 thematic categories, using large models for generation with human verification. Based on the benchmark, we propose ShotFinder, a text-driven three-stage retrieval and localization pipeline: (1) query expansion via video imagination, (2) candidate video retrieval with a search engine, and (3) description-guided temporal localization. Experiments on multiple closed-source and open-source models reveal a significant gap to human performance, with clear imbalance across constraints: temporal localization is relatively tractable, while color and visual style remain major challenges. These results reveal that open-domain video shot retrieval is still a critical capability that multimodal large models have yet to overcome.

64. Agile Reinforcement Learning through Separable Neural Architecture

Authors: Rajib Mostakim , Reza T. Batley , Sourav Saha
URL: https://arxiv.org/abs/2601.23225
Abstract:

Deep reinforcement learning (RL) is increasingly deployed in resource-constrained environments, yet the go-to function approximators - multilayer perceptrons (MLPs) - are often parameter-inefficient due to an imperfect inductive bias for the smooth structure of many value functions. This mismatch can also hinder sample efficiency and slow policy learning in this capacity-limited regime. Although model compression techniques exist, they operate post-hoc and do not improve learning efficiency. Recent spline-based separable architectures - such as Kolmogorov-Arnold Networks (KANs) - have been shown to offer parameter efficiency but are widely reported to exhibit significant computational overhead, especially at scale. In seeking to address these limitations, this work introduces SPAN (SPline-based Adaptive Networks), a novel function approximation approach to RL. SPAN adapts the low rank KHRONOS framework by integrating a learnable preprocessing layer with a separable tensor product B-spline basis. SPAN is evaluated across discrete (PPO) and high-dimensional continuous (SAC) control tasks, as well as offline settings (Minari/D4RL). Empirical results demonstrate that SPAN achieves a 30-50% improvement in sample efficiency and 1.3-9 times higher success rates across benchmarks compared to MLP baselines. Furthermore, SPAN demonstrates superior anytime performance and robustness to hyperparameter variations, suggesting it as a viable, high performance alternative for learning intrinsically efficient policies in resource-limited settings.

65. Med-Scout: Curing MLLMs’ Geometric Blindness in Medical Perception via Geometry-Aware RL Post-Training

Authors: Anglin Liu , Ruichao Chen , Yi Lu , Hongxia Xu , Jintai Chen
URL: https://arxiv.org/abs/2601.23220
Abstract:

Despite recent Multimodal Large Language Models (MLLMs)’ linguistic prowess in medical diagnosis, we find even state-of-the-art MLLMs suffer from a critical perceptual deficit: geometric blindness. This failure to ground outputs in objective geometric constraints leads to plausible yet factually incorrect hallucinations, rooted in training paradigms that prioritize linguistic fluency over geometric fidelity. This paper introduces Med-Scout, a novel framework that “cures” this blindness via Reinforcement Learning (RL) that leverages the intrinsic geometric logic latent within unlabeled medical images. Instead of relying on costly expert annotations, Med-Scout derives verifiable supervision signals through three strategic proxy tasks: Hierarchical Scale Localization, Topological Jigsaw Reconstruction, and Anomaly Consistency Detection. To rigorously quantify this deficit, we present Med-Scout-Bench, a new benchmark specifically designed to evaluate geometric perception. Extensive evaluations show that Med-Scout significantly mitigates geometric blindness, outperforming leading proprietary and open-source MLLMs by over 40% on our benchmark. Furthermore, this enhanced geometric perception generalizes to broader medical understanding, achieving superior results on radiological and comprehensive medical VQA tasks.

66. MonoScale: Scaling Multi-Agent System with Monotonic Improvement

Authors: Shuai Shao , Yixiang Liu , Bingwei Lu , Weinan Zhang
URL: https://arxiv.org/abs/2601.23219
Abstract:

In recent years, LLM-based multi-agent systems (MAS) have advanced rapidly, using a router to decompose tasks and delegate subtasks to specialized agents. A natural way to expand capability is to scale up the agent pool by continually integrating new functional agents or tool interfaces, but naive expansion can trigger performance collapse when the router cold-starts on newly added, heterogeneous, and unreliable agents. We propose MonoScale, an expansion-aware update framework that proactively generates a small set of agent-conditioned familiarization tasks, harvests evidence from both successful and failed interactions, and distills it into auditable natural-language memory to guide future routing. We formalize sequential augmentation as a contextual bandit and perform trust-region memory updates, yielding a monotonic non-decreasing performance guarantee across onboarding rounds. Experiments on GAIA and Humanity’s Last Exam show stable gains as the agent pool grows, outperforming naive scale-up and strong-router fixed-pool baselines.

67. Disentangling multispecific antibody function with graph neural networks

Authors: Joshua Southern , Changpeng Lu , Santrupti Nerli , Samuel D. Stanton , Andrew M. Watkins , Franziska Seeger , Frédéric A. Dreyer
URL: https://arxiv.org/abs/2601.23212
Abstract:

Multispecific antibodies offer transformative therapeutic potential by engaging multiple epitopes simultaneously, yet their efficacy is an emergent property governed by complex molecular architectures. Rational design is often bottlenecked by the inability to predict how subtle changes in domain topology influence functional outcomes, a challenge exacerbated by the scarcity of comprehensive experimental data. Here, we introduce a computational framework to address part of this gap. First, we present a generative method for creating large-scale, realistic synthetic functional landscapes that capture non-linear interactions where biological activity depends on domain connectivity. Second, we propose a graph neural network architecture that explicitly encodes these topological constraints, distinguishing between format configurations that appear identical to sequence-only models. We demonstrate that this model, trained on synthetic landscapes, recapitulates complex functional properties and, via transfer learning, has the potential to achieve high predictive accuracy on limited biological datasets. We showcase the model’s utility by optimizing trade-offs between efficacy and toxicity in trispecific T-cell engagers and retrieving optimal common light chains. This work provides a robust benchmarking environment for disentangling the combinatorial complexity of multispecifics, accelerating the design of next-generation therapeutics.

68. Learning to Execute Graph Algorithms Exactly with Graph Neural Networks

Authors: Muhammad Fetrat Qharabagh , Artur Back de Luca , George Giapitzakis , Kimon Fountoulakis
URL: https://arxiv.org/abs/2601.23207
Abstract:

Understanding what graph neural networks can learn, especially their ability to learn to execute algorithms, remains a central theoretical challenge. In this work, we prove exact learnability results for graph algorithms under bounded-degree and finite-precision constraints. Our approach follows a two-step process. First, we train an ensemble of multi-layer perceptrons (MLPs) to execute the local instructions of a single node. Second, during inference, we use the trained MLP ensemble as the update function within a graph neural network (GNN). Leveraging Neural Tangent Kernel (NTK) theory, we show that local instructions can be learned from a small training set, enabling the complete graph algorithm to be executed during inference without error and with high probability. To illustrate the learning power of our setting, we establish a rigorous learnability result for the LOCAL model of distributed computation. We further demonstrate positive learnability results for widely studied algorithms such as message flooding, breadth-first and depth-first search, and Bellman-Ford.

69. Beyond Fixed Frames: Dynamic Character-Aligned Speech Tokenization

Authors: Luca Della Libera , Cem Subakan , Mirco Ravanelli
URL: https://arxiv.org/abs/2601.23174
Abstract:

Neural audio codecs are at the core of modern conversational speech technologies, converting continuous speech into sequences of discrete tokens that can be processed by LLMs. However, existing codecs typically operate at fixed frame rates, allocating tokens uniformly in time and producing unnecessarily long sequences. In this work, we introduce DyCAST, a Dynamic Character-Aligned Speech Tokenizer that enables variable-frame-rate tokenization through soft character-level alignment and explicit duration modeling. DyCAST learns to associate tokens with character-level linguistic units during training and supports alignment-free inference with direct control over token durations at decoding time. To improve speech resynthesis quality at low frame rates, we further introduce a retrieval-augmented decoding mechanism that enhances reconstruction fidelity without increasing bitrate. Experiments show that DyCAST achieves competitive speech resynthesis quality and downstream performance while using significantly fewer tokens than fixed-frame-rate codecs.

70. Probing the Trajectories of Reasoning Traces in Large Language Models

Authors: Marthe Ballon , Brecht Verbeken , Vincent Ginis , Andres Algaba
URL: https://arxiv.org/abs/2601.23163
Abstract:

Large language models (LLMs) increasingly solve difficult problems by producing “reasoning traces” before emitting a final response. However, it remains unclear how accuracy and decision commitment evolve along a reasoning trajectory, and whether intermediate trace segments provide answer-relevant information beyond generic length or stylistic effects. Here, we propose a protocol to systematically probe the trajectories of reasoning traces in LLMs by 1) generating a model’s reasoning trace, 2) truncating it at fixed token-percentiles, and 3) injecting each partial trace back into the model (or a different model) to measure the induced distribution over answer choices via next-token probabilities. We apply this protocol to the open-source Qwen3-4B/-8B/-14B and gpt-oss-20b/-120b models across the multiple-choice GPQA Diamond and MMLU-Pro benchmarks. We find that accuracy and decision commitment consistently increase as the percentage of provided reasoning tokens grows. These gains are primarily driven by relevant content in the model generation rather than context length or generic “reasoning style” effects. Stronger models often backtrack successfully from incorrect partial traces, but immediate answers often remain anchored in the weaker model’s incorrect response. More broadly, we show that trajectory probing provides diagnostics for efficient and safer deployment of reasoning models as the measurements can inform practical trace-handling and monitoring policies that improve reliability without assuming intermediate tokens are inherently faithful explanations.

71. SPICE: Submodular Penalized Information-Conflict Selection for Efficient Large Language Model Training

Authors: Powei Chang , Jinpeng Zhang , Bowen Chen , Chenyu Wang , Chenlu Guo , Yixing Zhang , Yukang Gao , JianXiang Xiang , Yue Gao , Chaoqun Sun , Yiyi Chen , Dongying Kong
URL: https://arxiv.org/abs/2601.23155
Abstract:

Information-based data selection for instruction tuning is compelling: maximizing the log-determinant of the Fisher information yields a monotone submodular objective, enabling greedy algorithms to achieve a $(1-1/e)$ approximation under a cardinality budget. In practice, however, we identify alleviating gradient conflicts, misalignment between per-sample gradients, is a key factor that slows down the decay of marginal log-determinant information gains, thereby preventing significant loss of information. We formalize this via an $\varepsilon$-decomposition that quantifies the deviation from ideal submodularity as a function of conflict statistics, yielding data-dependent approximation factors that tighten as conflicts diminish. Guided by this analysis, we propose SPICE, a conflict-aware selector that maximizes information while penalizing misalignment, and that supports early stopping and proxy models for efficiency. Empirically, SPICE selects subsets with higher log-determinant information than original criteria, and these informational gains translate into performance improvements: across 8 benchmarks with LLaMA2-7B and Qwen2-7B, SPICE uses only 10% of the data, yet matches or exceeds 6 methods including full-data tuning. This achieves performance improvements with substantially lower training cost.

72. On Safer Reinforcement Learning Policies for Sedation and Analgesia in Intensive Care

Authors: Joel Romero-Hernandez , Oscar Camara
URL: https://arxiv.org/abs/2601.23154
Abstract:

Pain management in intensive care usually involves complex trade-offs between therapeutic goals and patient safety, since both inadequate and excessive treatment may induce serious sequelae. Reinforcement learning can help address this challenge by learning medication dosing policies from retrospective data. However, prior work on sedation and analgesia has optimized for objectives that do not value patient survival while relying on algorithms unsuitable for imperfect information settings. We investigated the risks of these design choices by implementing a deep reinforcement learning framework to suggest hourly medication doses under partial observability. Using data from 47,144 ICU stays in the MIMIC-IV database, we trained policies to prescribe opioids, propofol, benzodiazepines, and dexmedetomidine according to two goals: reduce pain or jointly reduce pain and mortality. We found that, although the two policies were associated with lower pain, actions from the first policy were positively correlated with mortality, while those proposed by the second policy were negatively correlated. This suggests that valuing long-term outcomes could be critical for safer treatment policies, even if a short-term goal remains the primary objective.

73. Securing Time in Energy IoT: A Clock-Dynamics-Aware Spatio-Temporal Graph Attention Network for Clock Drift Attacks and Y2K38 Failures

Authors: Saeid Jamshidi , Omar Abdul Wahab , Rolando Herrero , Foutse Khomh
URL: https://arxiv.org/abs/2601.23147
Abstract:

The integrity of time in distributed Internet of Things (IoT) devices is crucial for reliable operation in energy cyber-physical systems, such as smart grids and microgrids. However, IoT systems are vulnerable to clock drift, time-synchronization manipulation, and timestamp discontinuities, such as the Year 2038 (Y2K38) Unix overflow, all of which disrupt temporal ordering. Conventional anomaly-detection models, which assume reliable timestamps, fail to capture temporal inconsistencies. This paper introduces STGAT (Spatio-Temporal Graph Attention Network), a framework that models both temporal distortion and inter-device consistency in energy IoT systems. STGAT combines drift-aware temporal embeddings and temporal self-attention to capture corrupted time evolution at individual devices, and uses graph attention to model spatial propagation of timing errors. A curvature-regularized latent representation geometrically separates normal clock evolution from anomalies caused by drift, synchronization offsets, and overflow events. Experimental results on energy IoT telemetry with controlled timing perturbations show that STGAT achieves 95.7% accuracy, outperforming recurrent, transformer, and graph-based baselines with significant improvements (d > 1.8, p < 0.001). Additionally, STGAT reduces detection delay by 26%, achieving a 2.3-time-step delay while maintaining stable performance under overflow, drift, and physical inconsistencies.

74. Machine Learning for Energy-Performance-aware Scheduling

Authors: Zheyuan Hu , Yifei Shi
URL: https://arxiv.org/abs/2601.23134
Abstract:

In the post-Dennard era, optimizing embedded systems requires navigating complex trade-offs between energy efficiency and latency. Traditional heuristic tuning is often inefficient in such high-dimensional, non-smooth landscapes. In this work, we propose a Bayesian Optimization framework using Gaussian Processes to automate the search for optimal scheduling configurations on heterogeneous multi-core architectures. We explicitly address the multi-objective nature of the problem by approximating the Pareto Frontier between energy and time. Furthermore, by incorporating Sensitivity Analysis (fANOVA) and comparing different covariance kernels (e.g., Matérn vs. RBF), we provide physical interpretability to the black-box model, revealing the dominant hardware parameters driving system performance.

75. Secure Tool Manifest and Digital Signing Solution for Verifiable MCP and LLM Pipelines

Authors: Saeid Jamshidi , Kawser Wazed Nafi , Arghavan Moradi Dakhel , Foutse Khomh , Amin Nikanjam , Mohammad Adnan Hamdaqa
URL: https://arxiv.org/abs/2601.23132
Abstract:

Large Language Models (LLMs) are increasingly adopted in sensitive domains such as healthcare and financial institutions’ data analytics; however, their execution pipelines remain vulnerable to manipulation and unverifiable behavior. Existing control mechanisms, such as the Model Context Protocol (MCP), define compliance policies for tool invocation but lack verifiable enforcement and transparent validation of model actions. To address this gap, we propose a novel Secure Tool Manifest and Digital Signing Framework, a structured and security-aware extension of Model Context Protocols. The framework enforces cryptographically signed manifests, integrates transparent verification logs, and isolates model-internal execution metadata from user-visible components to ensure verifiable execution integrity. Furthermore, the evaluation demonstrates that the framework scales nearly linearly (R-squared = 0.998), achieves near-perfect acceptance of valid executions while consistently rejecting invalid ones, and maintains balanced model utilization across execution pipelines.

76. Regularisation in neural networks: a survey and empirical analysis of approaches

Authors: Christiaan P. Opperman , Anna S. Bosman , Katherine M. Malan
URL: https://arxiv.org/abs/2601.23131
Abstract:

Despite huge successes on a wide range of tasks, neural networks are known to sometimes struggle to generalise to unseen data. Many approaches have been proposed over the years to promote the generalisation ability of neural networks, collectively known as regularisation techniques. These are used as common practice under the assumption that any regularisation added to the pipeline would result in a performance improvement. In this study, we investigate whether this assumption holds in practice. First, we provide a broad review of regularisation techniques, including modern theories such as double descent. We propose a taxonomy of methods under four broad categories, namely: (1) data-based strategies, (2) architecture strategies, (3) training strategies, and (4) loss function strategies. Notably, we highlight the contradictions and correspondences between the approaches in these broad classes. Further, we perform an empirical comparison of the various regularisation techniques on classification tasks for ten numerical and image datasets applied to the multi-layer perceptron and convolutional neural network architectures. Results show that the efficacy of regularisation is dataset-dependent. For example, the use of a regularisation term only improved performance on numeric datasets, whereas batch normalisation improved performance on image datasets only. Generalisation is crucial to machine learning; thus, understanding the effects of applying regularisation techniques, and considering the connections between them is essential to the appropriate use of these methods in practice.

77. To See Far, Look Close: Evolutionary Forecasting for Long-term Time Series

Authors: Jiaming Ma , Siyuan Mu , Ruilin Tang , Haofeng Ma , Qihe Huang , Zhengyang Zhou , Pengkun Wang , Binwu Wang , Yang Wang
URL: https://arxiv.org/abs/2601.23114
Abstract:

The prevailing Direct Forecasting (DF) paradigm dominates Long-term Time Series Forecasting (LTSF) by forcing models to predict the entire future horizon in a single forward pass. While efficient, this rigid coupling of output and evaluation horizons necessitates computationally prohibitive re-training for every target horizon. In this work, we uncover a counter-intuitive optimization anomaly: models trained on short horizons-when coupled with our proposed Evolutionary Forecasting (EF) paradigm-significantly outperform those trained directly on long horizons. We attribute this success to the mitigation of a fundamental optimization pathology inherent in DF, where conflicting gradients from distant futures cripple the learning of local dynamics. We establish EF as a unified generative framework, proving that DF is merely a degenerate special case of EF. Extensive experiments demonstrate that a singular EF model surpasses task-specific DF ensembles across standard benchmarks and exhibits robust asymptotic stability in extreme extrapolation. This work propels a paradigm shift in LTSF: moving from passive Static Mapping to autonomous Evolutionary Reasoning.

78. WiFiPenTester: Advancing Wireless Ethical Hacking with Governed GenAI

Authors: Haitham S. Al-Sinani , Chris J. Mitchell
URL: https://arxiv.org/abs/2601.23092
Abstract:

Wireless ethical hacking relies heavily on skilled practitioners manually interpreting reconnaissance results and executing complex, time-sensitive sequences of commands to identify vulnerable targets, capture authentication handshakes, and assess password resilience; a process that is inherently labour-intensive, difficult to scale, and prone to subjective judgement and human error. To help address these limitations, we propose WiFiPenTester, an experimental, governed, and reproducible system for GenAI-enabled wireless ethical hacking. The system integrates large language models into the reconnaissance and decision-support phases of wireless security assessment, enabling intelligent target ranking, attack feasibility estimation, and strategy recommendation, while preserving strict human-in-the-loop control and budget-aware execution. We describe the system architecture, threat model, governance mechanisms, and prompt-engineering methodology, and empirical experiments conducted across multiple wireless environments. The results demonstrate that GenAI assistance improves target selection accuracy and overall assessment efficiency, while maintaining auditability and ethical safeguards. This indicates that WiFiPenTester is a meaningful step toward practical, safe, and scalable GenAI-assisted wireless penetration testing, while reinforcing the necessity of bounded autonomy, human oversight, and rigorous governance mechanisms when deploying GenAI in ethical hacking.

79. From Similarity to Vulnerability: Key Collision Attack on LLM Semantic Caching

Authors: Zhixiang Zhang , Zesen Liu , Yuchong Xie , Quanfeng Huang , Dongdong She
URL: https://arxiv.org/abs/2601.23088
Abstract:

Semantic caching has emerged as a pivotal technique for scaling LLM applications, widely adopted by major providers including AWS and Microsoft. By utilizing semantic embedding vectors as cache keys, this mechanism effectively minimizes latency and redundant computation for semantically similar queries. In this work, we conceptualize semantic cache keys as a form of fuzzy hashes. We demonstrate that the locality required to maximize cache hit rates fundamentally conflicts with the cryptographic avalanche effect necessary for collision resistance. Our conceptual analysis formalizes this inherent trade-off between performance (locality) and security (collision resilience), revealing that semantic caching is naturally vulnerable to key collision attacks. While prior research has focused on side-channel and privacy risks, we present the first systematic study of integrity risks arising from cache collisions. We introduce CacheAttack, an automated framework for launching black-box collision attacks. We evaluate CacheAttack in security-critical tasks and agentic workflows. It achieves a hit rate of 86\% in LLM response hijacking and can induce malicious behaviors in LLM agent, while preserving strong transferability across different embedding models. A case study on a financial agent further illustrates the real-world impact of these vulnerabilities. Finally, we discuss mitigation strategies.

80. OrLog: Resolving Complex Queries with LLMs and Probabilistic Reasoning

Authors: Mohanna Hoveyda , Jelle Piepenbrock , Arjen P de Vries , Maarten de Rijke , Faegheh Hasibi
URL: https://arxiv.org/abs/2601.23085
Abstract:

Resolving complex information needs that come with multiple constraints should consider enforcing the logical operators encoded in the query (i.e., conjunction, disjunction, negation) on the candidate answer set. Current retrieval systems either ignore these constraints in neural embeddings or approximate them in a generative reasoning process that can be inconsistent and unreliable. Although well-suited to structured reasoning, existing neuro-symbolic approaches remain confined to formal logic or mathematics problems as they often assume unambiguous queries and access to complete evidence, conditions rarely met in information retrieval. To bridge this gap, we introduce OrLog, a neuro-symbolic retrieval framework that decouples predicate-level plausibility estimation from logical reasoning: a large language model (LLM) provides plausibility scores for atomic predicates in one decoding-free forward pass, from which a probabilistic reasoning engine derives the posterior probability of query satisfaction. We evaluate OrLog across multiple backbone LLMs, varying levels of access to external knowledge, and a range of logical constraints, and compare it against base retrievers and LLM-as-reasoner methods. Provided with entity descriptions, OrLog can significantly boost top-rank precision compared to LLM reasoning with larger gains on disjunctive queries. OrLog is also more efficient, cutting mean tokens by $\sim$90\% per query-entity pair. These results demonstrate that generation-free predicate plausibility estimation combined with probabilistic reasoning enables constraint-aware retrieval that outperforms monolithic reasoning while using far fewer tokens.

81. Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures

Authors: Yanghao Su , Wenbo Zhou , Tianwei Zhang , Qiu Han , Weiming Zhang , Nenghai Yu , Jie Zhang
URL: https://arxiv.org/abs/2601.23081
Abstract:

Emergent Misalignment refers to a failure mode in which fine-tuning large language models (LLMs) on narrowly scoped data induces broadly misaligned behavior. Prior explanations mainly attribute this phenomenon to the generalization of erroneous or unsafe content. In this work, we show that this view is incomplete. Across multiple domains and model families, we find that fine-tuning models on data exhibiting specific character-level dispositions induces substantially stronger and more transferable misalignment than incorrect-advice fine-tuning, while largely preserving general capabilities. This indicates that emergent misalignment arises from stable shifts in model behavior rather than from capability degradation or corrupted knowledge. We further show that such behavioral dispositions can be conditionally activated by both training-time triggers and inference-time persona-aligned prompts, revealing shared structure across emergent misalignment, backdoor activation, and jailbreak susceptibility. Overall, our results identify character formation as a central and underexplored alignment risk, suggesting that robust alignment must address behavioral dispositions rather than isolated errors or prompt-level defenses.

82. ExplainerPFN: Towards tabular foundation models for model-free zero-shot feature importance estimations

Authors: Joao Fonseca , Julia Stoyanovich
URL: https://arxiv.org/abs/2601.23068
Abstract:

Computing the importance of features in supervised classification tasks is critical for model interpretability. Shapley values are a widely used approach for explaining model predictions, but require direct access to the underlying model, an assumption frequently violated in real-world deployments. Further, even when model access is possible, their exact computation may be prohibitively expensive. We investigate whether meaningful Shapley value estimations can be obtained in a zero-shot setting, using only the input data distribution and no evaluations of the target model. To this end, we introduce ExplainerPFN, a tabular foundation model built on TabPFN that is pretrained on synthetic datasets generated from random structural causal models and supervised using exact or near-exact Shapley values. Once trained, ExplainerPFN predicts feature attributions for unseen tabular datasets without model access, gradients, or example explanations. Our contributions are fourfold: (1) we show that few-shot learning-based explanations can achieve high fidelity to SHAP values with as few as two reference observations; (2) we propose ExplainerPFN, the first zero-shot method for estimating Shapley values without access to the underlying model or reference explanations; (3) we provide an open-source implementation of ExplainerPFN, including the full training pipeline and synthetic data generator; and (4) through extensive experiments on real and synthetic datasets, we show that ExplainerPFN achieves performance competitive with few-shot surrogate explainers that rely on 2-10 SHAP examples.

83. Towards Explicit Acoustic Evidence Perception in Audio LLMs for Speech Deepfake Detection

Authors: Xiaoxuan Guo , Yuankun Xie , Haonan Cheng , Jiayi Zhou , Jian Liu , Hengyan Huang , Long Ye , Qin Zhang
URL: https://arxiv.org/abs/2601.23066
Abstract:

Speech deepfake detection (SDD) focuses on identifying whether a given speech signal is genuine or has been synthetically generated. Existing audio large language model (LLM)-based methods excel in content understanding; however, their predictions are often biased toward semantically correlated cues, which results in fine-grained acoustic artifacts being overlooked during the decisionmaking process. Consequently, fake speech with natural semantics can bypass detectors despite harboring subtle acoustic anomalies; this suggests that the challenge stems not from the absence of acoustic data, but from its inadequate accessibility when semantic-dominant reasoning prevails. To address this issue, we investigate SDD within the audio LLM paradigm and introduce SDD with Auditory Perception-enhanced Audio Large Language Model (SDD-APALLM), an acoustically enhanced framework designed to explicitly expose fine-grained time-frequency evidence as accessible acoustic cues. By combining raw audio with structured spectrograms, the proposed framework empowers audio LLMs to more effectively capture subtle acoustic inconsistencies without compromising their semantic understanding. Experimental results indicate consistent gains in detection accuracy and robustness, especially in cases where semantic cues are misleading. Further analysis reveals that these improvements stem from a coordinated utilization of semantic and acoustic information, as opposed to simple modality aggregation.

84. HierLoc: Hyperbolic Entity Embeddings for Hierarchical Visual Geolocation

Authors: Hari Krishna Gadi , Daniel Matos , Hongyi Luo , Lu Liu , Yongliang Wang , Yanfeng Zhang , Liqiu Meng
URL: https://arxiv.org/abs/2601.23064
Abstract:

Visual geolocalization, the task of predicting where an image was taken, remains challenging due to global scale, visual ambiguity, and the inherently hierarchical structure of geography. Existing paradigms rely on either large-scale retrieval, which requires storing a large number of image embeddings, grid-based classifiers that ignore geographic continuity, or generative models that diffuse over space but struggle with fine detail. We introduce an entity-centric formulation of geolocation that replaces image-to-image retrieval with a compact hierarchy of geographic entities embedded in Hyperbolic space. Images are aligned directly to country, region, subregion, and city entities through Geo-Weighted Hyperbolic contrastive learning by directly incorporating haversine distance into the contrastive objective. This hierarchical design enables interpretable predictions and efficient inference with 240k entity embeddings instead of over 5 million image embeddings on the OSV5M benchmark, on which our method establishes a new state-of-the-art performance. Compared to the current methods in the literature, it reduces mean geodesic error by 19.5\%, while improving the fine-grained subregion accuracy by 43%. These results demonstrate that geometry-aware hierarchical embeddings provide a scalable and conceptually new alternative for global image geolocation.

85. On the Impact of Code Comments for Automated Bug-Fixing: An Empirical Study

Authors: Antonio Vitale , Emanuela Guglielmi , Simone Scalabrino , Rocco Oliveto
URL: https://arxiv.org/abs/2601.23059
Abstract:

Large Language Models (LLMs) are increasingly relevant in Software Engineering research and practice, with Automated Bug Fixing (ABF) being one of their key applications. ABF involves transforming a buggy method into its fixed equivalent. A common preprocessing step in ABF involves removing comments from code prior to training. However, we hypothesize that comments may play a critical role in fixing certain types of bugs by providing valuable design and implementation insights. In this study, we investigate how the presence or absence of comments, both during training and at inference time, impacts the bug-fixing capabilities of LLMs. We conduct an empirical evaluation comparing two model families, each evaluated under all combinations of training and inference conditions (with and without comments), and thereby revisiting the common practice of removing comments during training. To address the limited availability of comments in state-of-the-art datasets, we use an LLM to automatically generate comments for methods lacking them. Our findings show that comments improve ABF accuracy by up to threefold when present in both phases, while training with comments does not degrade performance when instances lack them. Additionally, an interpretability analysis identifies that comments detailing method implementation are particularly effective in aiding LLMs to fix bugs accurately.

86. Adaptive Edge Learning for Density-Aware Graph Generation

Authors: Seyedeh Ava Razi Razavi , James Sargant , Sheridan Houghten , Renata Dividino
URL: https://arxiv.org/abs/2601.23052
Abstract:

Generating realistic graph-structured data is challenging due to discrete structures, variable sizes, and class-specific connectivity patterns that resist conventional generative modelling. While recent graph generation methods employ generative adversarial network (GAN) frameworks to handle permutation invariance and irregular topologies, they typically rely on random edge sampling with fixed probabilities, limiting their capacity to capture complex structural dependencies between nodes. We propose a density-aware conditional graph generation framework using Wasserstein GANs (WGAN) that replaces random sampling with a learnable distance-based edge predictor. Our approach embeds nodes into a latent space where proximity correlates with edge likelihood, enabling the generator to learn meaningful connectivity patterns. A differentiable edge predictor determines pairwise relationships directly from node embeddings, while a density-aware selection mechanism adaptively controls edge density to match class-specific sparsity distributions observed in real graphs. We train the model using a WGAN with gradient penalty, employing a GCN-based critic to ensure generated graphs exhibit realistic topology and align with target class distributions. Experiments on benchmark datasets demonstrate that our method produces graphs with superior structural coherence and class-consistent connectivity compared to existing baselines. The learned edge predictor captures complex relational patterns beyond simple heuristics, generating graphs whose density and topology closely match real structural distributions. Our results show improved training stability and controllable synthesis, making the framework effective for realistic graph generation and data augmentation. Source code is publicly available at this https URL .

87. Avoiding Premature Collapse: Adaptive Annealing for Entropy-Regularized Structural Inference

Authors: Yizhi Liu
URL: https://arxiv.org/abs/2601.23039
Abstract:

Differentiable matching layers, often implemented via entropy-regularized Optimal Transport, serve as a critical approximate inference mechanism in structural prediction. However, recovering discrete permutations via annealing $\epsilon \to 0$ is notoriously unstable. We identify a fundamental mechanism for this failure: \textbf{Premature Mode Collapse}. By analyzing the non-normal dynamics of the Sinkhorn fixed-point map, we reveal a theoretical \textbf{thermodynamic speed limit}. Under standard exponential cooling, the shift in the target posterior ($O(1)$) outpaces the contraction rate of the inference operator, which degrades as $O(1/\epsilon)$. This mismatch inevitably forces the inference trajectory into spurious local basins. To address this, we propose \textbf{Efficient PH-ASC}, an adaptive scheduling algorithm that monitors the stability of the inference process. By enforcing a linear stability law, we decouple expensive spectral diagnostics from the training loop, reducing overhead from $O(N^3)$ to amortized $O(1)$. Our implementation and interactive demo are available at this https URL and this https URL . bounded away from zero in generic training dynamics unless the feature extractor converges unrealistically fast.

88. Leveraging Convolutional Sparse Autoencoders for Robust Movement Classification from Low-Density sEMG

Authors: Blagoj Hristov , Zoran Hadzi-Velkov , Katerina Hadzi-Velkova Saneva , Gorjan Nadzinski , Vesna Ojleska Latkoska
URL: https://arxiv.org/abs/2601.23011
Abstract:

Reliable control of myoelectric prostheses is often hindered by high inter-subject variability and the clinical impracticality of high-density sensor arrays. This study proposes a deep learning framework for accurate gesture recognition using only two surface electromyography (sEMG) channels. The method employs a Convolutional Sparse Autoencoder (CSAE) to extract temporal feature representations directly from raw signals, eliminating the need for heuristic feature engineering. On a 6-class gesture set, our model achieved a multi-subject F1-score of 94.3% $\pm$ 0.3%. To address subject-specific differences, we present a few-shot transfer learning protocol that improved performance on unseen subjects from a baseline of 35.1% $\pm$ 3.1% to 92.3% $\pm$ 0.9% with minimal calibration data. Furthermore, the system supports functional extensibility through an incremental learning strategy, allowing for expansion to a 10-class set with a 90.0% $\pm$ 0.2% F1-score without full model retraining. By combining high precision with minimal computational and sensor overhead, this framework provides a scalable and efficient approach for the next generation of affordable and adaptive prosthetic systems.

89. Automatic Constraint Policy Optimization based on Continuous Constraint Interpolation Framework for Offline Reinforcement Learning

Authors: Xinchen Han , Qiuyang Fang , Hossam Afifi , Michel Marot
URL: https://arxiv.org/abs/2601.23010
Abstract:

Offline Reinforcement Learning (RL) relies on policy constraints to mitigate extrapolation error, where both the constraint form and constraint strength critically shape performance. However, most existing methods commit to a single constraint family: weighted behavior cloning, density regularization, or support constraints, without a unified principle that explains their connections or trade-offs. In this work, we propose Continuous Constraint Interpolation (CCI), a unified optimization framework in which these three constraint families arise as special cases along a common constraint spectrum. The CCI framework introduces a single interpolation parameter that enables smooth transitions and principled combinations across constraint types. Building on CCI, we develop Automatic Constraint Policy Optimization (ACPO), a practical primal–dual algorithm that adapts the interpolation parameter via a Lagrangian dual update. Moreover, we establish a maximum-entropy performance difference lemma and derive performance lower bounds for both the closed-form optimal policy and its parametric projection. Experiments on D4RL and NeoRL2 demonstrate robust gains across diverse domains, achieving state-of-the-art performance overall.

90. Bias Beyond Borders: Political Ideology Evaluation and Steering in Multilingual LLMs

Authors: Afrozah Nadeem , Agrima , Mehwish Nasim , Usman Naseem
URL: https://arxiv.org/abs/2601.23001
Abstract:

Large Language Models (LLMs) increasingly shape global discourse, making fairness and ideological neutrality essential for responsible AI deployment. Despite growing attention to political bias in LLMs, prior work largely focuses on high-resource, Western languages or narrow multilingual settings, leaving cross-lingual consistency and safe post-hoc mitigation underexplored. To address this gap, we present a large-scale multilingual evaluation of political bias spanning 50 countries and 33 languages. We introduce a complementary post-hoc mitigation framework, Cross-Lingual Alignment Steering (CLAS), designed to augment existing steering methods by aligning ideological representations across languages and dynamically regulating intervention strength. This method aligns latent ideological representations induced by political prompts into a shared ideological subspace, ensuring cross lingual consistency, with the adaptive mechanism prevents over correction and preserves coherence. Experiments demonstrate substantial bias reduction along both economic and social axes with minimal degradation in response quality. The proposed framework establishes a scalable and interpretable paradigm for fairness-aware multilingual LLM governance, balancing ideological neutrality with linguistic and cultural diversity.

91. Mano: Restriking Manifold Optimization for LLM Training

Authors: Yufei Gu , Zeke Xie
URL: https://arxiv.org/abs/2601.23000
Abstract:

While large language models (LLMs) have emerged as a significant advancement in artificial intelligence, the hardware and computational costs for training LLMs are also significantly burdensome. Among the state-of-the-art optimizers, AdamW relies on diagonal curvature estimates and ignores structural properties, while Muon applies global spectral normalization at the expense of losing curvature information. In this study, we restriked manifold optimization methods for training LLMs, which may address both optimizers’ limitations, while conventional manifold optimization methods have been largely overlooked due to the poor performance in large-scale model optimization. By innovatively projecting the momentum onto the tangent space of model parameters and constraining it on a rotational Oblique manifold, we propose a novel, powerful, and efficient optimizer Mano that is the first to bridge the performance gap between manifold optimization and modern optimizers. Extensive experiments on the LLaMA and Qwen3 models demonstrate that Mano consistently and significantly outperforms AdamW and Muon even with less memory consumption and computational complexity, respectively, suggesting an expanded Pareto frontier in terms of space and time efficiency.

92. Self-Supervised Slice-to-Volume Reconstruction with Gaussian Representations for Fetal MRI

Authors: Yinsong Wang , Thomas Fletcher , Xinzhe Luo , Aine Travers Dineen , Rhodri Cusack , Chen Qin
URL: https://arxiv.org/abs/2601.22990
Abstract:

Reconstructing 3D fetal MR volumes from motion-corrupted stacks of 2D slices is a crucial and challenging task. Conventional slice-to-volume reconstruction (SVR) methods are time-consuming and require multiple orthogonal stacks for reconstruction. While learning-based SVR approaches have significantly reduced the time required at the inference stage, they heavily rely on ground truth information for training, which is inaccessible in practice. To address these challenges, we propose GaussianSVR, a self-supervised framework for slice-to-volume reconstruction. GaussianSVR represents the target volume using 3D Gaussian representations to achieve high-fidelity reconstruction. It leverages a simulated forward slice acquisition model to enable self-supervised training, alleviating the need for ground-truth volumes. Furthermore, to enhance both accuracy and efficiency, we introduce a multi-resolution training strategy that jointly optimizes Gaussian parameters and spatial transformations across different resolution levels. Experiments show that GaussianSVR outperforms the baseline methods on fetal MR volumetric reconstruction. Code will be available upon acceptance.

93. About an Automating Annotation Method for Robot Markers

Authors: Wataru Uemura , Takeru Nagashima
URL: https://arxiv.org/abs/2601.22982
Abstract:

Factory automation has become increasingly important due to labor shortages, leading to the introduction of autonomous mobile robots for tasks such as material transportation. Markers are commonly used for robot self-localization and object identification. In the RoboCup Logistics League (RCLL), ArUco markers are employed both for robot localization and for identifying processing modules. Conventional recognition relies on OpenCV-based image processing, which detects black-and-white marker patterns. However, these methods often fail under noise, motion blur, defocus, or varying illumination conditions. Deep-learning-based recognition offers improved robustness under such conditions, but requires large amounts of annotated data. Annotation must typically be done manually, as the type and position of objects cannot be detected automatically, making dataset preparation a major bottleneck. In contrast, ArUco markers include built-in recognition modules that provide both ID and positional information, enabling automatic annotation. This paper proposes an automated annotation method for training deep-learning models on ArUco marker images. By leveraging marker detection results obtained from the ArUco module, the proposed approach eliminates the need for manual labeling. A YOLO-based model is trained using the automatically annotated dataset, and its performance is evaluated under various conditions. Experimental results demonstrate that the proposed method improves recognition performance compared with conventional image-processing techniques, particularly for images affected by blur or defocus. Automatic annotation also reduces human effort and ensures consistent labeling quality. Future work will investigate the relationship between confidence thresholds and recognition performance.

94. Stabilizing the Q-Gradient Field for Policy Smoothness in Actor-Critic

Authors: Jeong Woon Lee , Kyoleen Kwak , Daeho Kim , Hyoseok Hwang
URL: https://arxiv.org/abs/2601.22970
Abstract:

Policies learned via continuous actor-critic methods often exhibit erratic, high-frequency oscillations, making them unsuitable for physical deployment. Current approaches attempt to enforce smoothness by directly regularizing the policy’s output. We argue that this approach treats the symptom rather than the cause. In this work, we theoretically establish that policy non-smoothness is fundamentally governed by the differential geometry of the critic. By applying implicit differentiation to the actor-critic objective, we prove that the sensitivity of the optimal policy is bounded by the ratio of the Q-function’s mixed-partial derivative (noise sensitivity) to its action-space curvature (signal distinctness). To empirically validate this theoretical insight, we introduce PAVE (Policy-Aware Value-field Equalization), a critic-centric regularization framework that treats the critic as a scalar field and stabilizes its induced action-gradient field. PAVE rectifies the learning signal by minimizing the Q-gradient volatility while preserving local curvature. Experimental results demonstrate that PAVE achieves smoothness and robustness comparable to policy-side smoothness regularization methods, while maintaining competitive task performance, without modifying the actor.

95. Residual Context Diffusion Language Models

Authors: Yuezhou Hu , Harman Singh , Monishwaran Maheswaran , Haocheng Xi , Coleman Hooper , Jintao Zhang , Aditya Tomar , Michael W. Mahoney , Sewon Min , Mehrdad Farajtabar , Kurt Keutzer , Amir Gholami , Chenfeng Xu
URL: https://arxiv.org/abs/2601.22954
Abstract:

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to purely autoregressive language models because they can decode multiple tokens in parallel. However, state-of-the-art block-wise dLLMs rely on a “remasking” mechanism that decodes only the most confident tokens and discards the rest, effectively wasting computation. We demonstrate that recycling computation from the discarded tokens is beneficial, as these tokens retain contextual information useful for subsequent decoding iterations. In light of this, we propose Residual Context Diffusion (RCD), a module that converts these discarded token representations into contextual residuals and injects them back for the next denoising step. RCD uses a decoupled two-stage training pipeline to bypass the memory bottlenecks associated with backpropagation. We validate our method on both long CoT reasoning (SDAR) and short CoT instruction following (LLaDA) models. We demonstrate that a standard dLLM can be efficiently converted to the RCD paradigm with merely ~1 billion tokens. RCD consistently improves frontier dLLMs by 5-10 points in accuracy with minimal extra computation overhead across a wide range of benchmarks. Notably, on the most challenging AIME tasks, RCD nearly doubles baseline accuracy and attains up to 4-5x fewer denoising steps at equivalent accuracy levels.

96. Perplexity Cannot Always Tell Right from Wrong

Authors: Petar Veličković , Federico Barbero , Christos Perivolaropoulos , Simon Osindero , Razvan Pascanu
URL: https://arxiv.org/abs/2601.22950
Abstract:

Perplexity – a function measuring a model’s overall level of “surprise” when encountering a particular output – has gained significant traction in recent years, both as a loss function and as a simple-to-compute metric of model quality. Prior studies have pointed out several limitations of perplexity, often from an empirical manner. Here we leverage recent results on Transformer continuity to show in a rigorous manner how perplexity may be an unsuitable metric for model selection. Specifically, we prove that, if there is any sequence that a compact decoder-only Transformer model predicts accurately and confidently – a necessary pre-requisite for strong generalisation – it must imply existence of another sequence with very low perplexity, but not predicted correctly by that same model. Further, by analytically studying iso-perplexity plots, we find that perplexity will not always select for the more accurate model – rather, any increase in model confidence must be accompanied by a commensurate rise in accuracy for the new model to be selected.

97. From Data Leak to Secret Misses: The Impact of Data Leakage on Secret Detection Models

Authors: Farnaz Soltaniani , Mohammad Ghafari
URL: https://arxiv.org/abs/2601.22946
Abstract:

Machine learning models are increasingly used for software security tasks. These models are commonly trained and evaluated on large Internet-derived datasets, which often contain duplicated or highly similar samples. When such samples are split across training and test sets, data leakage may occur, allowing models to memorize patterns instead of learning to generalize. We investigate duplication in a widely used benchmark dataset of hard coded secrets and show how data leakage can substantially inflate the reported performance of AI-based secret detectors, resulting in a misleading picture of their real-world effectiveness.

98. A Real-Time Privacy-Preserving Behavior Recognition System via Edge-Cloud Collaboration

Authors: Huan Song , Shuyu Tian , Junyi Hao , Cheng Yuan , Zhenyu Jia , Jiawei Shao , Xuelong Li
URL: https://arxiv.org/abs/2601.22938
Abstract:

As intelligent sensing expands into high-privacy environments such as restrooms and changing rooms, the field faces a critical privacy-security paradox. Traditional RGB surveillance raises significant concerns regarding visual recording and storage, while existing privacy-preserving methods-ranging from physical desensitization to traditional cryptographic or obfuscation techniques-often compromise semantic understanding capabilities or fail to guarantee mathematical irreversibility against reconstruction attacks. To address these challenges, this study presents a novel privacy-preserving perception technology based on the AI Flow theoretical framework and an edge-cloud collaborative architecture. The proposed methodology integrates source desensitization with irreversible feature mapping. Leveraging Information Bottleneck theory, the edge device performs millisecond-level processing to transform raw imagery into abstract feature vectors via non-linear mapping and stochastic noise injection. This process constructs a unidirectional information flow that strips identity-sensitive attributes, rendering the reconstruction of original images impossible. Subsequently, the cloud platform utilizes multimodal family models to perform joint inference solely on these abstract vectors to detect abnormal behaviors. This approach fundamentally severs the path to privacy leakage at the architectural level, achieving a breakthrough from video surveillance to de-identified behavior perception and offering a robust solution for risk management in high-sensitivity public spaces.

99. Protecting Private Code in IDE Autocomplete using Differential Privacy

Authors: Evgeny Grigorenko , David Stanojević , David Ilić , Egor Bogomolov , Kostadin Cvejoski
URL: https://arxiv.org/abs/2601.22935
Abstract:

Modern Integrated Development Environments (IDEs) increasingly leverage Large Language Models (LLMs) to provide advanced features like code autocomplete. While powerful, training these models on user-written code introduces significant privacy risks, making the models themselves a new type of data vulnerability. Malicious actors can exploit this by launching attacks to reconstruct sensitive training data or infer whether a specific code snippet was used for training. This paper investigates the use of Differential Privacy (DP) as a robust defense mechanism for training an LLM for Kotlin code completion. We fine-tune a \texttt{Mellum} model using DP and conduct a comprehensive evaluation of its privacy and utility. Our results demonstrate that DP provides a strong defense against Membership Inference Attacks (MIAs), reducing the attack’s success rate close to a random guess (AUC from 0.901 to 0.606). Furthermore, we show that this privacy guarantee comes at a minimal cost to model performance, with the DP-trained model achieving utility scores comparable to its non-private counterpart, even when trained on 100x less data. Our findings suggest that DP is a practical and effective solution for building private and trustworthy AI-powered IDE features.

100. MTDrive: Multi-turn Interactive Reinforcement Learning for Autonomous Driving

Authors: Xidong Li , Mingyu Guo , Chenchao Xu , Bailin Li , Wenjing Zhu , Yangang Zou , Rui Chen , Zehuan Wang
URL: https://arxiv.org/abs/2601.22930
Abstract:

Trajectory planning is a core task in autonomous driving, requiring the prediction of safe and comfortable paths across diverse scenarios. Integrating Multi-modal Large Language Models (MLLMs) with Reinforcement Learning (RL) has shown promise in addressing “long-tail” scenarios. However, existing methods are constrained to single-turn reasoning, limiting their ability to handle complex tasks requiring iterative refinement. To overcome this limitation, we present MTDrive, a multi-turn framework that enables MLLMs to iteratively refine trajectories based on environmental feedback. MTDrive introduces Multi-Turn Group Relative Policy Optimization (mtGRPO), which mitigates reward sparsity by computing relative advantages across turns. We further construct an interactive trajectory understanding dataset from closed-loop simulation to support multi-turn training. Experiments on the NAVSIM benchmark demonstrate superior performance compared to existing methods, validating the effectiveness of our multi-turn reasoning paradigm. Additionally, we implement system-level optimizations to reduce data transfer overhead caused by high-resolution images and multi-turn sequences, achieving 2.5x training throughput. Our data, models, and code will be made available soon.

101. BEAR: Towards Beam-Search-Aware Optimization for Recommendation with Large Language Models

Authors: Weiqin Yang , Bohao Wang , Zhenxiang Xu , Jiawei Chen , Shengjia Zhang , Jingbang Chen , Canghong Jin , Can Wang
URL: https://arxiv.org/abs/2601.22925
Abstract:

Recent years have witnessed a rapid surge in research leveraging Large Language Models (LLMs) for recommendation. These methods typically employ supervised fine-tuning (SFT) to adapt LLMs to recommendation scenarios, and utilize beam search during inference to efficiently retrieve $B$ top-ranked recommended items. However, we identify a critical training-inference inconsistency: while SFT optimizes the overall probability of positive items, it does not guarantee that such items will be retrieved by beam search even if they possess high overall probabilities. Due to the greedy pruning mechanism, beam search can prematurely discard a positive item once its prefix probability is insufficient. To address this inconsistency, we propose BEAR (Beam-SEarch-Aware Regularization), a novel fine-tuning objective that explicitly accounts for beam search behavior during training. Rather than directly simulating beam search for each instance during training, which is computationally prohibitive, BEAR enforces a relaxed necessary condition: each token in a positive item must rank within the top-$B$ candidate tokens at each decoding step. This objective effectively mitigates the risk of incorrect pruning while incurring negligible computational overhead compared to standard SFT. Extensive experiments across four real-world datasets demonstrate that BEAR significantly outperforms strong baselines. Code will be released upon acceptance.

102. Evaluating Large Language Models for Security Bug Report Prediction

Authors: Farnaz Soltaniani , Shoaib Razzaq , Mohammad Ghafari
URL: https://arxiv.org/abs/2601.22921
Abstract:

Early detection of security bug reports (SBRs) is critical for timely vulnerability mitigation. We present an evaluation of prompt-based engineering and fine-tuning approaches for predicting SBRs using Large Language Models (LLMs). Our findings reveal a distinct trade-off between the two approaches. Prompted proprietary models demonstrate the highest sensitivity to SBRs, achieving a G-measure of 77% and a recall of 74% on average across all the datasets, albeit at the cost of a higher false-positive rate, resulting in an average precision of only 22%. Fine-tuned models, by contrast, exhibit the opposite behavior, attaining a lower overall G-measure of 51% but substantially higher precision of 75% at the cost of reduced recall of 36%. Though a one-time investment in building fine-tuned models is necessary, the inference on the largest dataset is up to 50 times faster than that of proprietary models. These findings suggest that further investigations to harness the power of LLMs for SBR prediction are necessary.

103. DINO-SAE: DINO Spherical Autoencoder for High-Fidelity Image Reconstruction and Generation

Authors: Hun Chang , Byunghee Cha , Jong Chul Ye
URL: https://arxiv.org/abs/2601.22904
Abstract:

Recent studies have explored using pretrained Vision Foundation Models (VFMs) such as DINO for generative autoencoders, showing strong generative performance. Unfortunately, existing approaches often suffer from limited reconstruction fidelity due to the loss of high-frequency details. In this work, we present the DINO Spherical Autoencoder (DINO-SAE), a framework that bridges semantic representation and pixel-level reconstruction. Our key insight is that semantic information in contrastive representations is primarily encoded in the direction of feature vectors, while forcing strict magnitude matching can hinder the encoder from preserving fine-grained details. To address this, we introduce Hierarchical Convolutional Patch Embedding module that enhances local structure and texture preservation, and Cosine Similarity Alignment objective that enforces semantic consistency while allowing flexible feature magnitudes for detail retention. Furthermore, leveraging the observation that SSL-based foundation model representations intrinsically lie on a hypersphere, we employ Riemannian Flow Matching to train a Diffusion Transformer (DiT) directly on this spherical latent manifold. Experiments on ImageNet-1K demonstrate that our approach achieves state-of-the-art reconstruction quality, reaching 0.37 rFID and 26.2 dB PSNR, while maintaining strong semantic alignment to the pretrained VFM. Notably, our Riemannian Flow Matching-based DiT exhibits efficient convergence, achieving a gFID of 3.47 at 80 epochs.

104. DiffuSpeech: Silent Thought, Spoken Answer via Unified Speech-Text Diffusion

Authors: Yuxuan Lou , Ziming Wu , Yaochen Wang , Yong Liu , Yingxuan Ren , Fuming Lai , Shaobing Lian , Jie Tang , Yang You
URL: https://arxiv.org/abs/2601.22889
Abstract:

Current speech language models generate responses directly without explicit reasoning, leading to errors that cannot be corrected once audio is produced. We introduce \textbf{``Silent Thought, Spoken Answer’’} – a paradigm where speech LLMs generate internal text reasoning alongside spoken responses, with thinking traces informing speech quality. To realize this, we present \method{}, the first diffusion-based speech-text language model supporting both understanding and generation, unifying discrete text and tokenized speech under a single masked diffusion framework. Unlike autoregressive approaches, \method{} jointly generates reasoning traces and speech tokens through iterative denoising, with modality-specific masking schedules. We also construct \dataset{}, the first speech QA dataset with paired text reasoning traces, containing 26K samples totaling 319 hours. Experiments show \method{} achieves state-of-the-art speech-to-speech QA accuracy, outperforming the best baseline by up to 9 points, while attaining the best TTS quality among generative models (6.2\% WER) and preserving language understanding (66.2\% MMLU). Ablations confirm that both the diffusion architecture and thinking traces contribute to these gains.

105. Should LLMs, $\textit{like}$, Generate How Users Talk? Building Dialect-Accurate Dialog[ue]s Beyond the American Default with MDial

Authors: Jio Oh , Paul Vicinanza , Thomas Butler , Steven Euijong Whang , Dezhi Hong , Amani Namboori
URL: https://arxiv.org/abs/2601.22888
Abstract:

More than 80% of the 1.6 billion English speakers do not use Standard American English (SAE) and experience higher failure rates and stereotyped responses when interacting with LLMs as a result. Yet multi-dialectal performance remains underexplored. We introduce $\textbf{MDial}$, the first large-scale framework for generating multi-dialectal conversational data encompassing the three pillars of written dialect – lexical (vocabulary), orthographic (spelling), and morphosyntactic (grammar) features – for nine English dialects. Partnering with native linguists, we design an annotated and scalable rule-based LLM transformation to ensure precision. Our approach challenges the assumption that models should mirror users’ morphosyntactic features, showing that up to 90% of the grammatical features of a dialect should not be reproduced by models. Independent evaluations confirm data quality, with annotators preferring MDial outputs over prior methods in 98% of pairwise comparisons for dialect naturalness. Using this pipeline, we construct the dialect-parallel $\textbf{MDialBench}$mark with 50k+ dialogs, resulting in 97k+ QA pairs, and evaluate 17 LLMs on dialect identification and response generation tasks. Even frontier models achieve under 70% accuracy, fail to reach 50% for Canadian English, and systematically misclassify non-SAE dialects as American or British. As dialect identification underpins natural language understanding, these errors risk cascading failures into downstream tasks.

106. MoVE: Mixture of Value Embeddings – A New Axis for Scaling Parametric Memory in Autoregressive Models

Authors: Yangyan Li
URL: https://arxiv.org/abs/2601.22887
Abstract:

Autoregressive sequence modeling stands as the cornerstone of modern Generative AI, powering results across diverse modalities ranging from text generation to image generation. However, a fundamental limitation of this paradigm is the rigid structural coupling of model capacity to computational cost: expanding a model’s parametric memory – its repository of factual knowledge or visual patterns – traditionally requires deepening or widening the network, which incurs a proportional rise in active FLOPs. In this work, we introduce $\textbf{MoVE (Mixture of Value Embeddings)}$, a mechanism that breaks this coupling and establishes a new axis for scaling capacity. MoVE decouples memory from compute by introducing a global bank of learnable value embeddings shared across all attention layers. For every step in the sequence, the model employs a differentiable soft gating mechanism to dynamically mix retrieved concepts from this bank into the standard value projection. This architecture allows parametric memory to be scaled independently of network depth by simply increasing the number of embedding slots. We validate MoVE through strictly controlled experiments on two representative applications of autoregressive modeling: Text Generation and Image Generation. In both domains, MoVE yields consistent performance improvements over standard and layer-wise memory baselines, enabling the construction of “memory-dense” models that achieve lower perplexity and higher fidelity than their dense counterparts at comparable compute budgets.

107. Reinforcement Learning-Based Co-Design and Operation of Chiller and Thermal Energy Storage for Cost-Optimal HVAC Systems

Authors: Tanay Raghunandan Srinivasa , Vivek Deulkar , Aviruch Bhatia , Vishal Garg
URL: https://arxiv.org/abs/2601.22880
Abstract:

We study the joint operation and sizing of cooling infrastructure for commercial HVAC systems using reinforcement learning, with the objective of minimizing life-cycle cost over a 30-year horizon. The cooling system consists of a fixed-capacity electric chiller and a thermal energy storage (TES) unit, jointly operated to meet stochastic hourly cooling demands under time-varying electricity prices. The life-cycle cost accounts for both capital expenditure and discounted operating cost, including electricity consumption and maintenance. A key challenge arises from the strong asymmetry in capital costs: increasing chiller capacity by one unit is far more expensive than an equivalent increase in TES capacity. As a result, identifying the right combination of chiller and TES sizes, while ensuring zero loss-of-cooling-load under optimal operation, is a non-trivial co-design problem. To address this, we formulate the chiller operation problem for a fixed infrastructure configuration as a finite-horizon Markov Decision Process (MDP), in which the control action is the chiller part-load ratio (PLR). The MDP is solved using a Deep Q Network (DQN) with a constrained action space. The learned DQN RL policy minimizes electricity cost over historical traces of cooling demand and electricity prices. For each candidate chiller-TES sizing configuration, the trained policy is evaluated. We then restrict attention to configurations that fully satisfy the cooling demand and perform a life-cycle cost minimization over this feasible set to identify the cost-optimal infrastructure design. Using this approach, we determine the optimal chiller and thermal energy storage capacities to be 700 and 1500, respectively.

108. EmoShift: Lightweight Activation Steering for Enhanced Emotion-Aware Speech Synthesis

Authors: Li Zhou , Hao Jiang , Junjie Li , Tianrui Wang , Haizhou Li
URL: https://arxiv.org/abs/2601.22873
Abstract:

Achieving precise and controllable emotional expression is crucial for producing natural and context-appropriate speech in text-to-speech (TTS) synthesis. However, many emotion-aware TTS systems, including large language model (LLM)-based designs, rely on scaling fixed emotion embeddings or external guidance, limiting their ability to model emotion-specific latent characteristics. To address this gap, we present EmoShift, a lightweight activation-steering framework incorporating a EmoSteer layer, which learns a steering vector for each target emotion in the output embedding space to capture its latent offset and maintain stable, appropriate expression across utterances and categories. With only 10M trainable parameters,less than 1/30 of full fine-tuning, EmoShift outperforms zero-shot and fully fine-tuned baselines in objective and subjective evaluations, enhancing emotional expressiveness while preserving naturalness and speaker similarity. Further analysis confirms the proposed EmoSteer layer’s effectiveness and reveals its potential for controllable emotional intensity in speech synthesis.

109. Eroding the Truth-Default: A Causal Analysis of Human Susceptibility to Foundation Model Hallucinations and Disinformation in the Wild

Authors: Alexander Loth , Martin Kappes , Marc-Oliver Pahl
URL: https://arxiv.org/abs/2601.22871
Abstract:

As foundation models (FMs) approach human-level fluency, distinguishing synthetic from organic content has become a key challenge for Trustworthy Web Intelligence. This paper presents JudgeGPT and RogueGPT, a dual-axis framework that decouples “authenticity” from “attribution” to investigate the mechanisms of human susceptibility. Analyzing 918 evaluations across five FMs (including GPT-4 and Llama-2), we employ Structural Causal Models (SCMs) as a principal framework for formulating testable causal hypotheses about detection accuracy. Contrary to partisan narratives, we find that political orientation shows a negligible association with detection performance ($r=-0.10$). Instead, “fake news familiarity” emerges as a candidate mediator ($r=0.35$), suggesting that exposure may function as adversarial training for human discriminators. We identify a “fluency trap” where GPT-4 outputs (HumanMachineScore: 0.20) bypass Source Monitoring mechanisms, rendering them indistinguishable from human text. These findings suggest that “pre-bunking” interventions should target cognitive source monitoring rather than demographic segmentation to ensure trustworthy information ecosystems.

110. Degradation-Aware Frequency Regulation of a Heterogeneous Battery Fleet via Reinforcement Learning

Authors: Tanay Raghunandan Srinivasa , Vivek Deulkar , Jia Bhargava , Mohammad Hajiesmaili , Prashant Shenoy
URL: https://arxiv.org/abs/2601.22865
Abstract:

Battery energy storage systems are increasingly deployed as fast-responding resources for grid balancing services such as frequency regulation and for mitigating renewable generation uncertainty. However, repeated charging and discharging induces cycling degradation and reduces battery lifetime. This paper studies the real-time scheduling of a heterogeneous battery fleet that collectively tracks a stochastic balancing signal subject to per-battery ramp-rate and capacity constraints, while minimizing long-term cycling degradation. Cycling degradation is fundamentally path-dependent: it is determined by charge-discharge cycles formed by the state-of-charge (SoC) trajectory and is commonly quantified via rainflow cycle counting. This non-Markovian structure makes it difficult to express degradation as an additive per-time-step cost, complicating classical dynamic programming approaches. We address this challenge by formulating the fleet scheduling problem as a Markov decision process (MDP) with constrained action space and designing a dense proxy reward that provides informative feedback at each time step while remaining aligned with long-term cycle-depth reduction. To scale learning to large state-action spaces induced by fine-grained SoC discretization and asymmetric per-battery constraints, we develop a function-approximation reinforcement learning method using an Extreme Learning Machine (ELM) as a random nonlinear feature map combined with linear temporal-difference learning. We evaluate the proposed approach on a toy Markovian signal model and on a Markovian model trained from real-world regulation signal traces obtained from the University of Delaware, and demonstrate consistent reductions in cycle-depth occurrence and degradation metrics compared to baseline scheduling policies.

111. Bayesian Interpolating Neural Network (B-INN): a scalable and reliable Bayesian model for large-scale physical systems

Authors: Chanwook Park , Brian Kim , Jiachen Guo , Wing Kam Liu
URL: https://arxiv.org/abs/2601.22860
Abstract:

Neural networks and machine learning models for uncertainty quantification suffer from limited scalability and poor reliability compared to their deterministic counterparts. In industry-scale active learning settings, where generating a single high-fidelity simulation may require days or weeks of computation and produce data volumes on the order of gigabytes, they quickly become impractical. This paper proposes a scalable and reliable Bayesian surrogate model, termed the Bayesian Interpolating Neural Network (B-INN). The B-INN combines high-order interpolation theory with tensor decomposition and alternating direction algorithm to enable effective dimensionality reduction without compromising predictive accuracy. We theoretically show that the function space of a B-INN is a subset of that of Gaussian processes, while its Bayesian inference exhibits linear complexity, $\mathcal{O}(N)$, with respect to the number of training samples. Numerical experiments demonstrate that B-INNs can be from 20 times to 10,000 times faster with a robust uncertainty estimation compared to Bayesian neural networks and Gaussian processes. These capabilities make B-INN a practical foundation for uncertainty-driven active learning in large-scale industrial simulations, where computational efficiency and robust uncertainty calibration are paramount.

112. MEnvAgent: Scalable Polyglot Environment Construction for Verifiable Software Engineering

Authors: Chuanzhe Guo , Jingjing Wu , Sijun He , Yang Chen , Zhaoqi Kuang , Shilong Fan , Bingjin Chen , Siqi Bao , Jing Liu , Hua Wu , Qingfu Zhu , Wanxiang Che , Haifeng Wang
URL: https://arxiv.org/abs/2601.22859
Abstract:

The evolution of Large Language Model (LLM) agents for software engineering (SWE) is constrained by the scarcity of verifiable datasets, a bottleneck stemming from the complexity of constructing executable environments across diverse languages. To address this, we introduce MEnvAgent, a Multi-language framework for automated Environment construction that facilitates scalable generation of verifiable task instances. MEnvAgent employs a multi-agent Planning-Execution-Verification architecture to autonomously resolve construction failures and integrates a novel Environment Reuse Mechanism that reduces computational overhead by incrementally patching historical environments. Evaluations on MEnvBench, a new benchmark comprising 1,000 tasks across 10 languages, demonstrate that MEnvAgent outperforms baselines, improving Fail-to-Pass (F2P) rates by 8.6% while reducing time costs by 43%. Additionally, we demonstrate the utility of MEnvAgent by constructing MEnvData-SWE, the largest open-source polyglot dataset of realistic verifiable Docker environments to date, alongside solution trajectories that enable consistent performance gains on SWE tasks across a wide range of models. Our code, benchmark, and dataset are available at this https URL .

113. Learning to Build Shapes by Extrusion

Authors: Thor Vestergaard Christiansen , Karran Pandey , Alba Reinders , Karan Singh , Morten Rieger Hannemose , J. Andreas Bærentzen
URL: https://arxiv.org/abs/2601.22858
Abstract:

We introduce Text Encoded Extrusion (TEE), a text-based representation that expresses mesh construction as sequences of face extrusions rather than polygon lists, and a method for generating 3D meshes from TEE using a large language model (LLM). By learning extrusion sequences that assemble a mesh, similar to the way artists create meshes, our approach naturally supports arbitrary output face counts and produces manifold meshes by design, in contrast to recent transformer-based models. The learnt extrusion sequences can also be applied to existing meshes - enabling editing in addition to generation. To train our model, we decompose a library of quadrilateral meshes with non-self-intersecting face loops into constituent loops, which can be viewed as their building blocks, and finetune an LLM on the steps for reassembling the meshes by performing a sequence of extrusions. We demonstrate that our representation enables reconstruction, novel shape synthesis, and the addition of new features to existing meshes.

114. Just-in-Time Catching Test Generation at Meta

Authors: Matthew Becker , Yifei Chen , Nicholas Cochran , Pouyan Ghasemi , Abhishek Gulati , Mark Harman , Zachary Haluza , Mehrdad Honarkhah , Herve Robert , Jiacheng Liu , Weini Liu , Sreeja Thummala , Xiaoning Yang , Rui Xin , Sophie Zeng
URL: https://arxiv.org/abs/2601.22832
Abstract:

We report on Just-in-Time catching test generation at Meta, designed to prevent bugs in large scale backend systems of hundreds of millions of line of code. Unlike traditional hardening tests, which pass at generation time, catching tests are meant to fail, surfacing bugs before code lands. The primary challenge is to reduce development drag from false positive test failures. Analyzing 22,126 generated tests, we show code-change-aware methods improve candidate catch generation 4x over hardening tests and 20x over coincidentally failing tests. To address false positives, we use rule-based and LLM-based assessors. These assessors reduce human review load by 70%. Inferential statistical analysis showed that human-accepted code changes are assessed to have significantly more false positives, while human-rejected changes have significantly more true positives. We reported 41 candidate catches to engineers; 8 were confirmed to be true positives, 4 of which would have led to serious failures had they remained uncaught. Overall, our results show that Just-in-Time catching is scalable, industrially applicable, and that it prevents serious failures from reaching production.

115. Offline Reinforcement Learning of High-Quality Behaviors Under Robust Style Alignment

Authors: Mathieu Petitbois , Rémy Portelas , Sylvain Lamprier
URL: https://arxiv.org/abs/2601.22823
Abstract:

We study offline reinforcement learning of style-conditioned policies using explicit style supervision via subtrajectory labeling functions. In this setting, aligning style with high task performance is particularly challenging due to distribution shift and inherent conflicts between style and reward. Existing methods, despite introducing numerous definitions of style, often fail to reconcile these objectives effectively. To address these challenges, we propose a unified definition of behavior style and instantiate it into a practical framework. Building on this, we introduce Style-Conditioned Implicit Q-Learning (SCIQL), which leverages offline goal-conditioned RL techniques, such as hindsight relabeling and value learning, and combine it with a new Gated Advantage Weighted Regression mechanism to efficiently optimize task performance while preserving style alignment. Experiments demonstrate that SCIQL achieves superior performance on both objectives compared to prior offline methods. Code, datasets and visuals are available in: this https URL .

116. User-Adaptive Meta-Learning for Cold-Start Medication Recommendation with Uncertainty Filtering

Authors: Arya Hadizadeh Moghaddam , Mohsen Nayebi Kerdabadi , Dongjie Wang , Mei Liu , Zijun Yao
URL: https://arxiv.org/abs/2601.22820
Abstract:

Large-scale Electronic Health Record (EHR) databases have become indispensable in supporting clinical decision-making through data-driven treatment recommendations. However, existing medication recommender methods often struggle with a user (i.e., patient) cold-start problem, where recommendations for new patients are usually unreliable due to the lack of sufficient prescription history for patient profiling. While prior studies have utilized medical knowledge graphs to connect medication concepts through pharmacological or chemical relationships, these methods primarily focus on mitigating the item cold-start issue and fall short in providing personalized recommendations that adapt to individual patient characteristics. Meta-learning has shown promise in handling new users with sparse interactions in recommender systems. However, its application to EHRs remains underexplored due to the unique sequential structure of EHR data. To tackle these challenges, we propose MetaDrug, a multi-level, uncertainty-aware meta-learning framework designed to address the patient cold-start problem in medication recommendation. MetaDrug proposes a novel two-level meta-adaptation mechanism, including self-adaptation, which adapts the model to new patients using their own medical events as support sets to capture temporal dependencies; and peer-adaptation, which adapts the model using similar visits from peer patients to enrich new patient representations. Meanwhile, to further improve meta-adaptation outcomes, we introduce an uncertainty quantification module that ranks the support visits and filters out the unrelated information for adaptation consistency. We evaluate our approach on the MIMIC-III and Acute Kidney Injury (AKI) datasets. Experimental results on both datasets demonstrate that MetaDrug consistently outperforms state-of-the-art medication recommendation methods on cold-start patients.

117. Hide and Seek in Embedding Space: Geometry-based Steganography and Detection in Large Language Models

Authors: Charles Westphal , Keivan Navaie , Fernando E. Rosas
URL: https://arxiv.org/abs/2601.22818
Abstract:

Fine-tuned LLMs can covertly encode prompt secrets into outputs via steganographic channels. Prior work demonstrated this threat but relied on trivially recoverable encodings. We formalize payload recoverability via classifier accuracy and show previous schemes achieve 100\% recoverability. In response, we introduce low-recoverability steganography, replacing arbitrary mappings with embedding-space-derived ones. For Llama-8B (LoRA) and Ministral-8B (LoRA) trained on TrojanStego prompts, exact secret recovery rises from 17$\rightarrow$30\% (+78\%) and 24$\rightarrow$43\% (+80\%) respectively, while on Llama-70B (LoRA) trained on Wiki prompts, it climbs from 9$\rightarrow$19\% (+123\%), all while reducing payload recoverability. We then discuss detection. We argue that detecting fine-tuning-based steganographic attacks requires approaches beyond traditional steganalysis. Standard approaches measure distributional shift, which is an expected side-effect of fine-tuning. Instead, we propose a mechanistic interpretability approach: linear probes trained on later-layer activations detect the secret with up to 33\% higher accuracy in fine-tuned models compared to base models, even for low-recoverability schemes. This suggests that malicious fine-tuning leaves actionable internal signatures amenable to interpretability-based defenses.

118. SOMBRERO: Measuring and Steering Boundary Placement in End-to-End Hierarchical Sequence Models

Authors: Pit Neitemeier , Alessio Serra , Jiaze Li , Sascha Wirges , Lukas Balles , Jan Hendrik Metzen
URL: https://arxiv.org/abs/2601.22805
Abstract:

Hierarchical sequence models replace fixed tokenization with learned segmentations that compress long byte sequences for efficient autoregressive modeling. While recent end-to-end methods can learn meaningful boundaries from the language-modeling objective alone, it remains difficult to quantitatively assess and systematically steer where compute is spent. We introduce a router-agnostic metric of boundary quality, boundary enrichment B, which measures how strongly chunk starts concentrate on positions with high next-byte surprisal. Guided by this metric, we propose Sombrero, which steers boundary placement toward predictive difficulty via a confidence-alignment boundary loss and stabilizes boundary learning by applying confidence-weighted smoothing at the input level rather than on realized chunks. On 1B scale, across UTF-8 corpora covering English and German text as well as code and mathematical content, Sombrero improves the accuracy-efficiency trade-off and yields boundaries that more consistently align compute with hard-to-predict positions.

119. Beyond Abstract Compliance: Operationalising trust in AI as a moral relationship

Authors: Lameck Mbangula Amugongo , Tutaleni Asino , Nicola J Bidwell
URL: https://arxiv.org/abs/2601.22769
Abstract:

Dominant approaches, e.g. the EU’s “Trustworthy AI framework”, treat trust as a property that can be designed for, evaluated, and governed according to normative and technical criteria. They do not address how trust is subjectively cultivated and experienced, culturally embedded, and inherently relational. This paper proposes some expanded principles for trust in AI that can be incorporated into common development methods and frame trust as a dynamic, temporal relationship, which involves transparency and mutual respect. We draw on relational ethics and, in particular, African communitarian philosophies, to foreground the nuances of inclusive, participatory processes and long-term relationships with communities. Involving communities throughout the AI lifecycle can foster meaningful relationships with AI design and development teams that incrementally build trust and promote more equitable and context-sensitive AI systems. We illustrate how trust-enabling principles based on African relational ethics can be operationalised, using two use-cases for AI: healthcare and education.

120. How Far Can Pretrained LLMs Go in Symbolic Music? Controlled Comparisons of Supervised and Preference-based Adaptation

Authors: Deepak Kumar , Emmanouil Karystinaios , Gerhard Widmer , Markus Schedl
URL: https://arxiv.org/abs/2601.22764
Abstract:

Music often shares notable parallels with language, motivating the use of pretrained large language models (LLMs) for symbolic music understanding and generation. Despite growing interest, the practical effectiveness of adapting instruction-tuned LLMs to symbolic music remains insufficiently characterized. We present a controlled comparative study of finetuning strategies for ABC-based generation and understanding, comparing an off-the-shelf instruction-tuned backbone to domain-adapted variants and a music-specialized LLM baseline. Across multiple symbolic music corpora and evaluation signals, we provide some insights into adaptation choices for symbolic music applications. We highlight the domain adaptation vs.~preserving prior information tradeoff as well as the distinct behaviour of metrics used to measure the domain adaptation for symbolic music.

121. Qualitative Evaluation of LLM-Designed GUI

Authors: Bartosz Sawicki , Tomasz Les , Dariusz Parzych , Aleksandra Wycisk-Ficek , Pawel Trebacz , Pawel Zawadzki
URL: https://arxiv.org/abs/2601.22759
Abstract:

As generative artificial intelligence advances, Large Language Models (LLMs) are being explored for automated graphical user interface (GUI) design. This study investigates the usability and adaptability of LLM-generated interfaces by analysing their ability to meet diverse user needs. The experiments included utilization of three state-of-the-art models from January 2025 (OpenAI GPT o3-mini-high, DeepSeek R1, and Anthropic Claude 3.5 Sonnet) generating mockups for three interface types: a chat system, a technical team panel, and a manager dashboard. Expert evaluations revealed that while LLMs are effective at creating structured layouts, they face challenges in meeting accessibility standards and providing interactive functionality. Further testing showed that LLMs could partially tailor interfaces for different user personas but lacked deeper contextual understanding. The results suggest that while LLMs are promising tools for early-stage UI prototyping, human intervention remains critical to ensure usability, accessibility, and user satisfaction.

122. Procedural Knowledge Extraction from Industrial Troubleshooting Guides Using Vision Language Models

Authors: Guillermo Gil de Avalle , Laura Maruster , Christos Emmanouilidis
URL: https://arxiv.org/abs/2601.22754
Abstract:

Industrial troubleshooting guides encode diagnostic procedures in flowchart-like diagrams where spatial layout and technical language jointly convey meaning. To integrate this knowledge into operator support systems, which assist shop-floor personnel in diagnosing and resolving equipment issues, the information must first be extracted and structured for machine interpretation. However, when performed manually, this extraction is labor-intensive and error-prone. Vision Language Models offer potential to automate this process by jointly interpreting visual and textual meaning, yet their performance on such guides remains underexplored. This paper evaluates two VLMs on extracting structured knowledge, comparing two prompting strategies: standard instruction-guided versus an augmented approach that cues troubleshooting layout patterns. Results reveal model-specific trade-offs between layout sensitivity and semantic robustness, informing practical deployment decisions.

Authors: Pingping Liu , Jiamiao Liu , Zijian Zhang , Hao Miao , Qi Jiang , Qingliang Li , Qiuzhan Zhou , Irwin King
URL: https://arxiv.org/abs/2601.22746
Abstract:

Urban region profiling, the task of characterizing geographical areas, is crucial for urban planning and resource allocation. However, existing research in this domain faces two significant limitations. First, most methods are confined to single-task prediction, failing to capture the interconnected, multi-faceted nature of urban environments where numerous indicators are deeply correlated. Second, the field lacks a standardized experimental benchmark, which severely impedes fair comparison and reproducible progress. To address these challenges, we first establish a comprehensive benchmark for multi-task urban region profiling, featuring multi-modal features and a diverse set of strong baselines to ensure a fair and rigorous evaluation environment. Concurrently, we propose UrbanMoE, the first sparse multi-modal, multi-expert framework specifically architected to solve the multi-task challenge. Leveraging a sparse Mixture-of-Experts architecture, it dynamically routes multi-modal features to specialized sub-networks, enabling the simultaneous prediction of diverse urban indicators. We conduct extensive experiments on three real-world datasets within our benchmark, where UrbanMoE consistently demonstrates superior performance over all baselines. Further in-depth analysis validates the efficacy and efficiency of our approach, setting a new state-of-the-art and providing the community with a valuable tool for future research in urban analytics

124. Decomposing Epistemic Uncertainty for Causal Decision Making

Authors: Md Musfiqur Rahman , Ziwei Jiang , Hilaf Hasson , Murat Kocaoglu
URL: https://arxiv.org/abs/2601.22736
Abstract:

Causal inference from observational data provides strong evidence for the best action in decision-making without performing expensive randomized trials. The effect of an action is usually not identifiable under unobserved confounding, even with an infinite amount of data. Recent work uses neural networks to obtain practical bounds to such causal effects, which is often an intractable problem. However, these approaches may overfit to the dataset and be overconfident in their causal effect estimates. Moreover, there is currently no systematic approach to disentangle how much of the width of causal effect bounds is due to fundamental non-identifiability versus how much is due to finite-sample limitations. We propose a novel framework to address this problem by considering a confidence set around the empirical observational distribution and obtaining the intersection of causal effect bounds for all distributions in this confidence set. This allows us to distinguish the part of the interval that can be reduced by collecting more samples, which we call sample uncertainty, from the part that can only be reduced by observing more variables, such as latent confounders or instrumental variables, but not with more data, which we call non-ID uncertainty. The upper and lower bounds to this intersection are obtained by solving min-max and max-min problems with neural causal models by searching over all distributions that the dataset might have been sampled from, and all SCMs that entail the corresponding distribution. We demonstrate via extensive experiments on synthetic and real-world datasets that our algorithm can determine when collecting more samples will not help determine the best action. This can guide practitioners to collect more variables or lean towards a randomized study for best action identification.

125. ImgCoT: Compressing Long Chain of Thought into Compact Visual Tokens for Efficient Reasoning of Large Language Model

Authors: Xiaoshu Chen , Sihang Zhou , Ke Liang , Taichun Zhou , Xinwang Liu
URL: https://arxiv.org/abs/2601.22730
Abstract:

Compressing long chains of thought (CoT) into compact latent tokens is crucial for efficient reasoning with large language models (LLMs). Recent studies employ autoencoders to achieve this by reconstructing textual CoT from latent tokens, thus encoding CoT semantics. However, treating textual CoT as the reconstruction target forces latent tokens to preserve surface-level linguistic features (e.g., word choice and syntax), introducing a strong linguistic inductive bias that prioritizes linguistic form over reasoning structure and limits logical abstraction. Thus, we propose ImgCoT that replaces the reconstruction target from textual CoT to the visual CoT obtained by rendering CoT into images. This substitutes linguistic bias with spatial inductive bias, i.e., a tendency to model spatial layouts of the reasoning steps in visual CoT, enabling latent tokens to better capture global reasoning structure. Moreover, although visual latent tokens encode abstract reasoning structure, they may blur reasoning details. We thus propose a loose ImgCoT, a hybrid reasoning that augments visual latent tokens with a few key textual reasoning steps, selected based on low token log-likelihood. This design allows LLMs to retain both global reasoning structure and fine-grained reasoning details with fewer tokens than the complete CoT. Extensive experiments across multiple datasets and LLMs demonstrate the effectiveness of the two versions of ImgCoT.

126. OpenVTON-Bench: A Large-Scale High-Resolution Benchmark for Controllable Virtual Try-On Evaluation

Authors: Jin Li , Tao Chen , Shuai Jiang , Weijie Wang , Jingwen Luo , Chenhui Wu
URL: https://arxiv.org/abs/2601.22725
Abstract:

Recent advances in diffusion models have significantly elevated the visual fidelity of Virtual Try-On (VTON) systems, yet reliable evaluation remains a persistent bottleneck. Traditional metrics struggle to quantify fine-grained texture details and semantic consistency, while existing datasets fail to meet commercial standards in scale and diversity. We present OpenVTON-Bench, a large-scale benchmark comprising approximately 100K high-resolution image pairs (up to $1536 \times 1536$). The dataset is constructed using DINOv3-based hierarchical clustering for semantically balanced sampling and Gemini-powered dense captioning, ensuring a uniform distribution across 20 fine-grained garment categories. To support reliable evaluation, we propose a multi-modal protocol that measures VTON quality along five interpretable dimensions: background consistency, identity fidelity, texture fidelity, shape plausibility, and overall realism. The protocol integrates VLM-based semantic reasoning with a novel Multi-Scale Representation Metric based on SAM3 segmentation and morphological erosion, enabling the separation of boundary alignment errors from internal texture artifacts. Experimental results show strong agreement with human judgments (Kendall’s $\tau$ of 0.833 vs. 0.611 for SSIM), establishing a robust benchmark for VTON evaluation.

127. A Cross-Domain Graph Learning Protocol for Single-Step Molecular Geometry Refinement

Authors: Chengchun Liu , Wendi Cai , Boxuan Zhao , Fanyang Mo
URL: https://arxiv.org/abs/2601.22723
Abstract:

Accurate molecular geometries are a prerequisite for reliable quantum-chemical predictions, yet density functional theory (DFT) optimization remains a major bottleneck for high-throughput molecular screening. Here we present GeoOpt-Net, a multi-branch SE(3)-equivariant geometry refinement network that predicts DFT-quality structures at the B3LYP/TZVP level of theory in a single forward pass starting from inexpensive initial conformers generated at a low-cost force-field level. GeoOpt-Net is trained using a two-stage strategy in which a broadly pretrained geometric representation is subsequently fine-tuned to approach B3LYP/TZVP-level accuracy, with theory- and basis-set-aware calibration enabled by a fidelity-aware feature modulation (FAFM) mechanism. Benchmarking against representative approaches spanning classical conformer generation (RDKit), semiempirical quantum methods (xTB), data-driven geometry refinement pipelines (Auto3D), and machine-learning interatomic potentials (UMA) on external drug-like molecules demonstrates that GeoOpt-Net achieves sub-milli-Å all-atom RMSD with near-zero B3LYP/TZVP single-point energy deviations, indicating DFT-ready geometries that closely reproduce both structural and energetic references. Beyond geometric metrics, GeoOpt-Net generates initial guesses intrinsically compatible with DFT convergence criteria, yielding nonzero ``All-YES’’ convergence rates (65.0\% under loose and 33.4\% under default thresholds), and substantially reducing re-optimization steps and wall-clock time. GeoOpt-Net further exhibits smooth and predictable energy scaling with molecular complexity while preserving key electronic observables such as dipole moments. Collectively, these results establish GeoOpt-Net as a scalable, physically consistent geometry refinement framework that enables efficient acceleration of DFT-based quantum-chemical workflows.

128. AEGIS: White-Box Attack Path Generation using LLMs and Training Effectiveness Evaluation for Large-Scale Cyber Defence Exercises

Authors: Ivan K. Tung , Yu Xiang Shi , Alex Chien , Wenkai Liu , Lawrence Zheng
URL: https://arxiv.org/abs/2601.22720
Abstract:

Creating attack paths for cyber defence exercises requires substantial expert effort. Existing automation requires vulnerability graphs or exploit sets curated in advance, limiting where it can be applied. We present AEGIS, a system that generates attack paths using LLMs, white-box access, and Monte Carlo Tree Search over real exploit execution. LLM-based search discovers exploits dynamically without pre-existing vulnerability graphs, while white-box access enables validating exploits in isolation before committing to attack paths. Evaluation at CIDeX 2025, a large-scale exercise spanning 46 IT hosts, showed that AEGIS-generated paths are comparable to human-authored scenarios across four dimensions of training experience (perceived learning, engagement, believability, challenge). Results were measured with a validated questionnaire extensible to general simulation-based training. By automating exploit chain discovery and validation, AEGIS reduces scenario development from months to days, shifting expert effort from technical validation to scenario design.

129. Breaking the Blocks: Continuous Low-Rank Decomposed Scaling for Unified LLM Quantization and Adaptation

Authors: Pingzhi Tang , Ruijie Zhou , Fanxu Meng , Wenjie Pei , Muhan Zhang
URL: https://arxiv.org/abs/2601.22716
Abstract:

Current quantization methods for LLMs predominantly rely on block-wise structures to maintain efficiency, often at the cost of representational flexibility. In this work, we demonstrate that element-wise quantization can be made as efficient as block-wise scaling while providing strictly superior expressive power by modeling the scaling manifold as continuous low-rank matrices ($S = BA$). We propose Low-Rank Decomposed Scaling (LoRDS), a unified framework that rethinks quantization granularity through this low-rank decomposition. By “breaking the blocks” of spatial constraints, LoRDS establishes a seamless efficiency lifecycle: it provides high-fidelity PTQ initialization refined via iterative optimization, enables joint QAT of weights and scaling factors, and facilitates high-rank multiplicative PEFT adaptation. Unlike additive PEFT approaches such as QLoRA, LoRDS enables high-rank weight updates within a low-rank budget while incurring no additional inference overhead. Supported by highly optimized Triton kernels, LoRDS consistently outperforms state-of-the-art baselines across various model families in both quantization and downstream fine-tuning tasks. Notably, on Llama3-8B, our method achieves up to a 27.0% accuracy improvement at 3 bits over NormalFloat quantization and delivers a 1.5x inference speedup on NVIDIA RTX 4090 while enhancing PEFT performance by 9.6% on downstream tasks over 4bit QLoRA, offering a robust and integrated solution for unified compression and adaptation of LLMs.

130. Vision-Language Models Unlock Task-Centric Latent Actions

Authors: Alexander Nikulin , Ilya Zisman , Albina Klepach , Denis Tarasov , Alexander Derevyagin , Andrei Polubarov , Lyubaykin Nikita , Vladislav Kurenkov
URL: https://arxiv.org/abs/2601.22714
Abstract:

Latent Action Models (LAMs) have rapidly gained traction as an important component in the pre-training pipelines of leading Vision-Language-Action models. However, they fail when observations contain action-correlated distractors, often encoding noise instead of meaningful latent actions. Humans, on the other hand, can effortlessly distinguish task-relevant motions from irrelevant details in any video given only a brief task description. In this work, we propose to utilize the common-sense reasoning abilities of Vision-Language Models (VLMs) to provide promptable representations, effectively separating controllable changes from the noise in unsupervised way. We use these representations as targets during LAM training and benchmark a wide variety of popular VLMs, revealing substantial variation in the quality of promptable representations as well as their robustness to different prompts and hyperparameters. Interestingly, we find that more recent VLMs may perform worse than older ones. Finally, we show that simply asking VLMs to ignore distractors can substantially improve latent action quality, yielding up to a six-fold increase in downstream success rates on Distracting MetaWorld.

131. Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs

Authors: Yanlong Chen , Amirhossein Habibian , Luca Benini , Yawei Li
URL: https://arxiv.org/abs/2601.22709
Abstract:

Vision-Language Models (VLMs) achieve strong multimodal performance but are costly to deploy, and post-training quantization often causes significant accuracy loss. Despite its potential, quantization-aware training for VLMs remains underexplored. We propose GRACE, a framework unifying knowledge distillation and QAT under the Information Bottleneck principle: quantization constrains information capacity while distillation guides what to preserve within this budget. Treating the teacher as a proxy for task-relevant information, we introduce confidence-gated decoupled distillation to filter unreliable supervision, relational centered kernel alignment to transfer visual token structures, and an adaptive controller via Lagrangian relaxation to balance fidelity against capacity constraints. Across extensive benchmarks on LLaVA and Qwen families, our INT4 models consistently outperform FP16 baselines (e.g., LLaVA-1.5-7B: 70.1 vs. 66.8 on SQA; Qwen2-VL-2B: 76.9 vs. 72.6 on MMBench), nearly matching teacher performance. Using real INT4 kernel, we achieve 3$\times$ throughput with 54% memory reduction. This principled framework significantly outperforms existing quantization methods, making GRACE a compelling solution for resource-constrained deployment.

132. Deep Learning-Based Early-Stage IR-Drop Estimation via CNN Surrogate Modeling

Authors: Ritesh Bhadana
URL: https://arxiv.org/abs/2601.22707
Abstract:

IR-drop is a critical power integrity challenge in modern VLSI designs that can cause timing degradation, reliability issues, and functional failures if not detected early in the design flow. Conventional IR-drop analysis relies on physics-based signoff tools, which provide high accuracy but incur significant computational cost and require near-final layout information, making them unsuitable for rapid early-stage design exploration. In this work, we propose a deep learning-based surrogate modeling approach for early-stage IR-drop estimation using a CNN. The task is formulated as a dense pixel-wise regression problem, where spatial physical layout features are mapped directly to IR-drop heatmaps. A U-Net-based encoder-decoder architecture with skip connections is employed to effectively capture both local and global spatial dependencies within the layout. The model is trained on a physics-inspired synthetic dataset generated by us, which incorporates key physical factors including power grid structure, cell density distribution, and switching activity. Model performance is evaluated using standard regression metrics such as Mean Squared Error (MSE) and Peak Signal-to-Noise Ratio (PSNR). Experimental results demonstrate that the proposed approach can accurately predict IR-drop distributions with millisecond-level inference time, enabling fast pre-signoff screening and iterative design optimization. The proposed framework is intended as a complementary early-stage analysis tool, providing designers with rapid IR-drop insight prior to expensive signoff analysis. The implementation, dataset generation scripts, and the interactive inference application are publicly available at: this https URL . The live application can be accessed at: this https URL .

133. PEAR: Pixel-aligned Expressive humAn mesh Recovery

Authors: Jiahao Wu , Yunfei Liu , Lijian Lin , Ye Zhu , Lei Zhu , Jingyi Li , Yu Li
URL: https://arxiv.org/abs/2601.22693
Abstract:

Reconstructing detailed 3D human meshes from a single in-the-wild image remains a fundamental challenge in computer vision. Existing SMPLX-based methods often suffer from slow inference, produce only coarse body poses, and exhibit misalignments or unnatural artifacts in fine-grained regions such as the face and hands. These issues make current approaches difficult to apply to downstream tasks. To address these challenges, we propose PEAR-a fast and robust framework for pixel-aligned expressive human mesh recovery. PEAR explicitly tackles three major limitations of existing methods: slow inference, inaccurate localization of fine-grained human pose details, and insufficient facial expression capture. Specifically, to enable real-time SMPLX parameter inference, we depart from prior designs that rely on high resolution inputs or multi-branch architectures. Instead, we adopt a clean and unified ViT-based model capable of recovering coarse 3D human geometry. To compensate for the loss of fine-grained details caused by this simplified architecture, we introduce pixel-level supervision to optimize the geometry, significantly improving the reconstruction accuracy of fine-grained human details. To make this approach practical, we further propose a modular data annotation strategy that enriches the training data and enhances the robustness of the model. Overall, PEAR is a preprocessing-free framework that can simultaneously infer EHM-s (SMPLX and scaled-FLAME) parameters at over 100 FPS. Extensive experiments on multiple benchmark datasets demonstrate that our method achieves substantial improvements in pose estimation accuracy compared to previous SMPLX-based approaches. Project page: this https URL

134. FNF: Functional Network Fingerprint for Large Language Models

Authors: Yiheng Liu , Junhao Ning , Sichen Xia , Haiyang Sun , Yang Yang , Hanyang Chi , Xiaohui Gao , Ning Qiang , Bao Ge , Junwei Han , Xintao Hu
URL: https://arxiv.org/abs/2601.22692
Abstract:

The development of large language models (LLMs) is costly and has significant commercial value. Consequently, preventing unauthorized appropriation of open-source LLMs and protecting developers’ intellectual property rights have become critical challenges. In this work, we propose the Functional Network Fingerprint (FNF), a training-free, sample-efficient method for detecting whether a suspect LLM is derived from a victim model, based on the consistency between their functional network activity. We demonstrate that models that share a common origin, even with differences in scale or architecture, exhibit highly consistent patterns of neuronal activity within their functional networks across diverse input samples. In contrast, models trained independently on distinct data or with different objectives fail to preserve such activity alignment. Unlike conventional approaches, our method requires only a few samples for verification, preserves model utility, and remains robust to common model modifications (such as fine-tuning, pruning, and parameter permutation), as well as to comparisons across diverse architectures and dimensionalities. FNF thus provides model owners and third parties with a simple, non-invasive, and effective tool for protecting LLM intellectual property. The code is available at this https URL .

135. Do Transformers Have the Ability for Periodicity Generalization?

Authors: Huanyu Liu , Ge Li , Yihong Dong , Sihan Wu , Peixu Wang , Sihao Cheng , Taozhi Chen , Kechi Zhang , Hao Zhu , Tongxuan Liu
URL: https://arxiv.org/abs/2601.22690
Abstract:

Large language models (LLMs) based on the Transformer have demonstrated strong performance across diverse tasks. However, current models still exhibit substantial limitations in out-of-distribution (OOD) generalization compared with humans. We investigate this gap through periodicity, one of the basic OOD scenarios. Periodicity captures invariance amid variation. Periodicity generalization represents a model’s ability to extract periodic patterns from training data and generalize to OOD scenarios. We introduce a unified interpretation of periodicity from the perspective of abstract algebra and reasoning, including both single and composite periodicity, to explain why Transformers struggle to generalize periodicity. Then we construct Coper about composite periodicity, a controllable generative benchmark with two OOD settings, Hollow and Extrapolation. Experiments reveal that periodicity generalization in Transformers is limited, where models can memorize periodic data during training, but cannot generalize to unseen composite periodicity. We release the source code to support future research.

136. Fire on Motion: Optimizing Video Pass-bands for Efficient Spiking Action Recognition

Authors: Shuhan Ye , Yuanbin Qian , Yi Yu , Chong Wang , Yuqi Xie , Jiazhen Xu , Kun Wang , Xudong Jiang
URL: https://arxiv.org/abs/2601.22675
Abstract:

Spiking neural networks (SNNs) have gained traction in vision due to their energy efficiency, bio-plausibility, and inherent temporal processing. Yet, despite this temporal capacity, most progress concentrates on static image benchmarks, and SNNs still underperform on dynamic video tasks compared to artificial neural networks (ANNs). In this work, we diagnose a fundamental pass-band mismatch: Standard spiking dynamics behave as a temporal low pass that emphasizes static content while attenuating motion bearing bands, where task relevant information concentrates in dynamic tasks. This phenomenon explains why SNNs can approach ANNs on static tasks yet fall behind on tasks that demand richer temporal this http URL remedy this, we propose the Pass-Bands Optimizer (PBO), a plug-and-play module that optimizes the temporal pass-band toward task-relevant motion bands. PBO introduces only two learnable parameters, and a lightweight consistency constraint that preserves semantics and boundaries, incurring negligible computational overhead and requires no architectural changes. PBO deliberately suppresses static components that contribute little to discrimination, effectively high passing the stream so that spiking activity concentrates on motion bearing content. On UCF101, PBO yields over ten percentage points improvement. On more complex multi-modal action recognition and weakly supervised video anomaly detection, PBO delivers consistent and significant gains, offering a new perspective for SNN based video processing and understanding.

137. From Horizontal Layering to Vertical Integration: A Comparative Study of the AI-Driven Software Development Paradigm

Authors: Chi Zhang , Zehan Li , Ziqian Zhong , Haibing Ma , Dan Xiao , Chen Lin , Ming Dong
URL: https://arxiv.org/abs/2601.22667
Abstract:

This paper examines the organizational implications of Generative AI adoption in software engineering through a multiple-case comparative study. We contrast two development environments: a traditional enterprise (brownfield) and an AI-native startup (greenfield). Our analysis reveals that transitioning from Horizontal Layering (functional specialization) to Vertical Integration (end-to-end ownership) yields 8-fold to 33-fold reductions in resource consumption. We attribute these gains to the emergence of Super Employees, AI-augmented engineers who span traditional role boundaries, and the elimination of inter-functional coordination overhead. Theoretically, we propose Human-AI Collaboration Efficacy as the primary optimization target for engineering organizations, supplanting individual productivity metrics. Our Total Factor Productivity analysis identifies an AI Distortion Effect that diminishes returns to labor scale while amplifying technological leverage. We conclude with managerial strategies for organizational redesign, including the reactivation of idle cognitive bandwidth in senior engineers and the suppression of blind scale expansion.

138. Unsupervised Synthetic Image Attribution: Alignment and Disentanglement

Authors: Zongfang Liu , Guangyi Chen , Boyang Sun , Tongliang Liu , Kun Zhang
URL: https://arxiv.org/abs/2601.22663
Abstract:

As the quality of synthetic images improves, identifying the underlying concepts of model-generated images is becoming increasingly crucial for copyright protection and ensuring model transparency. Existing methods achieve this attribution goal by training models using annotated pairs of synthetic images and their original training sources. However, obtaining such paired supervision is challenging, as it requires either well-designed synthetic concepts or precise annotations from millions of training sources. To eliminate the need for costly paired annotations, in this paper, we explore the possibility of unsupervised synthetic image attribution. We propose a simple yet effective unsupervised method called Alignment and Disentanglement. Specifically, we begin by performing basic concept alignment using contrastive self-supervised learning. Next, we enhance the model’s attribution ability by promoting representation disentanglement with the Infomax loss. This approach is motivated by an interesting observation: contrastive self-supervised models, such as MoCo and DINO, inherently exhibit the ability to perform simple cross-domain alignment. By formulating this observation as a theoretical assumption on cross-covariance, we provide a theoretical explanation of how alignment and disentanglement can approximate the concept-matching process through a decomposition of the canonical correlation analysis objective. On the real-world benchmarks, AbC, we show that our unsupervised method surprisingly outperforms the supervised methods. As a starting point, we expect our intuitive insights and experimental findings to provide a fresh perspective on this challenging task.

139. NAG: A Unified Native Architecture for Encoder-free Text-Graph Modeling in Language Models

Authors: Haisong Gong , Zhibo Liu , Qiang Liu , Shu Wu , Liang Wang
URL: https://arxiv.org/abs/2601.22657
Abstract:

Prevailing methods for integrating graphs into Language Models (LMs) typically rely on a segregated architecture: external Graph Neural Networks (GNNs) encode structural topology, while LMs process textual semantics. We argue this approach is suboptimal for text-graphs: it creates a conceptually disjointed interaction paradigm. By segregating structural encoding from semantic processing, these systems must perform a complex implicit alignment between abstract graph tokens and concrete textual elements. Challenging the necessity of external encoders, we propose NAG (Native Architecture for Graphs), a unified framework that internalizes graph processing within the LM’s native manifold. Instead of bridging disparate embedding spaces, NAG repurposes the self-attention mechanism to enforce topological dependencies and recalibrates positional IDs to ensure structural equivalence. This allows the model to harness its intrinsic linguistic capability to simultaneously comprehend node and edge content alongside structural topology. We introduce two efficient implementations: NAG-Zero for absolute preservation of the base model’s linguistic capabilities, and NAG-LoRA for enhanced structural adaptation. Experiments across diverse graph tasks validate that NAG achieves robust graph comprehension without the overhead of external encoders, offering a simpler, more coherent paradigm for text-graph modeling.

140. Human-Centered Explainability in AI-Enhanced UI Security Interfaces: Designing Trustworthy Copilots for Cybersecurity Analysts

Authors: Mona Rajhans
URL: https://arxiv.org/abs/2601.22653
Abstract:

Artificial intelligence (AI) copilots are increasingly integrated into enterprise cybersecurity platforms to assist analysts in threat detection, triage, and remediation. However, the effectiveness of these systems depends not only on the accuracy of underlying models but also on the degree to which users can understand and trust their outputs. Existing research on algorithmic explainability has largely focused on model internals, while little attention has been given to how explanations should be surfaced in user interfaces for high-stakes decision-making contexts [8], [5], [6]. We present a mixed-methods study of explanation design strategies in AI-driven security dashboards. Through a taxonomy of explanation styles and a controlled user study with security practitioners, we compare natural language rationales, confidence visualizations, counterfactual explanations, and hybrid approaches. Our findings show that explanation style significantly affects user trust calibration, decision accuracy, and cognitive load. We contribute (1) empirical evidence on the usability of explanation interfaces for security copilots, (2) design guidelines for integrating explainability into enterprise UIs, and (3) a framework for aligning explanation strategies with analyst needs in security operations centers (SOCs). This work advances the design of human-centered AI tools in cybersecurity and provides broader implications for explainability in other high-stakes domains.

141. GUDA: Counterfactual Group-wise Training Data Attribution for Diffusion Models via Unlearning

Authors: Naoki Murata , Yuhta Takida , Chieh-Hsin Lai , Toshimitsu Uesaka , Bac Nguyen , Stefano Ermon , Yuki Mitsufuji
URL: https://arxiv.org/abs/2601.22651
Abstract:

Training-data attribution for vision generative models aims to identify which training data influenced a given output. While most methods score individual examples, practitioners often need group-level answers (e.g., artistic styles or object classes). Group-wise attribution is counterfactual: how would a model’s behavior on a generated sample change if a group were absent from training? A natural realization of this counterfactual is Leave-One-Group-Out (LOGO) retraining, which retrains the model with each group removed; however, it becomes computationally prohibitive as the number of groups grows. We propose GUDA (Group Unlearning-based Data Attribution) for diffusion models, which approximates each counterfactual model by applying machine unlearning to a shared full-data model instead of training from scratch. GUDA quantifies group influence using differences in a likelihood-based scoring rule (ELBO) between the full model and each unlearned counterfactual. Experiments on CIFAR-10 and artistic style attribution with Stable Diffusion show that GUDA identifies primary contributing groups more reliably than semantic similarity, gradient-based attribution, and instance-level unlearning approaches, while achieving x100 speedup on CIFAR-10 over LOGO retraining.

142. ScholarPeer: A Context-Aware Multi-Agent Framework for Automated Peer Review

Authors: Palash Goyal , Mihir Parmar , Yiwen Song , Hamid Palangi , Tomas Pfister , Jinsung Yoon
URL: https://arxiv.org/abs/2601.22638
Abstract:

Automated peer review has evolved from simple text classification to structured feedback generation. However, current state-of-the-art systems still struggle with “surface-level” critiques: they excel at summarizing content but often fail to accurately assess novelty and significance or identify deep methodological flaws because they evaluate papers in a vacuum, lacking the external context a human expert possesses. In this paper, we introduce ScholarPeer, a search-enabled multi-agent framework designed to emulate the cognitive processes of a senior researcher. ScholarPeer employs a dual-stream process of context acquisition and active verification. It dynamically constructs a domain narrative using a historian agent, identifies missing comparisons via a baseline scout, and verifies claims through a multi-aspect Q&A engine, grounding the critique in live web-scale literature. We evaluate ScholarPeer on DeepReview-13K and the results demonstrate that ScholarPeer achieves significant win-rates against state-of-the-art approaches in side-by-side evaluations and reduces the gap to human-level diversity.

143. Training Beyond Convergence: Grokking nnU-Net for Glioma Segmentation in Sub-Saharan MRI

Authors: Mohtady Barakat , Omar Salah , Ahmed Yasser , Mostafa Ahmed , Zahirul Arief , Waleed Khan , Dong Zhang , Aondona Iorumbur , Confidence Raymond , Mohannad Barakat , Noha Magdy
URL: https://arxiv.org/abs/2601.22637
Abstract:

Gliomas are placing an increasingly clinical burden on Sub-Saharan Africa (SSA). In the region, the median survival for patients remains under two years, and access to diagnostic imaging is extremely limited. These constraints highlight an urgent need for automated tools that can extract the maximum possible information from each available scan, tools that are specifically trained on local data, rather than adapted from high-income settings where conditions are vastly different. We utilize the Brain Tumor Segmentation (BraTS) Africa 2025 Challenge dataset, an expert annotated collection of glioma MRIs. Our objectives are: (i) establish a strong baseline with nnUNet on this dataset, and (ii) explore whether the celebrated “grokking” phenomenon an abrupt, late training jump from memorization to superior generalization can be triggered to push performance without extra labels. We evaluate two training regimes. The first is a fast, budget-conscious approach that limits optimization to just a few epochs, reflecting the constrained GPU resources typically available in African institutions. Despite this limitation, nnUNet achieves strong Dice scores: 92.3% for whole tumor (WH), 86.6% for tumor core (TC), and 86.3% for enhancing tumor (ET). The second regime extends training well beyond the point of convergence, aiming to trigger a grokking-driven performance leap. With this approach, we were able to achieve grokking and enhanced our results to higher Dice scores: 92.2% for whole tumor (WH), 90.1% for tumor core (TC), and 90.2% for enhancing tumor (ET).

144. What can Computer Vision learn from Ranganathan?

Authors: Mayukh Bagchi , Fausto Giunchiglia
URL: https://arxiv.org/abs/2601.22634
Abstract:

The Semantic Gap Problem (SGP) in Computer Vision (CV) arises from the misalignment between visual and lexical semantics leading to flawed CV dataset design and CV benchmarks. This paper proposes that classification principles of S.R. Ranganathan can offer a principled starting point to address SGP and design high-quality CV datasets. We elucidate how these principles, suitably adapted, underpin the vTelos CV annotation methodology. The paper also briefly presents experimental evidence showing improvements in CV annotation and accuracy, thereby, validating vTelos.

145. MCP-Diag: A Deterministic, Protocol-Driven Architecture for AI-Native Network Diagnostics

Authors: Devansh Lodha , Mohit Panchal , Sameer G. Kulkarni
URL: https://arxiv.org/abs/2601.22633
Abstract:

The integration of Large Language Models (LLMs) into network operations (AIOps) is hindered by two fundamental challenges: the stochastic grounding problem, where LLMs struggle to reliably parse unstructured, vendor-specific CLI output, and the security gap of granting autonomous agents shell access. This paper introduces MCP-Diag, a hybrid neuro-symbolic architecture built upon the Model Context Protocol (MCP). We propose a deterministic translation layer that converts raw stdout from canonical utilities (dig, ping, traceroute) into rigorous JSON schemas before AI ingestion. We further introduce a mandatory “Elicitation Loop” that enforces Human-in-the-Loop (HITL) authorization at the protocol level. Our preliminary evaluation demonstrates that MCP-Diag achieving 100% entity extraction accuracy with less than 0.9% execution latency overhead and 3.7x increase in context token usage.

146. PEFT-MuTS: A Multivariate Parameter-Efficient Fine-Tuning Framework for Remaining Useful Life Prediction based on Cross-domain Time Series Representation Model

Authors: En Fu , Yanyan Hu , Changhua Hu , Zengwang Jin , Kaixiang Peng
URL: https://arxiv.org/abs/2601.22631
Abstract:

The application of data-driven remaining useful life (RUL) prediction has long been constrained by the availability of large amount of degradation data. Mainstream solutions such as domain adaptation and meta-learning still rely on large amounts of historical degradation data from equipment that is identical or similar to the target, which imposes significant limitations in practical applications. This study investigates PEFT-MuTS, a Parameter-Efficient Fine-Tuning framework for few-shot RUL prediction, built on cross-domain pre-trained time-series representation models. Contrary to the widely held view that knowledge transfer in RUL prediction can only occur within similar devices, we demonstrate that substantial benefits can be achieved through pre-training process with large-scale cross-domain time series datasets. A independent feature tuning network and a meta-variable-based low rank multivariate fusion mechanism are developed to enable the pre-trained univariate time-series representation backbone model to fully exploit the multivariate relationships in degradation data for downstream RUL prediction task. Additionally, we introduce a zero-initialized regressor that stabilizes the fine-tuning process under few-shot conditions. Experiments on aero-engine and industrial bearing datasets demonstrate that our method can achieve effective RUL prediction even when less than 1\% of samples of target equipment are used. Meanwhile, it substantially outperforms conventional supervised and few-shot approaches while markedly reducing the data required to achieve high predictive accuracy. Our code is available at this https URL .

147. Time-Annealed Perturbation Sampling: Diverse Generation for Diffusion Language Models

Authors: Jingxuan Wu , Zhenglin Wan , Xingrui Yu , Yuzhe Yang , Yiqiao Huang , Ivor Tsang , Yang You
URL: https://arxiv.org/abs/2601.22629
Abstract:

Diffusion language models (Diffusion-LMs) introduce an explicit temporal dimension into text generation, yet how this structure can be leveraged to control generation diversity for exploring multiple valid semantic or reasoning paths remains underexplored. In this paper, we show that Diffusion-LMs, like diffusion models in image generation, exhibit a temporal division of labor: early denoising steps largely determine the global semantic structure, while later steps focus on local lexical refinement. Building on this insight, we propose Time-Annealed Perturbation Sampling (TAPS), a training-free inference strategy that encourages semantic branching early in the diffusion process while progressively reducing perturbations to preserve fluency and instruction adherence. TAPS is compatible with both non-autoregressive and semi-autoregressive Diffusion backbones, demonstrated on LLaDA and TraDo in our paper, and consistently improves output diversity across creative writing and reasoning benchmarks without compromising generation quality.

148. TTCS: Test-Time Curriculum Synthesis for Self-Evolving

Authors: Chengyi Yang , Zhishang Xiang , Yunbo Tang , Zongpei Teng , Chengsong Huang , Fei Long , Yuhan Liu , Jinsong Su
URL: https://arxiv.org/abs/2601.22628
Abstract:

Test-Time Training offers a promising way to improve the reasoning ability of large language models (LLMs) by adapting the model using only the test questions. However, existing methods struggle with difficult reasoning problems for two reasons: raw test questions are often too difficult to yield high-quality pseudo-labels, and the limited size of test sets makes continuous online updates prone to instability. To address these limitations, we propose TTCS, a co-evolving test-time training framework. Specifically, TTCS initializes two policies from the same pretrained model: a question synthesizer and a reasoning solver. These policies evolve through iterative optimization: the synthesizer generates progressively challenging question variants conditioned on the test questions, creating a structured curriculum tailored to the solver’s current capability, while the solver updates itself using self-consistency rewards computed from multiple sampled responses on both original test and synthetic questions. Crucially, the solver’s feedback guides the synthesizer to generate questions aligned with the model’s current capability, and the generated question variants in turn stabilize the solver’s test-time training. Experiments show that TTCS consistently strengthens the reasoning ability on challenging mathematical benchmarks and transfers to general-domain tasks across different LLM backbones, highlighting a scalable path towards dynamically constructing test-time curricula for self-evolving. Our code and implementation details are available at this https URL .

149. Local-Global Multimodal Contrastive Learning for Molecular Property Prediction

Authors: Xiayu Liu , Zhengyi Lu , Yunhong Liao , Chan Fan , Hou-biao Li
URL: https://arxiv.org/abs/2601.22610
Abstract:

Accurate molecular property prediction requires integrating complementary information from molecular structure and chemical semantics. In this work, we propose LGM-CL, a local-global multimodal contrastive learning framework that jointly models molecular graphs and textual representations derived from SMILES and chemistry-aware augmented texts. Local functional group information and global molecular topology are captured using AttentiveFP and Graph Transformer encoders, respectively, and aligned through self-supervised contrastive learning. In addition, chemically enriched textual descriptions are contrasted with original SMILES to incorporate physicochemical semantics in a task-agnostic manner. During fine-tuning, molecular fingerprints are further integrated via Dual Cross-attention multimodal fusion. Extensive experiments on MoleculeNet benchmarks demonstrate that LGM-CL achieves consistent and competitive performance across both classification and regression tasks, validating the effectiveness of unified local-global and multimodal representation learning.

150. Language Model Circuits Are Sparse in the Neuron Basis

Authors: Aryaman Arora , Zhengxuan Wu , Jacob Steinhardt , Sarah Schwettmann
URL: https://arxiv.org/abs/2601.22594
Abstract:

The high-level concepts that a neural network uses to perform computation need not be aligned to individual neurons (Smolensky, 1986). Language model interpretability research has thus turned to techniques such as \textit{sparse autoencoders} (SAEs) to decompose the neuron basis into more interpretable units of model computation, for tasks such as \textit{circuit tracing}. However, not all neuron-based representations are uninterpretable. For the first time, we empirically show that \textbf{MLP neurons are as sparse a feature basis as SAEs}. We use this finding to develop an end-to-end pipeline for circuit tracing on the MLP neuron basis, which locates causal circuitry on a variety of tasks using gradient-based attribution. On a standard subject-verb agreement benchmark (Marks et al., 2025), a circuit of $\approx 10^2$ MLP neurons is enough to control model behaviour. On the multi-hop city $\to$ state $\to$ capital task from Lindsey et al., 2025, we find a circuit in which small sets of neurons encode specific latent reasoning steps (e.g.~`map city to its state’), and can be steered to change the model’s output. This work thus advances automated interpretability of language models without additional training costs.

151. FedCARE: Federated Unlearning with Conflict-Aware Projection and Relearning-Resistant Recovery

Authors: Yue Li , Mingmin Chu , Xilei Yang , Da Xiao , Ziqi Xu , Wei Shao , Qipeng Song , Hui Li
URL: https://arxiv.org/abs/2601.22589
Abstract:

Federated learning (FL) enables collaborative model training without centralizing raw data, but privacy regulations such as the right to be forgotten require FL systems to remove the influence of previously used training data upon request. Retraining a federated model from scratch is prohibitively expensive, motivating federated unlearning (FU). However, existing FU methods suffer from high unlearning overhead, utility degradation caused by entangled knowledge, and unintended relearning during post-unlearning recovery. In this paper, we propose FedCARE, a unified and low overhead FU framework that enables conflict-aware unlearning and relearning-resistant recovery. FedCARE leverages gradient ascent for efficient forgetting when target data are locally available and employs data free model inversion to construct class level proxies of shared knowledge. Based on these insights, FedCARE integrates a pseudo-sample generator, conflict-aware projected gradient ascent for utility preserving unlearning, and a recovery strategy that suppresses rollback toward the pre-unlearning model. FedCARE supports client, instance, and class level unlearning with modest overhead. Extensive experiments on multiple datasets and model architectures under both IID and non-IID settings show that FedCARE achieves effective forgetting, improved utility retention, and reduced relearning risk compared to state of the art FU baselines.

152. Rethinking LLM-as-a-Judge: Representation-as-a-Judge with Small Language Models via Semantic Capacity Asymmetry

Authors: Zhuochun Li , Yong Zhang , Ming Li , Yuelyu Ji , Yiming Zeng , Ning Cheng , Yun Zhu , Yanmeng Wang , Shaojun Wang , Jing Xiao , Daqing He
URL: https://arxiv.org/abs/2601.22588
Abstract:

Large language models (LLMs) are widely used as reference-free evaluators via prompting, but this “LLM-as-a-Judge” paradigm is costly, opaque, and sensitive to prompt design. In this work, we investigate whether smaller models can serve as efficient evaluators by leveraging internal representations instead of surface generation. We uncover a consistent empirical pattern: small LMs, despite with weak generative ability, encode rich evaluative signals in their hidden states. This motivates us to propose the Semantic Capacity Asymmetry Hypothesis: evaluation requires significantly less semantic capacity than generation and can be grounded in intermediate representations, suggesting that evaluation does not necessarily need to rely on large-scale generative models but can instead leverage latent features from smaller ones. Our findings motivate a paradigm shift from LLM-as-a-Judge to Representation-as-a-Judge, a decoding-free evaluation strategy that probes internal model structure rather than relying on prompted output. We instantiate this paradigm through INSPECTOR, a probing-based framework that predicts aspect-level evaluation scores from small model representations. Experiments on reasoning benchmarks (GSM8K, MATH, GPQA) show that INSPECTOR substantially outperforms prompting-based small LMs and closely approximates full LLM judges, while offering a more efficient, reliable, and interpretable alternative for scalable evaluation.

153. MC-GRPO: Median-Centered Group Relative Policy Optimization for Small-Rollout Reinforcement Learning

Authors: Youngeun Kim
URL: https://arxiv.org/abs/2601.22582
Abstract:

Group-relative policy optimization methods train language models by generating multiple rollouts per prompt and normalizing rewards with a shared mean reward baseline. In resource-constrained settings where the rollout budget is small, accuracy often degrades. We find that noise in the shared baseline induces advantage sign flips, where some rollouts receive an incorrect advantage sign, and the update direction is reversed. To address this, we propose Median-Centered Group Relative Policy Optimization (MC-GRPO), a simple and effective solution for small-rollout training. Our main idea is to replace the mean baseline with a median baseline: the median is far less sensitive to outlier rewards than the mean, mitigating the sign flips under small rollout size (G). We generate one additional rollout for median reference (G+1), and compute advantages by using the group median. With an odd-sized group, exactly one completion is the median and receives zero advantage, we exclude this pivot rollout from backpropagation so the number of gradient-contributing samples per prompt remains G, preserving the core update cost of standard G-rollout training. Across various GRPO-family methods and a wide range of models and scales, this median-centered training consistently improves stability and final accuracy in the low-rollout regime, reducing the gap between G=2 and G=8 to within 1%. Code is available at this https URL

154. Cross-Domain Few-Shot Learning for Hyperspectral Image Classification Based on Mixup Foundation Model

Authors: Naeem Paeedeh , Mahardhika Pratama , Ary Shiddiqi , Zehong Cao , Mukesh Prasad , Wisnu Jatmiko
URL: https://arxiv.org/abs/2601.22581
Abstract:

Although cross-domain few-shot learning (CDFSL) for hyper-spectral image (HSI) classification has attracted significant research interest, existing works often rely on an unrealistic data augmentation procedure in the form of external noise to enlarge the sample size, thus greatly simplifying the issue of data scarcity. They involve a large number of parameters for model updates, being prone to the overfitting problem. To the best of our knowledge, none has explored the strength of the foundation model, having strong generalization power to be quickly adapted to downstream tasks. This paper proposes the MIxup FOundation MOdel (MIFOMO) for CDFSL of HSI classifications. MIFOMO is built upon the concept of a remote sensing (RS) foundation model, pre-trained across a large scale of RS problems, thus featuring generalizable features. The notion of coalescent projection (CP) is introduced to quickly adapt the foundation model to downstream tasks while freezing the backbone network. The concept of mixup domain adaptation (MDM) is proposed to address the extreme domain discrepancy problem. Last but not least, the label smoothing concept is implemented to cope with noisy pseudo-label problems. Our rigorous experiments demonstrate the advantage of MIFOMO, where it beats prior arts with up to 14% margin. The source code of MIFOMO is open-sourced in this https URL Paeedeh/MIFOMO for reproducibility and convenient further study.

155. SpanNorm: Reconciling Training Stability and Performance in Deep Transformers

Authors: Chao Wang , Bei Li , Jiaqi Zhang , Xinyu Liu , Yuchun Fan , Linkun Lyu , Xin Chen , Jingang Wang , Tong Xiao , Peng Pei , Xunliang Cai
URL: https://arxiv.org/abs/2601.22580
Abstract:

The success of Large Language Models (LLMs) hinges on the stable training of deep Transformer architectures. A critical design choice is the placement of normalization layers, leading to a fundamental trade-off: the PreNorm'' architecture ensures training stability at the cost of potential performance degradation in deep models, while thePostNorm’’ architecture offers strong performance but suffers from severe training instability. In this work, we propose SpanNorm, a novel technique designed to resolve this dilemma by integrating the strengths of both paradigms. Structurally, SpanNorm establishes a clean residual connection that spans the entire transformer block to stabilize signal propagation, while employing a PostNorm-style computation that normalizes the aggregated output to enhance model performance. We provide a theoretical analysis demonstrating that SpanNorm, combined with a principled scaling strategy, maintains bounded signal variance throughout the network, preventing the gradient issues that plague PostNorm models, and also alleviating the representation collapse of PreNorm. Empirically, SpanNorm consistently outperforms standard normalization schemes in both dense and Mixture-of-Experts (MoE) scenarios, paving the way for more powerful and stable Transformer architectures.

156. FedDis: A Causal Disentanglement Framework for Federated Traffic Prediction

Authors: Chengyang Zhou , Zijian Zhang , Chunxu Zhang , Hao Miao , Yulin Zhang , Kedi Lyu , Juncheng Hu
URL: https://arxiv.org/abs/2601.22578
Abstract:

Federated learning offers a promising paradigm for privacy-preserving traffic prediction, yet its performance is often challenged by the non-identically and independently distributed (non-IID) nature of decentralized traffic data. Existing federated methods frequently struggle with this data heterogeneity, typically entangling globally shared patterns with client-specific local dynamics within a single representation. In this work, we postulate that this heterogeneity stems from the entanglement of two distinct generative sources: client-specific localized dynamics and cross-client global spatial-temporal patterns. Motivated by this perspective, we introduce FedDis, a novel framework that, to the best of our knowledge, is the first to leverage causal disentanglement for federated spatial-temporal prediction. Architecturally, FedDis comprises a dual-branch design wherein a Personalized Bank learns to capture client-specific factors, while a Global Pattern Bank distills common knowledge. This separation enables robust cross-client knowledge transfer while preserving high adaptability to unique local environments. Crucially, a mutual information minimization objective is employed to enforce informational orthogonality between the two branches, thereby ensuring effective disentanglement. Comprehensive experiments conducted on four real-world benchmark datasets demonstrate that FedDis consistently achieves state-of-the-art performance, promising efficiency, and superior expandability.

157. Mitigating Hallucinations in Video Large Language Models via Spatiotemporal-Semantic Contrastive Decoding

Authors: Yuansheng Gao , Jinman Zhao , Tong Zhang , Xingguo Xu , Han Bao , Zonghui Wang , Wenzhi Chen
URL: https://arxiv.org/abs/2601.22574
Abstract:

Although Video Large Language Models perform remarkably well across tasks such as video understanding, question answering, and reasoning, they still suffer from the problem of hallucination, which refers to generating outputs that are inconsistent with explicit video content or factual evidence. However, existing decoding methods for mitigating video hallucinations, while considering the spatiotemporal characteristics of videos, mostly rely on heuristic designs. As a result, they fail to precisely capture the root causes of hallucinations and their fine-grained temporal and semantic correlations, leading to limited robustness and generalization in complex scenarios. To more effectively mitigate video hallucinations, we propose a novel decoding strategy termed Spatiotemporal-Semantic Contrastive Decoding. This strategy constructs negative features by deliberately disrupting the spatiotemporal consistency and semantic associations of video features, and suppresses video hallucinations through contrastive decoding against the original video features during inference. Extensive experiments demonstrate that our method not only effectively mitigates the occurrence of hallucinations, but also preserves the general video understanding and reasoning capabilities of the model.

158. Whispers of Wealth: Red-Teaming Google’s Agent Payments Protocol via Prompt Injection

Authors: Tanusree Debi , Wentian Zhu
URL: https://arxiv.org/abs/2601.22569
Abstract:

Large language model (LLM) based agents are increasingly used to automate financial transactions, yet their reliance on contextual reasoning exposes payment systems to prompt-driven manipulation. The Agent Payments Protocol (AP2) aims to secure agent-led purchases through cryptographically verifiable mandates, but its practical robustness remains underexplored. In this work, we perform an AI red-teaming evaluation of AP2 and identify vulnerabilities arising from indirect and direct prompt injection. We introduce two attack techniques, the Branded Whisper Attack and the Vault Whisper Attack which manipulate product ranking and extract sensitive user data. Using a functional AP2 based shopping agent built with Gemini-2.5-Flash and the Google ADK framework, we experimentally validate that simple adversarial prompts can reliably subvert agent behavior. Our findings reveal critical weaknesses in current agentic payment architectures and highlight the need for stronger isolation and defensive safeguards in LLM-mediated financial systems.

159. EUGens: Efficient, Unified, and General Dense Layers

Authors: Sang Min Kim , Byeongchan Kim , Arijit Sehanobish , Somnath Basu Roy Chowdhury , Rahul Kidambi , Dongseok Shim , Avinava Dubey , Snigdha Chaturvedi , Min-hwan Oh , Krzysztof Choromanski
URL: https://arxiv.org/abs/2601.22563
Abstract:

Efficient neural networks are essential for scaling machine learning models to real-time applications and resource-constrained environments. Fully-connected feedforward layers (FFLs) introduce computation and parameter count bottlenecks within neural network architectures. To address this challenge, in this work, we propose a new class of dense layers that generalize standard fully-connected feedforward layers, \textbf{E}fficient, \textbf{U}nified and \textbf{Gen}eral dense layers (EUGens). EUGens leverage random features to approximate standard FFLs and go beyond them by incorporating a direct dependence on the input norms in their computations. The proposed layers unify existing efficient FFL extensions and improve efficiency by reducing inference complexity from quadratic to linear time. They also lead to \textbf{the first} unbiased algorithms approximating FFLs with arbitrary polynomial activation functions. Furthermore, EuGens reduce the parameter count and computational overhead while preserving the expressive power and adaptability of FFLs. We also present a layer-wise knowledge transfer technique that bypasses backpropagation, enabling efficient adaptation of EUGens to pre-trained models. Empirically, we observe that integrating EUGens into Transformers and MLPs yields substantial improvements in inference speed (up to \textbf{27}\%) and memory efficiency (up to \textbf{30}\%) across a range of tasks, including image classification, language model pre-training, and 3D scene reconstruction. Overall, our results highlight the potential of EUGens for the scalable deployment of large-scale neural networks in real-world scenarios.

160. Are LLM Evaluators Really Narcissists? Sanity Checking Self-Preference Evaluations

Authors: Dani Roytburg , Matthew Bozoukov , Matthew Nguyen , Mackenzie Puig-Hall , Narmeen Oozeer
URL: https://arxiv.org/abs/2601.22548
Abstract:

Recent research has shown that large language models (LLM) favor own outputs when acting as judges, undermining the integrity of automated post-training and evaluation workflows. However, it is difficult to disentangle which evaluation biases are explained by narcissism versus general experimental confounds, distorting measurements of self-preference bias. We discover a core methodological confound which could reduce measurement error by 89.6%. Specifically, LLM evaluators may deliver self-preferring verdicts when the judge responds to queries which they completed incorrectly themselves; this would be true regardless of whether one of their responses is their own. To decouple self-preference signals from noisy outputs on hard problems, we introduce an Evaluator Quality Baseline, which compares the probability that a judge incorrectly votes for itself against the probability that it votes for an incorrect response from another model. Evaluating this simple baseline on 37,448 queries, only 51% of initial findings retain statistical significance. Finally, we turn towards characterizing the entropy of “easy” versus “hard” evaluation votes from LLM judges. Our corrective baseline enables future research on self-preference by eliminating noisy data from potential solutions. More widely, this work contributes to the growing body of work on cataloging and isolating judge-bias effects.

161. Towards the Holographic Characteristic of LLMs for Efficient Short-text Generation

Authors: Shun Qian , Bingquan Liu , Chengjie Sun , Zhen Xu , Baoxun Wang
URL: https://arxiv.org/abs/2601.22546
Abstract:

The recent advancements in Large Language Models (LLMs) have attracted interest in exploring their in-context learning abilities and chain-of-thought capabilities. However, there are few studies investigating the specific traits related to the powerful generation capacity of LLMs. This paper aims to delve into the generation characteristics exhibited by LLMs. Through our investigation, we have discovered that language models tend to capture target-side keywords at the beginning of the generation process. We name this phenomenon the Holographic Characteristic of language models. For the purpose of exploring this characteristic and further improving the inference efficiency of language models, we propose a plugin called HOLO, which leverages the Holographic Characteristic to extract target-side keywords from language models within a limited number of generation steps and complements the sentence with a parallel lexically constrained text generation method. To verify the effectiveness of HOLO, we conduct massive experiments on language models of varying architectures and scales in the short-text generation scenario. The results demonstrate that HOLO achieves comparable performance to the baselines in terms of both automatic and human-like evaluation metrics and highlight the potential of the Holographic Characteristic.

162. Adapting Reinforcement Learning for Path Planning in Constrained Parking Scenarios

Authors: Feng Tao , Luca Paparusso , Chenyi Gu , Robin Koehler , Chenxu Wu , Xinyu Huang , Christian Juette , David Paz , Ren Liu
URL: https://arxiv.org/abs/2601.22545
Abstract:

Real-time path planning in constrained environments remains a fundamental challenge for autonomous systems. Traditional classical planners, while effective under perfect perception assumptions, are often sensitive to real-world perception constraints and rely on online search procedures that incur high computational costs. In complex surroundings, this renders real-time deployment prohibitive. To overcome these limitations, we introduce a Deep Reinforcement Learning (DRL) framework for real-time path planning in parking scenarios. In particular, we focus on challenging scenes with tight spaces that require a high number of reversal maneuvers and adjustments. Unlike classical planners, our solution does not require ideal and structured perception, and in principle, could avoid the need for additional modules such as localization and tracking, resulting in a simpler and more practical implementation. Also, at test time, the policy generates actions through a single forward pass at each step, which is lightweight enough for real-time deployment. The task is formulated as a sequential decision-making problem grounded in a bicycle model dynamics, enabling the agent to directly learn navigation policies that respect vehicle kinematics and environmental constraints in the closed-loop setting. A new benchmark is developed to support both training and evaluation, capturing diverse and challenging scenarios. Our approach achieves state-of-the-art success rates and efficiency, surpassing classical planner baselines by +96% in success rate and +52% in efficiency. Furthermore, we release our benchmark as an open-source resource for the community to foster future research in autonomous systems. The benchmark and accompanying tools are available at this https URL .

163. Demystifying Design Choices of Reinforcement Fine-tuning: A Batched Contextual Bandit Learning Perspective

Authors: Hong Xie , Xiao Hu , Tao Tan , Haoran Gu , Xin Li , Jianyu Han , Defu Lian , Enhong Chen
URL: https://arxiv.org/abs/2601.22532
Abstract:

The reinforcement fine-tuning area is undergoing an explosion papers largely on optimizing design choices. Though performance gains are often claimed, inconsistent conclusions also arise from time to time, making the progress illusive. Reflecting on this illusion, we still lack principled answers to two fundamental questions: 1) what is the role of each design choice? 2) which ones are critical? This paper aims to shed light on them. The underlying challenge is that design choices are entangled together, making their contribution to learning and generalization difficult to attribute. To address this challenge, we first construct a minimalist baseline for disentangling factors: one rollout per query in each round, the outcome reward serving as the training signal without any advantage trick, and a batch size of thirty-two. This baseline connects to batched contextual bandit learning, which facilitates experimental analysis. Centering around this baseline, we design an experiment pipeline, examining the marginal gains of factors like advantage, number of rollouts, etc. Experiments on three base models and two datasets, not only reveal new understanding on the role of various design choices on learning and generalization dynamics, but also identify critical ones that deserve more effort.

164. Learn from A Rationalist: Distilling Intermediate Interpretable Rationales

Authors: Jiayi Dai , Randy Goebel
URL: https://arxiv.org/abs/2601.22531
Abstract:

Because of the pervasive use of deep neural networks (DNNs), especially in high-stakes domains, the interpretability of DNNs has received increased attention. The general idea of rationale extraction (RE) is to provide an interpretable-by-design framework for DNNs via a select-predict architecture where two neural networks learn jointly to perform feature selection and prediction, respectively. Given only the remote supervision from the final task prediction, the process of learning to select subsets of features (or \emph{rationales}) requires searching in the space of all possible feature combinations, which is computationally challenging and even harder when the base neural networks are not sufficiently capable. To improve the predictive performance of RE models that are based on less capable or smaller neural networks (i.e., the students), we propose \textbf{REKD} (\textbf{R}ationale \textbf{E}xtraction with \textbf{K}nowledge \textbf{D}istillation) where a student RE model learns from the rationales and predictions of a teacher (i.e., a \emph{rationalist}) in addition to the student’s own RE optimization. This structural adjustment to RE aligns well with how humans could learn effectively from interpretable and verifiable knowledge. Because of the neural-model agnostic nature of the method, any black-box neural network could be integrated as a backbone model. To demonstrate the viability of REKD, we conduct experiments with multiple variants of BERT and vision transformer (ViT) models. Our experiments across language and vision classification datasets (i.e., IMDB movie reviews, CIFAR 10 and CIFAR 100) show that REKD significantly improves the predictive performance of the student RE models.

165. SCOPE-PD: Explainable AI on Subjective and Clinical Objective Measurements of Parkinson’s Disease for Precision Decision-Making

Authors: Md Mezbahul Islam , John Michael Templeton , Masrur Sobhan , Christian Poellabauer , Ananda Mohan Mondal
URL: https://arxiv.org/abs/2601.22516
Abstract:

Parkinson’s disease (PD) is a chronic and complex neurodegenerative disorder influenced by genetic, clinical, and lifestyle factors. Predicting this disease early is challenging because it depends on traditional diagnostic methods that face issues of subjectivity, which commonly delay diagnosis. Several objective analyses are currently in practice to help overcome the challenges of subjectivity; however, a proper explanation of these analyses is still lacking. While machine learning (ML) has demonstrated potential in supporting PD diagnosis, existing approaches often rely on subjective reports only and lack interpretability for individualized risk estimation. This study proposes SCOPE-PD, an explainable AI-based prediction framework, by integrating subjective and objective assessments to provide personalized health decisions. Subjective and objective clinical assessment data are collected from the Parkinson’s Progression Markers Initiative (PPMI) study to construct a multimodal prediction framework. Several ML techniques are applied to these data, and the best ML model is selected to interpret the results. Model interpretability is examined using SHAP-based analysis. The Random Forest algorithm achieves the highest accuracy of 98.66 percent using combined features from both subjective and objective test data. Tremor, bradykinesia, and facial expression are identified as the top three contributing features from the MDS-UPDRS test in the prediction of PD.

166. Shattered Compositionality: Counterintuitive Learning Dynamics of Transformers for Arithmetic

Authors: Xingyu Zhao , Darsh Sharma , Rheeya Uppaal , Yiqiao Zhong
URL: https://arxiv.org/abs/2601.22510
Abstract:

Large language models (LLMs) often exhibit unexpected errors or unintended behavior, even at scale. While recent work reveals the discrepancy between LLMs and humans in skill compositions, the learning dynamics of skill compositions and the underlying cause of non-human behavior remain elusive. In this study, we investigate the mechanism of learning dynamics by training transformers on synthetic arithmetic tasks. Through extensive ablations and fine-grained diagnostic metrics, we discover that transformers do not reliably build skill compositions according to human-like sequential rules. Instead, they often acquire skills in reverse order or in parallel, which leads to unexpected mixing errors especially under distribution shifts–a phenomenon we refer to as shattered compositionality. To explain these behaviors, we provide evidence that correlational matching to the training data, rather than causal or procedural composition, shapes learning dynamics. We further show that shattered compositionality persists in modern LLMs and is not mitigated by pure model scaling or scratchpad-based reasoning. Our results reveal a fundamental mismatch between a model’s learning behavior and desired skill compositions, with implications for reasoning reliability, out-of-distribution robustness, and alignment.

167. Keep Rehearsing and Refining: Lifelong Learning Vehicle Routing under Continually Drifting Tasks

Authors: Jiyuan Pei , Yi Mei , Jialin Liu , Mengjie Zhang , Xin Yao
URL: https://arxiv.org/abs/2601.22509
Abstract:

Existing neural solvers for vehicle routing problems (VRPs) are typically trained either in a one-off manner on a fixed set of pre-defined tasks or in a lifelong manner on several tasks arriving sequentially, assuming sufficient training on each task. Both settings overlook a common real-world property: problem patterns may drift continually over time, yielding massive tasks sequentially arising while offering only limited training resources per task. In this paper, we study a novel lifelong learning paradigm for neural VRP solvers under continually drifting tasks over learning time steps, where sufficient training for any given task at any time is not available. We propose Dual Replay with Experience Enhancement (DREE), a general framework to improve learning efficiency and mitigate catastrophic forgetting under such drift. Extensive experiments show that, under such continual drift, DREE effectively learns new tasks, preserves prior knowledge, improves generalization to unseen tasks, and can be applied to diverse existing neural solvers.

168. Action-Sufficient Goal Representations

Authors: Jinu Hyeon , Woobin Park , Hongjoon Ahn , Taesup Moon
URL: https://arxiv.org/abs/2601.22496
Abstract:

Hierarchical policies in offline goal-conditioned reinforcement learning (GCRL) addresses long-horizon tasks by decomposing control into high-level subgoal planning and low-level action execution. A critical design choice in such architectures is the goal representation-the compressed encoding of goals that serves as the interface between these levels. Existing approaches commonly derive goal representations while learning value functions, implicitly assuming that preserving information sufficient for value estimation is adequate for optimal control. We show that this assumption can fail, even when the value estimation is exact, as such representations may collapse goal states that need to be differentiated for action learning. To address this, we introduce an information-theoretic framework that defines action sufficiency, a condition on goal representations necessary for optimal action selection. We prove that value sufficiency does not imply action sufficiency and empirically verify that the latter is more strongly associated with control success in a discrete environment. We further demonstrate that standard log-loss training of low-level policies naturally induces action-sufficient representations. Our experimental results a popular benchmark demonstrate that our actor-derived representations consistently outperform representations learned via value estimation.

169. AI Literacy, Safety Awareness, and STEM Career Aspirations of Australian Secondary Students: Evaluating the Impact of Workshop Interventions

Authors: Christian Bergh , Alexandra Vassar , Natasha Banks , Jessica Xu , Jake Renzella
URL: https://arxiv.org/abs/2601.22486
Abstract:

Deepfakes and other forms of synthetic media pose growing safety risks for adolescents, yet evidence on students’ exposure and related behaviours remains limited. This study evaluates the impact of Day of AI Australia’s workshop-based intervention designed to improve AI literacy and conceptual understanding among Australian secondary students (Years 7-10). Using a mixed-methods approach with pre- and post-intervention surveys (N=205 pre; N=163 post), we analyse changes in students’ ability to identify AI in everyday tools, their understanding of AI ethics, training, and safety, and their interest in STEM-related careers. Baseline data revealed notable synthetic media risks: 82.4% of students reported having seen deepfakes, 18.5% reported sharing them, and 7.3% reported creating them. Results show higher self-reported AI knowledge and confidence after the intervention, alongside improved recognition of AI in widely used platforms such as Netflix, Spotify, and TikTok. This pattern suggests a shift from seeing these tools as merely “algorithm-based” to recognising them as AI-driven systems. Students also reported increased interest in STEM careers post-workshop; however, effect sizes were small, indicating that sustained approaches beyond one-off workshops may be needed to influence longer-term aspirations. Overall, the findings support scalable AI literacy programs that pair foundational AI concepts with an explicit emphasis on synthetic media safety.

170. FraudShield: Knowledge Graph Empowered Defense for LLMs against Fraud Attacks

Authors: Naen Xu , Jinghuai Zhang , Ping He , Chunyi Zhou , Jun Wang , Zhihui Fu , Tianyu Du , Zhaoxiang Wang , Shouling Ji
URL: https://arxiv.org/abs/2601.22485
Abstract:

Large language models (LLMs) have been widely integrated into critical automated workflows, including contract review and job application processes. However, LLMs are susceptible to manipulation by fraudulent information, which can lead to harmful outcomes. Although advanced defense methods have been developed to address this issue, they often exhibit limitations in effectiveness, interpretability, and generalizability, particularly when applied to LLM-based applications. To address these challenges, we introduce FraudShield, a novel framework designed to protect LLMs from fraudulent content by leveraging a comprehensive analysis of fraud tactics. Specifically, FraudShield constructs and refines a fraud tactic-keyword knowledge graph to capture high-confidence associations between suspicious text and fraud techniques. The structured knowledge graph augments the original input by highlighting keywords and providing supporting evidence, guiding the LLM toward more secure responses. Extensive experiments show that FraudShield consistently outperforms state-of-the-art defenses across four mainstream LLMs and five representative fraud types, while also offering interpretable clues for the model’s generations.

171. RulePlanner: All-in-One Reinforcement Learner for Unifying Design Rules in 3D Floorplanning

Authors: Ruizhe Zhong , Xingbo Du , Junchi Yan
URL: https://arxiv.org/abs/2601.22476
Abstract:

Floorplanning determines the coordinate and shape of each module in Integrated Circuits. With the scaling of technology nodes, in floorplanning stage especially 3D scenarios with multiple stacked layers, it has become increasingly challenging to adhere to complex hardware design rules. Current methods are only capable of handling specific and limited design rules, while violations of other rules require manual and meticulous adjustment. This leads to labor-intensive and time-consuming post-processing for expert engineers. In this paper, we propose an all-in-one deep reinforcement learning-based approach to tackle these challenges, and design novel representations for real-world IC design rules that have not been addressed by previous approaches. Specifically, the processing of various hardware design rules is unified into a single framework with three key components: 1) novel matrix representations to model the design rules, 2) constraints on the action space to filter out invalid actions that cause rule violations, and 3) quantitative analysis of constraint satisfaction as reward signals. Experiments on public benchmarks demonstrate the effectiveness and validity of our approach. Furthermore, transferability is well demonstrated on unseen circuits. Our framework is extensible to accommodate new design rules, thus providing flexibility to address emerging challenges in future chip design. Code will be available at: this https URL

172. Training-Free Representation Guidance for Diffusion Models with a Representation Alignment Projector

Authors: Wenqiang Zu , Shenghao Xie , Bo Lei , Lei Ma
URL: https://arxiv.org/abs/2601.22468
Abstract:

Recent progress in generative modeling has enabled high-quality visual synthesis with diffusion-based frameworks, supporting controllable sampling and large-scale training. Inference-time guidance methods such as classifier-free and representative guidance enhance semantic alignment by modifying sampling dynamics; however, they do not fully exploit unsupervised feature representations. Although such visual representations contain rich semantic structure, their integration during generation is constrained by the absence of ground-truth reference images at inference. This work reveals semantic drift in the early denoising stages of diffusion transformers, where stochasticity results in inconsistent alignment even under identical conditioning. To mitigate this issue, we introduce a guidance scheme using a representation alignment projector that injects representations predicted by a projector into intermediate sampling steps, providing an effective semantic anchor without modifying the model architecture. Experiments on SiTs and REPAs show notable improvements in class-conditional ImageNet synthesis, achieving substantially lower FID scores; for example, REPA-XL/2 improves from 5.9 to 3.3, and the proposed method outperforms representative guidance when applied to SiT models. The approach further yields complementary gains when combined with classifier-free guidance, demonstrating enhanced semantic coherence and visual fidelity. These results establish representation-informed diffusion sampling as a practical strategy for reinforcing semantic preservation and image consistency.

173. AI Decodes Historical Chinese Archives to Reveal Lost Climate History

Authors: Sida He , Lingxi Xie , Xiaopeng Zhang , Qi Tian
URL: https://arxiv.org/abs/2601.22458
Abstract:

Historical archives contain qualitative descriptions of climate events, yet converting these into quantitative records has remained a fundamental challenge. Here we introduce a paradigm shift: a generative AI framework that inverts the logic of historical chroniclers by inferring the quantitative climate patterns associated with documented events. Applied to historical Chinese archives, it produces the sub-annual precipitation reconstruction for southeastern China over the period 1368-1911 AD. Our reconstruction not only quantifies iconic extremes like the Ming Dynasty’s Great Drought but also, crucially, maps the full spatial and seasonal structure of El Ni$ñ$o influence on precipitation in this region over five centuries, revealing dynamics inaccessible in shorter modern records. Our methodology and high-resolution climate dataset are directly applicable to climate science and have broader implications for the historical and social sciences.

174. Machine Unlearning in Low-Dimensional Feature Subspace

Authors: Kun Fang , Qinghua Tao , Junxu Liu , Yaxin Xiao , Qingqing Ye , Jian Sun , Haibo Hu
URL: https://arxiv.org/abs/2601.22456
Abstract:

Machine Unlearning (MU) aims at removing the influence of specific data from a pretrained model while preserving performance on the remaining data. In this work, a novel perspective for MU is presented upon low-dimensional feature subspaces, which gives rise to the potentials of separating the remaining and forgetting data herein. This separability motivates our LOFT, a method that proceeds unlearning in a LOw-dimensional FeaTure subspace from the pretrained model skithrough principal projections, which are optimized to maximally capture the information of the remaining data and meanwhile diminish that of the forgetting data. In training, LOFT simply optimizes a small-size projection matrix flexibly plugged into the pretrained model, and only requires one-shot feature fetching from the pretrained backbone instead of repetitively accessing the raw data. Hence, LOFT mitigates two critical issues in mainstream MU methods, i.e., the privacy leakage risk from massive data reload and the inefficiency of updates to the entire pretrained model. Extensive experiments validate the significantly lower computational overhead and superior unlearning performance of LOFT across diverse models, datasets, tasks, and applications. Code is anonymously available at this https URL .

175. Temporal Graph Pattern Machine

Authors: Yijun Ma , Zehong Wang , Weixiang Sun , Yanfang Ye
URL: https://arxiv.org/abs/2601.22454
Abstract:

Temporal graph learning is pivotal for deciphering dynamic systems, where the core challenge lies in explicitly modeling the underlying evolving patterns that govern network transformation. However, prevailing methods are predominantly task-centric and rely on restrictive assumptions – such as short-term dependency modeling, static neighborhood semantics, and retrospective time usage. These constraints hinder the discovery of transferable temporal evolution mechanisms. To address this, we propose the Temporal Graph Pattern Machine (TGPM), a foundation framework that shifts the focus toward directly learning generalized evolving patterns. TGPM conceptualizes each interaction as an interaction patch synthesized via temporally-biased random walks, thereby capturing multi-scale structural semantics and long-range dependencies that extend beyond immediate neighborhoods. These patches are processed by a Transformer-based backbone designed to capture global temporal regularities while adapting to context-specific interaction dynamics. To further empower the model, we introduce a suite of self-supervised pre-training tasks – specifically masked token modeling and next-time prediction – to explicitly encode the fundamental laws of network evolution. Extensive experiments show that TGPM consistently achieves state-of-the-art performance in both transductive and inductive link prediction, demonstrating exceptional cross-domain transferability.

176. Does My Chatbot Have an Agenda? Understanding Human and AI Agency in Human-Human-like Chatbot Interaction

Authors: Bhada Yun , Evgenia Taranova , April Yi Wang
URL: https://arxiv.org/abs/2601.22452
Abstract:

AI chatbots are shifting from tools to companions. This raises critical questions about agency: who drives conversations and sets boundaries in human-AI chatrooms? We report a month-long longitudinal study with 22 adults who chatted with Day, an LLM companion we built, followed by a semi-structured interview with post-hoc elicitation of notable moments, cross-participant chat reviews, and a ‘strategy reveal’ disclosing Day’s vertical (depth-seeking) vs. horizontal (breadth-seeking) modes. We discover that agency in human-AI chatrooms is an emergent, shared experience: as participants claimed agency by setting boundaries and providing feedback, and the AI was perceived to steer intentions and drive execution, control shifted and was co-constructed turn-by-turn. We introduce a 3-by-5 framework mapping who (human, AI, hybrid) x agency action (Intention, Execution, Adaptation, Delimitation, Negotiation), modulated by individual and environmental factors. Ultimately, we argue for translucent design (i.e. transparency-on-demand), spaces for agency negotiation, and guidelines toward agency-aware conversational AI.

177. Countering the Over-Reliance Trap: Mitigating Object Hallucination for LVLMs via a Self-Validation Framework

Authors: Shiyu Liu , Xinyi Wen , Zhibin Lan , Ante Wang , Jinsong Su
URL: https://arxiv.org/abs/2601.22451
Abstract:

Despite progress in Large Vision Language Models (LVLMs), object hallucination remains a critical issue in image captioning task, where models generate descriptions of non-existent objects, compromising their reliability. Previous work attributes this to LVLMs’ over-reliance on language priors and attempts to mitigate it through logits calibration. However, they still lack a thorough analysis of the over-reliance. To gain a deeper understanding of over-reliance, we conduct a series of preliminary experiments, indicating that as the generation length increases, LVLMs’ over-reliance on language priors leads to inflated probability of hallucinated object tokens, consequently exacerbating object hallucination. To circumvent this issue, we propose Language-Prior-Free Verification to enable LVLMs to faithfully verify the confidence of object existence. Based on this, we propose a novel training-free Self-Validation Framework to counter the over-reliance trap. It first validates objects’ existence in sampled candidate captions and further mitigates object hallucination via caption selection or aggregation. Experiment results demonstrate that our framework mitigates object hallucination significantly in image captioning task (e.g., 65.6% improvement on CHAIRI metric with LLaVA-v1.5-7B), surpassing the previous SOTA methods. This result highlights a novel path towards mitigating hallucination by unlocking the inherent potential within LVLMs themselves.

178. Tuning the Implicit Regularizer of Masked Diffusion Language Models: Enhancing Generalization via Insights from $k$-Parity

Authors: Jianhao Huang , Baharan Mirzasoleiman
URL: https://arxiv.org/abs/2601.22450
Abstract:

Masked Diffusion Language Models have recently emerged as a powerful generative paradigm, yet their generalization properties remain understudied compared to their auto-regressive counterparts. In this work, we investigate these properties within the setting of the $k$-parity problem (computing the XOR sum of $k$ relevant bits), where neural networks typically exhibit grokking – a prolonged plateau of chance-level performance followed by sudden generalization. We theoretically decompose the Masked Diffusion (MD) objective into a Signal regime which drives feature learning, and a Noise regime which serves as an implicit regularizer. By training nanoGPT using MD objective on the $k$-parity problem, we demonstrate that MD objective fundamentally alters the learning landscape, enabling rapid and simultaneous generalization without experiencing grokking. Furthermore, we leverage our theoretical insights to optimize the distribution of the mask probability in the MD objective. Our method significantly improves perplexity for 50M-parameter models and achieves superior results across both pre-training from scratch and supervised fine-tuning. Specifically, we observe performance gains peaking at $8.8\%$ and $5.8\%$, respectively, on 8B-parameter models, confirming the scalability and effectiveness of our framework in large-scale masked diffusion language model regimes.

179. Automating Forecasting Question Generation and Resolution for AI Evaluation

Authors: Nikos I. Bosse , Peter Mühlbacher , Jack Wildman , Lawrence Phillips , Dan Schwarz
URL: https://arxiv.org/abs/2601.22444
Abstract:

Forecasting future events is highly valuable in decision-making and is a robust measure of general intelligence. As forecasting is probabilistic, developing and evaluating AI forecasters requires generating large numbers of diverse and difficult questions, and accurately resolving them. Previous efforts to automate this laborious work relied on recurring data sources (e.g., weather, stocks), limiting diversity and utility. In this work, we present a system for generating and resolving high-quality forecasting questions automatically and at scale using LLM-powered web research agents. We use this system to generate 1499 diverse, real-world forecasting questions, and to resolve them several months later. We estimate that our system produces verifiable, unambiguous questions approximately 96% of the time, exceeding the rate of Metaculus, a leading human-curated forecasting platform. We also find that our system resolves questions at approximately 95% accuracy. We verify that forecasting agents powered by more intelligent LLMs perform better on these questions (Brier score of 0.134 for Gemini 3 Pro, 0.149 for GPT-5, and 0.179 for Gemini 2.5 Flash). Finally, we demonstrate how our system can be leveraged to directly improve forecasting, by evaluating a question decomposition strategy on a generated question set, yielding a significant improvement in Brier scores (0.132 vs. 0.141).

180. AI and My Values: User Perceptions of LLMs’ Ability to Extract, Embody, and Explain Human Values from Casual Conversations

Authors: Bhada Yun , Renn Su , April Yi Wang
URL: https://arxiv.org/abs/2601.22440
Abstract:

Does AI understand human values? While this remains an open philosophical question, we take a pragmatic stance by introducing VAPT, the Value-Alignment Perception Toolkit, for studying how LLMs reflect people’s values and how people judge those reflections. 20 participants texted a human-like chatbot over a month, then completed a 2-hour interview with our toolkit evaluating AI’s ability to extract (pull details regarding), embody (make decisions guided by), and explain (provide proof of) human values. 13 participants left our study convinced that AI can understand human values. Participants found the experience insightful for self-reflection and found themselves getting persuaded by the AI’s reasoning. Thus, we warn about “weaponized empathy”: a potentially dangerous design pattern that may arise in value-aligned, yet welfare-misaligned AI. VAPT offers concrete artifacts and design implications to evaluate and responsibly build value-aligned conversational agents with transparency, consent, and safeguards as AI grows more capable and human-like into the future.

181. CoDCL: Counterfactual Data Augmentation Contrastive Learning for Continuous-Time Dynamic Network Link Prediction

Authors: Hantong Feng , Yonggang Wu , Duxin Chen , Wenwu Yu
URL: https://arxiv.org/abs/2601.22427
Abstract:

The rapid growth and continuous structural evolution of dynamic networks make effective predictions increasingly challenging. To enable prediction models to adapt to complex temporal environments, they need to be robust to emerging structural changes. We propose a dynamic network learning framework CoDCL, which combines counterfactual data augmentation with contrastive learning to address this this http URL , we devise a comprehensive strategy to generate high-quality counterfactual data, combining a dynamic treatments design with efficient structural neighborhood exploration to quantify the temporal changes in interaction this http URL , the entire CoDCL is designed as a plug-and-play universal module that can be seamlessly integrated into various existing temporal graph models without requiring architectural this http URL experiments on multiple real-world datasets demonstrate that CoDCL significantly gains state-of-the-art baseline models in the field of dynamic networks, confirming the critical role of integrating counterfactual data augmentation into dynamic representation learning.

182. MetaLead: A Comprehensive Human-Curated Leaderboard Dataset for Transparent Reporting of Machine Learning Experiments

Authors: Roelien C. Timmer , Necva Bölücü , Stephen Wan
URL: https://arxiv.org/abs/2601.22420
Abstract:

Leaderboards are crucial in the machine learning (ML) domain for benchmarking and tracking progress. However, creating leaderboards traditionally demands significant manual effort. In recent years, efforts have been made to automate leaderboard generation, but existing datasets for this purpose are limited by capturing only the best results from each paper and limited metadata. We present MetaLead, a fully human-annotated ML Leaderboard dataset that captures all experimental results for result transparency and contains extra metadata, such as the result experimental type: baseline, proposed method, or variation of proposed method for experiment-type guided comparisons, and explicitly separates train and test dataset for cross-domain assessment. This enriched structure makes MetaLead a powerful resource for more transparent and nuanced evaluations across ML research.

183. Dynamic Welfare-Maximizing Pooled Testing

Authors: Nicholas Lopez , Francisco Marmolejo-Cossío , Jose Roberto Tello Ayala , David C. Parkes
URL: https://arxiv.org/abs/2601.22419
Abstract:

Pooled testing is a common strategy for public health disease screening under limited testing resources, allowing multiple biological samples to be tested together with the resources of a single test, at the cost of reduced individual resolution. While dynamic and adaptive strategies have been extensively studied in the classical pooled testing literature, where the goal is to minimize the number of tests required for full diagnosis of a given population, much of the existing work on welfare-maximizing pooled testing adopts static formulations in which all tests are assigned in advance. In this paper, we study dynamic welfare-maximizing pooled testing strategies in which a limited number of tests are performed sequentially to maximize social welfare, defined as the aggregate utility of individuals who are confirmed to be healthy. We formally define the dynamic problem and study algorithmic approaches for sequential test assignment. Because exact dynamic optimization is computationally infeasible beyond small instances, we evaluate a range of strategies (including exact optimization baselines, greedy heuristics, mixed-integer programming relaxations, and learning-based policies) and empirically characterize their performance and tradeoffs using synthetic experiments. Our results show that dynamic testing can yield substantial welfare improvements over static baselines in low-budget regimes. We find that much of the benefit of dynamic testing is captured by simple greedy policies, which substantially outperform static approaches while remaining computationally efficient. Learning-based methods are included as flexible baselines, but in our experiments they do not reliably improve upon these heuristics. Overall, this work provides a principled computational perspective on dynamic pooled testing and clarifies when dynamic assignment meaningfully improves welfare in public health screening.

184. Optimization, Generalization and Differential Privacy Bounds for Gradient Descent on Kolmogorov-Arnold Networks

Authors: Puyu Wang , Junyu Zhou , Philipp Liznerski , Marius Kloft
URL: https://arxiv.org/abs/2601.22409
Abstract:

Kolmogorov–Arnold Networks (KANs) have recently emerged as a structured alternative to standard MLPs, yet a principled theory for their training dynamics, generalization, and privacy properties remains limited. In this paper, we analyze gradient descent (GD) for training two-layer KANs and derive general bounds that characterize their training dynamics, generalization, and utility under differential privacy (DP). As a concrete instantiation, we specialize our analysis to logistic loss under an NTK-separable assumption, where we show that polylogarithmic network width suffices for GD to achieve an optimization rate of order $1/T$ and a generalization rate of order $1/n$, with $T$ denoting the number of GD iterations and $n$ the sample size. In the private setting, we characterize the noise required for $(\epsilon,\delta)$-DP and obtain a utility bound of order $\sqrt{d}/(n\epsilon)$ (with $d$ the input dimension), matching the classical lower bound for general convex Lipschitz problems. Our results imply that polylogarithmic width is not only sufficient but also necessary under differential privacy, revealing a qualitative gap between non-private (sufficiency only) and private (necessity also emerges) training regimes. Experiments further illustrate how these theoretical insights can guide practical choices, including network width selection and early stopping.

185. Spectral Filtering for Learning Quantum Dynamics

Authors: Elad Hazan , Annie Marsden
URL: https://arxiv.org/abs/2601.22400
Abstract:

Learning high-dimensional quantum systems is a fundamental challenge that notoriously suffers from the curse of dimensionality. We formulate the task of predicting quantum evolution in the linear response regime as a specific instance of learning a Complex-Valued Linear Dynamical System (CLDS) with sector-bounded eigenvalues – a setting that also encompasses modern Structured State Space Models (SSMs). While traditional system identification attempts to reconstruct full system matrices (incurring exponential cost in the Hilbert dimension), we propose Quantum Spectral Filtering, a method that shifts the goal to improper dynamic learning. Leveraging the optimal concentration properties of the Slepian basis, we prove that the learnability of such systems is governed strictly by an effective quantum dimension $k^*$, determined by the spectral bandwidth and memory horizon. This result establishes that complex-valued LDSs can be learned with sample and computational complexity independent of the ambient state dimension, provided their spectrum is bounded.

186. Score-based Integrated Gradient for Root Cause Explanations of Outliers

Authors: Phuoc Nguyen , Truyen Tran , Sunil Gupta , Svetha Venkatesh
URL: https://arxiv.org/abs/2601.22399
Abstract:

Identifying the root causes of outliers is a fundamental problem in causal inference and anomaly detection. Traditional approaches based on heuristics or counterfactual reasoning often struggle under uncertainty and high-dimensional dependencies. We introduce SIREN, a novel and scalable method that attributes the root causes of outliers by estimating the score functions of the data likelihood. Attribution is computed via integrated gradients that accumulate score contributions along paths from the outlier toward the normal data distribution. Our method satisfies three of the four classic Shapley value axioms - dummy, efficiency, and linearity - as well as an asymmetry axiom derived from the underlying causal structure. Unlike prior work, SIREN operates directly on the score function, enabling tractable and uncertainty-aware root cause attribution in nonlinear, high-dimensional, and heteroscedastic causal models. Extensive experiments on synthetic random graphs and real-world cloud service and supply chain datasets show that SIREN outperforms state-of-the-art baselines in both attribution accuracy and computational efficiency.

187. Jailbreaks on Vision Language Model via Multimodal Reasoning

Authors: Aarush Noheria , Yuguang Yao
URL: https://arxiv.org/abs/2601.22398
Abstract:

Vision-language models (VLMs) have become central to tasks such as visual question answering, image captioning, and text-to-image generation. However, their outputs are highly sensitive to prompt variations, which can reveal vulnerabilities in safety alignment. In this work, we present a jailbreak framework that exploits post-training Chain-of-Thought (CoT) prompting to construct stealthy prompts capable of bypassing safety filters. To further increase attack success rates (ASR), we propose a ReAct-driven adaptive noising mechanism that iteratively perturbs input images based on model feedback. This approach leverages the ReAct paradigm to refine adversarial noise in regions most likely to activate safety defenses, thereby enhancing stealth and evasion. Experimental results demonstrate that the proposed dual-strategy significantly improves ASR while maintaining naturalness in both text and visual domains.

188. Culturally Grounded Personas in Large Language Models: Characterization and Alignment with Socio-Psychological Value Frameworks

Authors: Candida M. Greco , Lucio La Cava , Andrea Tagarelli
URL: https://arxiv.org/abs/2601.22396
Abstract:

Despite the growing utility of Large Language Models (LLMs) for simulating human behavior, the extent to which these synthetic personas accurately reflect world and moral value systems across different cultural conditionings remains uncertain. This paper investigates the alignment of synthetic, culturally-grounded personas with established frameworks, specifically the World Values Survey (WVS), the Inglehart-Welzel Cultural Map, and Moral Foundations Theory. We conceptualize and produce LLM-generated personas based on a set of interpretable WVS-derived variables, and we examine the generated personas through three complementary lenses: positioning on the Inglehart-Welzel map, which unveils their interpretation reflecting stable differences across cultural conditionings; demographic-level consistency with the World Values Survey, where response distributions broadly track human group patterns; and moral profiles derived from a Moral Foundations questionnaire, which we analyze through a culture-to-morality mapping to characterize how moral responses vary across different cultural configurations. Our approach of culturally-grounded persona generation and analysis enables evaluation of cross-cultural structure and moral variation.

189. SP^2DPO: An LLM-assisted Semantic Per-Pair DPO Generalization

Authors: Chaoyue He , Xin Zhou , Di Wang , Hong Xu , Wei Liu , Chunyan Miao
URL: https://arxiv.org/abs/2601.22385
Abstract:

Direct Preference Optimization (DPO) controls the trade-off between fitting preference labels and staying close to a reference model using a single global temperature beta, implicitly treating all preference pairs as equally informative. Real-world preference corpora are heterogeneous: they mix high-signal, objective failures (for example, safety, factuality, instruction violations) with low-signal or subjective distinctions (for example, style), and also include label noise. We introduce our method, SP2DPO (Semantic Per-Pair DPO), a generalization that replaces the global temperature with an instance-specific schedule beta_i pre-decided offline from structured semantic-gap annotations (category, magnitude, confidence) produced by teacher language models. We instantiate this procedure on the UltraFeedback preference corpus (59,960 pairs), enabling large-scale construction of an auditable beta_i artifact, and incur zero training-time overhead: the inner-loop optimizer remains standard DPO with beta set per pair. We focus our empirical study on AlpacaEval 2.0, reporting both raw win rate and length-controlled win rate. Across four open-weight, instruction-tuned student backbones (4B-8B), SP2DPO is competitive with a tuned global-beta DPO baseline and improves AlpacaEval 2.0 length-controlled win rate on two of four backbones, while avoiding per-model beta sweeps. All code, annotations, and artifacts will be released.

190. Graph is a Substrate Across Data Modalities

Authors: Ziming Li , Xiaoming Wu , Zehong Wang , Jiazheng Li , Yijun Tian , Jinhe Bi , Yunpu Ma , Yanfang Ye , Chuxu Zhang
URL: https://arxiv.org/abs/2601.22384
Abstract:

Graphs provide a natural representation of relational structure that arises across diverse domains. Despite this ubiquity, graph structure is typically learned in a modality- and task-isolated manner, where graph representations are constructed within individual task contexts and discarded thereafter. As a result, structural regularities across modalities and tasks are repeatedly reconstructed rather than accumulated at the level of intermediate graph representations. This motivates a representation-learning question: how should graph structure be organized so that it can persist and accumulate across heterogeneous modalities and tasks? We adopt a representation-centric perspective in which graph structure is treated as a structural substrate that persists across learning contexts. To instantiate this perspective, we propose G-Substrate, a graph substrate framework that organizes learning around shared graph structures. G-Substrate comprises two complementary mechanisms: a unified structural schema that ensures compatibility among graph representations across heterogeneous modalities and tasks, and an interleaved role-based training strategy that exposes the same graph structure to multiple functional roles during learning. Experiments across multiple domains, modalities, and tasks show that G-Substrate outperforms task-isolated and naive multi-task learning methods.

191. Context Structure Reshapes the Representational Geometry of Language Models

Authors: Eghbal A. Hosseini , Yuxuan Li , Yasaman Bahri , Declan Campbell , Andrew Kyle Lampinen
URL: https://arxiv.org/abs/2601.22364
Abstract:

Large Language Models (LLMs) have been shown to organize the representations of input sequences into straighter neural trajectories in their deep layers, which has been hypothesized to facilitate next-token prediction via linear extrapolation. Language models can also adapt to diverse tasks and learn new structure in context, and recent work has shown that this in-context learning (ICL) can be reflected in representational changes. Here we bring these two lines of research together to explore whether representation straightening occurs \emph{within} a context during ICL. We measure representational straightening in Gemma 2 models across a diverse set of in-context tasks, and uncover a dichotomy in how LLMs’ representations change in context. In continual prediction settings (e.g., natural language, grid world traversal tasks) we observe that increasing context increases the straightness of neural sequence trajectories, which is correlated with improvement in model prediction. Conversely, in structured prediction settings (e.g., few-shot tasks), straightening is inconsistent – it is only present in phases of the task with explicit structure (e.g., repeating a template), but vanishes elsewhere. These results suggest that ICL is not a monolithic process. Instead, we propose that LLMs function like a Swiss Army knife: depending on task structure, the LLM dynamically selects between strategies, only some of which yield representational straightening.

192. MERMAID: Memory-Enhanced Retrieval and Reasoning with Multi-Agent Iterative Knowledge Grounding for Veracity Assessment

Authors: Yupeng Cao , Chengyang He , Yangyang Yu , Ping Wang , K.P. Subbalakshmi
URL: https://arxiv.org/abs/2601.22361
Abstract:

Assessing the veracity of online content has become increasingly critical. Large language models (LLMs) have recently enabled substantial progress in automated veracity assessment, including automated fact-checking and claim verification systems. Typical veracity assessment pipelines break down complex claims into sub-claims, retrieve external evidence, and then apply LLM reasoning to assess veracity. However, existing methods often treat evidence retrieval as a static, isolated step and do not effectively manage or reuse retrieved evidence across claims. In this work, we propose MERMAID, a memory-enhanced multi-agent veracity assessment framework that tightly couples the retrieval and reasoning processes. MERMAID integrates agent-driven search, structured knowledge representations, and a persistent memory module within a Reason-Action style iterative process, enabling dynamic evidence acquisition and cross-claim evidence reuse. By retaining retrieved evidence in an evidence memory, the framework reduces redundant searches and improves verification efficiency and consistency. We evaluate MERMAID on three fact-checking benchmarks and two claim-verification datasets using multiple LLMs, including GPT, LLaMA, and Qwen families. Experimental results show that MERMAID achieves state-of-the-art performance while improving the search efficiency, demonstrating the effectiveness of synergizing retrieval, reasoning, and memory for reliable veracity assessment.

193. The Unseen Threat: Residual Knowledge in Machine Unlearning under Perturbed Samples

Authors: Hsiang Hsu , Pradeep Niroula , Zichang He , Ivan Brugere , Freddy Lecue , Chun-Fu Chen
URL: https://arxiv.org/abs/2601.22359
Abstract:

Machine unlearning offers a practical alternative to avoid full model re-training by approximately removing the influence of specific user data. While existing methods certify unlearning via statistical indistinguishability from re-trained models, these guarantees do not naturally extend to model outputs when inputs are adversarially perturbed. In particular, slight perturbations of forget samples may still be correctly recognized by the unlearned model - even when a re-trained model fails to do so - revealing a novel privacy risk: information about the forget samples may persist in their local neighborhood. In this work, we formalize this vulnerability as residual knowledge and show that it is inevitable in high-dimensional settings. To mitigate this risk, we propose a fine-tuning strategy, named RURK, that penalizes the model’s ability to re-recognize perturbed forget samples. Experiments on vision benchmarks with deep neural networks demonstrate that residual knowledge is prevalent across existing unlearning methods and that our approach effectively prevents residual knowledge.

194. Recoverability Has a Law: The ERR Measure for Tool-Augmented Agents

Authors: Sri Vatsa Vuddanti , Satwik Kumar Chittiprolu
URL: https://arxiv.org/abs/2601.22352
Abstract:

Language model agents often appear capable of self-recovery after failing tool call executions, yet this behavior lacks a formal explanation. We present a predictive theory that resolves this gap by showing that recoverability follows a measurable law. To elaborate, we formalize recoverability through Expected Recovery Regret (ERR), which quantifies the deviation of a recovery policy from the optimal one under stochastic execution noise, and derive a first-order relationship between ERR and an empirical observable quantity, the Efficiency Score (ES). This yields a falsifiable first-order quantitative law of recovery dynamics in tool-using agents. We empirically validate the law across five tool-use benchmarks spanning controlled perturbations, diagnostic reasoning, and real-world APIs. Across model scales, perturbation regimes, and recovery horizons, predicted regret under the ERR-ES law closely matched observed post-failure regret measured from Monte Carlo rollouts, within delta less than or equal to 0.05. Our results reveal that recoverability is not an artifact of model scale or architecture, but a governed property of interaction dynamics, providing a theoretical foundation for execution-level robustness in language agents.

195. Learning Policy Representations for Steerable Behavior Synthesis

Authors: Beiming Li , Sergio Rozada , Alejandro Ribeiro
URL: https://arxiv.org/abs/2601.22350
Abstract:

Given a Markov decision process (MDP), we seek to learn representations for a range of policies to facilitate behavior steering at test time. As policies of an MDP are uniquely determined by their occupancy measures, we propose modeling policy representations as expectations of state-action feature maps with respect to occupancy measures. We show that these representations can be approximated uniformly for a range of policies using a set-based architecture. Our model encodes a set of state-action samples into a latent embedding, from which we decode both the policy and its value functions corresponding to multiple rewards. We use variational generative approach to induce a smooth latent space, and further shape it with contrastive learning so that latent distances align with differences in value functions. This geometry permits gradient-based optimization directly in the latent space. Leveraging this capability, we solve a novel behavior synthesis task, where policies are steered to satisfy previously unseen value function constraints without additional training.

196. MixQuant: Pushing the Limits of Block Rotations in Post-Training Quantization

Authors: Sai Sanjeet , Ian Colbert , Pablo Monteagudo-Lago , Giuseppe Franco , Yaman Umuroglu , Nicholas J. Fraser
URL: https://arxiv.org/abs/2601.22347
Abstract:

Recent post-training quantization (PTQ) methods have adopted block rotations to diffuse outliers prior to rounding. While this reduces the overhead of full-vector rotations, the effect of block structure on outlier suppression remains poorly understood. To fill this gap, we present the first systematic, non-asymptotic analysis of outlier suppression for block Hadamard rotations. Our analysis reveals that outlier suppression is fundamentally limited by the geometry of the input vector. In particular, post-rotation outliers are deterministically minimized when the pre-rotation $\ell_1$ norm mass is evenly distributed across blocks. Guided by these insights, we introduce MixQuant, a block rotation-aware PTQ framework that redistributes activation mass via permutations prior to rotation. We propose a greedy mass diffusion algorithm to calibrate permutations by equalizing the expected blockwise $\ell_1$ norms. To avoid adding inference overhead, we identify permutation-equivariant regions in transformer architectures to merge the resulting permutations into model weights before deployment. Experiments show that MixQuant consistently improves accuracy across all block sizes, recovering up to 90% of the full-vector rotation perplexity when quantizing Llama3 1B to INT4 with block size 16, compared to 46% without permutations.

197. From Retrieving Information to Reasoning with AI: Exploring Different Interaction Modalities to Support Human-AI Coordination in Clinical Decision-Making

Authors: Behnam Rahdari , Sameer Shaikh , Jonathan H Chen , Tobias Gerstenberg , Shriti Raj
URL: https://arxiv.org/abs/2601.22338
Abstract:

LLMs are popular among clinicians for decision-support because of simple text-based interaction. However, their impact on clinicians’ performance is ambiguous. Not knowing how clinicians use this new technology and how they compare it to traditional clinical decision-support systems (CDSS) restricts designing novel mechanisms that overcome existing tool limitations and enhance performance and experience. This qualitative study examines how clinicians (n=12) perceive different interaction modalities (text-based conversation with LLMs, interactive and static UI, and voice) for decision-support. In open-ended use of LLM-based tools, our participants took a tool-centric approach using them for information retrieval and confirmation with simple prompts instead of use as active deliberation partners that can handle complex questions. Critical engagement emerged with changes to the interaction setup. Engagement also differed with individual cognitive styles. Lastly, benefits and drawbacks of interaction with text, voice and traditional UIs for clinical decision-support show the lack of a one-size-fits-all interaction modality.

198. Stealthy Poisoning Attacks Bypass Defenses in Regression Settings

Authors: Javier Carnerero-Cano , Luis Muñoz-González , Phillippa Spencer , Emil C. Lupu
URL: https://arxiv.org/abs/2601.22308
Abstract:

Regression models are widely used in industrial processes, engineering and in natural and physical sciences, yet their robustness to poisoning has received less attention. When it has, studies often assume unrealistic threat models and are thus less useful in practice. In this paper, we propose a novel optimal stealthy attack formulation that considers different degrees of detectability and show that it bypasses state-of-the-art defenses. We further propose a new methodology based on normalization of objectives to evaluate different trade-offs between effectiveness and detectability. Finally, we develop a novel defense (BayesClean) against stealthy attacks. BayesClean improves on previous defenses when attacks are stealthy and the number of poisoning points is significant.

199. Conformal Prediction for Generative Models via Adaptive Cluster-Based Density Estimation

Authors: Qidong Yang , Qianyu Julie Zhu , Jonathan Giezendanner , Youssef Marzouk , Stephen Bates , Sherrie Wang
URL: https://arxiv.org/abs/2601.22298
Abstract:

Conditional generative models map input variables to complex, high-dimensional distributions, enabling realistic sample generation in a diverse set of domains. A critical challenge with these models is the absence of calibrated uncertainty, which undermines trust in individual outputs for high-stakes applications. To address this issue, we propose a systematic conformal prediction approach tailored to conditional generative models, leveraging density estimation on model-generated samples. We introduce a novel method called CP4Gen, which utilizes clustering-based density estimation to construct prediction sets that are less sensitive to outliers, more interpretable, and of lower structural complexity than existing methods. Extensive experiments on synthetic datasets and real-world applications, including climate emulation tasks, demonstrate that CP4Gen consistently achieves superior performance in terms of prediction set volume and structural simplicity. Our approach offers practitioners a powerful tool for uncertainty estimation associated with conditional generative models, particularly in scenarios demanding rigorous and interpretable prediction sets.

200. ParalESN: Enabling parallel information processing in Reservoir Computing

Authors: Matteo Pinna , Giacomo Lagomarsini , Andrea Ceni , Claudio Gallicchio
URL: https://arxiv.org/abs/2601.22296
Abstract:

Reservoir Computing (RC) has established itself as an efficient paradigm for temporal processing. However, its scalability remains severely constrained by (i) the necessity of processing temporal data sequentially and (ii) the prohibitive memory footprint of high-dimensional reservoirs. In this work, we revisit RC through the lens of structured operators and state space modeling to address these limitations, introducing Parallel Echo State Network (ParalESN). ParalESN enables the construction of high-dimensional and efficient reservoirs based on diagonal linear recurrence in the complex space, enabling parallel processing of temporal data. We provide a theoretical analysis demonstrating that ParalESN preserves the Echo State Property and the universality guarantees of traditional Echo State Networks while admitting an equivalent representation of arbitrary linear reservoirs in the complex diagonal form. Empirically, ParalESN matches the predictive accuracy of traditional RC on time series benchmarks, while delivering substantial computational savings. On 1-D pixel-level classification tasks, ParalESN achieves competitive accuracy with fully trainable neural networks while reducing computational costs and energy consumption by orders of magnitude. Overall, ParalESN offers a promising, scalable, and principled pathway for integrating RC within the deep learning landscape.

201. PersonaCite: VoC-Grounded Interviewable Agentic Synthetic AI Personas for Verifiable User and Design Research

Authors: Mario Truss
URL: https://arxiv.org/abs/2601.22288
Abstract:

LLM-based and agent-based synthetic personas are increasingly used in design and product decision-making, yet prior work shows that prompt-based personas often produce persuasive but unverifiable responses that obscure their evidentiary basis. We present PersonaCite, an agentic system that reframes AI personas as evidence-bounded research instruments through retrieval-augmented interaction. Unlike prior approaches that rely on prompt-based roleplaying, PersonaCite retrieves actual voice-of-customer artifacts during each conversation turn, constrains responses to retrieved evidence, explicitly abstains when evidence is missing, and provides response-level source attribution. Through semi-structured interviews and deployment study with 14 industry experts, we identify preliminary findings on perceived benefits, validity concerns, and design tensions, and propose Persona Provenance Cards as a documentation pattern for responsible AI persona use in human-centered design workflows.

202. VMonarch: Efficient Video Diffusion Transformers with Structured Attention

Authors: Cheng Liang , Haoxian Chen , Liang Hou , Qi Fan , Gangshan Wu , Xin Tao , Limin Wang
URL: https://arxiv.org/abs/2601.22275
Abstract:

The quadratic complexity of the attention mechanism severely limits the context scalability of Video Diffusion Transformers (DiTs). We find that the highly sparse spatio-temporal attention patterns exhibited in Video DiTs can be naturally represented by the Monarch matrix. It is a class of structured matrices with flexible sparsity, enabling sub-quadratic attention via an alternating minimization algorithm. Accordingly, we propose VMonarch, a novel attention mechanism for Video DiTs that enables efficient computation over the dynamic sparse patterns with structured Monarch matrices. First, we adapt spatio-temporal Monarch factorization to explicitly capture the intra-frame and inter-frame correlations of the video data. Second, we introduce a recomputation strategy to mitigate artifacts arising from instabilities during alternating minimization of Monarch matrices. Third, we propose a novel online entropy algorithm fused into FlashAttention, enabling fast Monarch matrix updates for long sequences. Extensive experiments demonstrate that VMonarch achieves comparable or superior generation quality to full attention on VBench after minimal tuning. It overcomes the attention bottleneck in Video DiTs, reduces attention FLOPs by a factor of 17.5, and achieves a speedup of over 5x in attention computation for long videos, surpassing state-of-the-art sparse attention methods at 90% sparsity.

203. Predicting Intermittent Job Failure Categories for Diagnosis Using Few-Shot Fine-Tuned Language Models

Authors: Henri Aïdasso , Francis Bordeleau , Ali Tizghadam
URL: https://arxiv.org/abs/2601.22264
Abstract:

In principle, Continuous Integration (CI) pipeline failures provide valuable feedback to developers on code-related errors. In practice, however, pipeline jobs often fail intermittently due to non-deterministic tests, network outages, infrastructure failures, resource exhaustion, and other reliability issues. These intermittent (flaky) job failures lead to substantial inefficiencies: wasted computational resources from repeated reruns and significant diagnosis time that distracts developers from core activities and often requires intervention from specialized teams. Prior work has proposed machine learning techniques to detect intermittent failures, but does not address the subsequent diagnosis challenge. To fill this gap, we introduce FlaXifyer, a few-shot learning approach for predicting intermittent job failure categories using pre-trained language models. FlaXifyer requires only job execution logs and achieves 84.3% Macro F1 and 92.0% Top-2 accuracy with just 12 labeled examples per category. We also propose LogSift, an interpretability technique that identifies influential log statements in under one second, reducing review effort by 74.4% while surfacing relevant failure information in 87% of cases. Evaluation on 2,458 job failures from TELUS demonstrates that FlaXifyer and LogSift enable effective automated triage, accelerate failure diagnosis, and pave the way towards the automated resolution of intermittent job failures.

204. AI Narrative Breakdown. A Critical Assessment of Power and Promise

Authors: Rainer Rehak
URL: https://arxiv.org/abs/2601.22255
Abstract:

This article sets off for an exploration of the still evolving discourse surrounding artificial intelligence (AI) in the wake of the release of ChatGPT. It scrutinizes the pervasive narratives that are shaping the societal engagement with AI, spotlighting key themes such as agency and decision-making, autonomy, truthfulness, knowledge processing, prediction, general purpose, neutrality and objectivity, apolitical optimization, sustainability game-changer, democratization, mass unemployment, and the dualistic portrayal of AI as either a harbinger of societal utopia or dystopia. Those narratives are analysed critically based on insights from critical computer science, critical data and algorithm studies, from STS, data protection theory, as well as from the philosophy of mind and semiotics. To properly analyse the narratives presented, the article first delves into a historical and technical contextualisation of the AI discourse itself. The article then introduces the notion of “Zeitgeist AI” to critique the imprecise and misleading application of the term “AI” across various societal sectors. Then, by discussing common narratives with nuance, the article contextualises and challenges often assumed socio-political implications of AI, uncovering in detail and with examples the inherent political, power infused and value-laden decisions within all AI applications. Concluding with a call for a more grounded engagement with AI, the article carves out acute problems ignored by the narratives discussed and proposes new narratives recognizing AI as a human-directed tool necessarily subject to societal governance.

205. MirrorMark: A Distortion-Free Multi-Bit Watermark for Large Language Models

Authors: Ya Jiang , Massieh Kordi Boroujeny , Surender Suresh Kumar , Kai Zeng
URL: https://arxiv.org/abs/2601.22246
Abstract:

As large language models (LLMs) become integral to applications such as question answering and content creation, reliable content attribution has become increasingly important. Watermarking is a promising approach, but existing methods either provide only binary signals or distort the sampling distribution, degrading text quality; distortion-free approaches, in turn, often suffer from weak detectability or robustness. We propose MirrorMark, a multi-bit and distortion-free watermark for LLMs. By mirroring sampling randomness in a measure-preserving manner, MirrorMark embeds multi-bit messages without altering the token probability distribution, preserving text quality by design. To improve robustness, we introduce a context-based scheduler that balances token assignments across message positions while remaining resilient to insertions and deletions. We further provide a theoretical analysis of the equal error rate to interpret empirical performance. Experiments show that MirrorMark matches the text quality of non-watermarked generation while achieving substantially stronger detectability: with 54 bits embedded in 300 tokens, it improves bit accuracy by 8-12% and correctly identifies up to 11% more watermarked texts at 1% false positive rate.

206. A Systematic Literature Review on LLM Defenses Against Prompt Injection and Jailbreaking: Expanding NIST Taxonomy

Authors: Pedro H. Barcha Correia , Ryan W. Achjian , Diego E. G. Caetano de Oliveira , Ygor Acacio Maria , Victor Takashi Hayashi , Marcos Lopes , Charles Christian Miers , Marcos A. Simplicio Jr
URL: https://arxiv.org/abs/2601.22240
Abstract:

The rapid advancement and widespread adoption of generative artificial intelligence (GenAI) and large language models (LLMs) has been accompanied by the emergence of new security vulnerabilities and challenges, such as jailbreaking and other prompt injection attacks. These maliciously crafted inputs can exploit LLMs, causing data leaks, unauthorized actions, or compromised outputs, for instance. As both offensive and defensive prompt injection techniques evolve quickly, a structured understanding of mitigation strategies becomes increasingly important. To address that, this work presents the first systematic literature review on prompt injection mitigation strategies, comprehending 88 studies. Building upon NIST’s report on adversarial machine learning, this work contributes to the field through several avenues. First, it identifies studies beyond those documented in NIST’s report and other academic reviews and surveys. Second, we propose an extension to NIST taxonomy by introducing additional categories of defenses. Third, by adopting NIST’s established terminology and taxonomy as a foundation, we promote consistency and enable future researchers to build upon the standardized taxonomy proposed in this work. Finally, we provide a comprehensive catalog of the reviewed prompt injection defenses, documenting their reported quantitative effectiveness across specific LLMs and attack datasets, while also indicating which solutions are open-source and model-agnostic. This catalog, together with the guidelines presented herein, aims to serve as a practical resource for researchers advancing the field of adversarial machine learning and for developers seeking to implement effective defenses in production systems.

207. Lost in Space? Vision-Language Models Struggle with Relative Camera Pose Estimation

Authors: Ken Deng , Yifu Qiu , Yoni Kasten , Shay B. Cohen , Yftah Ziser
URL: https://arxiv.org/abs/2601.22228
Abstract:

Vision-Language Models (VLMs) perform well in 2D perception and semantic reasoning compared to their limited understanding of 3D spatial structure. We investigate this gap using relative camera pose estimation (RCPE), a fundamental vision task that requires inferring relative camera translation and rotation from a pair of images. We introduce VRRPI-Bench, a benchmark derived from unlabeled egocentric videos with verbalized annotations of relative camera motion, reflecting realistic scenarios with simultaneous translation and rotation around a shared object. We further propose VRRPI-Diag, a diagnostic benchmark that isolates individual motion degrees of freedom. Despite the simplicity of RCPE, most VLMs fail to generalize beyond shallow 2D heuristics, particularly for depth changes and roll transformations along the optical axis. Even state-of-the-art models such as GPT-5 ($0.64$) fall short of classic geometric baselines ($0.97$) and human performance ($0.92$). Moreover, VLMs exhibit difficulty in multi-image reasoning, with inconsistent performance (best $59.7\%$) when integrating spatial cues across frames. Our findings reveal limitations in grounding VLMs in 3D and multi-view spatial reasoning.

Authors: Xinyuan Song , Liang Zhao
URL: https://arxiv.org/abs/2601.22209
Abstract:

Multi-agent systems (MAS) increasingly solve complex tasks by orchestrating agents and tools selected from rapidly growing marketplaces. As these marketplaces expand, many candidates become functionally overlapping, making selection not just a retrieval problem: beyond filtering relevant agents, an orchestrator must choose options that are reliable, compatible with the current execution context, and able to cooperate with other selected agents. Existing recommender systems – largely built for item-level ranking from flat user-item logs – do not directly address the structured, sequential, and interaction-dependent nature of agent orchestration. We address this gap by \textbf{formulating agent recommendation in MAS as a constrained decision problem} and introducing a generic \textbf{constrained recommendation framework} that first uses retrieval to build a compact candidate set conditioned on the current subtask and context, and then performs \textbf{utility optimization} within this feasible set using a learned scorer that accounts for relevance, reliability, and interaction effects. We ground both the formulation and learning signals in \textbf{historical calling trees}, which capture the execution structure of MAS (parent-child calls, branching dependencies, and local cooperation patterns) beyond what flat logs provide. The framework supports two complementary settings: \textbf{agent-level recommendation} (select the next agent/tool) and \textbf{system-level recommendation} (select a small, connected agent team/subgraph for coordinated execution). To enable systematic evaluation, we construct a unified calling-tree benchmark by normalizing invocation logs from eight heterogeneous multi-agent corpora into a shared structured representation.

209. Beyond Conditional Computation: Retrieval-Augmented Genomic Foundation Models with Gengram

Authors: Huinan Xu , Xuyang Feng , Junhong Chen , Junchen Liu , Kaiwen Deng , Kai Ding , Shengning Long , Jiaxue Shuai , Zhaorong Li , Shiping Liu , Guirong Xue , Zhan Xiao
URL: https://arxiv.org/abs/2601.22203
Abstract:

Current genomic foundation models (GFMs) rely on extensive neural computation to implicitly approximate conserved biological motifs from single-nucleotide inputs. We propose Gengram, a conditional memory module that introduces an explicit and highly efficient lookup primitive for multi-base motifs via a genomic-specific hashing scheme, establishing genomic “syntax”. Integrated into the backbone of state-of-the-art GFMs, Gengram achieves substantial gains (up to 14%) across several functional genomics tasks. The module demonstrates robust architectural generalization, while further inspection of Gengram’s latent space reveals the emergence of meaningful representations that align closely with fundamental biological knowledge. By establishing structured motif memory as a modeling primitive, Gengram simultaneously boosts empirical performance and mechanistic interpretability, providing a scalable and biology-aligned pathway for the next generation of GFMs. The code is available at this https URL , and the model checkpoint is available at this https URL .

210. Advanced techniques and applications of LiDAR Place Recognition in Agricultural Environments: A Comprehensive Survey

Authors: Judith Vilella-Cantos , Mónica Ballesta , David Valiente , María Flores , Luis Payá
URL: https://arxiv.org/abs/2601.22198
Abstract:

An optimal solution to the localization problem is essential for developing autonomous robotic systems. Apart from autonomous vehicles, precision agriculture is one of the elds that can bene t most from these systems. Although LiDAR place recognition is a widely used technique in recent years to achieve accurate localization, it is mostly used in urban settings. However, the lack of distinctive features and the unstructured nature of agricultural environments make place recognition challenging. This work presents a comprehensive review of state-of-the-art the latest deep learning applications for agricultural environments and LPR techniques. We focus on the challenges that arise in these environments. We analyze the existing approaches, datasets, and metrics used to evaluate LPR system performance and discuss the limitations and future directions of research in this eld. This is the rst survey that focuses on LiDAR based localization in agricultural settings, with the aim of providing a thorough understanding and fostering further research in this specialized domain.

211. Neural Signals Generate Clinical Notes in the Wild

Authors: Jathurshan Pradeepkumar , Zheng Chen , Jimeng Sun
URL: https://arxiv.org/abs/2601.22197
Abstract:

Generating clinical reports that summarize abnormal patterns, diagnostic findings, and clinical interpretations from long-term EEG recordings remains labor-intensive. We curate a large-scale clinical EEG dataset with $9{,}922$ reports paired with approximately $11{,}000$ hours of EEG recordings from $9{,}048$ patients. We therefore develop CELM, the first clinical EEG-to-Language foundation model capable of summarizing long-duration, variable-length EEG recordings and performing end-to-end clinical report generation at multiple scales, including recording description, background activity, epileptiform abnormalities, events/seizures, and impressions. Experimental results show that, with patient history supervision, our method achieves $70\%$–$95\%$ average relative improvements in standard generation metrics (e.g., ROUGE-1 and METEOR) from $0.2$–$0.3$ to $0.4$–$0.6$. In the zero-shot setting without patient history, CELM attains generation scores in the range of $0.43$–$0.52$, compared to baselines of $0.17$–$0.26$. CELM integrates pretrained EEG foundation models with language models to enable scalable multimodal learning. We release our model and benchmark construction pipeline at [URL].

212. Multitask Learning for Earth Observation Data Classification with Hybrid Quantum Network

Authors: Fan Fan , Yilei Shi , Tobias Guggemos , Xiao Xiang Zhu
URL: https://arxiv.org/abs/2601.22195
Abstract:

Quantum machine learning (QML) has gained increasing attention as a potential solution to address the challenges of computation requirements in the future. Earth observation (EO) has entered the era of Big Data, and the computational demands for effectively analyzing large EO data with complex deep learning models have become a bottleneck. Motivated by this, we aim to leverage quantum computing for EO data classification and explore its advantages despite the current limitations of quantum devices. This paper presents a hybrid model that incorporates multitask learning to assist efficient data encoding and employs a location weight module with quantum convolution operations to extract valid features for classification. The validity of our proposed model was evaluated using multiple EO benchmarks. Additionally, we experimentally explored the generalizability of our model and investigated the factors contributing to its advantage, highlighting the potential of QML in EO data analysis.

213. Practical Evaluation of Quantum Kernel Methods for Radar Micro-Doppler Classification on Noisy Intermediate-Scale Quantum (NISQ) Hardware

Authors: Vikas Agnihotri , Jasleen Kaur , Sarvagya Kaushik
URL: https://arxiv.org/abs/2601.22194
Abstract:

This paper examines the application of a Quantum Support Vector Machine (QSVM) for radarbased aerial target classification using micro-Doppler signatures. Classical features are extracted and reduced via Principal Component Analysis (PCA) to enable efficient quantum encoding. The reduced feature vectors are embedded into a quantum kernel-induced feature space using a fully entangled ZZFeatureMap and classified using a kernel based QSVM. Performance is first evaluated on a quantum simulator and subsequently validated on NISQ-era superconducting quantum hardware, specifically the IBM Torino (133-qubit) and IBM Fez (156-qubit) processors. Experimental results demonstrate that the QSVM achieves competitive classification performance relative to classical SVM baselines while operating on substantially reduced feature dimensionality. Hardware experiments reveal the impact of noise and decoherence and measurement shot count on quantum kernel estimation, and further show improved stability and fidelity on newer Heron r2 architecture. This study provides a systematic comparison between simulator-based and hardware-based QSVM implementations and highlights both the feasibility and current limitations of deploying quantum kernel methods for practical radar signal classification tasks.

214. COL-Trees: Efficient Hierarchical Object Search in Road Networks

Authors: Tenindra Abeywickrama , Muhammad Aamir Cheema , Sabine Storandt
URL: https://arxiv.org/abs/2601.22183
Abstract:

Location-based services rely heavily on efficient methods that search for relevant points-of-interest (POIs) near a given location. A k Nearest Neighbor (kNN) query is one such example that finds the k closest POIs from an agent’s location. While most existing techniques focus on retrieving nearby POIs for a single agent, these search heuristics do not translate to many other applications. For example, Aggregate k Nearest Neighbor (AkNN) queries require POIs that are close to multiple agents. k Farthest Neighbor (kFN) queries require POIs that are the antithesis of nearest. Such problems naturally benefit from a hierarchical approach, but existing methods rely on Euclidean-based heuristics, which have diminished effectiveness in graphs such as road networks. We propose a novel data structure, COL-Tree (Compacted Object-Landmark Tree), to address this gap by enabling efficient hierarchical graph traversal using a more accurate landmark-based heuristic. We then present query algorithms that utilize COL-Trees to efficiently answer AkNN, kFN, and other queries. In our experiments on real-world and synthetic datasets, we demonstrate that our techniques significantly outperform existing approaches, achieving up to 4 orders of magnitude improvement. Moreover, this comes at a small pre-processing overhead in both theory and practice.

215. ShellForge: Adversarial Co-Evolution of Webshell Generation and Multi-View Detection for Robust Webshell Defense

Authors: Yizhong Ding
URL: https://arxiv.org/abs/2601.22182
Abstract:

Webshells remain a primary foothold for attackers to compromise servers, particularly within PHP ecosystems. However, existing detection mechanisms often struggle to keep pace with rapid variant evolution and sophisticated obfuscation techniques that camouflage malicious intent. Furthermore, many current defenses suffer from high false-alarm rates when encountering benign administrative scripts that employ heavy obfuscation for intellectual property protection. To address these challenges, we present ShellForge, an adversarial co-evolution framework that couples automated webshell generation with multi-view detection to continuously harden defensive boundaries. The framework operates through an iterative co-training loop where a generator and a detector mutually reinforce each other via the exchange of hard samples. The generator is optimized through supervised fine-tuning and preference-based reinforcement learning to synthesize functional, highly evasive variants. Simultaneously, we develop a multi-view fusion detector that integrates semantic features from long-string compression, structural features from pruned abstract syntax trees, and global statistical indicators such as Shannon entropy. To minimize false positives, ShellForge utilizes a LLM-based transformation to create de-malicious samples–scripts that retain complex obfuscation patterns but lack harmful payloads–serving as high-quality hard negatives during training. Evaluations on the public FWOID benchmark demonstrate that ShellForge significantly enhances defensive robustness. Upon convergence, the detector maintains a 0.981 F1-score while the generator achieves a 0.939 evasion rate against commercial engines on VirusTotal.

216. In Vino Veritas and Vulnerabilities: Examining LLM Safety via Drunk Language Inducement

Authors: Anudeex Shetty , Aditya Joshi , Salil S. Kanhere
URL: https://arxiv.org/abs/2601.22169
Abstract:

Humans are susceptible to undesirable behaviours and privacy leaks under the influence of alcohol. This paper investigates drunk language, i.e., text written under the influence of alcohol, as a driver for safety failures in large language models (LLMs). We investigate three mechanisms for inducing drunk language in LLMs: persona-based prompting, causal fine-tuning, and reinforcement-based post-training. When evaluated on 5 LLMs, we observe a higher susceptibility to jailbreaking on JailbreakBench (even in the presence of defences) and privacy leaks on ConfAIde, where both benchmarks are in English, as compared to the base LLMs as well as previously reported approaches. Via a robust combination of manual evaluation and LLM-based evaluators and analysis of error categories, our findings highlight a correspondence between human-intoxicated behaviour, and anthropomorphism in LLMs induced with drunk language. The simplicity and efficiency of our drunk language inducement approaches position them as potential counters for LLM safety tuning, highlighting significant risks to LLM safety.

217. Stablecoin Design with Adversarial-Robust Multi-Agent Systems via Trust-Weighted Signal Aggregation

Authors: Shengwei You , Aditya Joshi , Andrey Kuehlkamp , Jarek Nabrzyski
URL: https://arxiv.org/abs/2601.22168
Abstract:

Algorithmic stablecoins promise decentralized monetary stability by maintaining a target peg through programmatic reserve management. Yet, their reserve controllers remain vulnerable to regime-blind optimization, calibrating risk parameters on fair-weather data while ignoring tail events that precipitate cascading failures. The March 2020 Black Thursday collapse, wherein MakerDAO’s collateral auctions yielded $8.3M in losses and a 15% peg deviation, exposed a critical gap: existing models like SAS systematically omit extreme volatility regimes from covariance estimates, producing allocations optimal in expectation but catastrophic under adversarial stress. We present MVF-Composer, a trust-weighted Mean-Variance Frontier reserve controller incorporating a novel Stress Harness for risk-state estimation. Our key insight is deploying multi-agent simulations as adversarial stress-testers: heterogeneous agents (traders, liquidity providers, attackers) execute protocol actions under crisis scenarios, exposing reserve vulnerabilities before they manifest on-chain. We formalize a trust-scoring mechanism T: A -> [0,1] that down-weights signals from agents exhibiting manipulative behavior, ensuring the risk-state estimator remains robust to signal injection and Sybil attacks. Across 1,200 randomized scenarios with injected Black-Swan shocks (10% collateral drawdown, 50% sentiment collapse, coordinated redemption attacks), MVF-Composer reduces peak peg deviation by 57% and mean recovery time by 3.1x relative to SAS baselines. Ablation studies confirm the trust layer accounts for 23% of stability gains under adversarial conditions, achieving 72% adversarial agent detection. Our system runs on commodity hardware, requires no on-chain oracles beyond standard price feeds, and provides a reproducible framework for stress-testing DeFi reserve policies.

218. UniFinEval: Towards Unified Evaluation of Financial Multimodal Models across Text, Images and Videos

Authors: Zhi Yang , Lingfeng Zeng , Fangqi Lou , Qi Qi , Wei Zhang , Zhenyu Wu , Zhenxiong Yu , Jun Han , Zhiheng Jin , Lejie Zhang , Xiaoming Huang , Xiaolong Liang , Zheng Wei , Junbo Zou , Dongpo Cheng , Zhaowei Liu , Xin Guo , Rongjunchen Zhang , Liwen Zhang
URL: https://arxiv.org/abs/2601.22162
Abstract:

Multimodal large language models are playing an increasingly significant role in empowering the financial domain, however, the challenges they face, such as multimodal and high-density information and cross-modal multi-hop reasoning, go beyond the evaluation scope of existing multimodal benchmarks. To address this gap, we propose UniFinEval, the first unified multimodal benchmark designed for high-information-density financial environments, covering text, images, and videos. UniFinEval systematically constructs five core financial scenarios grounded in real-world financial systems: Financial Statement Auditing, Company Fundamental Reasoning, Industry Trend Insights, Financial Risk Sensing, and Asset Allocation Analysis. We manually construct a high-quality dataset consisting of 3,767 question-answer pairs in both chinese and english and systematically evaluate 10 mainstream MLLMs under Zero-Shot and CoT settings. Results show that Gemini-3-pro-preview achieves the best overall performance, yet still exhibits a substantial gap compared to financial experts. Further error analysis reveals systematic deficiencies in current models. UniFinEval aims to provide a systematic assessment of MLLMs’ capabilities in fine-grained, high-information-density financial environments, thereby enhancing the robustness of MLLMs applications in real-world financial scenarios. Data and code are available at this https URL .

219. Screen, Match, and Cache: A Training-Free Causality-Consistent Reference Frame Framework for Human Animation

Authors: Jianan Wang , Nailei Hei , Li He , Huanzhen Wang , Aoxing Li , Haofen Wang , Yan Wang , Wenqiang Zhang
URL: https://arxiv.org/abs/2601.22160
Abstract:

Human animation aims to generate temporally coherent and visually consistent videos over long sequences, yet modeling long-range dependencies while preserving frame quality remains challenging. Inspired by the human ability to leverage past observations for interpreting ongoing actions, we propose FrameCache, a training-free three-stage framework consisting of Screen, Cache, and Match. In the Screen stage, a multi-dimensional, quality-aware mechanism with adaptive thresholds dynamically selects informative frames; the Cache stage maintains a reference pool using a dynamic replacement-hit strategy, preserving both diversity and relevance; and the Match stage extracts behavioral features to perform motion-consistent reference matching for coherent animation guidance. Extensive experiments on standard benchmarks demonstrate that FrameCache consistently improves temporal coherence and visual stability while integrating seamlessly with diverse baselines. Despite these encouraging results, further analysis reveals that its effectiveness depends on baseline temporal reasoning and real-synthetic consistency, motivating future work on compatibility conditions and adaptive cache mechanisms. Code will be made publicly available.

220. Smart Routing with Precise Link Estimation: DSEE-Based Anypath Routing for Reliable Wireless Networking

Authors: Narjes Nourzad , Bhaskar Krishnamachari
URL: https://arxiv.org/abs/2405.10377
Abstract:

In dynamic and resource-constrained environments, such as multi-hop wireless mesh networks, traditional routing protocols often falter by relying on predetermined paths that prove ineffective in unpredictable link conditions. Shortest Anypath routing offers a solution by adapting routing decisions based on real-time link conditions. However, the effectiveness of such routing is fundamentally dependent on the quality and reliability of the available links, and predicting these variables with certainty is challenging. This paper introduces a novel approach that leverages the Deterministic Sequencing of Exploration and Exploitation (DSEE), a multi-armed bandit algorithm, to address the need for accurate and real-time estimation of link delivery probabilities. This approach augments the reliability and resilience of the Shortest Anypath routing in the face of fluctuating link conditions. By coupling DSEE with Anypath routing, this algorithm continuously learns and ensures accurate delivery probability estimation and selects the most suitable way to efficiently route packets while maintaining a provable near-logarithmic regret bound. We also theoretically prove that our proposed scheme offers better regret scaling with respect to the network size than the previously proposed Thompson Sampling-based Opportunistic Routing (TSOR).

전체 AI 논문 - 2026-02-02

1. Strongly Polynomial Time Complexity of Policy Iteration for $L_\infty$ Robust MDPs

2. Scaling Multiagent Systems with Process Rewards

3. High-quality generation of dynamic game content via small language models: A proof of concept

4. TSAQA: Time Series Analysis Question And Answering Benchmark

5. Make Anything Match Your Target: Universal Adversarial Perturbations against Closed-Source MLLMs via Multi-Crop Routed Meta Optimization

6. THINKSAFE: Self-Generated Safety Alignment for Reasoning Models

7. RAudit: A Blind Auditing Protocol for Large Language Model Reasoning

8. Chain-of-thought obfuscation learned from output supervision can generalise to unseen tasks

9. MedMCP-Calc: Benchmarking LLMs for Realistic Medical Calculator Scenarios via MCP Integration

10. From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics

11. The Hot Mess of AI: How Does Misalignment Scale With Model Intelligence and Task Complexity?

12. Guided by Trajectories: Repairing and Rewarding Tool-Use Trajectories for Tool-Integrated Reasoning

13. TriCEGAR: A Trace-Driven Abstraction Mechanism for Agentic AI

14. Why Your Deep Research Agent Fails? On Hallucination Evaluation in Full Research Trajectory

15. Quantifying Model Uniqueness in Heterogeneous AI Ecosystems

16. Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text

17. EvoClinician: A Self-Evolving Agent for Multi-Turn Medical Diagnosis via Test-Time Evolutionary Learning

18. Alignment among Language, Vision and Action Representations

19. MulFeRL: Enhancing Reinforcement Learning with Verbal Feedback in a Multi-turn Loop

20. Game-Theoretic Co-Evolution for LLM-Based Heuristic Discovery

21. Aligning the Unseen in Attributed Graphs: Interplay between Graph Geometry and Node Attributes Manifold

22. CVeDRL: An Efficient Code Verifier via Difficulty-aware Reinforcement Learning

23. Conditional Performance Guarantee for Large Reasoning Models

24. Toward IIT-Inspired Consciousness in LLMs: A Reward-Based Learning Framework

25. Learning with Challenges: Adaptive Difficulty-Aware Data Generation for Mobile GUI Agent Training

26. TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization

27. AutoRefine: From Trajectories to Reusable Expertise for Continual LLM Agent Refinement

28. A Step Back: Prefix Importance Ratio Stabilizes Policy Optimization

29. Best-of-Q: Improving VLM agents with Q-function Action Ranking at Inference

30. Real-Time Aligned Reward Model beyond Semantics

31. Task-Aware LLM Council with Adaptive Decision Pathways for Decision Support

32. UCPO: Uncertainty-Aware Policy Optimization

33. Test-Time Mixture of World Models for Embodied Agents in Dynamic Environments

34. Beyond Medical Chatbots: Meddollina and the Rise of Continuous Clinical Intelligence

35. Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling

36. SYMPHONY: Synergistic Multi-agent Planning with Heterogeneous Language Model Assembly

37. EntroCut: Entropy-Guided Adaptive Truncation for Efficient Chain-of-Thought Reasoning in Small-scale Large Reasoning Models

38. From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents

39. Learn More with Less: Uncertainty Consistency Guided Query Selection for RLVR

40. WED-Net: A Weather-Effect Disentanglement Network with Causal Augmentation for Urban Flow Prediction

41. PerfGuard: A Performance-Aware Agent for Visual Content Generation

42. Decoding in Geometry: Alleviating Embedding-Space Crowding for Complex Reasoning

43. Enhancing TableQA through Verifiable Reasoning Trace Reward

44. Darwinian Memory: A Training-Free Self-Regulating Memory System for GUI Agent Evolution

45. Why Self-Rewarding Works: Theoretical Guarantees for Iterative Alignment of Language Models

46. Controllable Information Production

47. Anytime Safe PAC Efficient Reasoning

48. When LLM meets Fuzzy-TOPSIS for Personnel Selection through Automated Profile Analysis

49. AI-Enabled Waste Classification as a Data-Driven Decision Support Tool for Circular Economy and Urban Sustainability

50. Semi-Autonomous Mathematics Discovery with Gemini: A Case Study on the Erdős Problems

51. Learning Provably Correct Distributed Protocols Without Human Knowledge

52. Sparks of Rationality: Do Reasoning LLMs Align with Human Judgment and Choice?

53. Why Reasoning Fails to Plan: A Planning-Centric Analysis of Long-Horizon Decision Making in LLM Agents

54. The Six Sigma Agent: Achieving Enterprise-Grade Reliability in LLM Systems Through Consensus-Driven Decomposed Execution

55. JAF: Judge Agent Forest

56. VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

57. End-to-end Optimization of Belief and Policy Learning in Shared Autonomy Paradigms

58. IRL-DAL: Safe and Adaptive Trajectory Planning for Autonomous Driving via Energy-Guided Diffusion Models

59. TEON: Tensorized Orthonormalization Beyond Layer-Wise Muon for Large Language Model Pre-Training

60. Agnostic Language Identification and Generation

61. Now You Hear Me: Audio Narrative Attacks Against Large Audio-Language Models

62. YuriiFormer: A Suite of Nesterov-Accelerated Transformers

63. ShotFinder: Imagination-Driven Open-Domain Video Shot Retrieval via Web Search

64. Agile Reinforcement Learning through Separable Neural Architecture

65. Med-Scout: Curing MLLMs’ Geometric Blindness in Medical Perception via Geometry-Aware RL Post-Training

66. MonoScale: Scaling Multi-Agent System with Monotonic Improvement

67. Disentangling multispecific antibody function with graph neural networks

68. Learning to Execute Graph Algorithms Exactly with Graph Neural Networks

69. Beyond Fixed Frames: Dynamic Character-Aligned Speech Tokenization

70. Probing the Trajectories of Reasoning Traces in Large Language Models

71. SPICE: Submodular Penalized Information-Conflict Selection for Efficient Large Language Model Training

72. On Safer Reinforcement Learning Policies for Sedation and Analgesia in Intensive Care

73. Securing Time in Energy IoT: A Clock-Dynamics-Aware Spatio-Temporal Graph Attention Network for Clock Drift Attacks and Y2K38 Failures

74. Machine Learning for Energy-Performance-aware Scheduling

75. Secure Tool Manifest and Digital Signing Solution for Verifiable MCP and LLM Pipelines

76. Regularisation in neural networks: a survey and empirical analysis of approaches

77. To See Far, Look Close: Evolutionary Forecasting for Long-term Time Series

78. WiFiPenTester: Advancing Wireless Ethical Hacking with Governed GenAI

79. From Similarity to Vulnerability: Key Collision Attack on LLM Semantic Caching