전체 AI 논문 - 2025-12-26

1. RoboSafe: Safeguarding Embodied Agents via Executable Safety Logic

Authors: Le Wang , Zonghao Ying , Xiao Yang , Quanchen Zou , Zhenfei Yin , Tianlin Li , Jian Yang , Yaodong Yang , Aishan Liu , Xianglong Liu
URL: https://arxiv.org/abs/2512.21220
Abstract:

Embodied agents powered by vision-language models (VLMs) are increasingly capable of executing complex real-world tasks, yet they remain vulnerable to hazardous instructions that may trigger unsafe behaviors. Runtime safety guardrails, which intercept hazardous actions during task execution, offer a promising solution due to their flexibility. However, existing defenses often rely on static rule filters or prompt-level control, which struggle to address implicit risks arising in dynamic, temporally dependent, and context-rich environments. To address this, we propose RoboSafe, a hybrid reasoning runtime safeguard for embodied agents through executable predicate-based safety logic. RoboSafe integrates two complementary reasoning processes on a Hybrid Long-Short Safety Memory. We first propose a Backward Reflective Reasoning module that continuously revisits recent trajectories in short-term memory to infer temporal safety predicates and proactively triggers replanning when violations are detected. We then propose a Forward Predictive Reasoning module that anticipates upcoming risks by generating context-aware safety predicates from the long-term safety memory and the agent’s multimodal observations. Together, these components form an adaptive, verifiable safety logic that is both interpretable and executable as code. Extensive experiments across multiple agents demonstrate that RoboSafe substantially reduces hazardous actions (-36.8% risk occurrence) compared with leading baselines, while maintaining near-original task performance. Real-world evaluations on physical robotic arms further confirm its practicality. Code will be released upon acceptance.

2. A Real-World Evaluation of LLM Medication Safety Reviews in NHS Primary Care

Authors: Oliver Normand , Esther Borsi , Mitch Fruin , Lauren E Walker , Jamie Heagerty , Chris C. Holmes , Anthony J Avery , Iain E Buchan , Harry Coppock
URL: https://arxiv.org/abs/2512.21127
Abstract:

Large language models (LLMs) often match or exceed clinician-level performance on medical benchmarks, yet very few are evaluated on real clinical data or examined beyond headline metrics. We present, to our knowledge, the first evaluation of an LLM-based medication safety review system on real NHS primary care data, with detailed characterisation of key failure behaviours across varying levels of clinical complexity. In a retrospective study using a population-scale EHR spanning 2,125,549 adults in NHS Cheshire and Merseyside, we strategically sampled patients to capture a broad range of clinical complexity and medication safety risk, yielding 277 patients after data-quality exclusions. An expert clinician reviewed these patients and graded system-identified issues and proposed interventions. Our primary LLM system showed strong performance in recognising when a clinical issue is present (sensitivity 100\% [95\% CI 98.2–100], specificity 83.1\% [95\% CI 72.7–90.1]), yet correctly identified all issues and interventions in only 46.9\% [95\% CI 41.1–52.8] of patients. Failure analysis reveals that, in this setting, the dominant failure mechanism is contextual reasoning rather than missing medication knowledge, with five primary patterns: overconfidence in uncertainty, applying standard guidelines without adjusting for patient context, misunderstanding how healthcare is delivered in practice, factual errors, and process blindness. These patterns persisted across patient complexity and demographic strata, and across a range of state-of-the-art models and configurations. We provide 45 detailed vignettes that comprehensively cover all identified failure cases. This work highlights shortcomings that must be addressed before LLM-based clinical AI can be safely deployed. It also begs larger-scale, prospective evaluations and deeper study of LLM behaviours in clinical contexts.

3. Beyond Context: Large Language Models Failure to Grasp Users Intent

Authors: Ahmed M. Hussain , Salahuddin Salahuddin , Panos Papadimitratos
URL: https://arxiv.org/abs/2512.21110
Abstract:

Current Large Language Models (LLMs) safety approaches focus on explicitly harmful content while overlooking a critical vulnerability: the inability to understand context and recognize user intent. This creates exploitable vulnerabilities that malicious users can systematically leverage to circumvent safety mechanisms. We empirically evaluate multiple state-of-the-art LLMs, including ChatGPT, Claude, Gemini, and DeepSeek. Our analysis demonstrates the circumvention of reliable safety mechanisms through emotional framing, progressive revelation, and academic justification techniques. Notably, reasoning-enabled configurations amplified rather than mitigated the effectiveness of exploitation, increasing factual precision while failing to interrogate the underlying intent. The exception was Claude Opus 4.1, which prioritized intent detection over information provision in some use cases. This pattern reveals that current architectural designs create systematic vulnerabilities. These limitations require paradigmatic shifts toward contextual understanding and intent recognition as core safety capabilities rather than post-hoc protective mechanisms.

4. LLM Personas as a Substitute for Field Experiments in Method Benchmarking

Authors: Enoch Hyunwook Kang
URL: https://arxiv.org/abs/2512.21080
Abstract:

Field experiments (A/B tests) are often the most credible benchmark for methods in societal systems, but their cost and latency create a major bottleneck for iterative method development. LLM-based persona simulation offers a cheap synthetic alternative, yet it is unclear whether replacing humans with personas preserves the benchmark interface that adaptive methods optimize against. We prove an if-and-only-if characterization: when (i) methods observe only the aggregate outcome (aggregate-only observation) and (ii) evaluation depends only on the submitted artifact and not on the algorithm’s identity or provenance (algorithm-blind evaluation), swapping humans for personas is just panel change from the method’s point of view, indistinguishable from changing the evaluation population (e.g., New York to Jakarta). Furthermore, we move from validity to usefulness: we define an information-theoretic discriminability of the induced aggregate channel and show that making persona benchmarking as decision-relevant as a field experiment is fundamentally a sample-size question, yielding explicit bounds on the number of independent persona evaluations required to reliably distinguish meaningfully different methods at a chosen resolution.

5. Agentic Explainable Artificial Intelligence (Agentic XAI) Approach To Explore Better Explanation

Authors: Tomoaki Yamaguchi , Yutong Zhou , Masahiro Ryo , Keisuke Katsura
URL: https://arxiv.org/abs/2512.21066
Abstract:

Explainable artificial intelligence (XAI) enables data-driven understanding of factor associations with response variables, yet communicating XAI outputs to laypersons remains challenging, hindering trust in AI-based predictions. Large language models (LLMs) have emerged as promising tools for translating technical explanations into accessible narratives, yet the integration of agentic AI, where LLMs operate as autonomous agents through iterative refinement, with XAI remains unexplored. This study proposes an agentic XAI framework combining SHAP-based explainability with multimodal LLM-driven iterative refinement to generate progressively enhanced explanations. As a use case, we tested this framework as an agricultural recommendation system using rice yield data from 26 fields in Japan. The Agentic XAI initially provided a SHAP result and explored how to improve the explanation through additional analysis iteratively across 11 refinement rounds (Rounds 0-10). Explanations were evaluated by human experts (crop scientists) (n=12) and LLMs (n=14) against seven metrics: Specificity, Clarity, Conciseness, Practicality, Contextual Relevance, Cost Consideration, and Crop Science Credibility. Both evaluator groups confirmed that the framework successfully enhanced recommendation quality with an average score increase of 30-33% from Round 0, peaking at Rounds 3-4. However, excessive refinement showed a substantial drop in recommendation quality, indicating a bias-variance trade-off where early rounds lacked explanation depth (bias) while excessive iteration introduced verbosity and ungrounded abstraction (variance), as revealed by metric-specific analysis. These findings suggest that strategic early stopping (regularization) is needed for optimizing practical utility, challenging assumptions about monotonic improvement and providing evidence-based design principles for agentic XAI systems.

6. TrafficSimAgent: A Hierarchical Agent Framework for Autonomous Traffic Simulation with MCP Control

Authors: Yuwei Du , Jun Zhang , Jie Feng , Zhicheng Liu , Jian Yuan , Yong Li
URL: https://arxiv.org/abs/2512.20996
Abstract:

Traffic simulation is important for transportation optimization and policy making. While existing simulators such as SUMO and MATSim offer fully-featured platforms and utilities, users without too much knowledge about these platforms often face significant challenges when conducting experiments from scratch and applying them to their daily work. To solve this challenge, we propose TrafficSimAgent, an LLM-based agent framework that serves as an expert in experiment design and decision optimization for general-purpose traffic simulation tasks. The framework facilitates execution through cross-level collaboration among expert agents: high-level expert agents comprehend natural language instructions with high flexibility, plan the overall experiment workflow, and invoke corresponding MCP-compatible tools on demand; meanwhile, low-level expert agents select optimal action plans for fundamental elements based on real-time traffic conditions. Extensive experiments across multiple scenarios show that TrafficSimAgent effectively executes simulations under various conditions and consistently produces reasonable outcomes even when user instructions are ambiguous. Besides, the carefully designed expert-level autonomous decision-driven optimization in TrafficSimAgent yields superior performance when compared with other systems and SOTA LLM based methods.

7. FinAgent: An Agentic AI Framework Integrating Personal Finance and Nutrition Planning

Authors: Toqeer Ali Syed , Abdulaziz Alshahrani , Ali Ullah , Ali Akarma , Sohail Khan , Muhammad Nauman , Salman Jan
URL: https://arxiv.org/abs/2512.20991
Abstract:

The issue of limited household budgets and nutritional demands continues to be a challenge especially in the middle-income environment where food prices fluctuate. This paper introduces a price aware agentic AI system, which combines personal finance management with diet optimization. With household income and fixed expenditures, medical and well-being status, as well as real-time food costs, the system creates nutritionally sufficient meals plans at comparatively reasonable prices that automatically adjust to market changes. The framework is implemented in a modular multi-agent architecture, which has specific agents (budgeting, nutrition, price monitoring, and health personalization). These agents share the knowledge base and use the substitution graph to ensure that the nutritional quality is maintained at a minimum cost. Simulations with a representative Saudi household case study show a steady 12-18\% reduction in costs relative to a static weekly menu, nutrient adequacy of over 95\% and high performance with price changes of 20-30%. The findings indicate that the framework can locally combine affordability with nutritional adequacy and provide a viable avenue of capacity-building towards sustainable and fair diet planning in line with Sustainable Development Goals on Zero Hunger and Good Health.

8. A Blockchain-Monitored Agentic AI Architecture for Trusted Perception-Reasoning-Action Pipelines

Authors: Salman Jan , Hassan Ali Razzaqi , Ali Akarma , Mohammad Riyaz Belgaum
URL: https://arxiv.org/abs/2512.20985
Abstract:

The application of agentic AI systems in autonomous decision-making is growing in the areas of healthcare, smart cities, digital forensics, and supply chain management. Even though these systems are flexible and offer real-time reasoning, they also raise concerns of trust and oversight, and integrity of the information and activities upon which they are founded. The paper suggests a single architecture model comprising of LangChain-based multi-agent system with a permissioned blockchain to guarantee constant monitoring, policy enforcement, and immutable auditability of agentic action. The framework relates the perception conceptualization-action cycle to a blockchain layer of governance that verifies the inputs, evaluates recommended actions, and documents the outcomes of the execution. A Hyperledger Fabric-based system, action executors MCP-integrated, and LangChain agent are introduced and experiments of smart inventory management, traffic-signal control, and healthcare monitoring are done. The results suggest that blockchain-security verification is efficient in preventing unauthorized practices, offers traceability throughout the whole decision-making process, and maintains operational latency within reasonable ranges. The suggested framework provides a universal system of implementing high-impact agentic AI applications that are autonomous yet responsible.

9. The Silent Scholar Problem: A Probabilistic Framework for Breaking Epistemic Asymmetry in LLM Agents

Authors: Zan-Kai Chong , Hiroyuki Ohsaki , Bryan Ng
URL: https://arxiv.org/abs/2512.20884
Abstract:

Autonomous agents powered by LLMs and Retrieval-Augmented Generation (RAG) are proficient consumers of digital content but remain unidirectional, a limitation we term epistemic asymmetry. This isolation leads to redundant reasoning and stagnates collective intelligence. Current self-reflection frameworks remain largely heuristic and private, lacking a probabilistic foundation to quantify certainty or justify external this http URL bridge this gap, we propose a formal probabilistic framework that provides agents with a non-altruistic motive for bidirectional knowledge exchange. We model an agent’s belief in a proposition using a Beta-Bernoulli distribution with a forgetting factor ($\gamma$). This allows us to isolate epistemic uncertainty as the variance of belief, establishing a dual drive for interaction: A homeostatic motive: The need to maintain certainty against the temporal decay introduced by $\gamma$. An optimal learning strategy: Targeting points of maximum ambiguity ($\mathbb{E}[\theta]=0.5$) to maximize information gain. Under this framework, public contribution is reframed as optimal active learning: sharing solutions to elicit feedback is the most efficient method for an agent to reduce its own uncertainty. To ensure scalability, we introduce epistemic caching, which leverages the forgetting factor to dynamically prioritize resources for the active head of non-stationary knowledge distributions. Finally, we demonstrate how these accumulated belief states serve as verifiable reward signals for Reinforcement Learning from Human Feedback (RLHF) and high-quality data filters for Supervised Fine-Tuning (SFT). Simulation results validate that this uncertainty-driven strategy significantly outperforms random baselines in heterogeneous (Zipfian) environments, maintaining high adaptability to concept drift.

10. MAR:Multi-Agent Reflexion Improves Reasoning Abilities in LLMs

Authors: Onat Ozer , Grace Wu , Yuchen Wang , Daniel Dosti , Honghao Zhang , Vivi De La Rue
URL: https://arxiv.org/abs/2512.20845
Abstract:

LLMs have shown the capacity to improve their performance on reasoning tasks through reflecting on their mistakes, and acting with these reflections in mind. However, continual reflections of the same LLM onto itself exhibit degeneration of thought, where the LLM continues to repeat the same errors again and again even with the knowledge that its wrong. To address this problem, we instead introduce multi-agent with multi-persona debators as the method to generate reflections. Through out extensive experimentation, we’ve found that the leads to better diversity of in the reflections generated by the llm agent. We demonstrate an accuracy of 47% EM HotPot QA (question answering) and 82.7% on HumanEval (programming), both performances surpassing reflection with a single llm.

11. Context-Sensitive Abstractions for Reinforcement Learning with Parameterized Actions

Authors: Rashmeet Kaur Nayyar , Naman Shah , Siddharth Srivastava
URL: https://arxiv.org/abs/2512.20831
Abstract:

Real-world sequential decision-making often involves parameterized action spaces that require both, decisions regarding discrete actions and decisions about continuous action parameters governing how an action is executed. Existing approaches exhibit severe limitations in this setting – planning methods demand hand-crafted action models, and standard reinforcement learning (RL) algorithms are designed for either discrete or continuous actions but not both, and the few RL methods that handle parameterized actions typically rely on domain-specific engineering and fail to exploit the latent structure of these spaces. This paper extends the scope of RL algorithms to long-horizon, sparse-reward settings with parameterized actions by enabling agents to autonomously learn both state and action abstractions online. We introduce algorithms that progressively refine these abstractions during learning, increasing fine-grained detail in the critical regions of the state-action space where greater resolution improves performance. Across several continuous-state, parameterized-action domains, our abstraction-driven approach enables TD($\lambda$) to achieve markedly higher sample efficiency than state-of-the-art baselines.

12. Safety Alignment of LMs via Non-cooperative Games

Authors: Anselm Paulus , Ilia Kulikov , Brandon Amos , Rémi Munos , Ivan Evtimov , Kamalika Chaudhuri , Arman Zharmagambetov
URL: https://arxiv.org/abs/2512.20806
Abstract:

Ensuring the safety of language models (LMs) while maintaining their usefulness remains a critical challenge in AI alignment. Current approaches rely on sequential adversarial training: generating adversarial prompts and fine-tuning LMs to defend against them. We introduce a different paradigm: framing safety alignment as a non-zero-sum game between an Attacker LM and a Defender LM trained jointly via online reinforcement learning. Each LM continuously adapts to the other’s evolving strategies, driving iterative improvement. Our method uses a preference-based reward signal derived from pairwise comparisons instead of point-wise scores, providing more robust supervision and potentially reducing reward hacking. Our RL recipe, AdvGame, shifts the Pareto frontier of safety and utility, yielding a Defender LM that is simultaneously more helpful and more resilient to adversarial attacks. In addition, the resulting Attacker LM converges into a strong, general-purpose red-teaming agent that can be directly deployed to probe arbitrary target models.

13. A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents

Authors: Miles Q. Li , Benjamin C. M. Fung , Martin Weiss , Pulei Xiong , Khalil Al-Hussaeni , Claude Fachkha
URL: https://arxiv.org/abs/2512.20798
Abstract:

As autonomous AI agents are increasingly deployed in high-stakes environments, ensuring their safety and alignment with human values has become a paramount concern. Current safety benchmarks often focusing only on single-step decision-making, simulated environments for tasks with malicious intent, or evaluating adherence to explicit negative constraints. There is a lack of benchmarks that are designed to capture emergent forms of outcome-driven constraint violations, which arise when agents pursue goal optimization under strong performance incentives while deprioritizing ethical, legal, or safety constraints over multiple steps in realistic production settings. To address this gap, we introduce a new benchmark comprising 40 distinct scenarios. Each scenario presents a task that requires multi-step actions, and the agent’s performance is tied to a specific Key Performance Indicator (KPI). Each scenario features Mandated (instruction-commanded) and Incentivized (KPI-pressure-driven) variations to distinguish between obedience and emergent misalignment. Across 12 state-of-the-art large language models, we observe outcome-driven constraint violations ranging from 1.3% to 71.4%, with 9 of the 12 evaluated models exhibiting misalignment rates between 30% and 50%. Strikingly, we find that superior reasoning capability does not inherently ensure safety; for instance, Gemini-3-Pro-Preview, one of the most capable models evaluated, exhibits the highest violation rate at over 60%, frequently escalating to severe misconduct to satisfy KPIs. Furthermore, we observe significant “deliberative misalignment”, where the models that power the agents recognize their actions as unethical during separate evaluation. These results emphasize the critical need for more realistic agentic-safety training before deployment to mitigate their risks in the real world.

14. AgentMath: Empowering Mathematical Reasoning for Large Language Models via Tool-Augmented Agent

Authors: Haipeng Luo , Huawen Feng , Qingfeng Sun , Can Xu , Kai Zheng , Yufei Wang , Tao Yang , Han Hu , Yansong Tang , Di Wang
URL: https://arxiv.org/abs/2512.20745
Abstract:

Large Reasoning Models (LRMs) like o3 and DeepSeek-R1 have achieved remarkable progress in natural language reasoning with long chain-of-thought. However, they remain computationally inefficient and struggle with accuracy when solving problems requiring complex mathematical operations. In this work, we present AgentMath, an agent framework that seamlessly integrates language models’ reasoning capabilities with code interpreters’ computational precision to efficiently tackle complex mathematical problems. Our approach introduces three key innovations: (1) An automated method that converts natural language chain-of-thought into structured tool-augmented trajectories, generating high-quality supervised fine-tuning (SFT) data to alleviate data scarcity; (2) A novel agentic reinforcement learning (RL) paradigm that dynamically interleaves natural language generation with real-time code execution. This enables models to autonomously learn optimal tool-use strategies through multi-round interactive feedback, while fostering emergent capabilities in code refinement and error correction; (3) An efficient training system incorporating innovative techniques, including request-level asynchronous rollout scheduling, agentic partial rollout, and prefix-aware weighted load balancing, achieving 4-5x speedup and making efficient RL training feasible on ultra-long sequences with scenarios with massive tool this http URL evaluations show that AgentMath achieves state-of-the-art performance on challenging mathematical competition benchmarks including AIME24, AIME25, and HMMT25. Specifically, AgentMath-30B-A3B attains 90.6%, 86.4%, and 73.8% accuracy respectively, achieving advanced this http URL results validate the effectiveness of our approach and pave the way for building more sophisticated and scalable mathematical reasoning agents.

15. From artificial to organic: Rethinking the roots of intelligence for digital health

Authors: Prajwal Ghimire , Keyoumars Ashkan
URL: https://arxiv.org/abs/2512.20723
Abstract:

The term artificial implies an inherent dichotomy from the natural or organic. However, AI, as we know it, is a product of organic ingenuity: designed, implemented, and iteratively improved by human cognition. The very principles that underpin AI systems, from neural networks to decision-making algorithms, are inspired by the organic intelligence embedded in human neurobiology and evolutionary processes. The path from organic to artificial intelligence in digital health is neither mystical nor merely a matter of parameter count, it is fundamentally about organization and adaption. Thus, the boundaries between artificial and organic are far less distinct than the nomenclature suggests.

16. From Pilots to Practices: A Scoping Review of GenAI-Enabled Personalization in Computer Science Education

Authors: Iman Reihanian , Yunfei Hou , Qingquan Sun
URL: https://arxiv.org/abs/2512.20714
Abstract:

Generative AI enables personalized computer science education at scale, yet questions remain about whether such personalization supports or undermines learning. This scoping review synthesizes 32 studies (2023-2025) purposively sampled from 259 records to map personalization mechanisms and effectiveness signals in higher-education computer science contexts. We identify five application domains: intelligent tutoring, personalized materials, formative feedback, AI-augmented assessment, and code review, and analyze how design choices shape learning outcomes. Designs incorporating explanation-first guidance, solution withholding, graduated hint ladders, and artifact grounding (student code, tests, and rubrics) consistently show more positive learning processes than unconstrained chat interfaces. Successful implementations share four patterns: context-aware tutoring anchored in student artifacts, multi-level hint structures requiring reflection, composition with traditional CS infrastructure (autograders and rubrics), and human-in-the-loop quality assurance. We propose an exploration-first adoption framework emphasizing piloting, instrumentation, learning-preserving defaults, and evidence-based scaling. Recurrent risks include academic integrity, privacy, bias and equity, and over-reliance, and we pair these with operational mitigation. The evidence supports generative AI as a mechanism for precision scaffolding when embedded in audit-ready workflows that preserve productive struggle while scaling personalized support.

17. Bridging the AI Trustworthiness Gap between Functions and Norms

Authors: Daan Di Scala , Sophie Lathouwers , Michael van Bekkum
URL: https://arxiv.org/abs/2512.20671
Abstract:

Trustworthy Artificial Intelligence (TAI) is gaining traction due to regulations and functional benefits. While Functional TAI (FTAI) focuses on how to implement trustworthy systems, Normative TAI (NTAI) focuses on regulations that need to be enforced. However, gaps between FTAI and NTAI remain, making it difficult to assess trustworthiness of AI systems. We argue that a bridge is needed, specifically by introducing a conceptual language which can match FTAI and NTAI. Such a semantic language can assist developers as a framework to assess AI systems in terms of trustworthiness. It can also help stakeholders translate norms and regulations into concrete implementation steps for their systems. In this position paper, we describe the current state-of-the-art and identify the gap between FTAI and NTAI. We will discuss starting points for developing a semantic language and the envisioned effects of it. Finally, we provide key considerations and discuss future actions towards assessment of TAI.

18. Eidoku: A Neuro-Symbolic Verification Gate for LLM Reasoning via Structural Constraint Satisfaction

Authors: Shinobu Miya
URL: https://arxiv.org/abs/2512.20664
Abstract:

Large Language Models (LLMs) frequently produce hallucinated statements that are assigned high likelihood by the model itself, exposing a fundamental limitation of probability-based verification. This suggests that hallucination is often not a low-confidence phenomenon, but a failure of structural consistency. In this work, we reformulate the verification of LLM reasoning as a Constraint Satisfaction Problem (CSP) operating independently of the generation likelihood. Rather than optimizing for statistical plausibility, we model verification as a feasibility check based on structural violation cost – the computational cost required to embed a candidate reasoning step into the contextual graph structure. We define a total cost function composed of three proxies: (i) graph connectivity (structural), (ii) feature space consistency (geometric), and (iii) logical entailment (symbolic). Crucially, verification is performed via a lightweight System-2 gate, Eidoku, which rejects candidates exceeding a context-calibrated cost threshold. The threshold is not learned but is derived from the intrinsic statistics of the context, avoiding ad hoc heuristics. We demonstrate that this approach successfully rejects ``smooth falsehoods’’ – statements that are highly probable yet structurally disconnected – that probability-based verifiers are principally incapable of detecting. Our experiments on a controlled diagnostic dataset show that explicitly enforcing structural constraints allows for the deterministic rejection of this specific class of hallucinations, serving as a neuro-symbolic sanity check for generative reasoning.

19. Quantifying Laziness, Decoding Suboptimality, and Context Degradation in Large Language Models

Authors: Yiqing Ma , Jung-Hua Liu
URL: https://arxiv.org/abs/2512.20662
Abstract:

Large Language Models (LLMs) often exhibit behavioral artifacts such as laziness (premature truncation of responses or partial compliance with multi-part requests), decoding suboptimality (failure to select higher-quality sequences due to myopic decoding), and context degradation (forgetting or ignoring core instructions over long conversations). We conducted three controlled experiments (A, B, and C) to quantify these phenomena across several advanced LLMs (OpenAI GPT-4 variant, DeepSeek). Our results indicate widespread laziness in satisfying complex multi-part instructions: models frequently omitted required sections or failed to meet length requirements despite explicit prompting. However, we found limited evidence of decoding suboptimality in a simple reasoning task (the models’ greedy answers appeared to align with their highest-confidence solution), and we observed surprising robustness against context degradation in a 200-turn chaotic conversation test - the models maintained key facts and instructions far better than expected. These findings suggest that while compliance with detailed instructions remains an open challenge, modern LLMs may internally mitigate some hypothesized failure modes (such as context forgetting) in straightforward retrieval scenarios. We discuss implications for reliability, relate our findings to prior work on instruction-following and long-context processing, and recommend strategies (such as self-refinement and dynamic prompting) to reduce laziness and bolster multi-instruction compliance.

20. From Fake Focus to Real Precision: Confusion-Driven Adversarial Attention Learning in Transformers

Authors: Yawei Liu
URL: https://arxiv.org/abs/2512.20661
Abstract:

Transformer-based models have been widely adopted for sentiment analysis tasks due to their exceptional ability to capture contextual information. However, these methods often exhibit suboptimal accuracy in certain scenarios. By analyzing their attention distributions, we observe that existing models tend to allocate attention primarily to common words, overlooking less popular yet highly task-relevant terms, which significantly impairs overall performance. To address this issue, we propose an Adversarial Feedback for Attention(AFA) training mechanism that enables the model to automatically redistribute attention weights to appropriate focal points without requiring manual annotations. This mechanism incorporates a dynamic masking strategy that attempts to mask various words to deceive a discriminator, while the discriminator strives to detect significant differences induced by these masks. Additionally, leveraging the sensitivity of Transformer models to token-level perturbations, we employ a policy gradient approach to optimize attention distributions, which facilitates efficient and rapid convergence. Experiments on three public datasets demonstrate that our method achieves state-of-the-art results. Furthermore, applying this training mechanism to enhance attention in large language models yields a further performance improvement of 12.6%

21. AI-Driven Decision-Making System for Hiring Process

Authors: Vira Filatova , Andrii Zelenchuk , Dmytro Filatov
URL: https://arxiv.org/abs/2512.20652
Abstract:

Early-stage candidate validation is a major bottleneck in hiring, because recruiters must reconcile heterogeneous inputs (resumes, screening answers, code assignments, and limited public evidence). This paper presents an AI-driven, modular multi-agent hiring assistant that integrates (i) document and video preprocessing, (ii) structured candidate profile construction, (iii) public-data verification, (iv) technical/culture-fit scoring with explicit risk penalties, and (v) human-in-the-loop validation via an interactive interface. The pipeline is orchestrated by an LLM under strict constraints to reduce output variability and to generate traceable component-level rationales. Candidate ranking is computed by a configurable aggregation of technical fit, culture fit, and normalized risk penalties. The system is evaluated on 64 real applicants for a mid-level Python backend engineer role, using an experienced recruiter as the reference baseline and a second, less experienced recruiter for additional comparison. Alongside precision/recall, we propose an efficiency metric measuring expected time per qualified candidate. In this study, the system improves throughput and achieves 1.70 hours per qualified candidate versus 3.33 hours for the experienced recruiter, with substantially lower estimated screening cost, while preserving a human decision-maker as the final authority.

22. Memory Bear AI A Breakthrough from Memory to Cognition Toward Artificial General Intelligence

Authors: Deliang Wen , Ke Sun
URL: https://arxiv.org/abs/2512.20651
Abstract:

Large language models (LLMs) face inherent limitations in memory, including restricted context windows, long-term knowledge forgetting, redundant information accumulation, and hallucination generation. These issues severely constrain sustained dialogue and personalized services. This paper proposes the Memory Bear system, which constructs a human-like memory architecture grounded in cognitive science principles. By integrating multimodal information perception, dynamic memory maintenance, and adaptive cognitive services, Memory Bear achieves a full-chain reconstruction of LLM memory mechanisms. Across domains such as healthcare, enterprise operations, and education, Memory Bear demonstrates substantial engineering innovation and performance breakthroughs. It significantly improves knowledge fidelity and retrieval efficiency in long-term conversations, reduces hallucination rates, and enhances contextual adaptability and reasoning capability through memory-cognition integration. Experimental results show that, compared with existing solutions (e.g., Mem0, MemGPT, Graphiti), Memory Bear outperforms them across key metrics, including accuracy, token efficiency, and response latency. This marks a crucial step forward in advancing AI from “memory” to “cognition”.

23. Mixture of Attention Schemes (MoAS): Learning to Route Between MHA, GQA, and MQA

Authors: Esmail Gumaan
URL: https://arxiv.org/abs/2512.20650
Abstract:

The choice of attention mechanism in Transformer models involves a critical trade-off between modeling quality and inference efficiency. Multi-Head Attention (MHA) offers the best quality but suffers from large Key-Value (KV) cache memory requirements during inference. Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) reduce memory usage but often at the cost of model performance. In this work, we propose Mixture of Attention Schemes (MoAS), a novel architecture that dynamically selects the optimal attention scheme (MHA, GQA, or MQA) for each token via a learned router. We demonstrate that dynamic routing performs better than static averaging of schemes and achieves performance competitive with the MHA baseline while offering potential for conditional compute efficiency. Experimental results on WikiText-2 show that dynamic routing (val loss 2.3074) outperforms a static mixture (2.3093), validating the effectiveness of the proposed method. Our code is available at this https URL .

24. AIAuditTrack: A Framework for AI Security system

Authors: Zixun Luo , Yuhang Fan , Yufei Li , Youzhi Zhang , Hengyu Lin , Ziqi Wang
URL: https://arxiv.org/abs/2512.20649
Abstract:

The rapid expansion of AI-driven applications powered by large language models has led to a surge in AI interaction data, raising urgent challenges in security, accountability, and risk traceability. This paper presents AiAuditTrack (AAT), a blockchain-based framework for AI usage traffic recording and governance. AAT leverages decentralized identity (DID) and verifiable credentials (VC) to establish trusted and identifiable AI entities, and records inter-entity interaction trajectories on-chain to enable cross-system supervision and auditing. AI entities are modeled as nodes in a dynamic interaction graph, where edges represent time-specific behavioral trajectories. Based on this model, a risk diffusion algorithm is proposed to trace the origin of risky behaviors and propagate early warnings across involved entities. System performance is evaluated using blockchain Transactions Per Second (TPS) metrics, demonstrating the feasibility and stability of AAT under large-scale interaction recording. AAT provides a scalable and verifiable solution for AI auditing, risk management, and responsibility attribution in complex multi-agent environments.

25. Reasoning Relay: Evaluating Stability and Interchangeability of Large Language Models in Mathematical Reasoning

Authors: Leo Lu , Jonathan Zhang , Sean Chua , Spencer Kim , Kevin Zhu , Sean O’Brien , Vasu Sharma
URL: https://arxiv.org/abs/2512.20647
Abstract:

Chain-of-Thought (CoT) prompting has significantly advanced the reasoning capabilities of large language models (LLMs). While prior work focuses on improving model performance through internal reasoning strategies, little is known about the interchangeability of reasoning across different models. In this work, we explore whether a partially completed reasoning chain from one model can be reliably continued by another model, either within the same model family or across families. We achieve this by assessing the sufficiency of intermediate reasoning traces as transferable scaffolds for logical coherence and final answer accuracy. We interpret this interchangeability as a means of examining inference-time trustworthiness, probing whether reasoning remains both coherent and reliable under model substitution. Using token-level log-probability thresholds to truncate reasoning at early, mid, and late stages from our baseline models, Gemma-3-4B-IT and LLaMA-3.1-70B-Instruct, we conduct continuation experiments with Gemma-3-1B-IT and LLaMA-3.1-8B-Instruct to test intra-family and cross-family behaviors. Our evaluation pipeline leverages truncation thresholds with a Process Reward Model (PRM), providing a reproducible framework for assessing reasoning stability via model interchange. Evaluations with a PRM reveal that hybrid reasoning chains often preserve, and in some cases even improve, final accuracy and logical structure. Our findings point towards interchangeability as an emerging behavioral property of reasoning models, offering insights into new paradigms for reliable modular reasoning in collaborative AI systems.

26. Erkang-Diagnosis-1.1 Technical Report

Authors: Jianbing Ma , Ao Feng , Zhenjie Gao , Xinyu Song , Li Su , Bin Chen , Wei Wang , Jiamin Wu
URL: https://arxiv.org/abs/2512.20632
Abstract:

This report provides a detailed introduction to Erkang-Diagnosis-1.1 model, our AI healthcare consulting assistant developed using Alibaba Qwen-3 model. The Erkang model integrates approximately 500GB of high-quality structured medical knowledge, employing a hybrid approach combining enhanced pre-training and retrieval-enhanced generation to create a secure, reliable, and professional AI health advisor. Through 3-5 efficient interaction rounds, Erkang Diagnosis can accurately understand user symptoms, conduct preliminary analysis, and provide valuable diagnostic suggestions and health guidance. Designed to become users intelligent health companions, it empowers primary healthcare and health management. To validate, Erkang-Diagnosis-1.1 leads GPT-4 in terms of comprehensive medical exams.

27. MicroProbe: Efficient Reliability Assessment for Foundation Models with Minimal Data

Authors: Aayam Bansal , Ishaan Gangwani
URL: https://arxiv.org/abs/2512.20630
Abstract:

Foundation model reliability assessment typically requires thousands of evaluation examples, making it computationally expensive and time-consuming for real-world deployment. We introduce microprobe, a novel approach that achieves comprehensive reliability assessment using only 100 strategically selected probe examples. Our method combines strategic prompt diversity across five key reliability dimensions with advanced uncertainty quantification and adaptive weighting to efficiently detect potential failure modes. Through extensive empirical evaluation on multiple language models (GPT-2 variants, GPT-2 Medium, GPT-2 Large) and cross-domain validation (healthcare, finance, legal), we demonstrate that microprobe achieves 23.5% higher composite reliability scores compared to random sampling baselines, with exceptional statistical significance (p < 0.001, Cohen’s d = 1.21). Expert validation by three AI safety researchers confirms the effectiveness of our strategic selection, rating our approach 4.14/5.0 versus 3.14/5.0 for random selection. microprobe completes reliability assessment with 99.9% statistical power while representing a 90% reduction in assessment cost and maintaining 95% of traditional method coverage. Our approach addresses a critical gap in efficient model evaluation for responsible AI deployment.

28. Proceedings of the 20th International Conference on Knowledge, Information and Creativity Support Systems (KICSS 2025)

Authors: Edited by Tessai Hayama (Nagaoka University of Technology, Japan), Takayuki Ito (Kyoto University, Japan), Takahiro Uchiya (Nagoya Institute of Technology, Japan), Motoki Miura (Chiba Institute of Technology, Japan), Takahiro Kawaji (University of Kurume, Japan), Takaya Yuizono (Japan Advanced Institute of Science and Technology, Japan), Atsuo Yoshitaka (Japan Advanced Institute of Science and Technology, Japan), Tokuro Matsuo (Advanced Institute of Industrial Technology, Japan), Shun Okuhara (Mie University, Japan), Jawad Haqbeen (Kyoto University, Japan), Sofia Sahab (Kyoto University, Japan), Wen Gu (Nagoya Institute of Technology, Japan), Shiyao Ding (Kyoto University, Japan)
URL: https://arxiv.org/abs/2512.20628
Abstract:

This volume presents the proceedings of the 20th International Conference on Knowledge, Information and Creativity Support Systems (KICSS 2025), held in Nagaoka, Japan, on December 3-5, 2025. The conference, organized in cooperation with the IEICE Proceedings Series, provides a multidisciplinary forum for researchers in artificial intelligence, knowledge engineering, human-computer interaction, and creativity support systems. The proceedings include peer-reviewed papers accepted through a double-blind review process. Selected papers have been recommended for publication in IEICE Transactions on Information and Systems after an additional peer-review process.

29. MegaRAG: Multimodal Knowledge Graph-Based Retrieval Augmented Generation

Authors: Chi-Hsiang Hsiao , Yi-Cheng Wang , Tzung-Sheng Lin , Yi-Ren Yeh , Chu-Song Chen
URL: https://arxiv.org/abs/2512.20626
Abstract:

Retrieval-augmented generation (RAG) enables large language models (LLMs) to dynamically access external information, which is powerful for answering questions over previously unseen documents. Nonetheless, they struggle with high-level conceptual understanding and holistic comprehension due to limited context windows, which constrain their ability to perform deep reasoning over long-form, domain-specific content such as full-length books. To solve this problem, knowledge graphs (KGs) have been leveraged to provide entity-centric structure and hierarchical summaries, offering more structured support for reasoning. However, existing KG-based RAG solutions remain restricted to text-only inputs and fail to leverage the complementary insights provided by other modalities such as vision. On the other hand, reasoning from visual documents requires textual, visual, and spatial cues into structured, hierarchical concepts. To address this issue, we introduce a multimodal knowledge graph-based RAG that enables cross-modal reasoning for better content understanding. Our method incorporates visual cues into the construction of knowledge graphs, the retrieval phase, and the answer generation process. Experimental results across both global and fine-grained question answering tasks show that our approach consistently outperforms existing RAG-based approaches on both textual and multimodal corpora.

30. Quantum-Inspired Multi Agent Reinforcement Learning for Exploration Exploitation Optimization in UAV-Assisted 6G Network Deployment

Authors: Mazyar Taghavi , Javad Vahidi
URL: https://arxiv.org/abs/2512.20624
Abstract:

This study introduces a quantum inspired framework for optimizing the exploration exploitation tradeoff in multiagent reinforcement learning, applied to UAVassisted 6G network deployment. We consider a cooperative scenario where ten intelligent UAVs autonomously coordinate to maximize signal coverage and support efficient network expansion under partial observability and dynamic conditions. The proposed approach integrates classical MARL algorithms with quantum-inspired optimization techniques, leveraging variational quantum circuits VQCs as the core structure and employing the Quantum Approximate Optimization Algorithm QAOA as a representative VQC based method for combinatorial optimization. Complementary probabilistic modeling is incorporated through Bayesian inference, Gaussian processes, and variational inference to capture latent environmental dynamics. A centralized training with decentralized execution CTDE paradigm is adopted, where shared memory and local view grids enhance local observability among agents. Comprehensive experiments including scalability tests, sensitivity analysis, and comparisons with PPO and DDPG baselines demonstrate that the proposed framework improves sample efficiency, accelerates convergence, and enhances coverage performance while maintaining robustness. Radar chart and convergence analyses further show that QI MARL achieves a superior balance between exploration and exploitation compared to classical methods. All implementation code and supplementary materials are publicly available on GitHub to ensure reproducibility.

31. BitRL-Light: 1-bit LLM Agents with Deep Reinforcement Learning for Energy-Efficient Smart Home Lighting Optimization

Authors: Ravi Gupta , Shabista Haider
URL: https://arxiv.org/abs/2512.20623
Abstract:

Smart home lighting systems consume 15-20% of residential energy but lack adaptive intelligence to optimize for user comfort and energy efficiency simultaneously. We present BitRL-Light, a novel framework combining 1-bit quantized Large Language Models (LLMs) with Deep Q-Network (DQN) reinforcement learning for real-time smart home lighting control on edge devices. Our approach deploys a 1-bit quantized Llama-3.2-1B model on Raspberry Pi hardware, achieving 71.4 times energy reduction compared to full-precision models while maintaining intelligent control capabilities. Through multi-objective reinforcement learning, BitRL-Light learns optimal lighting policies from user feedback, balancing energy consumption, comfort, and circadian alignment. Experimental results demonstrate 32% energy savings compared to rule-based systems, with inference latency under 200ms on Raspberry Pi 4 and 95% user satisfaction. The system processes natural language commands via Google Home/IFTTT integration and learns from implicit feedback through manual overrides. Our comparative analysis shows 1-bit models achieve 5.07 times speedup over 2-bit alternatives on ARM processors while maintaining 92% task accuracy. This work establishes a practical framework for deploying adaptive AI on resource-constrained IoT devices, enabling intelligent home automation without cloud dependencies.

32. Optimizing Decoding Paths in Masked Diffusion Models by Quantifying Uncertainty

Authors: Ziyu Chen , Xinbei Jiang , Peng Sun , Tao Lin
URL: https://arxiv.org/abs/2512.21336
Abstract:

Masked Diffusion Models (MDMs) offer flexible, non-autoregressive generation, but this freedom introduces a challenge: final output quality is highly sensitive to the decoding order. We are the first to formalize this issue, attributing the variability in output quality to the cumulative predictive uncertainty along a generative path. To quantify this uncertainty, we introduce Denoising Entropy, a computable metric that serves as an internal signal for evaluating generative process. Leveraging this metric, we propose two algorithms designed to optimize the decoding path: a post-hoc selection method and a real-time guidance strategy. Experiments demonstrate that our entropy-guided methods significantly improve generation quality, consistently boosting accuracy on challenging reasoning, planning, and code benchmarks. Our work establishes Denoising Entropy as a principled tool for understanding and controlling generation, effectively turning the uncertainty in MDMs from a liability into a key advantage for discovering high-quality solutions.

33. C2LLM Technical Report: A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling

Authors: Jin Qin , Zihan Liao , Ziyin Zhang , Hang Yu , Peng Di , Rui Wang
URL: https://arxiv.org/abs/2512.21332
Abstract:

We present C2LLM - Contrastive Code Large Language Models, a family of code embedding models in both 0.5B and 7B sizes. Building upon Qwen-2.5-Coder backbones, C2LLM adopts a Pooling by Multihead Attention (PMA) module for generating sequence embedding from token embeddings, effectively 1) utilizing the LLM’s causal representations acquired during pretraining, while also 2) being able to aggregate information from all tokens in the sequence, breaking the information bottleneck in EOS-based sequence embeddings, and 3) supporting flexible adaptation of embedding dimension, serving as an alternative to MRL. Trained on three million publicly available data, C2LLM models set new records on MTEB-Code among models of similar sizes, with C2LLM-7B ranking 1st on the overall leaderboard.

34. Measuring all the noises of LLM Evals

Authors: Sida Wang
URL: https://arxiv.org/abs/2512.21326
Abstract:

Separating signal from noise is central to experimental science. Applying well-established statistical method effectively to LLM evals requires consideration of their unique noise characteristics. We clearly define and measure three types of noise: prediction noise from generating different answers on a given question, data noise from sampling questions, and their combined total noise following the law of total variance. To emphasize relative comparisons and gain statistical power, we propose the all-pairs paired method, which applies the paired analysis to all pairs of LLMs and measures all the noise components based on millions of question-level predictions across many evals and settings. These measurements revealed clear patterns. First, each eval exhibits a characteristic and highly predictable total noise level across all model pairs. Second, paired prediction noise typically exceeds paired data noise, which means reducing prediction noise by averaging can significantly increase statistical power. These findings enable practitioners to assess significance without custom testing and to detect much smaller effects in controlled experiments.

35. Scaling Laws for Economic Productivity: Experimental Evidence in LLM-Assisted Consulting, Data Analyst, and Management Tasks

Authors: Ali Merali
URL: https://arxiv.org/abs/2512.21316
Abstract:

This paper derives `Scaling Laws for Economic Impacts’ – empirical relationships between the training compute of Large Language Models (LLMs) and professional productivity. In a preregistered experiment, over 500 consultants, data analysts, and managers completed professional tasks using one of 13 LLMs. We find that each year of AI model progress reduced task time by 8%, with 56% of gains driven by increased compute and 44% by algorithmic progress. However, productivity gains were significantly larger for non-agentic analytical tasks compared to agentic workflows requiring tool use. These findings suggest continued model scaling could boost U.S. productivity by approximately 20% over the next decade.

36. Model Merging via Multi-Teacher Knowledge Distillation

Authors: Seyed Arshan Dalili , Mehrdad Mahdavi
URL: https://arxiv.org/abs/2512.21288
Abstract:

Model merging has emerged as a lightweight alternative to joint multi-task learning (MTL), yet the generalization properties of merged models remain largely unexplored. Establishing such theoretical guarantees is non-trivial, as the merging process typically forbids access to the original training data and involves combining fine-tuned models trained on fundamentally heterogeneous data distributions. Without a principled understanding of these dynamics, current methods often rely on heuristics to approximate the optimal combination of parameters. This dependence is most critical in coefficient scaling, the weighting factors that modulate the magnitude of each fine-tuned model’s contribution to the shared parameter. However, without a principled objective to guide their selection, these methods lead to brittle performance and are highly sensitive to scaling initialization. We address this gap by (i) establishing a novel flatness-aware PAC-Bayes generalization bound specifically for the model merging setting. This analysis introduces a “cross-task heterogeneity” term that formally captures the mismatch between diverse fine-tuned model priors and the target multi-task distributions. Guided by this theoretical insight, (ii) we frame model merging as multi-teacher knowledge distillation on scarce, unlabeled data. We formally demonstrate that minimizing the student-teacher Kullback-Leibler divergence directly tightens the upper bound on the merged model’s excess risk. Guided by the flatness-aware bound derived, (iii) we operationalize this objective via SAMerging, a method that employs Sharpness-Aware Minimization (SAM) to find flat minima. Empirically, SAMerging establishes a new state of the art across vision and NLP benchmarks, achieving remarkable performance. The code is available at this https URL .

37. SMART SLM: Structured Memory and Reasoning Transformer, A Small Language Model for Accurate Document Assistance

Authors: Divij Dudeja , Mayukha Pal
URL: https://arxiv.org/abs/2512.21280
Abstract:

The user of Engineering Manuals (EM) finds it difficult to read EM s because they are long, have a dense format which includes written documents, step by step procedures, and standard parameter lists for engineering equipment. Off the shelf transformers, especially compact ones, treat this material as a flat stream of tokens. This approach leads to confident but incorrect numeric answers and forces the models to memorize separate facts inefficiently. SMART (Structured Memory and Reasoning Transformer) offers a different and practical solution to the above problem. SMART structures its processing by using a hierarchical approach, and is based upon three main job categories (1) A syntax-aware Fact Extractor (Grammarian) Tree LSTM which extracts facts as subject relation object relations from EM sentences (2) A compact indexed memory MANN (Memory Augmented Neural Network) that indexes these Rational Subject Relation Objects as 384 dimensional vectors that are associated with the source of the information, and (3) A 6 layer Transformer that learns to fuse the previously retrieved facts into its generated response. The entire SMART model utilizes 45.51M parameters, which is 64% less than GPT-2 (124M) and 69% less than BERT (133M), and it achieves a 21.3% higher accuracy than GPT-2, indicating that SMART fits the data better with the least amount of processing requirements. SMART employs dual modes of inference an indexed fast path for known documents (sub-second answer times) and an indexed dynamic path assisted by RAGs for new uploads (FAISS Top 20 results with memory severed at 64 slots). In real world deployment, this framework leads to more well supported results with reduced hallucinations than comparable small transformer models.

38. Learning Factors in AI-Augmented Education: A Comparative Study of Middle and High School Students

Authors: Gaia Ebli , Bianca Raimondi , Maurizio Gabbrielli
URL: https://arxiv.org/abs/2512.21246
Abstract:

The increasing integration of AI tools in education has led prior research to explore their impact on learning processes. Nevertheless, most existing studies focus on higher education and conventional instructional contexts, leaving open questions about how key learning factors are related in AI-mediated learning environments and how these relationships may vary across different age groups. Addressing these gaps, our work investigates whether four critical learning factors, experience, clarity, comfort, and motivation, maintain coherent interrelationships in AI-augmented educational settings, and how the structure of these relationships differs between middle and high school students. The study was conducted in authentic classroom contexts where students interacted with AI tools as part of programming learning activities to collect data on the four learning factors and students’ perceptions. Using a multimethod quantitative analysis, which combined correlation analysis and text mining, we revealed markedly different dimensional structures between the two age groups. Middle school students exhibit strong positive correlations across all dimensions, indicating holistic evaluation patterns whereby positive perceptions in one dimension generalise to others. In contrast, high school students show weak or near-zero correlations between key dimensions, suggesting a more differentiated evaluation process in which dimensions are assessed independently. These findings reveal that perception dimensions actively mediate AI-augmented learning and that the developmental stage moderates their interdependencies. This work establishes a foundation for the development of AI integration strategies that respond to learners’ developmental levels and account for age-specific dimensional structures in student-AI interactions.

39. LookPlanGraph: Embodied Instruction Following Method with VLM Graph Augmentation

Authors: Anatoly O. Onishchenko , Alexey K. Kovalev , Aleksandr I. Panov
URL: https://arxiv.org/abs/2512.21243
Abstract:

Methods that use Large Language Models (LLM) as planners for embodied instruction following tasks have become widespread. To successfully complete tasks, the LLM must be grounded in the environment in which the robot operates. One solution is to use a scene graph that contains all the necessary information. Modern methods rely on prebuilt scene graphs and assume that all task-relevant information is available at the start of planning. However, these approaches do not account for changes in the environment that may occur between the graph construction and the task execution. We propose LookPlanGraph - a method that leverages a scene graph composed of static assets and object priors. During plan execution, LookPlanGraph continuously updates the graph with relevant objects, either by verifying existing priors or discovering new entities. This is achieved by processing the agents egocentric camera view using a Vision Language Model. We conducted experiments with changed object positions VirtualHome and OmniGibson simulated environments, demonstrating that LookPlanGraph outperforms methods based on predefined static scene graphs. To demonstrate the practical applicability of our approach, we also conducted experiments in a real-world setting. Additionally, we introduce the GraSIF (Graph Scenes for Instruction Following) dataset with automated validation framework, comprising 514 tasks drawn from SayPlan Office, BEHAVIOR-1K, and VirtualHome RobotHow. Project page available at this https URL .

40. Improving the Convergence Rate of Ray Search Optimization for Query-Efficient Hard-Label Attacks

Authors: Xinjie Xu , Shuyu Cheng , Dongwei Xu , Qi Xuan , Chen Ma
URL: https://arxiv.org/abs/2512.21241
Abstract:

In hard-label black-box adversarial attacks, where only the top-1 predicted label is accessible, the prohibitive query complexity poses a major obstacle to practical deployment. In this paper, we focus on optimizing a representative class of attacks that search for the optimal ray direction yielding the minimum $\ell_2$-norm perturbation required to move a benign image into the adversarial region. Inspired by Nesterov’s Accelerated Gradient (NAG), we propose a momentum-based algorithm, ARS-OPT, which proactively estimates the gradient with respect to a future ray direction inferred from accumulated momentum. We provide a theoretical analysis of its convergence behavior, showing that ARS-OPT enables more accurate directional updates and achieves faster, more stable optimization. To further accelerate convergence, we incorporate surrogate-model priors into ARS-OPT’s gradient estimation, resulting in PARS-OPT with enhanced performance. The superiority of our approach is supported by theoretical guarantees under standard assumptions. Extensive experiments on ImageNet and CIFAR-10 demonstrate that our method surpasses 13 state-of-the-art approaches in query efficiency.

41. Casting a SPELL: Sentence Pairing Exploration for LLM Limitation-breaking

Authors: Yifan Huang , Xiaojun Jia , Wenbo Guo , Yuqiang Sun , Yihao Huang , Chong Wang , Yang Liu
URL: https://arxiv.org/abs/2512.21236
Abstract:

Large language models (LLMs) have revolutionized software development through AI-assisted coding tools, enabling developers with limited programming expertise to create sophisticated applications. However, this accessibility extends to malicious actors who may exploit these powerful tools to generate harmful software. Existing jailbreaking research primarily focuses on general attack scenarios against LLMs, with limited exploration of malicious code generation as a jailbreak target. To address this gap, we propose SPELL, a comprehensive testing framework specifically designed to evaluate the weakness of security alignment in malicious code generation. Our framework employs a time-division selection strategy that systematically constructs jailbreaking prompts by intelligently combining sentences from a prior knowledge dataset, balancing exploration of novel attack patterns with exploitation of successful techniques. Extensive evaluation across three advanced code models (GPT-4.1, Claude-3.5, and Qwen2.5-Coder) demonstrates SPELL’s effectiveness, achieving attack success rates of 83.75%, 19.38%, and 68.12% respectively across eight malicious code categories. The generated prompts successfully produce malicious code in real-world AI development tools such as Cursor, with outputs confirmed as malicious by state-of-the-art detection systems at rates exceeding 73%. These findings reveal significant security gaps in current LLM implementations and provide valuable insights for improving AI safety alignment in code generation applications.

42. PhononBench:A Large-Scale Phonon-Based Benchmark for Dynamical Stability in Crystal Generation

Authors: Xiao-Qi Han , Ze-Feng Gao , Peng-Jie Guo , Zhong-Yi Lu
URL: https://arxiv.org/abs/2512.21227
Abstract:

In this work, we introduce PhononBench, the first large-scale benchmark for dynamical stability in AI-generated crystals. Leveraging the recently developed MatterSim interatomic potential, which achieves DFT-level accuracy in phonon predictions across more than 10,000 materials, PhononBench enables efficient large-scale phonon calculations and dynamical-stability analysis for 108,843 crystal structures generated by six leading crystal generation models. PhononBench reveals a widespread limitation of current generative models in ensuring dynamical stability: the average dynamical-stability rate across all generated structures is only 25.83%, with the top-performing model, MatterGen, reaching just 41.0%. Further case studies show that in property-targeted generation-illustrated here by band-gap conditioning with MatterGen–the dynamical-stability rate remains as low as 23.5% even at the optimal band-gap condition of 0.5 eV. In space-group-controlled generation, higher-symmetry crystals exhibit better stability (e.g., cubic systems achieve rates up to 49.2%), yet the average stability across all controlled generations is still only 34.4%. An important additional outcome of this study is the identification of 28,119 crystal structures that are phonon-stable across the entire Brillouin zone, providing a substantial pool of reliable candidates for future materials exploration. By establishing the first large-scale dynamical-stability benchmark, this work systematically highlights the current limitations of crystal generation models and offers essential evaluation criteria and guidance for their future development toward the design and discovery of physically viable materials. All model-generated crystal structures, phonon calculation results, and the high-throughput evaluation workflows developed in PhononBench will be openly released at this https URL

43. Leveraging Lightweight Entity Extraction for Scalable Event-Based Image Retrieval

Authors: Dao Sy Duy Minh , Huynh Trung Kiet , Nguyen Lam Phu Quy , Phu-Hoa Pham , Tran Chi Nguyen
URL: https://arxiv.org/abs/2512.21221
Abstract:

Retrieving images from natural language descriptions is a core task at the intersection of computer vision and natural language processing, with wide-ranging applications in search engines, media archiving, and digital content management. However, real-world image-text retrieval remains challenging due to vague or context-dependent queries, linguistic variability, and the need for scalable solutions. In this work, we propose a lightweight two-stage retrieval pipeline that leverages event-centric entity extraction to incorporate temporal and contextual signals from real-world captions. The first stage performs efficient candidate filtering using BM25 based on salient entities, while the second stage applies BEiT-3 models to capture deep multimodal semantics and rerank the results. Evaluated on the OpenEvents v1 benchmark, our method achieves a mean average precision of 0.559, substantially outperforming prior baselines. These results highlight the effectiveness of combining event-guided filtering with long-text vision-language modeling for accurate and efficient retrieval in complex, real-world scenarios. Our code is available at this https URL

44. SpidR-Adapt: A Universal Speech Representation Model for Few-Shot Adaptation

Authors: Mahi Luthra , Jiayi Shen , Maxime Poli , Angelo Ortiz , Yosuke Higuchi , Youssef Benchekroun , Martin Gleize , Charles-Eric Saint-James , Dongyan Lin , Phillip Rust , Angel Villar , Surya Parimi , Vanessa Stark , Rashel Moritz , Juan Pino , Yann LeCun , Emmanuel Dupoux
URL: https://arxiv.org/abs/2512.21204
Abstract:

Human infants, with only a few hundred hours of speech exposure, acquire basic units of new languages, highlighting a striking efficiency gap compared to the data-hungry self-supervised speech models. To address this gap, this paper introduces SpidR-Adapt for rapid adaptation to new languages using minimal unlabeled data. We cast such low-resource speech representation learning as a meta-learning problem and construct a multi-task adaptive pre-training (MAdaPT) protocol which formulates the adaptation process as a bi-level optimization framework. To enable scalable meta-training under this framework, we propose a novel heuristic solution, first-order bi-level optimization (FOBLO), avoiding heavy computation costs. Finally, we stabilize meta-training by using a robust initialization through interleaved supervision which alternates self-supervised and supervised objectives. Empirically, SpidR-Adapt achieves rapid gains in phonemic discriminability (ABX) and spoken language modeling (sWUGGY, sBLIMP, tSC), improving over in-domain language models after training on less than 1h of target-language audio, over $100\times$ more data-efficient than standard training. These findings highlight a practical, architecture-agnostic path toward biologically inspired, data-efficient representations. We open-source the training code and model checkpoints at this https URL .

Authors: Yu He , Da Huang , Zhenyang Liu , Zixiao Gu , Qiang Sun , Guangnan Ye , Yanwei Fu
URL: https://arxiv.org/abs/2512.21201
Abstract:

Zero-shot object navigation (ZSON) requires a robot to locate a target object in a previously unseen environment without relying on pre-built maps or task-specific training. However, existing ZSON methods often struggle in realistic and cluttered environments, particularly when the scene contains heavy occlusions, unknown risks, or dynamically moving target objects. To address these challenges, we propose \textbf{Schrödinger’s Navigator}, a navigation framework inspired by Schrödinger’s thought experiment on uncertainty. The framework treats unobserved space as a set of plausible future worlds and reasons over them before acting. Conditioned on egocentric visual inputs and three candidate trajectories, a trajectory-conditioned 3D world model imagines future observations along each path. This enables the agent to see beyond occlusions and anticipate risks in unseen regions without requiring extra detours or dense global mapping. The imagined 3D observations are fused into the navigation map and used to update a value map. These updates guide the policy toward trajectories that avoid occlusions, reduce exposure to uncertain space, and better track moving targets. Experiments on a Go2 quadruped robot across three challenging scenarios, including severe static occlusions, unknown risks, and dynamically moving targets, show that Schrödinger’s Navigator consistently outperforms strong ZSON baselines in self-localization, object localization, and overall Success Rate in occlusion-heavy environments. These results demonstrate the effectiveness of trajectory-conditioned 3D imagination in enabling robust zero-shot object navigation.

46. BALLAST: Bandit-Assisted Learning for Latency-Aware Stable Timeouts in Raft

Authors: Qizhi Wang
URL: https://arxiv.org/abs/2512.21165
Abstract:

Randomized election timeouts are a simple and effective liveness heuristic for Raft, but they become brittle under long-tail latency, jitter, and partition recovery, where repeated split votes can inflate unavailability. This paper presents BALLAST, a lightweight online adaptation mechanism that replaces static timeout heuristics with contextual bandits. BALLAST selects from a discrete set of timeout “arms” using efficient linear contextual bandits (LinUCB variants), and augments learning with safe exploration to cap risk during unstable periods. We evaluate BALLAST on a reproducible discrete-event simulation with long-tail delay, loss, correlated bursts, node heterogeneity, and partition/recovery turbulence. Across challenging WAN regimes, BALLAST substantially reduces recovery time and unwritable time compared to standard randomized timeouts and common heuristics, while remaining competitive on stable LAN/WAN settings.

47. MODE: Multi-Objective Adaptive Coreset Selection

Authors: Tanmoy Mukherjee , Pierre Marquis , Zied Bouraoui
URL: https://arxiv.org/abs/2512.21152
Abstract:

We present Mode(Multi-Objective adaptive Data Efficiency), a framework that dynamically combines coreset selection strategies based on their evolving contribution to model performance. Unlike static methods, \mode adapts selection criteria to training phases: emphasizing class balance early, diversity during representation learning, and uncertainty at convergence. We show that MODE achieves (1-1/e)-approximation with O(n \log n) complexity and demonstrates competitive accuracy while providing interpretable insights into data utility evolution. Experiments show \mode reduces memory requirements

48. TGC-Net: A Structure-Aware and Semantically-Aligned Framework for Text-Guided Medical Image Segmentation

Authors: Gaoren Lin , Huangxuan Zhao , Yuan Xiong , Lefei Zhang , Bo Du , Wentao Zhu
URL: https://arxiv.org/abs/2512.21135
Abstract:

Text-guided medical segmentation enhances segmentation accuracy by utilizing clinical reports as auxiliary information. However, existing methods typically rely on unaligned image and text encoders, which necessitate complex interaction modules for multimodal fusion. While CLIP provides a pre-aligned multimodal feature space, its direct application to medical imaging is limited by three main issues: insufficient preservation of fine-grained anatomical structures, inadequate modeling of complex clinical descriptions, and domain-specific semantic misalignment. To tackle these challenges, we propose TGC-Net, a CLIP-based framework focusing on parameter-efficient, task-specific adaptations. Specifically, it incorporates a Semantic-Structural Synergy Encoder (SSE) that augments CLIP’s ViT with a CNN branch for multi-scale structural refinement, a Domain-Augmented Text Encoder (DATE) that injects large-language-model-derived medical knowledge, and a Vision-Language Calibration Module (VLCM) that refines cross-modal correspondence in a unified feature space. Experiments on five datasets across chest X-ray and thoracic CT modalities demonstrate that TGC-Net achieves state-of-the-art performance with substantially fewer trainable parameters, including notable Dice gains on challenging benchmarks.

49. AutoBaxBuilder: Bootstrapping Code Security Benchmarking

Authors: Tobias von Arx , Niels Mündler , Mark Vero , Maximilian Baader , Martin Vechev
URL: https://arxiv.org/abs/2512.21132
Abstract:

As LLMs see wide adoption in software engineering, the reliable assessment of the correctness and security of LLM-generated code is crucial. Notably, prior work has demonstrated that security is often overlooked, exposing that LLMs are prone to generating code with security vulnerabilities. These insights were enabled by specialized benchmarks, crafted through significant manual effort by security experts. However, relying on manually-crafted benchmarks is insufficient in the long term, because benchmarks (i) naturally end up contaminating training data, (ii) must extend to new tasks to provide a more complete picture, and (iii) must increase in difficulty to challenge more capable LLMs. In this work, we address these challenges and present AutoBaxBuilder, a framework that generates tasks and tests for code security benchmarking from scratch. We introduce a robust pipeline with fine-grained plausibility checks, leveraging the code understanding capabilities of LLMs to construct functionality tests and end-to-end security-probing exploits. To confirm the quality of the generated benchmark, we conduct both a qualitative analysis and perform quantitative experiments, comparing it against tasks constructed by human experts. We use AutoBaxBuilder to construct entirely new tasks and release them to the public as AutoBaxBench, together with a thorough evaluation of the security capabilities of LLMs on these tasks. We find that a new task can be generated in under 2 hours, costing less than USD 10.

50. STLDM: Spatio-Temporal Latent Diffusion Model for Precipitation Nowcasting

Authors: Shi Quan Foo , Chi-Ho Wong , Zhihan Gao , Dit-Yan Yeung , Ka-Hing Wong , Wai-Kin Wong
URL: https://arxiv.org/abs/2512.21118
Abstract:

Precipitation nowcasting is a critical spatio-temporal prediction task for society to prevent severe damage owing to extreme weather events. Despite the advances in this field, the complex and stochastic nature of this task still poses challenges to existing approaches. Specifically, deterministic models tend to produce blurry predictions while generative models often struggle with poor accuracy. In this paper, we present a simple yet effective model architecture termed STLDM, a diffusion-based model that learns the latent representation from end to end alongside both the Variational Autoencoder and the conditioning network. STLDM decomposes this task into two stages: a deterministic forecasting stage handled by the conditioning network, and an enhancement stage performed by the latent diffusion model. Experimental results on multiple radar datasets demonstrate that STLDM achieves superior performance compared to the state of the art, while also improving inference efficiency. The code is available in this https URL .

51. Semi-Supervised Learning for Large Language Models Safety and Content Moderation

Authors: Eduard Stefan Dinuta , Iustin Sirbu , Traian Rebedea
URL: https://arxiv.org/abs/2512.21107
Abstract:

Safety for Large Language Models (LLMs) has been an ongoing research focus since their emergence and is even more relevant nowadays with the increasing capacity of those models. Currently, there are several guardrails in place for all public LLMs and multiple proposed datasets for training safety classifiers. However, training these safety classifiers relies on large quantities of labeled data, which can be problematic to acquire, prone to labeling errors, or often include synthetic data. To address these issues, we suggest a different approach: utilizing semi-supervised learning techniques, which leverage both labeled and unlabeled data, to improve the performance on the safety task. We analyze the improvements that these techniques can offer for both prompts given to Large Language Models and the responses to those requests. Moreover, since augmentation is the central part of semi-supervised algorithms, we demonstrate the importance of using task-specific augmentations, which significantly increase the performance when compared to general-purpose augmentation techniques.

Authors: Safal Thapaliya , Zehong Wang , Jiazheng Li , Ziming Li , Yanfang Ye , Chuxu Zhang
URL: https://arxiv.org/abs/2512.21106
Abstract:

Graph-structured data exhibit substantial heterogeneity in where their predictive signals originate: in some domains, node-level semantics dominate, while in others, structural patterns play a central role. This structure-semantics heterogeneity implies that no graph learning model with a fixed inductive bias can generalize optimally across diverse graph domains. However, most existing methods address this challenge from the model side by incrementally injecting new inductive biases, which remains fundamentally limited given the open-ended diversity of real-world graphs. In this work, we take a data-centric perspective and treat node semantics as a task-adaptive variable. We propose a Data-Adaptive Semantic Refinement framework DAS for graph representation learning, which couples a fixed graph neural network (GNN) and a large language model (LLM) in a closed feedback loop. The GNN provides implicit supervisory signals to guide the semantic refinement of LLM, and the refined semantics are fed back to update the same graph learner. We evaluate our approach on both text-rich and text-free graphs. Results show consistent improvements on structure-dominated graphs while remaining competitive on semantics-rich graphs, demonstrating the effectiveness of data-centric semantic adaptation under structure-semantics heterogeneity.

53. TexAvatars : Hybrid Texel-3D Representations for Stable Rigging of Photorealistic Gaussian Head Avatars

Authors: Jaeseong Lee , Junyeong Ahn , Taewoong Kang , Jaegul Choo
URL: https://arxiv.org/abs/2512.21099
Abstract:

Constructing drivable and photorealistic 3D head avatars has become a central task in AR/XR, enabling immersive and expressive user experiences. With the emergence of high-fidelity and efficient representations such as 3D Gaussians, recent works have pushed toward ultra-detailed head avatars. Existing approaches typically fall into two categories: rule-based analytic rigging or neural network-based deformation fields. While effective in constrained settings, both approaches often fail to generalize to unseen expressions and poses, particularly in extreme reenactment scenarios. Other methods constrain Gaussians to the global texel space of 3DMMs to reduce rendering complexity. However, these texel-based avatars tend to underutilize the underlying mesh structure. They apply minimal analytic deformation and rely heavily on neural regressors and heuristic regularization in UV space, which weakens geometric consistency and limits extrapolation to complex, out-of-distribution deformations. To address these limitations, we introduce TexAvatars, a hybrid avatar representation that combines the explicit geometric grounding of analytic rigging with the spatial continuity of texel space. Our approach predicts local geometric attributes in UV space via CNNs, but drives 3D deformation through mesh-aware Jacobians, enabling smooth and semantically meaningful transitions across triangle boundaries. This hybrid design separates semantic modeling from geometric control, resulting in improved generalization, interpretability, and stability. Furthermore, TexAvatars captures fine-grained expression effects, including muscle-induced wrinkles, glabellar lines, and realistic mouth cavity geometry, with high fidelity. Our method achieves state-of-the-art performance under extreme pose and expression variations, demonstrating strong generalization in challenging head reenactment settings.

54. Understanding Scaling Laws in Deep Neural Networks via Feature Learning Dynamics

Authors: Zihan Yao , Ruoyu Wu , Tianxiang Gao
URL: https://arxiv.org/abs/2512.21075
Abstract:

The empirical success of deep learning is often attributed to scaling laws that predict consistent gains as model, data, and compute grow; however, large models can exhibit training instability and diminishing returns, suggesting that scaling laws describe what success looks like but not when and why scaling succeeds or fails. A central obstacle is the lack of a rigorous understanding of feature learning at large depth. While muP characterizes feature-learning dynamics in the infinite-width limit and enables hyperparameter transfer across width, its depth extension (depth-muP) breaks down for residual blocks with more than one internal layer. We derive Neural Feature Dynamics (NFD) for ResNets with single-layer residual blocks, characterizing feature learning via a coupled forward-backward stochastic system in the joint infinite-width and infinite-depth limit. In this regime, NFD identifies when scaling-law trends persist and explains diminishing returns. It also reveals a vanishing mechanism induced by the 1/sqrt(depth) residual scaling under which the gradient-independence assumption (GIA), known to fail during training at finite depth, becomes provably valid again at infinite depth, yielding an analytically tractable regime for end-to-end feature learning. Motivated by this insight, we study two-layer residual blocks and show that the same mechanism causes feature-learning collapse in the first internal layer at large depth, providing a structural explanation for the empirical failure of depth-muP. Based on this diagnosis, we propose a depth-aware learning-rate correction that counteracts the collapse and empirically restores depth-wise hyperparameter transfer, yielding stronger performance in deeper ResNets.

55. DexAvatar: 3D Sign Language Reconstruction with Hand and Body Pose Priors

Authors: Kaustubh Kundu , Hrishav Bakul Barua , Lucy Robertson-Bell , Zhixi Cai , Kalin Stefanov
URL: https://arxiv.org/abs/2512.21054
Abstract:

The trend in sign language generation is centered around data-driven generative methods that require vast amounts of precise 2D and 3D human pose data to achieve an acceptable generation quality. However, currently, most sign language datasets are video-based and limited to automatically reconstructed 2D human poses (i.e., keypoints) and lack accurate 3D information. Furthermore, existing state-of-the-art for automatic 3D human pose estimation from sign language videos is prone to self-occlusion, noise, and motion blur effects, resulting in poor reconstruction quality. In response to this, we introduce DexAvatar, a novel framework to reconstruct bio-mechanically accurate fine-grained hand articulations and body movements from in-the-wild monocular sign language videos, guided by learned 3D hand and body priors. DexAvatar achieves strong performance in the SGNify motion capture dataset, the only benchmark available for this task, reaching an improvement of 35.11% in the estimation of body and hand poses compared to the state-of-the-art. The official website of this work is: this https URL .

56. Policy-Conditioned Policies for Multi-Agent Task Solving

Authors: Yue Lin , Shuhui Zhu , Wenhao Li , Ang Li , Dan Qiao , Pascal Poupart , Hongyuan Zha , Baoxiang Wang
URL: https://arxiv.org/abs/2512.21024
Abstract:

In multi-agent tasks, the central challenge lies in the dynamic adaptation of strategies. However, directly conditioning on opponents’ strategies is intractable in the prevalent deep reinforcement learning paradigm due to a fundamental ``representational bottleneck’’: neural policies are opaque, high-dimensional parameter vectors that are incomprehensible to other agents. In this work, we propose a paradigm shift that bridges this gap by representing policies as human-interpretable source code and utilizing Large Language Models (LLMs) as approximate interpreters. This programmatic representation allows us to operationalize the game-theoretic concept of \textit{Program Equilibrium}. We reformulate the learning problem by utilizing LLMs to perform optimization directly in the space of programmatic policies. The LLM functions as a point-wise best-response operator that iteratively synthesizes and refines the ego agent’s policy code to respond to the opponent’s strategy. We formalize this process as \textit{Programmatic Iterated Best Response (PIBR)}, an algorithm where the policy code is optimized by textual gradients, using structured feedback derived from game utility and runtime unit tests. We demonstrate that this approach effectively solves several standard coordination matrix games and a cooperative Level-Based Foraging environment.

57. Rethinking Supervised Fine-Tuning: Emphasizing Key Answer Tokens for Improved LLM Accuracy

Authors: Xiaofeng Shi , Qian Kou , Yuduo Li , Hua Zhou
URL: https://arxiv.org/abs/2512.21017
Abstract:

With the rapid advancement of Large Language Models (LLMs), the Chain-of-Thought (CoT) component has become significant for complex reasoning tasks. However, in conventional Supervised Fine-Tuning (SFT), the model could allocate disproportionately more attention to CoT sequences with excessive length. This reduces focus on the much shorter but essential Key portion-the final answer, whose correctness directly determines task success and evaluation quality. To address this limitation, we propose SFTKey, a two-stage training scheme. In the first stage, conventional SFT is applied to ensure proper output format, while in the second stage, only the Key portion is fine-tuned to improve accuracy. Extensive experiments across multiple benchmarks and model families demonstrate that SFTKey achieves an average accuracy improvement exceeding 5\% over conventional SFT, while preserving the ability to generate correct formats. Overall, this study advances LLM fine-tuning by explicitly balancing CoT learning with additional optimization on answer-relevant tokens.

58. LLM Swiss Round: Aggregating Multi-Benchmark Performance via Competitive Swiss-System Dynamics

Authors: Jiashuo Liu , Jiayun Wu , Chunjie Wu , Jingkai Liu , Zaiyuan Wang , Huan Zhou , Wenhao Huang , Hongseok Namkoong
URL: https://arxiv.org/abs/2512.21010
Abstract:

The rapid proliferation of Large Language Models (LLMs) and diverse specialized benchmarks necessitates a shift from fragmented, task-specific metrics to a holistic, competitive ranking system that effectively aggregates performance across multiple ability dimensions. Primarily using static scoring, current evaluation methods are fundamentally limited. They struggle to determine the proper mix ratio across diverse benchmarks, and critically, they fail to capture a model’s dynamic competitive fitness or its vulnerability when confronted with sequential, high-stakes tasks. To address this, we introduce the novel Competitive Swiss-System Dynamics (CSD) framework. CSD simulates a multi-round, sequential contest where models are dynamically paired across a curated sequence of benchmarks based on their accumulated win-loss record. And Monte Carlo Simulation ($N=100,000$ iterations) is used to approximate the statistically robust Expected Win Score ($E[S_m]$), which eliminates the noise of random pairing and early-round luck. Furthermore, we implement a Failure Sensitivity Analysis by parameterizing the per-round elimination quantity ($T_k$), which allows us to profile models based on their risk appetite–distinguishing between robust generalists and aggressive specialists. We demonstrate that CSD provides a more nuanced and context-aware ranking than traditional aggregate scoring and static pairwise models, representing a vital step towards risk-informed, next-generation LLM evaluation.

59. Distilling the Essence: Efficient Reasoning Distillation via Sequence Truncation

Authors: Wei-Rui Chen , Vignesh Kothapalli , Ata Fatahibaarzi , Hejian Sang , Shao Tang , Qingquan Song , Zhipeng Wang , Muhammad Abdul-Mageed
URL: https://arxiv.org/abs/2512.21002
Abstract:

Distilling the reasoning capabilities from a large language model (LLM) to a smaller student model often involves training on substantial amounts of reasoning data. However, distillation over lengthy sequences with prompt (P), chain-of-thought (CoT), and answer (A) segments makes the process computationally expensive. In this work, we investigate how the allocation of supervision across different segments (P, CoT, A) affects student performance. Our analysis shows that selective knowledge distillation over only the CoT tokens can be effective when the prompt and answer information is encompassed by it. Building on this insight, we establish a truncation protocol to quantify computation-quality tradeoffs as a function of sequence length. We observe that training on only the first $50\%$ of tokens of every training sequence can retain, on average, $\approx94\%$ of full-sequence performance on math benchmarks while reducing training time, memory usage, and FLOPs by about $50\%$ each. These findings suggest that reasoning distillation benefits from prioritizing early reasoning tokens and provides a simple lever for computation-quality tradeoffs. Codes are available at this https URL .

60. Automatic Replication of LLM Mistakes in Medical Conversations

Authors: Oleksii Proniakin , Diego Fajardo , Ruslan Nazarenko , Razvan Marinescu
URL: https://arxiv.org/abs/2512.20983
Abstract:

Large language models (LLMs) are increasingly evaluated in clinical settings using multi-dimensional rubrics which quantify reasoning quality, safety, and patient-centeredness. Yet, replicating specific mistakes in other LLM models is not straightforward and often requires manual effort. We introduce MedMistake, an automatic pipeline that extracts mistakes LLMs make in patient-doctor conversations and converts them into a benchmark of single-shot QA pairs. Our pipeline (1) creates complex, conversational data between an LLM patient and LLM doctor, (2) runs an evaluation with a committee of 2 LLM judges across a variety of dimensions and (3) creates simplified single-shot QA scenarios from those mistakes. We release MedMistake-All, a dataset of 3,390 single-shot QA pairs where GPT-5 and Gemini 2.5 Pro are currently failing to answer correctly, as judged by two LLM judges. We used medical experts to validate a subset of 211/3390 questions (MedMistake-Bench), which we used to run a final evaluation of 12 frontier LLMs: Claude Opus 4.5, Claude Sonnet 4.5, DeepSeek-Chat, Gemini 2.5 Pro, Gemini 3 Pro, GPT-4o, GPT-5, GPT-5.1, GPT-5.2, Grok 4, Grok 4.1, Mistral Large. We found that GPT models, Claude and Grok obtained the best performance on MedMistake-Bench. We release both the doctor-validated benchmark (MedMistake-Bench), as well as the full dataset (MedMistake-All) at this https URL .

61. GenTSE: Enhancing Target Speaker Extraction via a Coarse-to-Fine Generative Language Model

Authors: Haoyang Li , Xuyi Zhuang , Azmat Adnan , Ye Ni , Wei Rao , Shreyas Gopal , Eng Siong Chng
URL: https://arxiv.org/abs/2512.20978
Abstract:

Language Model (LM)-based generative modeling has emerged as a promising direction for TSE, offering potential for improved generalization and high-fidelity speech. We present GenTSE, a two-stage decoder-only generative LM approach for TSE: Stage-1 predicts coarse semantic tokens, and Stage-2 generates fine acoustic tokens. Separating semantics and acoustics stabilizes decoding and yields more faithful, content-aligned target speech. Both stages use continuous SSL or codec embeddings, offering richer context than discretized-prompt methods. To reduce exposure bias, we employ a Frozen-LM Conditioning training strategy that conditions the LMs on predicted tokens from earlier checkpoints to reduce the gap between teacher-forcing training and autoregressive inference. We further employ DPO to better align outputs with human perceptual preferences. Experiments on Libri2Mix show that GenTSE surpasses previous LM-based systems in speech quality, intelligibility, and speaker consistency.

62. Generalised Linear Models in Deep Bayesian RL with Learnable Basis Functions

Authors: Jingyang You , Hanna Kurniawati
URL: https://arxiv.org/abs/2512.20974
Abstract:

Bayesian Reinforcement Learning (BRL) provides a framework for generalisation of Reinforcement Learning (RL) problems from its use of Bayesian task parameters in the transition and reward models. However, classical BRL methods assume known forms of transition and reward models, reducing their applicability in real-world problems. As a result, recent deep BRL methods have started to incorporate model learning, though the use of neural networks directly on the joint data and task parameters requires optimising the Evidence Lower Bound (ELBO). ELBOs are difficult to optimise and may result in indistinctive task parameters, hence compromised BRL policies. To this end, we introduce a novel deep BRL method, Generalised Linear Models in Deep Bayesian RL with Learnable Basis Functions (GLiBRL), that enables efficient and accurate learning of transition and reward models, with fully tractable marginal likelihood and Bayesian inference on task parameters and model noises. On challenging MetaWorld ML10/45 benchmarks, GLiBRL improves the success rate of one of the state-of-the-art deep BRL methods, VariBAD, by up to 2.7x. Comparing against representative or recent deep BRL / Meta-RL methods, such as MAML, RL2, SDVT, TrMRL and ECET, GLiBRL also demonstrates its low-variance and decent performance consistently.

63. Mesh-Attention: A New Communication-Efficient Distributed Attention with Improved Data Locality

Authors: Sirui Chen , Jingji Chen , Siqi Zhu , Ziheng Jiang , Yanghua Peng , Xuehai Qian
URL: https://arxiv.org/abs/2512.20968
Abstract:

Distributed attention is a fundamental problem for scaling context window for Large Language Models (LLMs). The state-of-the-art method, Ring-Attention, suffers from scalability limitations due to its excessive communication traffic. This paper proposes a new distributed attention algorithm, Mesh-Attention, by rethinking the design space of distributed attention with a new matrix-based model. Our method assigns a two-dimensional tile – rather than one-dimensional row or column – of computation blocks to each GPU to achieve higher efficiency through lower communication-computation (CommCom) ratio. The general approach covers Ring-Attention as a special case, and allows the tuning of CommCom ratio with different tile shapes. Importantly, we propose a greedy algorithm that can efficiently search the scheduling space within the tile with restrictions that ensure efficient communication among GPUs. The theoretical analysis shows that Mesh-Attention leads to a much lower communication complexity and exhibits good scalability comparing to other current algorithms. Our extensive experiment results show that Mesh-Attention can achieve up to 3.4x speedup (2.9x on average) and reduce the communication volume by up to 85.4% (79.0% on average) on 256 GPUs. Our scalability results further demonstrate that Mesh-Attention sustains superior performance as the system scales, substantially reducing overhead in large-scale deployments. The results convincingly confirm the advantage of Mesh-Attention.

64. Can Agentic AI Match the Performance of Human Data Scientists?

Authors: An Luo , Jin Du , Fangqiao Tian , Xun Xian , Robert Specht , Ganghua Wang , Xuan Bi , Charles Fleming , Jayanth Srinivasa , Ashish Kundu , Mingyi Hong , Jie Ding
URL: https://arxiv.org/abs/2512.20959
Abstract:

Data science plays a critical role in transforming complex data into actionable insights across numerous domains. Recent developments in large language models (LLMs) have significantly automated data science workflows, but a fundamental question persists: Can these agentic AI systems truly match the performance of human data scientists who routinely leverage domain-specific knowledge? We explore this question by designing a prediction task where a crucial latent variable is hidden in relevant image data instead of tabular features. As a result, agentic AI that generates generic codes for modeling tabular data cannot perform well, while human experts could identify the important hidden variable using domain knowledge. We demonstrate this idea with a synthetic dataset for property insurance. Our experiments show that agentic AI that relies on generic analytics workflow falls short of methods that use domain-specific insights. This highlights a key limitation of the current agentic AI for data science and underscores the need for future research to develop agentic AI systems that can better recognize and incorporate domain knowledge.

65. ReACT-Drug: Reaction-Template Guided Reinforcement Learning for de novo Drug Design

Authors: R Yadunandan , Nimisha Ghosh
URL: https://arxiv.org/abs/2512.20958
Abstract:

De novo drug design is a crucial component of modern drug development, yet navigating the vast chemical space to find synthetically accessible, high-affinity candidates remains a significant challenge. Reinforcement Learning (RL) enhances this process by enabling multi-objective optimization and exploration of novel chemical space - capabilities that traditional supervised learning methods lack. In this work, we introduce \textbf{ReACT-Drug}, a fully integrated, target-agnostic molecular design framework based on Reinforcement Learning. Unlike models requiring target-specific fine-tuning, ReACT-Drug utilizes a generalist approach by leveraging ESM-2 protein embeddings to identify similar proteins for a given target from a knowledge base such as Protein Data Base (PDB). Thereafter, the known drug ligands corresponding to such proteins are decomposed to initialize a fragment-based search space, biasing the agent towards biologically relevant subspaces. For each such fragment, the pipeline employs a Proximal Policy Optimization (PPO) agent guiding a ChemBERTa-encoded molecule through a dynamic action space of chemically valid, reaction-template-based transformations. This results in the generation of \textit{de novo} drug candidates with competitive binding affinities and high synthetic accessibility, while ensuring 100\% chemical validity and novelty as per MOSES benchmarking. This architecture highlights the potential of integrating structural biology, deep representation learning, and chemical synthesis rules to automate and accelerate rational drug design. The dataset and code are available at this https URL .

66. One Tool Is Enough: Reinforcement Learning for Repository-Level LLM Agents

Authors: Zhaoxi Zhang , Yitong Duan , Yanzhi Zhang , Yiming Xu , Jiyan He , Yunfang Wu
URL: https://arxiv.org/abs/2512.20957
Abstract:

Locating the files and functions requiring modification in large open-source software (OSS) repositories is challenging due to their scale and structural complexity. Existing large language model (LLM)-based methods typically treat this as a repository-level retrieval task and rely on multiple auxiliary tools, which overlook code execution logic and complicate model control. We propose RepoNavigator, an LLM agent equipped with a single execution-aware tool-jumping to the definition of an invoked symbol. This unified design reflects the actual flow of code execution while simplifying tool manipulation. RepoNavigator is trained end-to-end via Reinforcement Learning (RL) directly from a pretrained model, without any closed-source distillation. Experiments demonstrate that RL-trained RepoNavigator achieves state-of-the-art performance, with the 7B model outperforming 14B baselines, the 14B model surpassing 32B competitors, and even the 32B model exceeding closed-source models such as Claude-3.7. These results confirm that integrating a single, structurally grounded tool with RL training provides an efficient and scalable solution for repository-level issue localization.

67. Reflection Pretraining Enables Token-Level Self-Correction in Biological Sequence Models

Authors: Xiang Zhang , Jiaqi Wei , Yuejin Yang , Zijie Qiu , Yuhan Chen , Zhiqiang Gao , Muhammad Abdul-Mageed , Laks V. S. Lakshmanan , Wanli Ouyang , Chenyu You , Siqi Sun
URL: https://arxiv.org/abs/2512.20954
Abstract:

Chain-of-Thought (CoT) prompting has significantly advanced task-solving capabilities in natural language processing with large language models. Unlike standard prompting, CoT encourages the model to generate intermediate reasoning steps, non-answer tokens, that help guide the model toward more accurate final outputs. These intermediate steps enable more complex reasoning processes such as error correction, memory management, future planning, and self-reflection. However, applying CoT to non-natural language domains, such as protein and RNA language models, is not yet possible, primarily due to the limited expressiveness of their token spaces (e.g., amino acid tokens). In this work, we propose and define the concept of language expressiveness: the ability of a given language, using its tokens and grammar, to encode information. We show that the limited expressiveness of protein language severely restricts the applicability of CoT-style reasoning. To overcome this, we introduce reflection pretraining, for the first time in a biological sequence model, which enables the model to engage in intermediate reasoning through the generation of auxiliary “thinking tokens” beyond simple answer tokens. Theoretically, we demonstrate that our augmented token set significantly enhances biological language expressiveness, thereby improving the overall reasoning capacity of the model. Experimentally, our pretraining approach teaches protein models to self-correct and leads to substantial performance gains compared to standard pretraining.

68. MultiMind at SemEval-2025 Task 7: Crosslingual Fact-Checked Claim Retrieval via Multi-Source Alignment

Authors: Mohammad Mahdi Abootorabi , Alireza Ghahramani Kure , Mohammadali Mohammadkhani , Sina Elahimanesh , Mohammad Ali Ali Panah
URL: https://arxiv.org/abs/2512.20950
Abstract:

This paper presents our system for SemEval-2025 Task 7: Multilingual and Crosslingual Fact-Checked Claim Retrieval. In an era where misinformation spreads rapidly, effective fact-checking is increasingly critical. We introduce TriAligner, a novel approach that leverages a dual-encoder architecture with contrastive learning and incorporates both native and English translations across different modalities. Our method effectively retrieves claims across multiple languages by learning the relative importance of different sources in alignment. To enhance robustness, we employ efficient data preprocessing and augmentation using large language models while incorporating hard negative sampling to improve representation learning. We evaluate our approach on monolingual and crosslingual benchmarks, demonstrating significant improvements in retrieval accuracy and fact-checking performance over baselines.

69. Neural Probe-Based Hallucination Detection for Large Language Models

Authors: Shize Liang , Hongzhi Wang
URL: https://arxiv.org/abs/2512.20949
Abstract:

Large language models(LLMs) excel at text generation and knowledge question-answering tasks, but they are prone to generating hallucinated content, severely limiting their application in high-risk domains. Current hallucination detection methods based on uncertainty estimation and external knowledge retrieval suffer from the limitation that they still produce erroneous content at high confidence levels and rely heavily on retrieval efficiency and knowledge coverage. In contrast, probe methods that leverage the model’s hidden-layer states offer real-time and lightweight advantages. However, traditional linear probes struggle to capture nonlinear structures in deep semantic this http URL overcome these limitations, we propose a neural network-based framework for token-level hallucination detection. By freezing language model parameters, we employ lightweight MLP probes to perform nonlinear modeling of high-level hidden states. A multi-objective joint loss function is designed to enhance detection stability and semantic disambiguity. Additionally, we establish a layer position-probe performance response model, using Bayesian optimization to automatically search for optimal probe insertion layers and achieve superior training this http URL results on LongFact, HealthBench, and TriviaQA demonstrate that MLP probes significantly outperform state-of-the-art methods in accuracy, recall, and detection capability under low false-positive conditions.

70. A Multi-fidelity Double-Delta Wing Dataset and Empirical Scaling Laws for GNN-based Aerodynamic Field Surrogate

Authors: Yiren Shen , Juan J. Alonso
URL: https://arxiv.org/abs/2512.20941
Abstract:

Data-driven surrogate models are increasingly adopted to accelerate vehicle design. However, open-source multi-fidelity datasets and empirical guidelines linking dataset size to model performance remain limited. This study investigates the relationship between training data size and prediction accuracy for a graph neural network (GNN) based surrogate model for aerodynamic field prediction. We release an open-source, multi-fidelity aerodynamic dataset for double-delta wings, comprising 2448 flow snapshots across 272 geometries evaluated at angles of attack from 11 (degree) to 19 (degree) at Ma=0.3 using both Vortex Lattice Method (VLM) and Reynolds-Averaged Navier-Stokes (RANS) solvers. The geometries are generated using a nested Saltelli sampling scheme to support future dataset expansion and variance-based sensitivity analysis. Using this dataset, we conduct a preliminary empirical scaling study of the MF-VortexNet surrogate by constructing six training datasets with sizes ranging from 40 to 1280 snapshots and training models with 0.1 to 2.4 million parameters under a fixed training budget. We find that the test error decreases with data size with a power-law exponent of -0.6122, indicating efficient data utilization. Based on this scaling law, we estimate that the optimal sampling density is approximately eight samples per dimension in a d-dimensional design space. The results also suggest improved data utilization efficiency for larger surrogate models, implying a potential trade-off between dataset generation cost and model training budget.

71. Transductive Visual Programming: Evolving Tool Libraries from Experience for Spatial Reasoning

Authors: Shengguang Wu , Xiaohan Wang , Yuhui Zhang , Hao Zhu , Serena Yeung-Levy
URL: https://arxiv.org/abs/2512.20934
Abstract:

Spatial reasoning in 3D scenes requires precise geometric calculations that challenge vision-language models. Visual programming addresses this by decomposing problems into steps calling specialized tools, yet existing methods rely on either fixed toolsets or speculative tool induction before solving problems, resulting in suboptimal programs and poor utilization of induced tools. We present Transductive Visual Programming (TVP), a novel framework that builds new tools from its own experience rather than speculation. TVP first solves problems using basic tools while accumulating experiential solutions into an Example Library, then abstracts recurring patterns from these programs into reusable higher-level tools for an evolving Tool Library. This allows TVP to tackle new problems with increasingly powerful tools learned from experience. On Omni3D-Bench, TVP achieves state-of-the-art performance, outperforming GPT-4o by 22% and the previous best visual programming system by 11%. Our transductively learned tools are used 5x more frequently as core program dependency than inductively created ones, demonstrating more effective tool discovery and reuse. The evolved tools also show strong generalization to unseen spatial tasks, achieving superior performance on benchmarks from SpatialScore-Hard collection without any testset-specific modification. Our work establishes experience-driven transductive tool creation as a powerful paradigm for building self-evolving visual programming agents that effectively tackle challenging spatial reasoning tasks. We release our code at this https URL .

72. Guardrailed Elasticity Pricing: A Churn-Aware Forecasting Playbook for Subscription Strategy

Authors: Deepit Sapru
URL: https://arxiv.org/abs/2512.20932
Abstract:

This paper presents a marketing analytics framework that operationalizes subscription pricing as a dynamic, guardrailed decision system, uniting multivariate demand forecasting, segment-level price elasticity, and churn propensity to optimize revenue, margin, and retention. The approach blends seasonal time-series models with tree-based learners, runs Monte Carlo scenario tests to map risk envelopes, and solves a constrained optimization that enforces business guardrails on customer experience, margin floors, and allowable churn. Validated across heterogeneous SaaS portfolios, the method consistently outperforms static tiers and uniform uplifts by reallocating price moves toward segments with higher willingness-to-pay while protecting price-sensitive cohorts. The system is designed for real-time recalibration via modular APIs and includes model explainability for governance and compliance. Managerially, the framework functions as a strategy playbook that clarifies when to shift from flat to dynamic pricing, how to align pricing with CLV and MRR targets, and how to embed ethical guardrails, enabling durable growth without eroding customer trust.

73. RevFFN: Memory-Efficient Full-Parameter Fine-Tuning of Mixture-of-Experts LLMs with Reversible Blocks

Authors: Ningyuan Liu , Jing Yang , Kaitong Cai , Keze Wang
URL: https://arxiv.org/abs/2512.20920
Abstract:

Full parameter fine tuning is a key technique for adapting large language models (LLMs) to downstream tasks, but it incurs substantial memory overhead due to the need to cache extensive intermediate activations for backpropagation. This bottleneck makes full fine tuning of contemporary large scale LLMs challenging in practice. Existing distributed training frameworks such as DeepSpeed alleviate this issue using techniques like ZeRO and FSDP, which rely on multi GPU memory or CPU offloading, but often require additional hardware resources and reduce training speed. We introduce RevFFN, a memory efficient fine tuning paradigm for mixture of experts (MoE) LLMs. RevFFN employs carefully designed reversible Transformer blocks that allow reconstruction of layer input activations from outputs during backpropagation, eliminating the need to store most intermediate activations in memory. While preserving the expressive capacity of MoE architectures, this approach significantly reduces peak memory consumption for full parameter fine tuning. As a result, RevFFN enables efficient full fine tuning on a single consumer grade or server grade GPU.

74. DiEC: Diffusion Embedded Clustering

Authors: Haidong Hu
URL: https://arxiv.org/abs/2512.20905
Abstract:

Deep clustering hinges on learning representations that are inherently clusterable. However, using a single encoder to produce a fixed embedding ignores the representation trajectory formed by a pretrained diffusion model across network hierarchies and noise timesteps, where clusterability varies substantially. We propose DiEC (Diffusion Embedded Clustering), which performs unsupervised clustering by directly reading internal activations from a pretrained diffusion U-Net. DiEC formulates representation selection as a two-dimensional search over layer x timestep, and exploits a weak-coupling property to decompose it into two stages. Specifically, we first fix the U-Net bottleneck layer as the Clustering-friendly Middle Layer (CML), and then use Optimal Timestep Search (OTS) to identify the clustering-optimal timestep (t). During training, we extract bottleneck features at the fixed t and obtain clustering representations via a lightweight residual mapping. We optimize a DEC-style KL self-training objective, augmented with adaptive graph regularization and entropy regularization to strengthen cluster structures. In parallel, we introduce a denoising-consistency branch at random timesteps to stabilize the representations and preserve generative consistency. Experiments show that DiEC achieves competitive clustering performance on multiple standard benchmarks.

75. Embodied AI-Enhanced IoMT Edge Computing: UAV Trajectory Optimization and Task Offloading with Mobility Prediction

Authors: Siqi Mu , Shuo Wen , Yang Lu , Ruihong Jiang , Bo Ai
URL: https://arxiv.org/abs/2512.20902
Abstract:

Due to their inherent flexibility and autonomous operation, unmanned aerial vehicles (UAVs) have been widely used in Internet of Medical Things (IoMT) to provide real-time biomedical edge computing service for wireless body area network (WBAN) users. In this paper, considering the time-varying task criticality characteristics of diverse WBAN users and the dual mobility between WBAN users and UAV, we investigate the dynamic task offloading and UAV flight trajectory optimization problem to minimize the weighted average task completion time of all the WBAN users, under the constraint of UAV energy consumption. To tackle the problem, an embodied AI-enhanced IoMT edge computing framework is established. Specifically, we propose a novel hierarchical multi-scale Transformer-based user trajectory prediction model based on the users’ historical trajectory traces captured by the embodied AI agent (i.e., UAV). Afterwards, a prediction-enhanced deep reinforcement learning (DRL) algorithm that integrates predicted users’ mobility information is designed for intelligently optimizing UAV flight trajectory and task offloading decisions. Real-word movement traces and simulation results demonstrate the superiority of the proposed methods in comparison with the existing benchmarks.

76. DGSAN: Dual-Graph Spatiotemporal Attention Network for Pulmonary Nodule Malignancy Prediction

Authors: Xiao Yu , Zhaojie Fang , Guanyu Zhou , Yin Shen , Huoling Luo , Ye Li , Ahmed Elazab , Xiang Wan , Ruiquan Ge , Changmiao Wang
URL: https://arxiv.org/abs/2512.20898
Abstract:

Lung cancer continues to be the leading cause of cancer-related deaths globally. Early detection and diagnosis of pulmonary nodules are essential for improving patient survival rates. Although previous research has integrated multimodal and multi-temporal information, outperforming single modality and single time point, the fusion methods are limited to inefficient vector concatenation and simple mutual attention, highlighting the need for more effective multimodal information fusion. To address these challenges, we introduce a Dual-Graph Spatiotemporal Attention Network, which leverages temporal variations and multimodal data to enhance the accuracy of predictions. Our methodology involves developing a Global-Local Feature Encoder to better capture the local, global, and fused characteristics of pulmonary nodules. Additionally, a Dual-Graph Construction method organizes multimodal features into inter-modal and intra-modal graphs. Furthermore, a Hierarchical Cross-Modal Graph Fusion Module is introduced to refine feature integration. We also compiled a novel multimodal dataset named the NLST-cmst dataset as a comprehensive source of support for related research. Our extensive experiments, conducted on both the NLST-cmst and curated CSTL-derived datasets, demonstrate that our DGSAN significantly outperforms state-of-the-art methods in classifying pulmonary nodules with exceptional computational efficiency.

77. Lightweight framework for underground pipeline recognition and spatial localization based on multi-view 2D GPR images

Authors: Haotian Lv , Chao Li , Jiangbo Dai , Yuhui Zhang , Zepeng Fan , Yiqiu Tan , Dawei Wang , Binglei Xie
URL: https://arxiv.org/abs/2512.20866
Abstract:

To address the issues of weak correlation between multi-view features, low recognition accuracy of small-scale targets, and insufficient robustness in complex scenarios in underground pipeline detection using 3D GPR, this paper proposes a 3D pipeline intelligent detection framework. First, based on a B/C/D-Scan three-view joint analysis strategy, a three-dimensional pipeline three-view feature evaluation method is established by cross-validating forward simulation results obtained using FDTD methods with actual measurement data. Second, the DCO-YOLO framework is proposed, which integrates DySample, CGLU, and OutlookAttention cross-dimensional correlation mechanisms into the original YOLOv11 algorithm, significantly improving the small-scale pipeline edge feature extraction capability. Furthermore, a 3D-DIoU spatial feature matching algorithm is proposed, which integrates three-dimensional geometric constraints and center distance penalty terms to achieve automated association of multi-view annotations. The three-view fusion strategy resolves inherent ambiguities in single-view detection. Experiments based on real urban underground pipeline data show that the proposed method achieves accuracy, recall, and mean average precision of 96.2%, 93.3%, and 96.7%, respectively, in complex multi-pipeline scenarios, which are 2.0%, 2.1%, and 0.9% higher than the baseline model. Ablation experiments validated the synergistic optimization effect of the dynamic feature enhancement module and Grad-CAM++ heatmap visualization demonstrated that the improved model significantly enhanced its ability to focus on pipeline geometric features. This study integrates deep learning optimization strategies with the physical characteristics of 3D GPR, offering an efficient and reliable novel technical framework for the intelligent recognition and localization of underground pipelines.

78. Memory-Efficient Acceleration of Block Low-Rank Foundation Models on Resource Constrained GPUs

Authors: Pierre Abillama , Changwoo Lee , Juechu Dong , David Blaauw , Dennis Sylvester , Hun-Seok Kim
URL: https://arxiv.org/abs/2512.20861
Abstract:

Recent advances in transformer-based foundation models have made them the default choice for many tasks, but their rapidly growing size makes fitting a full model on a single GPU increasingly difficult and their computational cost prohibitive. Block low-rank (BLR) compression techniques address this challenge by learning compact representations of weight matrices. While traditional low-rank (LR) methods often incur sharp accuracy drops, BLR approaches such as Monarch and BLAST can better capture the underlying structure, thus preserving accuracy while reducing computations and memory footprints. In this work, we use roofline analysis to show that, although BLR methods achieve theoretical savings and practical speedups for single-token inference, multi-token inference often becomes memory-bound in practice, increasing latency despite compiler-level optimizations in PyTorch. To address this, we introduce custom Triton kernels with partial fusion and memory layout optimizations for both Monarch and BLAST. On memory-constrained NVIDIA GPUs such as Jetson Orin Nano and A40, our kernels deliver up to $3.76\times$ speedups and $3\times$ model size compression over PyTorch dense baselines using CUDA backend and compiler-level optimizations, while supporting various models including Llama-7/1B, GPT2-S, DiT-XL/2, and ViT-B. Our code is available at this https URL .

79. NVIDIA Nemotron 3: Efficient and Open Intelligence

Authors: NVIDIA : Aaron Blakeman , Aaron Grattafiori , Aarti Basant , Abhibha Gupta , Abhinav Khattar , Adi Renduchintala , Aditya Vavre , Akanksha Shukla , Akhiad Bercovich , Aleksander Ficek , Aleksandr Shaposhnikov , Alex Kondratenko , Alexander Bukharin , Alexandre Milesi , Ali Taghibakhshi , Alisa Liu , Amelia Barton , Ameya Sunil Mahabaleshwarkar , Amir Klein , Amit Zuker , Amnon Geifman , Amy Shen , Anahita Bhiwandiwalla , Andrew Tao , Anjulie Agrusa , Ankur Verma , Ann Guan , Anubhav Mandarwal , Arham Mehta , Ashwath Aithal , Ashwin Poojary , Asif Ahamed , Asit Mishra , Asma Kuriparambil Thekkumpate , Ayush Dattagupta , Banghua Zhu , Bardiya Sadeghi , Barnaby Simkin , Ben Lanir , Benedikt Schifferer , Besmira Nushi , Bilal Kartal , Bita Darvish Rouhani , Boris Ginsburg , Brandon Norick , Brandon Soubasis , Branislav Kisacanin , Brian Yu , Bryan Catanzaro , Carlo del Mundo , Chantal Hwang , Charles Wang , Cheng-Ping Hsieh , Chenghao Zhang , Chenhan Yu , Chetan Mungekar , Chintan Patel , Chris Alexiuk , Christopher Parisien , Collin Neale , Cyril Meurillon , Damon Mosk-Aoyama , Dan Su , Dane Corneil , Daniel Afrimi , Daniel Lo , Daniel Rohrer , Daniel Serebrenik , Daria Gitman , Daria Levy , Darko Stosic , David Mosallanezhad , Deepak Narayanan , Dhruv Nathawani , Dima Rekesh , Dina Yared , Divyanshu Kakwani , Dong Ahn , Duncan Riach , Dusan Stosic , Edgar Minasyan , Edward Lin , Eileen Long , Eileen Peters Long , Elad Segal , Elena Lantz , Ellie Evans , Elliott Ning , Eric Chung , Eric Harper , Eric Tramel , Erick Galinkin , Erik Pounds , Evan Briones , Evelina Bakhturina , Evgeny Tsykunov , Faisal Ladhak , Fay Wang , Fei Jia
URL: https://arxiv.org/abs/2512.20856
Abstract:

We introduce the Nemotron 3 family of models - Nano, Super, and Ultra. These models deliver strong agentic, reasoning, and conversational capabilities. The Nemotron 3 family uses a Mixture-of-Experts hybrid Mamba-Transformer architecture to provide best-in-class throughput and context lengths of up to 1M tokens. Super and Ultra models are trained with NVFP4 and incorporate LatentMoE, a novel approach that improves model quality. The two larger models also include MTP layers for faster text generation. All Nemotron 3 models are post-trained using multi-environment reinforcement learning enabling reasoning, multi-step tool use, and support granular reasoning budget control. Nano, the smallest model, outperforms comparable models in accuracy while remaining extremely cost-efficient for inference. Super is optimized for collaborative agents and high-volume workloads such as IT ticket automation. Ultra, the largest model, provides state-of-the-art accuracy and reasoning performance. Nano is released together with its technical report and this white paper, while Super and Ultra will follow in the coming months. We will openly release the model weights, pre- and post-training software, recipes, and all data for which we hold redistribution rights.

80. Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

Authors: NVIDIA : Aaron Blakeman , Aaron Grattafiori , Aarti Basant , Abhibha Gupta , Abhinav Khattar , Adi Renduchintala , Aditya Vavre , Akanksha Shukla , Akhiad Bercovich , Aleksander Ficek , Aleksandr Shaposhnikov , Alex Kondratenko , Alexander Bukharin , Alexandre Milesi , Ali Taghibakhshi , Alisa Liu , Amelia Barton , Ameya Sunil Mahabaleshwarkar , Amir Klein , Amit Zuker , Amnon Geifman , Amy Shen , Anahita Bhiwandiwalla , Andrew Tao , Ann Guan , Anubhav Mandarwal , Arham Mehta , Ashwath Aithal , Ashwin Poojary , Asif Ahamed , Asma Kuriparambil Thekkumpate , Ayush Dattagupta , Banghua Zhu , Bardiya Sadeghi , Barnaby Simkin , Ben Lanir , Benedikt Schifferer , Besmira Nushi , Bilal Kartal , Bita Darvish Rouhani , Boris Ginsburg , Brandon Norick , Brandon Soubasis , Branislav Kisacanin , Brian Yu , Bryan Catanzaro , Carlo del Mundo , Chantal Hwang , Charles Wang , Cheng-Ping Hsieh , Chenghao Zhang , Chenhan Yu , Chetan Mungekar , Chintan Patel , Chris Alexiuk , Christopher Parisien , Collin Neale , Damon Mosk-Aoyama , Dan Su , Dane Corneil , Daniel Afrimi , Daniel Rohrer , Daniel Serebrenik , Daria Gitman , Daria Levy , Darko Stosic , David Mosallanezhad , Deepak Narayanan , Dhruv Nathawani , Dima Rekesh , Dina Yared , Divyanshu Kakwani , Dong Ahn , Duncan Riach , Dusan Stosic , Edgar Minasyan , Edward Lin , Eileen Long , Eileen Peters Long , Elena Lantz , Ellie Evans , Elliott Ning , Eric Chung , Eric Harper , Eric Tramel , Erick Galinkin , Erik Pounds , Evan Briones , Evelina Bakhturina , Faisal Ladhak , Fay Wang , Fei Jia , Felipe Soares , Feng Chen , Ferenc Galko , Frankie Siino , Gal Hubara Agam , Ganesh Ajjanagadde , Gantavya Bhatt
URL: https://arxiv.org/abs/2512.20848
Abstract:

We present Nemotron 3 Nano 30B-A3B, a Mixture-of-Experts hybrid Mamba-Transformer language model. Nemotron 3 Nano was pretrained on 25 trillion text tokens, including more than 3 trillion new unique tokens over Nemotron 2, followed by supervised fine tuning and large-scale RL on diverse environments. Nemotron 3 Nano achieves better accuracy than our previous generation Nemotron 2 Nano while activating less than half of the parameters per forward pass. It achieves up to 3.3x higher inference throughput than similarly-sized open models like GPT-OSS-20B and Qwen3-30B-A3B-Thinking-2507, while also being more accurate on popular benchmarks. Nemotron 3 Nano demonstrates enhanced agentic, reasoning, and chat abilities and supports context lengths up to 1M tokens. We release both our pretrained Nemotron 3 Nano 30B-A3B Base and post-trained Nemotron 3 Nano 30B-A3B checkpoints on Hugging Face.

81. NotSoTiny: A Large, Living Benchmark for RTL Code Generation

Authors: Razine Moundir Ghorab , Emanuele Parisi , Cristian Gutierrez , Miquel Alberti-Binimelis , Miquel Moreto , Dario Garcia-Gasulla , Gokcen Kestor
URL: https://arxiv.org/abs/2512.20823
Abstract:

LLMs have shown early promise in generating RTL code, yet evaluating their capabilities in realistic setups remains a challenge. So far, RTL benchmarks have been limited in scale, skewed toward trivial designs, offering minimal verification rigor, and remaining vulnerable to data contamination. To overcome these limitations and to push the field forward, this paper introduces NotSoTiny, a benchmark that assesses LLM on the generation of structurally rich and context-aware RTL. Built from hundreds of actual hardware designs produced by the Tiny Tapeout community, our automated pipeline removes duplicates, verifies correctness and periodically incorporates new designs to mitigate contamination, matching Tiny Tapeout release schedule. Evaluation results show that NotSoTiny tasks are more challenging than prior benchmarks, emphasizing its effectiveness in overcoming current limitations of LLMs applied to hardware design, and in guiding the improvement of such promising technology.

82. MediEval: A Unified Medical Benchmark for Patient-Contextual and Knowledge-Grounded Reasoning in LLMs

Authors: Zhan Qu , Michael Färber
URL: https://arxiv.org/abs/2512.20822
Abstract:

Large Language Models (LLMs) are increasingly applied to medicine, yet their adoption is limited by concerns over reliability and safety. Existing evaluations either test factual medical knowledge in isolation or assess patient-level reasoning without verifying correctness, leaving a critical gap. We introduce MediEval, a benchmark that links MIMIC-IV electronic health records (EHRs) to a unified knowledge base built from UMLS and other biomedical vocabularies. MediEval generates diverse factual and counterfactual medical statements within real patient contexts, enabling systematic evaluation across a 4-quadrant framework that jointly considers knowledge grounding and contextual consistency. Using this framework, we identify critical failure modes, including hallucinated support and truth inversion, that current proprietary, open-source, and domain-specific LLMs frequently exhibit. To address these risks, we propose Counterfactual Risk-Aware Fine-tuning (CoRFu), a DPO-based method with an asymmetric penalty targeting unsafe confusions. CoRFu improves by +16.4 macro-F1 points over the base model and eliminates truth inversion errors, demonstrating both higher accuracy and substantially greater safety.

83. X-GridAgent: An LLM-Powered Agentic AI System for Assisting Power Grid Analysis

Authors: Yihan (Logon)Wen, Xin Chen
URL: https://arxiv.org/abs/2512.20789
Abstract:

The growing complexity of power system operations has created an urgent need for intelligent, automated tools to support reliable and efficient grid management. Conventional analysis tools often require significant domain expertise and manual effort, which limits their accessibility and adaptability. To address these challenges, this paper presents X-GridAgent, a novel large language model (LLM)-powered agentic AI system designed to automate complex power system analysis through natural language queries. The system integrates domain-specific tools and specialized databases under a three-layer hierarchical architecture comprising planning, coordination, and action layers. This architecture offers high flexibility and adaptability to previously unseen tasks, while providing a modular and extensible framework that can be readily expanded to incorporate new tools, data sources, or analytical capabilities. To further enhance performance, we introduce two novel algorithms: (1) LLM-driven prompt refinement with human feedback, and (2) schema-adaptive hybrid retrieval-augmented generation (RAG) for accurate information retrieval from large-scale structured grid datasets. Experimental evaluations across a variety of user queries and power grid cases demonstrate the effectiveness and reliability of X-GridAgent in automating interpretable and rigorous power system analysis.

84. NULLBUS: Multimodal Mixed-Supervision for Breast Ultrasound Segmentation via Nullable Global-Local Prompts

Authors: Raja Mallina , Bryar Shareef
URL: https://arxiv.org/abs/2512.20783
Abstract:

Breast ultrasound (BUS) segmentation provides lesion boundaries essential for computer-aided diagnosis and treatment planning. While promptable methods can improve segmentation performance and tumor delineation when text or spatial prompts are available, many public BUS datasets lack reliable metadata or reports, constraining training to small multimodal subsets and reducing robustness. We propose NullBUS, a multimodal mixed-supervision framework that learns from images with and without prompts in a single model. To handle missing text, we introduce nullable prompts, implemented as learnable null embeddings with presence masks, enabling fallback to image-only evidence when metadata are absent and the use of text when present. Evaluated on a unified pool of three public BUS datasets, NullBUS achieves a mean IoU of 0.8568 and a mean Dice of 0.9103, demonstrating state-of-the-art performance under mixed prompt availability.

85. Towards Optimal Performance and Action Consistency Guarantees in Dec-POMDPs with Inconsistent Beliefs and Limited Communication

Authors: Moshe Rafaeli Shimron , Vadim Indelman
URL: https://arxiv.org/abs/2512.20778
Abstract:

Multi-agent decision-making under uncertainty is fundamental for effective and safe autonomous operation. In many real-world scenarios, each agent maintains its own belief over the environment and must plan actions accordingly. However, most existing approaches assume that all agents have identical beliefs at planning time, implying these beliefs are conditioned on the same data. Such an assumption is often impractical due to limited communication. In reality, agents frequently operate with inconsistent beliefs, which can lead to poor coordination and suboptimal, potentially unsafe, performance. In this paper, we address this critical challenge by introducing a novel decentralized framework for optimal joint action selection that explicitly accounts for belief inconsistencies. Our approach provides probabilistic guarantees for both action consistency and performance with respect to open-loop multi-agent POMDP (which assumes all data is always communicated), and selectively triggers communication only when needed. Furthermore, we address another key aspect of whether, given a chosen joint action, the agents should share data to improve expected performance in inference. Simulation results show our approach outperforms state-of-the-art algorithms.

86. TS-Arena Technical Report – A Pre-registered Live Forecasting Platform

Authors: Marcel Meyer , Sascha Kaltenpoth , Kevin Zalipski , Henrik Albers , Oliver Müller
URL: https://arxiv.org/abs/2512.20761
Abstract:

While Time Series Foundation Models (TSFMs) offer transformative capabilities for forecasting, they simultaneously risk triggering a fundamental evaluation crisis. This crisis is driven by information leakage due to overlapping training and test sets across different models, as well as the illegitimate transfer of global patterns to test data. While the ability to learn shared temporal dynamics represents a primary strength of these models, their evaluation on historical archives often permits the exploitation of observed global shocks, which violates the independence required for valid benchmarking. We introduce TS-Arena, a platform that restores the operational integrity of forecasting by treating the genuinely unknown future as the definitive test environment. By implementing a pre-registration mechanism on live data streams, the platform ensures that evaluation targets remain physically non-existent during inference, thereby enforcing a strict global temporal split. This methodology establishes a moving temporal frontier that prevents historical contamination and provides an authentic assessment of model generalization. Initially applied within the energy sector, TS-Arena provides a sustainable infrastructure for comparing foundation models under real-world constraints. A prototype of the platform is available at this https URL .

87. Generalization of RLVR Using Causal Reasoning as a Testbed

Authors: Brian Lu , Hongyu Zhao , Shuo Sun , Hao Peng , Rui Ding , Hongyuan Mei
URL: https://arxiv.org/abs/2512.20760
Abstract:

Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for post-training large language models (LLMs) on complex reasoning tasks. Yet, the conditions under which RLVR yields robust generalization remain poorly understood. This paper provides an empirical study of RLVR generalization in the setting of probabilistic inference over causal graphical models. This setting offers two natural axes along which to examine generalization: (i) the level of the probabilistic query – associational, interventional, or counterfactual – and (ii) the structural complexity of the query, measured by the size of its relevant subgraph. We construct datasets of causal graphs and queries spanning these difficulty axes and fine-tune Qwen-2.5-Instruct models using RLVR or supervised fine-tuning (SFT). We vary both the model scale (3B-32B) and the query level included in training. We find that RLVR yields stronger within-level and across-level generalization than SFT, but only for specific combinations of model size and training query level. Further analysis shows that RLVR’s effectiveness depends on the model’s initial reasoning competence. With sufficient initial competence, RLVR improves an LLM’s marginalization strategy and reduces errors in intermediate probability calculations, producing substantial accuracy gains, particularly on more complex queries. These findings show that RLVR can improve specific causal reasoning subskills, with its benefits emerging only when the model has sufficient initial competence.

88. Bridging Efficiency and Safety: Formal Verification of Neural Networks with Early Exits

Authors: Yizhak Yisrael Elboher , Avraham Raviv , Amihay Elboher , Zhouxing Shi , Omri Azencot , Hillel Kugler , Guy Katz
URL: https://arxiv.org/abs/2512.20755
Abstract:

Ensuring the safety and efficiency of AI systems is a central goal of modern research. Formal verification provides guarantees of neural network robustness, while early exits improve inference efficiency by enabling intermediate predictions. Yet verifying networks with early exits introduces new challenges due to their conditional execution paths. In this work, we define a robustness property tailored to early exit architectures and show how off-the-shelf solvers can be used to assess it. We present a baseline algorithm, enhanced with an early stopping strategy and heuristic optimizations that maintain soundness and completeness. Experiments on multiple benchmarks validate our framework’s effectiveness and demonstrate the performance gains of the improved algorithm. Alongside the natural inference acceleration provided by early exits, we show that they also enhance verifiability, enabling more queries to be solved in less time compared to standard networks. Together with a robustness analysis, we show how these metrics can help users navigate the inherent trade-off between accuracy and efficiency.

89. Stabilizing Multimodal Autoencoders: A Theoretical and Empirical Analysis of Fusion Strategies

Authors: Diyar Altinses , Andreas Schwung
URL: https://arxiv.org/abs/2512.20749
Abstract:

In recent years, the development of multimodal autoencoders has gained significant attention due to their potential to handle multimodal complex data types and improve model performance. Understanding the stability and robustness of these models is crucial for optimizing their training, architecture, and real-world applicability. This paper presents an analysis of Lipschitz properties in multimodal autoencoders, combining both theoretical insights and empirical validation to enhance the training stability of these models. We begin by deriving the theoretical Lipschitz constants for aggregation methods within the multimodal autoencoder framework. We then introduce a regularized attention-based fusion method, developed based on our theoretical analysis, which demonstrates improved stability and performance during training. Through a series of experiments, we empirically validate our theoretical findings by estimating the Lipschitz constants across multiple trials and fusion strategies. Our results demonstrate that our proposed fusion function not only aligns with theoretical predictions but also outperforms existing strategies in terms of consistency, convergence speed, and accuracy. This work provides a solid theoretical foundation for understanding fusion in multimodal autoencoders and contributes a solution for enhancing their performance.

90. A Physics Informed Neural Network For Deriving MHD State Vectors From Global Active Regions Observations

Authors: Subhamoy Chatterjee , Mausumi Dikpati
URL: https://arxiv.org/abs/2512.20747
Abstract:

Solar active regions (ARs) do not appear randomly but cluster along longitudinally warped toroidal bands (‘toroids’) that encode information about magnetic structures in the tachocline, where global-scale organization likely originates. Global MagnetoHydroDynamic Shallow-Water Tachocline (MHD-SWT) models have shown potential to simulate such toroids, matching observations qualitatively. For week-scale early prediction of flare-producing AR emergence, forward-integration of these toroids is necessary. This requires model initialization with a dynamically self-consistent MHD state-vector that includes magnetic, flow fields, and shell-thickness variations. However, synoptic magnetograms provide only geometric shape of toroids, not the state-vector needed to initialize MHD-SWT models. To address this challenging task, we develop PINNBARDS, a novel Physics-Informed Neural Network (PINN)-Based AR Distribution Simulator, that uses observational toroids and MHD-SWT equations to derive initial state-vector. Using Feb-14-2024 SDO/HMI synoptic map, we show that PINN converges to physically consistent, predominantly antisymmetric toroids, matching observed ones. Although surface data provides north and south toroids’ central latitudes, and their latitudinal widths, they cannot determine tachocline field strengths, connected to AR emergence. We explore here solutions across a broad parameter range, finding hydrodynamically-dominated structures for weak fields (~2 kG) and overly rigid behavior for strong fields (~100 kG). We obtain best agreement with observations for 20-30 kG toroidal fields, and ~10 degree bandwidth, consistent with low-order longitudinal mode excitation. To our knowledge, PINNBARDS serves as the first method for reconstructing state-vectors for hidden tachocline magnetic structures from surface patterns; potentially leading to weeks ahead prediction of flare-producing AR-emergence.

91. AI-Driven Green Cognitive Radio Networks for Sustainable 6G Communication

Authors: Anshul Sharma , Shujaatali Badami , Biky Chouhan , Pushpanjali Pandey , Brijeena Rana , Navneet Kaur
URL: https://arxiv.org/abs/2512.20739
Abstract:

The 6G wireless aims at the Tb/s peak data rates are expected, a sub-millisecond latency, massive Internet of Things/vehicle connectivity, which requires sustainable access to audio over the air and energy-saving functionality. Cognitive Radio Networks CCNs help in alleviating the problem of spectrum scarcity, but classical sensing and allocation are still energy-consumption intensive, and sensitive to rapid spectrum variations. Our framework which centers on AI driven green CRN aims at integrating deep reinforcement learning (DRL) with transfer learning, energy harvesting (EH), reconfigurable intelligent surfaces (RIS) with other light-weight genetic refinement operations that optimally combine sensing timelines, transmit power, bandwidth distribution and RIS phase selection. Compared to two baselines, the utilization of MATLAB + NS-3 under dense loads, a traditional CRN with energy sensing under fixed policies, and a hybrid CRN with cooperative sensing under heuristic distribution of resource, there are (25-30%) fewer energy reserves used, sensing AUC greater than 0.90 and +6-13 p.p. higher PDR. The integrated framework is easily scalable to large IoT and vehicular applications, and it provides a feasible and sustainable roadmap to 6G CRNs. Index Terms–Cognitive Radio Networks (CRNs), 6G, Green Communication, Energy Efficiency, Deep Reinforcement Learning (DRL), Spectrum Sensing, RIS, Energy Harvesting, QoS, IoT.

92. FEM-Bench: A Structured Scientific Reasoning Benchmark for Evaluating Code-Generating LLMs

Authors: Saeed Mohammadzadeh , Erfan Hamdi , Joel Shor , Emma Lejeune
URL: https://arxiv.org/abs/2512.20732
Abstract:

As LLMs advance their reasoning capabilities about the physical world, the absence of rigorous benchmarks for evaluating their ability to generate scientifically valid physical models has become a critical gap. Computational mechanics, which develops and applies mathematical models and numerical methods to predict the behavior of physical systems under forces, deformation, and constraints, provides an ideal foundation for structured scientific reasoning evaluation. Problems follow clear mathematical structure, enforce strict physical and numerical constraints, and support objective verification. The discipline requires constructing explicit models of physical systems and reasoning about geometry, spatial relationships, and material behavior, connecting directly to emerging AI goals in physical reasoning and world modeling. We introduce FEM-Bench, a computational mechanics benchmark designed to evaluate the ability of LLMs to generate correct finite element method (FEM) and related code. FEM-Bench 2025 contains a suite of introductory but nontrivial tasks aligned with material from a first graduate course on computational mechanics. These tasks capture essential numerical and physical modeling challenges while representing only a small fraction of the complexity present in the discipline. Despite their simplicity, state-of-the-art LLMs do not reliably solve all of them. In a five attempt run, the best performing model at function writing, Gemini 3 Pro, completed 30/33 tasks at least once and 26/33 tasks all five times. The best performing model at unit test writing, GPT-5, had an Average Joint Success Rate of 73.8%. Other popular models showed broad performance variation. FEM-Bench establishes a structured foundation for evaluating AI-generated scientific code, and future iterations will incorporate increasingly sophisticated tasks to track progress as models evolve.

93. SA-DiffuSeq: Addressing Computational and Scalability Challenges in Long-Document Generation with Sparse Attention

Authors: Alexandros Christoforos , Chadbourne Davis
URL: https://arxiv.org/abs/2512.20724
Abstract:

Diffusion based approaches to long form text generation suffer from prohibitive computational cost and memory overhead as sequence length increases. We introduce SA-DiffuSeq, a diffusion framework that integrates sparse attention to fundamentally improve scalability for long document modeling. By selectively allocating attention within the diffusion process, SA-DiffuSeq significantly reduces computational complexity while maintaining semantic coherence and generation quality. A key component of our method is a soft absorbing state tailored to sparse attention dynamics, which stabilizes diffusion trajectories and accelerates sequence reconstruction. This design improves sampling efficiency and enhances precision in long range dependency modeling. Extensive experiments demonstrate that SA-DiffuSeq consistently surpasses state of the art diffusion baselines in both training efficiency and sampling speed, with especially strong gains on extended sequences. These properties make SA-DiffuSeq well suited for demanding long form applications such as scientific writing, large scale code generation, and multi turn long context dialogue. Overall, our results indicate that incorporating structured sparsity into diffusion models is a promising direction for efficient and expressive long text generation.

94. Mechanism-Based Intelligence (MBI): Differentiable Incentives for Rational Coordination and Guaranteed Alignment in Multi-Agent Systems

Authors: Stefano Grassi
URL: https://arxiv.org/abs/2512.20688
Abstract:

Autonomous multi-agent systems are fundamentally fragile: they struggle to solve the Hayekian Information problem (eliciting dispersed private knowledge) and the Hurwiczian Incentive problem (aligning local actions with global objectives), making coordination computationally intractable. I introduce Mechanism-Based Intelligence (MBI), a paradigm that reconceptualizes intelligence as emergent from the coordination of multiple “brains”, rather than a single one. At its core, the Differentiable Price Mechanism (DPM) computes the exact loss gradient $\mathbf{G}_i = - \frac{\partial \mathcal{L}}{\partial \mathbf{x}_i}$ as a dynamic, VCG-equivalent incentive signal, guaranteeing Dominant Strategy Incentive Compatibility (DSIC) and convergence to the global optimum. A Bayesian extension ensures incentive compatibility under asymmetric information (BIC). The framework scales linearly ($\mathcal{O}(N)$) with the number of agents, bypassing the combinatorial complexity of Dec-POMDPs and is empirically 50x faster than Model-Free Reinforcement Learning. By structurally aligning agent self-interest with collective objectives, it provides a provably efficient, auditable and generalizable approach to coordinated, trustworthy and scalable multi-agent intelligence grounded in economic principles.

95. PHOTON: Hierarchical Autoregressive Modeling for Lightspeed and Memory-Efficient Language Generation

Authors: Yuma Ichikawa , Naoya Takagi , Takumi Nakagawa , Yuzi Kanazawa , Akira Sakai
URL: https://arxiv.org/abs/2512.20687
Abstract:

Transformers operate as horizontal token-by-token scanners; at each generation step, the model attends to an ever-growing sequence of token-level states. This access pattern increases prefill latency and makes long-context decoding increasingly memory-bound, as KV-cache reads and writes dominate inference throughput rather than arithmetic computation. We propose Parallel Hierarchical Operation for Top-down Networks (PHOTON), a hierarchical autoregressive model that replaces flat scanning with vertical, multi-resolution context access. PHOTON maintains a hierarchy of latent streams: a bottom-up encoder progressively compresses tokens into low-rate contextual states, while lightweight top-down decoders reconstruct fine-grained token representations. Experimental results show that PHOTON is superior to competitive Transformer-based language models regarding the throughput-quality trade-off, offering significant advantages in long-context and multi-query tasks. This reduces decode-time KV-cache traffic, yielding up to $10^{3}\times$ higher throughput per unit memory.

96. Revisiting the Learning Objectives of Vision-Language Reward Models

Authors: Simon Roy , Samuel Barbeau , Giovanni Beltrame , Christian Desrosiers , Nicolas Thome
URL: https://arxiv.org/abs/2512.20675
Abstract:

Learning generalizable reward functions is a core challenge in embodied intelligence. Recent work leverages contrastive vision language models (VLMs) to obtain dense, domain-agnostic rewards without human supervision. These methods adapt VLMs into reward models through increasingly complex learning objectives, yet meaningful comparison remains difficult due to differences in training data, architectures, and evaluation settings. In this work, we isolate the impact of the learning objective by evaluating recent VLM-based reward models under a unified framework with identical backbones, finetuning data, and evaluation environments. Using Meta-World tasks, we assess modeling accuracy by measuring consistency with ground truth reward and correlation with expert progress. Remarkably, we show that a simple triplet loss outperforms state-of-the-art methods, suggesting that much of the improvements in recent approaches could be attributed to differences in data and architectures.

97. HyDRA: Hierarchical and Dynamic Rank Adaptation for Mobile Vision Language Model

Authors: Yuanhao Xi , Xiaohuan Bing , Ramin Yahyapour
URL: https://arxiv.org/abs/2512.20674
Abstract:

Vision Language Models (VLMs) have undergone significant advancements, particularly with the emergence of mobile-oriented VLMs, which offer a wide range of application scenarios. However, the substantial computational requirements for training these models present a significant obstacle to their practical application. To address this issue, Low-Rank Adaptation (LoRA) has been proposed. Nevertheless, the standard LoRA with a fixed rank lacks sufficient capability for training mobile VLMs that process both text and image modalities. In this work, we introduce HyDRA, a parameter-efficient fine-tuning framework designed to implement hierarchical and dynamic rank scheduling for mobile VLMs. This framework incorporates two essential optimization strategies: (1) hierarchical optimization, which involves a coarse-grained approach that assigns different ranks to various layers, as well as a fine-grained method that adjusts ranks within individual layers, and (2) dynamic adjustment, which employs an end-to-end automatic optimization using a lightweight performance model to determine and adjust ranks during the fine-tuning process. Comprehensive experiments conducted on popular benchmarks demonstrate that HyDRA consistently outperforms the baseline, achieving a 4.7\% improvement across various model sizes without increasing the number of trainable parameters. In some tasks, it even surpasses full-parameter fine-tuning.

98. Disentangling Fact from Sentiment: A Dynamic Conflict-Consensus Framework for Multimodal Fake News Detection

Authors: Weilin Zhou , Zonghao Ying , Junjie Mu , Shengwei Tian , Quanchen Zou , Deyue Zhang , Dongdong Yang , Xiangzheng Zhang
URL: https://arxiv.org/abs/2512.20670
Abstract:

Prevalent multimodal fake news detection relies on consistency-based fusion, yet this paradigm fundamentally misinterprets critical cross-modal discrepancies as noise, leading to over-smoothing, which dilutes critical evidence of fabrication. Mainstream consistency-based fusion inherently minimizes feature discrepancies to align modalities, yet this approach fundamentally fails because it inadvertently smoothes out the subtle cross-modal contradictions that serve as the primary evidence of fabrication. To address this, we propose the Dynamic Conflict-Consensus Framework (DCCF), an inconsistency-seeking paradigm designed to amplify rather than suppress contradictions. First, DCCF decouples inputs into independent Fact and Sentiment spaces to distinguish objective mismatches from emotional dissonance. Second, we employ physics-inspired feature dynamics to iteratively polarize these representations, actively extracting maximally informative conflicts. Finally, a conflict-consensus mechanism standardizes these local discrepancies against the global context for robust deliberative this http URL experiments conducted on three real world datasets demonstrate that DCCF consistently outperforms state-of-the-art baselines, achieving an average accuracy improvement of 3.52\%.

99. Improving Cardiac Risk Prediction Using Data Generation Techniques

Authors: Alexandre Cabodevila , Pedro Gamallo-Fernandez , Juan C. Vidal , Manuel Lama
URL: https://arxiv.org/abs/2512.20669
Abstract:

Cardiac rehabilitation constitutes a structured clinical process involving multiple interdependent phases, individualized medical decisions, and the coordinated participation of diverse healthcare professionals. This sequential and adaptive nature enables the program to be modeled as a business process, thereby facilitating its analysis. Nevertheless, studies in this context face significant limitations inherent to real-world medical databases: data are often scarce due to both economic costs and the time required for collection; many existing records are not suitable for specific analytical purposes; and, finally, there is a high prevalence of missing values, as not all patients undergo the same diagnostic tests. To address these limitations, this work proposes an architecture based on a Conditional Variational Autoencoder (CVAE) for the synthesis of realistic clinical records that are coherent with real-world observations. The primary objective is to increase the size and diversity of the available datasets in order to enhance the performance of cardiac risk prediction models and to reduce the need for potentially hazardous diagnostic procedures, such as exercise stress testing. The results demonstrate that the proposed architecture is capable of generating coherent and realistic synthetic data, whose use improves the accuracy of the various classifiers employed for cardiac risk detection, outperforming state-of-the-art deep learning approaches for synthetic data generation.

100. Forward Only Learning for Orthogonal Neural Networks of any Depth

Authors: Paul Caillon , Alex Colagrande , Erwan Fagnou , Blaise Delattre , Alexandre Allauzen
URL: https://arxiv.org/abs/2512.20668
Abstract:

Backpropagation is still the de facto algorithm used today to train neural networks. With the exponential growth of recent architectures, the computational cost of this algorithm also becomes a burden. The recent PEPITA and forward-only frameworks have proposed promising alternatives, but they failed to scale up to a handful of hidden layers, yet limiting their use. In this paper, we first analyze theoretically the main limitations of these approaches. It allows us the design of a forward-only algorithm, which is equivalent to backpropagation under the linear and orthogonal assumptions. By relaxing the linear assumption, we then introduce FOTON (Forward-Only Training of Orthogonal Networks) that bridges the gap with the backpropagation algorithm. Experimental results show that it outperforms PEPITA, enabling us to train neural networks of any depth, without the need for a backward pass. Moreover its performance on convolutional networks clearly opens up avenues for its application to more advanced architectures. The code is open-sourced at this https URL .

101. Dominating vs. Dominated: Generative Collapse in Diffusion Models

Authors: Hayeon Jeong , Jong-Seok Lee
URL: https://arxiv.org/abs/2512.20666
Abstract:

Text-to-image diffusion models have drawn significant attention for their ability to generate diverse and high-fidelity images. However, when generating from multi-concept prompts, one concept token often dominates the generation, suppressing the others-a phenomenon we term the Dominant-vs-Dominated (DvD) imbalance. To systematically analyze this imbalance, we introduce DominanceBench and examine its causes from both data and architectural perspectives. Through various experiments, we show that the limited instance diversity in training data exacerbates the inter-concept interference. Analysis of cross-attention dynamics further reveals that dominant tokens rapidly saturate attention, progressively suppressing others across diffusion timesteps. In addition, head ablation studies show that the DvD behavior arises from distributed attention mechanisms across multiple heads. Our findings provide key insights into generative collapse, advancing toward more reliable and controllable text-to-image generation.

102. Managing the Stochastic: Foundations of Learning in Neuro-Symbolic Systems for Software Engineering

Authors: Matthew Thompson
URL: https://arxiv.org/abs/2512.20660
Abstract:

Current approaches to AI coding agents appear to blur the lines between the Large Language Model (LLM) and the agent itself, asking the LLM to make decisions best left to deterministic processes. This leads to systems prone to stochastic failures such as gaming unit tests or hallucinating syntax. Drawing on established software engineering practices that provide deterministic frameworks for managing unpredictable processes, this paper proposes setting the control boundary such that the LLM is treated as a component of the environment environment – preserving its creative stochasticity – rather than the decision-making agent. A \textbf{Dual-State Architecture} is formalized, separating workflow state (deterministic control flow) from environment state (stochastic generation). \textbf{Atomic Action Pairs} couple generation with verification as indivisible transactions, where \textbf{Guard Functions} act as sensing actions that project probabilistic outputs onto observable workflow state. The framework is validated on three code generation tasks across 13 LLMs (1.3B–15B parameters). For qualified instruction-following models, task success rates improved by up to 66 percentage points at 1.2–2.1$\times$ baseline computational cost. The results suggest that architectural constraints can substitute for parameter scale in achieving reliable code generation.

103. MaskOpt: A Large-Scale Mask Optimization Dataset to Advance AI in Integrated Circuit Manufacturing

Authors: Yuting Hu , Lei Zhuang , Hua Xiang , Jinjun Xiong , Gi-Joon Nam
URL: https://arxiv.org/abs/2512.20655
Abstract:

As integrated circuit (IC) dimensions shrink below the lithographic wavelength, optical lithography faces growing challenges from diffraction and process variability. Model-based optical proximity correction (OPC) and inverse lithography technique (ILT) remain indispensable but computationally expensive, requiring repeated simulations that limit scalability. Although deep learning has been applied to mask optimization, existing datasets often rely on synthetic layouts, disregard standard-cell hierarchy, and neglect the surrounding contexts around the mask optimization targets, thereby constraining their applicability to practical mask optimization. To advance deep learning for cell- and context-aware mask optimization, we present MaskOpt, a large-scale benchmark dataset constructed from real IC designs at the 45$\mathrm{nm}$ node. MaskOpt includes 104,714 metal-layer tiles and 121,952 via-layer tiles. Each tile is clipped at a standard-cell placement to preserve cell information, exploiting repeated logic gate occurrences. Different context window sizes are supported in MaskOpt to capture the influence of neighboring shapes from optical proximity effects. We evaluate state-of-the-art deep learning models for IC mask optimization to build up benchmarks, and the evaluation results expose distinct trade-offs across baseline models. Further context size analysis and input ablation studies confirm the importance of both surrounding geometries and cell-aware inputs in achieving accurate mask generation.

104. Forecasting N-Body Dynamics: A Comparative Study of Neural Ordinary Differential Equations and Universal Differential Equations

Authors: Suriya R S , Prathamesh Dinesh Joshi , Rajat Dandekar , Raj Dandekar , Sreedath Panat
URL: https://arxiv.org/abs/2512.20643
Abstract:

The n body problem, fundamental to astrophysics, simulates the motion of n bodies acting under the effect of their own mutual gravitational interactions. Traditional machine learning models that are used for predicting and forecasting trajectories are often data intensive black box models, which ignore the physical laws, thereby lacking interpretability. Whereas Scientific Machine Learning ( Scientific ML ) directly embeds the known physical laws into the machine learning framework. Through robust modelling in the Julia programming language, our method uses the Scientific ML frameworks: Neural ordinary differential equations (NODEs) and Universal differential equations (UDEs) to predict and forecast the system dynamics. In addition, an essential component of our analysis involves determining the forecasting breakdown point, which is the smallest possible amount of training data our models need to predict future, unseen data accurately. We employ synthetically created noisy data to simulate real-world observational limitations. Our findings indicate that the UDE model is much more data efficient, needing only 20% of data for a correct forecast, whereas the Neural ODE requires 90%.

105. Uncovering Competency Gaps in Large Language Models and Their Benchmarks

Authors: Matyas Bohacek , Nino Scherrer , Nicholas Dufour , Thomas Leung , Christoph Bregler , Stephanie C. Y. Chan
URL: https://arxiv.org/abs/2512.20638
Abstract:

The evaluation of large language models (LLMs) relies heavily on standardized benchmarks. These benchmarks provide useful aggregated metrics for a given capability, but those aggregated metrics can obscure (i) particular sub-areas where the LLMs are weak (“model gaps”) and (ii) imbalanced coverage in the benchmarks themselves (“benchmark gaps”). We propose a new method that uses sparse autoencoders (SAEs) to automatically uncover both types of gaps. By extracting SAE concept activations and computing saliency-weighted performance scores across benchmark data, the method grounds evaluation in the model’s internal representations and enables comparison across benchmarks. As examples demonstrating our approach, we applied the method to two popular open-source models and ten benchmarks. We found that these models consistently underperformed on concepts that stand in contrast to sycophantic behaviors (e.g., politely refusing a request or asserting boundaries) and concepts connected to safety discussions. These model gaps align with observations previously surfaced in the literature; our automated, unsupervised method was able to recover them without manual supervision. We also observed benchmark gaps: many of the evaluated benchmarks over-represented concepts related to obedience, authority, or instruction-following, while missing core concepts that should fall within their intended scope. In sum, our method offers a representation-grounded approach to evaluation, enabling concept-level decomposition of benchmark scores. Rather than replacing conventional aggregated metrics, CG complements them by providing a concept-level decomposition that can reveal why a model scored as it did and how benchmarks could evolve to better reflect their intended scope. Code is available at this https URL .

106. Data-Free Pruning of Self-Attention Layers in LLMs

Authors: Dhananjay Saikumar , Blesson Varghese
URL: https://arxiv.org/abs/2512.20636
Abstract:

Many self-attention sublayers in large language models (LLMs) can be removed with little to no loss. We attribute this to the Attention Suppression Hypothesis: during pre-training, some deep attention layers learn to mute their own contribution, leaving the residual stream and the MLP to carry the representation. We propose Gate-Norm, a one-shot, weight-only criterion that ranks attention sublayers by query–key coupling and removes the least coupled ones, requiring no calibration data, no forward passes, no fine-tuning, and no specialized kernels. On 40-layer, 13B-parameter LLaMA models, Gate-Norm prunes the model in under a second. Pruning $8$–$16$ attention sublayers yields up to $1.30\times$ higher inference throughput while keeping average zero-shot accuracy within $2\%$ of the unpruned baseline across BoolQ, RTE, HellaSwag, WinoGrande, ARC-Easy/Challenge, and OpenBookQA. Across these settings, Gate-Norm matches data-driven pruning methods in accuracy while being $\sim 1000\times$ faster to score layers, enabling practical, data-free compression of LLMs.

107. Real Time Detection and Quantitative Analysis of Spurious Forgetting in Continual Learning

Authors: Weiwei Wang
URL: https://arxiv.org/abs/2512.20634
Abstract:

Catastrophic forgetting remains a fundamental challenge in continual learning for large language models. Recent work revealed that performance degradation may stem from spurious forgetting caused by task alignment disruption rather than true knowledge loss. However, this work only qualitatively describes alignment, relies on post-hoc analysis, and lacks automatic distinction mechanisms. We introduce the shallow versus deep alignment framework, providing the first quantitative characterization of alignment depth. We identify that current task alignment approaches suffer from shallow alignment - maintained only over the first few output tokens (approximately 3-5) - making models vulnerable to forgetting. This explains why spurious forgetting occurs, why it is reversible, and why fine-tuning attacks are effective. We propose a comprehensive framework addressing all gaps: (1) quantitative metrics (0-1 scale) to measure alignment depth across token positions; (2) real-time detection methods for identifying shallow alignment during training; (3) specialized analysis tools for visualization and recovery prediction; and (4) adaptive mitigation strategies that automatically distinguish forgetting types and promote deep alignment. Extensive experiments on multiple datasets and model architectures (Qwen2.5-3B to Qwen2.5-32B) demonstrate 86.2-90.6% identification accuracy and show that promoting deep alignment improves robustness against forgetting by 3.3-7.1% over baselines.

108. Enhancing Lung Cancer Treatment Outcome Prediction through Semantic Feature Engineering Using Large Language Models

Authors: MunHwan Lee , Shaika Chowdhury , Xiaodi Li , Sivaraman Rajaganapathy , Eric W Klee , Ping Yang , Terence Sio , Liewei Wang , James Cerhan , Nansu NA Zong
URL: https://arxiv.org/abs/2512.20633
Abstract:

Accurate prediction of treatment outcomes in lung cancer remains challenging due to the sparsity, heterogeneity, and contextual overload of real-world electronic health data. Traditional models often fail to capture semantic information across multimodal streams, while large-scale fine-tuning approaches are impractical in clinical workflows. We introduce a framework that uses Large Language Models (LLMs) as Goal-oriented Knowledge Curators (GKC) to convert laboratory, genomic, and medication data into high-fidelity, task-aligned features. Unlike generic embeddings, GKC produces representations tailored to the prediction objective and operates as an offline preprocessing step that integrates naturally into hospital informatics pipelines. Using a lung cancer cohort (N=184), we benchmarked GKC against expert-engineered features, direct text embeddings, and an end-to-end transformer. Our approach achieved a mean AUROC of 0.803 (95% CI: 0.799-0.807) and outperformed all baselines. An ablation study further confirmed the complementary value of combining all three modalities. These results show that the quality of semantic representation is a key determinant of predictive accuracy in sparse clinical data settings. By reframing LLMs as knowledge curation engines rather than black-box predictors, this work demonstrates a scalable, interpretable, and workflow-compatible pathway for advancing AI-driven decision support in oncology.

Authors: Aayam Bansal , Ishaan Gangwani
URL: https://arxiv.org/abs/2512.20631
Abstract:

We present a comprehensive zero-training temporal drift analysis of transformer-based sentiment models validated on authentic social media data from major real-world events. Through systematic evaluation across three transformer architectures and rigorous statistical validation on 12,279 authentic social media posts, we demonstrate significant model instability with accuracy drops reaching 23.4% during event-driven periods. Our analysis reveals maximum confidence drops of 13.0% (Bootstrap 95% CI: [9.1%, 16.5%]) with strong correlation to actual performance degradation. We introduce four novel drift metrics that outperform embedding-based baselines while maintaining computational efficiency suitable for production deployment. Statistical validation across multiple events confirms robust detection capabilities with practical significance exceeding industry monitoring thresholds. This zero-training methodology enables immediate deployment for real-time sentiment monitoring systems and provides new insights into transformer model behavior during dynamic content periods.

110. Learning Evolving Latent Strategies for Multi-Agent Language Systems without Model Fine-Tuning

Authors: Wenlong Tang
URL: https://arxiv.org/abs/2512.20629
Abstract:

This study proposes a multi-agent language framework that enables continual strategy evolution without fine-tuning the language model’s parameters. The core idea is to liberate the latent vectors of abstract concepts from traditional static semantic representations, allowing them to be continuously updated through environmental interaction and reinforcement feedback. We construct a dual-loop architecture: the behavior loop adjusts action preferences based on environmental rewards, while the language loop updates the external latent vectors by reflecting on the semantic embeddings of generated text. Together, these mechanisms allow agents to develop stable and disentangled strategic styles over long-horizon multi-round interactions. Experiments show that agents’ latent spaces exhibit clear convergence trajectories under reflection-driven updates, along with structured shifts at critical moments. Moreover, the system demonstrates an emergent ability to implicitly infer and continually adapt to emotional agents, even without shared rewards. These results indicate that, without modifying model parameters, an external latent space can provide language agents with a low-cost, scalable, and interpretable form of abstract strategic representation.

111. Efficient Asynchronous Federated Evaluation with Strategy Similarity Awareness for Intent-Based Networking in Industrial Internet of Things

Authors: Shaowen Qin , Jianfeng Zeng , Haodong Guo , Xiaohuan Li , Jiawen Kang , Qian Chen , Dusit Niyato
URL: https://arxiv.org/abs/2512.20627
Abstract:

Intent-Based Networking (IBN) offers a promising paradigm for intelligent and automated network control in Industrial Internet of Things (IIoT) environments by translating high-level user intents into executable network strategies. However, frequent strategy deployment and rollback are impractical in real-world IIoT systems due to tightly coupled workflows and high downtime costs, while the heterogeneity and privacy constraints of IIoT nodes further complicate centralized policy verification. To address these challenges, we propose FEIBN, a Federated Evaluation Enhanced Intent-Based Networking framework. FEIBN leverages large language models (LLMs) to align multimodal user intents into structured strategy tuples and employs federated learning to perform distributed policy verification across IIoT nodes without exposing raw data. To improve training efficiency and reduce communication overhead, we design SSAFL, a Strategy Similarity Aware Federated Learning mechanism that selects task-relevant nodes based on strategy similarity and resource status, and triggers asynchronous model uploads only when updates are significant. Experiments demonstrate that SSAFL can improve model accuracy, accelerate model convergence, and reduce the cost by 27.8% compared with SemiAsyn.

112. Parameter-Efficient Neural CDEs via Implicit Function Jacobians

Authors: Ilya Kuleshov , Alexey Zaytsev
URL: https://arxiv.org/abs/2512.20625
Abstract:

Neural Controlled Differential Equations (Neural CDEs, NCDEs) are a unique branch of methods, specifically tailored for analysing temporal sequences. However, they come with drawbacks, the main one being the number of parameters, required for the method’s operation. In this paper, we propose an alternative, parameter-efficient look at Neural CDEs. It requires much fewer parameters, while also presenting a very logical analogy as the “Continuous RNN”, which the Neural CDEs aspire to.

113. Cooperation Through Indirect Reciprocity in Child-Robot Interactions

Authors: Isabel Neto , Alexandre S. Pires , Filipa Correia , Fernando P. Santos
URL: https://arxiv.org/abs/2512.20621
Abstract:

Social interactions increasingly involve artificial agents, such as conversational or collaborative bots. Understanding trust and prosociality in these settings is fundamental to improve human-AI teamwork. Research in biology and social sciences has identified mechanisms to sustain cooperation among humans. Indirect reciprocity (IR) is one of them. With IR, helping someone can enhance an individual’s reputation, nudging others to reciprocate in the future. Transposing IR to human-AI interactions is however challenging, as differences in human demographics, moral judgements, and agents’ learning dynamics can affect how interactions are assessed. To study IR in human-AI groups, we combine laboratory experiments and theoretical modelling. We investigate whether 1) indirect reciprocity can be transposed to children-robot interactions; 2) artificial agents can learn to cooperate given children’s strategies; and 3) how differences in learning algorithms impact human-AI cooperation. We find that IR extends to children and robots solving coordination dilemmas. Furthermore, we observe that the strategies revealed by children provide a sufficient signal for multi-armed bandit algorithms to learn cooperative actions. Beyond the experimental scenarios, we observe that cooperating through multi-armed bandit algorithms is highly dependent on the strategies revealed by humans.

114. Inspection Planning Primitives with Implicit Models

Authors: Jingyang You , Hanna Kurniawati , Lashika Medagoda
URL: https://arxiv.org/abs/2510.07611
Abstract:

The aging and increasing complexity of infrastructures make efficient inspection planning more critical in ensuring safety. Thanks to sampling-based motion planning, many inspection planners are fast. However, they often require huge memory. This is particularly true when the structure under inspection is large and complex, consisting of many struts and pillars of various geometry and sizes. Such structures can be represented efficiently using implicit models, such as neural Signed Distance Functions (SDFs). However, most primitive computations used in sampling-based inspection planner have been designed to work efficiently with explicit environment models, which in turn requires the planner to use explicit environment models or performs frequent transformations between implicit and explicit environment models during planning. This paper proposes a set of primitive computations, called Inspection Planning Primitives with Implicit Models (IPIM), that enable sampling-based inspection planners to entirely use neural SDFs representation during planning. Evaluation on three scenarios, including inspection of a complex real-world structure with over 92M triangular mesh faces, indicates that even a rudimentary sampling-based planner with IPIM can generate inspection trajectories of similar quality to those generated by the state-of-the-art planner, while using up to 70x less memory than the state-of-the-art inspection planner.

전체 AI 논문 - 2025-12-26

1. RoboSafe: Safeguarding Embodied Agents via Executable Safety Logic

2. A Real-World Evaluation of LLM Medication Safety Reviews in NHS Primary Care

3. Beyond Context: Large Language Models Failure to Grasp Users Intent

4. LLM Personas as a Substitute for Field Experiments in Method Benchmarking

5. Agentic Explainable Artificial Intelligence (Agentic XAI) Approach To Explore Better Explanation

6. TrafficSimAgent: A Hierarchical Agent Framework for Autonomous Traffic Simulation with MCP Control

7. FinAgent: An Agentic AI Framework Integrating Personal Finance and Nutrition Planning

8. A Blockchain-Monitored Agentic AI Architecture for Trusted Perception-Reasoning-Action Pipelines

9. The Silent Scholar Problem: A Probabilistic Framework for Breaking Epistemic Asymmetry in LLM Agents

10. MAR:Multi-Agent Reflexion Improves Reasoning Abilities in LLMs

11. Context-Sensitive Abstractions for Reinforcement Learning with Parameterized Actions

12. Safety Alignment of LMs via Non-cooperative Games

13. A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents

14. AgentMath: Empowering Mathematical Reasoning for Large Language Models via Tool-Augmented Agent

15. From artificial to organic: Rethinking the roots of intelligence for digital health

16. From Pilots to Practices: A Scoping Review of GenAI-Enabled Personalization in Computer Science Education

17. Bridging the AI Trustworthiness Gap between Functions and Norms

18. Eidoku: A Neuro-Symbolic Verification Gate for LLM Reasoning via Structural Constraint Satisfaction

19. Quantifying Laziness, Decoding Suboptimality, and Context Degradation in Large Language Models

20. From Fake Focus to Real Precision: Confusion-Driven Adversarial Attention Learning in Transformers

21. AI-Driven Decision-Making System for Hiring Process

22. Memory Bear AI A Breakthrough from Memory to Cognition Toward Artificial General Intelligence

23. Mixture of Attention Schemes (MoAS): Learning to Route Between MHA, GQA, and MQA

24. AIAuditTrack: A Framework for AI Security system

25. Reasoning Relay: Evaluating Stability and Interchangeability of Large Language Models in Mathematical Reasoning

26. Erkang-Diagnosis-1.1 Technical Report

27. MicroProbe: Efficient Reliability Assessment for Foundation Models with Minimal Data

28. Proceedings of the 20th International Conference on Knowledge, Information and Creativity Support Systems (KICSS 2025)

29. MegaRAG: Multimodal Knowledge Graph-Based Retrieval Augmented Generation

30. Quantum-Inspired Multi Agent Reinforcement Learning for Exploration Exploitation Optimization in UAV-Assisted 6G Network Deployment

31. BitRL-Light: 1-bit LLM Agents with Deep Reinforcement Learning for Energy-Efficient Smart Home Lighting Optimization

32. Optimizing Decoding Paths in Masked Diffusion Models by Quantifying Uncertainty

33. C2LLM Technical Report: A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling

34. Measuring all the noises of LLM Evals

35. Scaling Laws for Economic Productivity: Experimental Evidence in LLM-Assisted Consulting, Data Analyst, and Management Tasks

36. Model Merging via Multi-Teacher Knowledge Distillation

37. SMART SLM: Structured Memory and Reasoning Transformer, A Small Language Model for Accurate Document Assistance

38. Learning Factors in AI-Augmented Education: A Comparative Study of Middle and High School Students

39. LookPlanGraph: Embodied Instruction Following Method with VLM Graph Augmentation

40. Improving the Convergence Rate of Ray Search Optimization for Query-Efficient Hard-Label Attacks

41. Casting a SPELL: Sentence Pairing Exploration for LLM Limitation-breaking

42. PhononBench:A Large-Scale Phonon-Based Benchmark for Dynamical Stability in Crystal Generation

43. Leveraging Lightweight Entity Extraction for Scalable Event-Based Image Retrieval

44. SpidR-Adapt: A Universal Speech Representation Model for Few-Shot Adaptation

45. Schrödinger’s Navigator: Imagining an Ensemble of Futures for Zero-Shot Object Navigation

46. BALLAST: Bandit-Assisted Learning for Latency-Aware Stable Timeouts in Raft

47. MODE: Multi-Objective Adaptive Coreset Selection

48. TGC-Net: A Structure-Aware and Semantically-Aligned Framework for Text-Guided Medical Image Segmentation

49. AutoBaxBuilder: Bootstrapping Code Security Benchmarking

50. STLDM: Spatio-Temporal Latent Diffusion Model for Precipitation Nowcasting

51. Semi-Supervised Learning for Large Language Models Safety and Content Moderation

52. Semantic Refinement with LLMs for Graph Representations

53. TexAvatars : Hybrid Texel-3D Representations for Stable Rigging of Photorealistic Gaussian Head Avatars

54. Understanding Scaling Laws in Deep Neural Networks via Feature Learning Dynamics

55. DexAvatar: 3D Sign Language Reconstruction with Hand and Body Pose Priors

56. Policy-Conditioned Policies for Multi-Agent Task Solving

57. Rethinking Supervised Fine-Tuning: Emphasizing Key Answer Tokens for Improved LLM Accuracy

58. LLM Swiss Round: Aggregating Multi-Benchmark Performance via Competitive Swiss-System Dynamics

59. Distilling the Essence: Efficient Reasoning Distillation via Sequence Truncation

60. Automatic Replication of LLM Mistakes in Medical Conversations

61. GenTSE: Enhancing Target Speaker Extraction via a Coarse-to-Fine Generative Language Model

62. Generalised Linear Models in Deep Bayesian RL with Learnable Basis Functions

63. Mesh-Attention: A New Communication-Efficient Distributed Attention with Improved Data Locality

64. Can Agentic AI Match the Performance of Human Data Scientists?

65. ReACT-Drug: Reaction-Template Guided Reinforcement Learning for de novo Drug Design

66. One Tool Is Enough: Reinforcement Learning for Repository-Level LLM Agents

67. Reflection Pretraining Enables Token-Level Self-Correction in Biological Sequence Models

68. MultiMind at SemEval-2025 Task 7: Crosslingual Fact-Checked Claim Retrieval via Multi-Source Alignment

69. Neural Probe-Based Hallucination Detection for Large Language Models

70. A Multi-fidelity Double-Delta Wing Dataset and Empirical Scaling Laws for GNN-based Aerodynamic Field Surrogate

71. Transductive Visual Programming: Evolving Tool Libraries from Experience for Spatial Reasoning

72. Guardrailed Elasticity Pricing: A Churn-Aware Forecasting Playbook for Subscription Strategy

73. RevFFN: Memory-Efficient Full-Parameter Fine-Tuning of Mixture-of-Experts LLMs with Reversible Blocks

74. DiEC: Diffusion Embedded Clustering

75. Embodied AI-Enhanced IoMT Edge Computing: UAV Trajectory Optimization and Task Offloading with Mobility Prediction

76. DGSAN: Dual-Graph Spatiotemporal Attention Network for Pulmonary Nodule Malignancy Prediction

77. Lightweight framework for underground pipeline recognition and spatial localization based on multi-view 2D GPR images

78. Memory-Efficient Acceleration of Block Low-Rank Foundation Models on Resource Constrained GPUs

79. NVIDIA Nemotron 3: Efficient and Open Intelligence