전체 AI 논문 - 2025-07-29
1. A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence
- Authors: Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, Hongru Wang, Han Xiao, Yuhang Zhou, Shaokun Zhang, Jiayi Zhang, Jinyu Xiang, Yixiong Fang, Qiwen Zhao, Dongrui Liu, Qihan Ren, Cheng Qian, Zhenghailong Wang, Minda Hu, Huazheng Wang, Qingyun Wu, Heng Ji, Mengdi Wang
- URL: https://arxiv.org/abs/2507.21046
- Abstract:
Large Language Models (LLMs) have demonstrated strong capabilities but remain fundamentally static, unable to adapt their internal parameters to novel tasks, evolving knowledge domains, or dynamic interaction contexts. As LLMs are increasingly deployed in open-ended, interactive environments, this static nature has become a critical bottleneck, necessitating agents that can adaptively reason, act, and evolve in real time. This paradigm shift – from scaling static models to developing self-evolving agents – has sparked growing interest in architectures and methods enabling continual learning and adaptation from data, interactions, and experiences. This survey provides the first systematic and comprehensive review of self-evolving agents, organized around three foundational dimensions – what to evolve, when to evolve, and how to evolve. We examine evolutionary mechanisms across agent components (e.g., models, memory, tools, architecture), categorize adaptation methods by stages (e.g., intra-test-time, inter-test-time), and analyze the algorithmic and architectural designs that guide evolutionary adaptation (e.g., scalar rewards, textual feedback, single-agent and multi-agent systems). Additionally, we analyze evaluation metrics and benchmarks tailored for self-evolving agents, highlight applications in domains such as coding, education, and healthcare, and identify critical challenges and research directions in safety, scalability, and co-evolutionary dynamics. By providing a structured framework for understanding and designing self-evolving agents, this survey establishes a roadmap for advancing adaptive agentic systems in both research and real-world deployments, ultimately shedding lights to pave the way for the realization of Artificial Super Intelligence (ASI), where agents evolve autonomously, performing at or beyond human-level intelligence across a wide array of tasks.
2. GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis
- Authors: Haoyang Liu, Yijiang Li, Haohan Wang
- URL: https://arxiv.org/abs/2507.21035
- Abstract:
Gene expression analysis holds the key to many biomedical discoveries, yet extracting insights from raw transcriptomic data remains formidable due to the complexity of multiple large, semi-structured files and the need for extensive domain expertise. Current automation approaches are often limited by either inflexible workflows that break down in edge cases or by fully autonomous agents that lack the necessary precision for rigorous scientific inquiry. GenoMAS charts a different course by presenting a team of LLM-based scientists that integrates the reliability of structured workflows with the adaptability of autonomous agents. GenoMAS orchestrates six specialized LLM agents through typed message-passing protocols, each contributing complementary strengths to a shared analytic canvas. At the heart of GenoMAS lies a guided-planning framework: programming agents unfold high-level task guidelines into Action Units and, at each juncture, elect to advance, revise, bypass, or backtrack, thereby maintaining logical coherence while bending gracefully to the idiosyncrasies of genomic data. On the GenoTEX benchmark, GenoMAS reaches a Composite Similarity Correlation of 89.13% for data preprocessing and an F$_1$ of 60.48% for gene identification, surpassing the best prior art by 10.61% and 16.85% respectively. Beyond metrics, GenoMAS surfaces biologically plausible gene-phenotype associations corroborated by the literature, all while adjusting for latent confounders. Code is available at this https URL.
3. Smart Expansion Techniques for ASP-based Interactive Configuration
- Authors: Lucia Balážová, Richard Comploi-Taupe, Susana Hahn, Nicolas Rühling, Gottfried Schenner
- URL: https://arxiv.org/abs/2507.21027
- Abstract:
Product configuration is a successful application of Answer Set Programming (ASP). However, challenges are still open for interactive systems to effectively guide users through the configuration process. The aim of our work is to provide an ASP-based solver for interactive configuration that can deal with large-scale industrial configuration problems and that supports intuitive user interfaces via an API. In this paper, we focus on improving the performance of automatically completing a partial configuration. Our main contribution enhances the classical incremental approach for multi-shot solving by four different smart expansion functions. The core idea is to determine and add specific objects or associations to the partial configuration by exploiting cautious and brave consequences before checking for the existence of a complete configuration with the current objects in each iteration. This approach limits the number of costly unsatisfiability checks and reduces the search space, thereby improving solving performance. In addition, we present a user interface that uses our API and is implemented in ASP.
4. MIRAGE-Bench: LLM Agent is Hallucinating and Where to Find Them
- Authors: Weichen Zhang, Yiyou Sun, Pohao Huang, Jiayue Pu, Heyue Lin, Dawn Song
- URL: https://arxiv.org/abs/2507.21017
- Abstract:
Hallucinations pose critical risks for large language model (LLM)-based agents, often manifesting as hallucinative actions resulting from fabricated or misinterpreted information within the cognitive context. While recent studies have exposed such failures, existing evaluations remain fragmented and lack a principled testbed. In this paper, we present MIRAGE-Bench–Measuring Illusions in Risky AGEnt settings–the first unified benchmark for eliciting and evaluating hallucinations in interactive LLM-agent scenarios. We begin by introducing a three-part taxonomy to address agentic hallucinations: actions that are unfaithful to (i) task instructions, (ii) execution history, or (iii) environment observations. To analyze, we first elicit such failures by performing a systematic audit of existing agent benchmarks, then synthesize test cases using a snapshot strategy that isolates decision points in deterministic and reproducible manners. To evaluate hallucination behaviors, we adopt a fine-grained-level LLM-as-a-Judge paradigm with tailored risk-aware prompts, enabling scalable, high-fidelity assessment of agent actions without enumerating full action spaces. MIRAGE-Bench provides actionable insights on failure modes of LLM agents and lays the groundwork for principled progress in mitigating hallucinations in interactive environments.
5. Core Safety Values for Provably Corrigible Agents
- Authors: Aran Nayebi
- URL: https://arxiv.org/abs/2507.20964
- Abstract:
We introduce the first implementable framework for corrigibility, with provable guarantees in multi-step, partially observed environments. Our framework replaces a single opaque reward with five structurally separate utility heads – deference, switch-access preservation, truthfulness, low-impact behavior via a belief-based extension of Attainable Utility Preservation, and bounded task reward – combined lexicographically by strict weight gaps. Theorem 1 proves exact single-round corrigibility in the partially observable off-switch game; Theorem 3 extends the guarantee to multi-step, self-spawning agents, showing that even if each head is \emph{learned} to mean-squared error $\varepsilon$ and the planner is $\varepsilon$-sub-optimal, the probability of violating \emph{any} safety property is bounded while still ensuring net human benefit. In contrast to Constitutional AI or RLHF/RLAIF, which merge all norms into one learned scalar, our separation makes obedience and impact-limits dominate even when incentives conflict. For open-ended settings where adversaries can modify the agent, we prove that deciding whether an arbitrary post-hack agent will ever violate corrigibility is undecidable by reduction to the halting problem, then carve out a finite-horizon ``decidable island’’ where safety can be certified in randomized polynomial time and verified with privacy-preserving, constant-round zero-knowledge proofs. Consequently, the remaining challenge is the ordinary ML task of data coverage and generalization: reward-hacking risk is pushed into evaluation quality rather than hidden incentive leak-through, giving clearer implementation guidance for today’s LLM assistants and future autonomous systems.
6. On the Limits of Hierarchically Embedded Logic in Classical Neural Networks
- Authors: Bill Cochran
- URL: https://arxiv.org/abs/2507.20960
- Abstract:
We propose a formal model of reasoning limitations in large neural net models for language, grounded in the depth of their neural architecture. By treating neural networks as linear operators over logic predicate space we show that each layer can encode at most one additional level of logical reasoning. We prove that a neural network of depth a particular depth cannot faithfully represent predicates in a one higher order logic, such as simple counting over complex predicates, implying a strict upper bound on logical expressiveness. This structure induces a nontrivial null space during tokenization and embedding, excluding higher-order predicates from representability. Our framework offers a natural explanation for phenomena such as hallucination, repetition, and limited planning, while also providing a foundation for understanding how approximations to higher-order logic may emerge. These results motivate architectural extensions and interpretability strategies in future development of language models.
7. Partially Observable Monte-Carlo Graph Search
- Authors: Yang You, Vincent Thomas, Alex Schutz, Robert Skilton, Nick Hawes, Olivier Buffet
- URL: https://arxiv.org/abs/2507.20951
- Abstract:
Currently, large partially observable Markov decision processes (POMDPs) are often solved by sampling-based online methods which interleave planning and execution phases. However, a pre-computed offline policy is more desirable in POMDP applications with time or energy constraints. But previous offline algorithms are not able to scale up to large POMDPs. In this article, we propose a new sampling-based algorithm, the partially observable Monte-Carlo graph search (POMCGS) to solve large POMDPs offline. Different from many online POMDP methods, which progressively develop a tree while performing (Monte-Carlo) simulations, POMCGS folds this search tree on the fly to construct a policy graph, so that computations can be drastically reduced, and users can analyze and validate the policy prior to embedding and executing it. Moreover, POMCGS, together with action progressive widening and observation clustering methods provided in this article, is able to address certain continuous POMDPs. Through experiments, we demonstrate that POMCGS can generate policies on the most challenging POMDPs, which cannot be computed by previous offline algorithms, and these policies’ values are competitive compared with the state-of-the-art online POMDP algorithms.
8. MMGraphRAG: Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs
- Authors: Xueyao Wan, Hang Yu
- URL: https://arxiv.org/abs/2507.20804
- Abstract:
Retrieval-Augmented Generation (RAG) enhances language model generation by retrieving relevant information from external knowledge bases. However, conventional RAG methods face the issue of missing multimodal information. Multimodal RAG methods address this by fusing images and text through mapping them into a shared embedding space, but they fail to capture the structure of knowledge and logical chains between modalities. Moreover, they also require large-scale training for specific tasks, resulting in limited generalizing ability. To address these limitations, we propose MMGraphRAG, which refines visual content through scene graphs and constructs a multimodal knowledge graph (MMKG) in conjunction with text-based KG. It employs spectral clustering to achieve cross-modal entity linking and retrieves context along reasoning paths to guide the generative process. Experimental results show that MMGraphRAG achieves state-of-the-art performance on the DocBench and MMLongBench datasets, demonstrating strong domain adaptability and clear reasoning paths.
9. evalSmarT: An LLM-Based Framework for Evaluating Smart Contract Generated Comments
- Authors: Fatou Ndiaye Mbodji
- URL: https://arxiv.org/abs/2507.20774
- Abstract:
Smart contract comment generation has gained traction as a means to improve code comprehension and maintainability in blockchain systems. However, evaluating the quality of generated comments remains a challenge. Traditional metrics such as BLEU and ROUGE fail to capture domain-specific nuances, while human evaluation is costly and unscalable. In this paper, we present \texttt{evalSmarT}, a modular and extensible framework that leverages large language models (LLMs) as evaluators. The system supports over 400 evaluator configurations by combining approximately 40 LLMs with 10 prompting strategies. We demonstrate its application in benchmarking comment generation tools and selecting the most informative outputs. Our results show that prompt design significantly impacts alignment with human judgment, and that LLM-based evaluation offers a scalable and semantically rich alternative to existing methods.
10. How Chain-of-Thought Works? Tracing Information Flow from Decoding, Projection, and Activation
- Authors: Hao Yang, Qinghua Zhao, Lei Li
- URL: https://arxiv.org/abs/2507.20758
- Abstract:
Chain-of-Thought (CoT) prompting significantly enhances model reasoning, yet its internal mechanisms remain poorly understood. We analyze CoT’s operational principles by reversely tracing information flow across decoding, projection, and activation phases. Our quantitative analysis suggests that CoT may serve as a decoding space pruner, leveraging answer templates to guide output generation, with higher template adherence strongly correlating with improved performance. Furthermore, we surprisingly find that CoT modulates neuron engagement in a task-dependent manner: reducing neuron activation in open-domain tasks, yet increasing it in closed-domain scenarios. These findings offer a novel mechanistic interpretability framework and critical insights for enabling targeted CoT interventions to design more efficient and robust prompts. We released our code and data at this https URL.
11. Beyond Listenership: AI-Predicted Interventions Drive Improvements in Maternal Health Behaviours
- Authors: Arpan Dasgupta, Sarvesh Gharat, Neha Madhiwalla, Aparna Hegde, Milind Tambe, Aparna Taneja
- URL: https://arxiv.org/abs/2507.20755
- Abstract:
Automated voice calls with health information are a proven method for disseminating maternal and child health information among beneficiaries and are deployed in several programs around the world. However, these programs often suffer from beneficiary dropoffs and poor engagement. In previous work, through real-world trials, we showed that an AI model, specifically a restless bandit model, could identify beneficiaries who would benefit most from live service call interventions, preventing dropoffs and boosting engagement. However, one key question has remained open so far: does such improved listenership via AI-targeted interventions translate into beneficiaries’ improved knowledge and health behaviors? We present a first study that shows not only listenership improvements due to AI interventions, but also simultaneously links these improvements to health behavior changes. Specifically, we demonstrate that AI-scheduled interventions, which enhance listenership, lead to statistically significant improvements in beneficiaries’ health behaviors such as taking iron or calcium supplements in the postnatal period, as well as understanding of critical health topics during pregnancy and infancy. This underscores the potential of AI to drive meaningful improvements in maternal and child health.
12. Learning the Value Systems of Societies from Preferences
- Authors: Andrés Holgado-Sánchez, Holger Billhardt, Sascha Ossowski, Sara Degli-Esposti
- URL: https://arxiv.org/abs/2507.20728
- Abstract:
Aligning AI systems with human values and the value-based preferences of various stakeholders (their value systems) is key in ethical AI. In value-aware AI systems, decision-making draws upon explicit computational representations of individual values (groundings) and their aggregation into value systems. As these are notoriously difficult to elicit and calibrate manually, value learning approaches aim to automatically derive computational models of an agent’s values and value system from demonstrations of human behaviour. Nonetheless, social science and humanities literature suggest that it is more adequate to conceive the value system of a society as a set of value systems of different groups, rather than as the simple aggregation of individual value systems. Accordingly, here we formalize the problem of learning the value systems of societies and propose a method to address it based on heuristic deep clustering. The method learns socially shared value groundings and a set of diverse value systems representing a given society by observing qualitative value-based preferences from a sample of agents. We evaluate the proposal in a use case with real data about travelling decisions.
13. Algorithmic Fairness: A Runtime Perspective
- Authors: Filip Cano, Thomas A. Henzinger, Konstantin Kueffner
- URL: https://arxiv.org/abs/2507.20711
- Abstract:
Fairness in AI is traditionally studied as a static property evaluated once, over a fixed dataset. However, real-world AI systems operate sequentially, with outcomes and environments evolving over time. This paper proposes a framework for analysing fairness as a runtime property. Using a minimal yet expressive model based on sequences of coin tosses with possibly evolving biases, we study the problems of monitoring and enforcing fairness expressed in either toss outcomes or coin biases. Since there is no one-size-fits-all solution for either problem, we provide a summary of monitoring and enforcement strategies, parametrised by environment dynamics, prediction horizon, and confidence thresholds. For both problems, we present general results under simple or minimal assumptions. We survey existing solutions for the monitoring problem for Markovian and additive dynamics, and existing solutions for the enforcement problem in static settings with known dynamics.
14. A General Framework for Dynamic MAPF using Multi-Shot ASP and Tunnels
- Authors: Aysu Bogatarkan, Esra Erdem
- URL: https://arxiv.org/abs/2507.20703
- Abstract:
MAPF problem aims to find plans for multiple agents in an environment within a given time, such that the agents do not collide with each other or obstacles. Motivated by the execution and monitoring of these plans, we study Dynamic MAPF (D-MAPF) problem, which allows changes such as agents entering/leaving the environment or obstacles being removed/moved. Considering the requirements of real-world applications in warehouses with the presence of humans, we introduce 1) a general definition for D-MAPF (applicable to variations of D-MAPF), 2) a new framework to solve D-MAPF (utilizing multi-shot computation, and allowing different methods to solve D-MAPF), and 3) a new ASP-based method to solve D-MAPF (combining advantages of replanning and repairing methods, with a novel concept of tunnels to specify where agents can move). We have illustrated the strengths and weaknesses of this method by experimental evaluations, from the perspectives of computational performance and quality of solutions.
15. Adaptive Fuzzy Time Series Forecasting via Partially Asymmetric Convolution and Sub-Sliding Window Fusion
- Authors: Lijian Li
- URL: https://arxiv.org/abs/2507.20641
- Abstract:
At present, state-of-the-art forecasting models are short of the ability to capture spatio-temporal dependency and synthesize global information at the stage of learning. To address this issue, in this paper, through the adaptive fuzzified construction of temporal data, we propose a novel convolutional architecture with partially asymmetric design based on the scheme of sliding window to realize accurate time series forecasting. First, the construction strategy of traditional fuzzy time series is improved to further extract short and long term temporal interrelation, which enables every time node to automatically possess corresponding global information and inner relationships among them in a restricted sliding window and the process does not require human involvement. Second, a bilateral Atrous algorithm is devised to reduce calculation demand of the proposed model without sacrificing global characteristics of elements. And it also allows the model to avoid processing redundant information. Third, after the transformation of time series, a partially asymmetric convolutional architecture is designed to more flexibly mine data features by filters in different directions on feature maps, which gives the convolutional neural network (CNN) the ability to construct sub-windows within existing sliding windows to model at a more fine-grained level. And after obtaining the time series information at different levels, the multi-scale features from different sub-windows will be sent to the corresponding network layer for time series information fusion. Compared with other competitive modern models, the proposed method achieves state-of-the-art results on most of popular time series datasets, which is fully verified by the experimental results.
16. Complementarity-driven Representation Learning for Multi-modal Knowledge Graph Completion
- Authors: Lijian Li
- URL: https://arxiv.org/abs/2507.20620
- Abstract:
Multi-modal Knowledge Graph Completion (MMKGC) aims to uncover hidden world knowledge in multimodal knowledge graphs by leveraging both multimodal and structural entity information. However, the inherent imbalance in multimodal knowledge graphs, where modality distributions vary across entities, poses challenges in utilizing additional modality data for robust entity representation. Existing MMKGC methods typically rely on attention or gate-based fusion mechanisms but overlook complementarity contained in multi-modal data. In this paper, we propose a novel framework named Mixture of Complementary Modality Experts (MoCME), which consists of a Complementarity-guided Modality Knowledge Fusion (CMKF) module and an Entropy-guided Negative Sampling (EGNS) mechanism. The CMKF module exploits both intra-modal and inter-modal complementarity to fuse multi-view and multi-modal embeddings, enhancing representations of entities. Additionally, we introduce an Entropy-guided Negative Sampling mechanism to dynamically prioritize informative and uncertain negative samples to enhance training effectiveness and model robustness. Extensive experiments on five benchmark datasets demonstrate that our MoCME achieves state-of-the-art performance, surpassing existing approaches.
17. Enhancing Large Multimodal Models with Adaptive Sparsity and KV Cache Compression
- Authors: Te Zhang, Yuheng Li, Junxiang Wang, Lujun Li
- URL: https://arxiv.org/abs/2507.20613
- Abstract:
Large multimodal models (LMMs) have advanced significantly by integrating visual encoders with extensive language models, enabling robust reasoning capabilities. However, compressing LMMs for deployment on edge devices remains a critical challenge. In this work, we propose an adaptive search algorithm that optimizes sparsity and KV cache compression to enhance LMM efficiency. Utilizing the Tree-structured Parzen Estimator, our method dynamically adjusts pruning ratios and KV cache quantization bandwidth across different LMM layers, using model performance as the optimization objective. This approach uniquely combines pruning with key-value cache quantization and incorporates a fast pruning technique that eliminates the need for additional fine-tuning or weight adjustments, achieving efficient compression without compromising accuracy. Comprehensive evaluations on benchmark datasets, including LLaVA-1.5 7B and 13B, demonstrate our method superiority over state-of-the-art techniques such as SparseGPT and Wanda across various compression levels. Notably, our framework automatic allocation of KV cache compression resources sets a new standard in LMM optimization, delivering memory efficiency without sacrificing much performance.
18. Unlearning of Knowledge Graph Embedding via Preference Optimization
- Authors: Jiajun Liu, Wenjun Ke, Peng Wang, Yao He, Ziyu Shang, Guozheng Li, Zijie Xu, Ke Ji
- URL: https://arxiv.org/abs/2507.20566
- Abstract:
Existing knowledge graphs (KGs) inevitably contain outdated or erroneous knowledge that needs to be removed from knowledge graph embedding (KGE) models. To address this challenge, knowledge unlearning can be applied to eliminate specific information while preserving the integrity of the remaining knowledge in KGs. Existing unlearning methods can generally be categorized into exact unlearning and approximate unlearning. However, exact unlearning requires high training costs while approximate unlearning faces two issues when applied to KGs due to the inherent connectivity of triples: (1) It fails to fully remove targeted information, as forgetting triples can still be inferred from remaining ones. (2) It focuses on local data for specific removal, which weakens the remaining knowledge in the forgetting boundary. To address these issues, we propose GraphDPO, a novel approximate unlearning framework based on direct preference optimization (DPO). Firstly, to effectively remove forgetting triples, we reframe unlearning as a preference optimization problem, where the model is trained by DPO to prefer reconstructed alternatives over the original forgetting triples. This formulation penalizes reliance on forgettable knowledge, mitigating incomplete forgetting caused by KG connectivity. Moreover, we introduce an out-boundary sampling strategy to construct preference pairs with minimal semantic overlap, weakening the connection between forgetting and retained knowledge. Secondly, to preserve boundary knowledge, we introduce a boundary recall mechanism that replays and distills relevant information both within and across time steps. We construct eight unlearning datasets across four popular KGs with varying unlearning rates. Experiments show that GraphDPO outperforms state-of-the-art baselines by up to 10.1% in MRR_Avg and 14.0% in MRR_F1.
19. MeLA: A Metacognitive LLM-Driven Architecture for Automatic Heuristic Design
- Authors: Zishang Qiu, Xinan Chen, Long Chen, Ruibin Bai
- URL: https://arxiv.org/abs/2507.20541
- Abstract:
This paper introduces MeLA, a Metacognitive LLM-Driven Architecture that presents a new paradigm for Automatic Heuristic Design (AHD). Traditional evolutionary methods operate directly on heuristic code; in contrast, MeLA evolves the instructional prompts used to guide a Large Language Model (LLM) in generating these heuristics. This process of “prompt evolution” is driven by a novel metacognitive framework where the system analyzes performance feedback to systematically refine its generative strategy. MeLA’s architecture integrates a problem analyzer to construct an initial strategic prompt, an error diagnosis system to repair faulty code, and a metacognitive search engine that iteratively optimizes the prompt based on heuristic effectiveness. In comprehensive experiments across both benchmark and real-world problems, MeLA consistently generates more effective and robust heuristics, significantly outperforming state-of-the-art methods. Ultimately, this research demonstrates the profound potential of using cognitive science as a blueprint for AI architecture, revealing that by enabling an LLM to metacognitively regulate its problem-solving process, we unlock a more robust and interpretable path to AHD.
20. Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition
- Authors: Andy Zou, Maxwell Lin, Eliot Jones, Micha Nowak, Mateusz Dziemian, Nick Winter, Alexander Grattan, Valent Nathanael, Ayla Croft, Xander Davies, Jai Patel, Robert Kirk, Nate Burnikell, Yarin Gal, Dan Hendrycks, J. Zico Kolter, Matt Fredrikson
- URL: https://arxiv.org/abs/2507.20526
- Abstract:
Recent advances have enabled LLM-powered AI agents to autonomously execute complex tasks by combining language model reasoning with tools, memory, and web access. But can these systems be trusted to follow deployment policies in realistic environments, especially under attack? To investigate, we ran the largest public red-teaming competition to date, targeting 22 frontier AI agents across 44 realistic deployment scenarios. Participants submitted 1.8 million prompt-injection attacks, with over 60,000 successfully eliciting policy violations such as unauthorized data access, illicit financial actions, and regulatory noncompliance. We use these results to build the Agent Red Teaming (ART) benchmark - a curated set of high-impact attacks - and evaluate it across 19 state-of-the-art models. Nearly all agents exhibit policy violations for most behaviors within 10-100 queries, with high attack transferability across models and tasks. Importantly, we find limited correlation between agent robustness and model size, capability, or inference-time compute, suggesting that additional defenses are needed against adversarial misuse. Our findings highlight critical and persistent vulnerabilities in today’s AI agents. By releasing the ART benchmark and accompanying evaluation framework, we aim to support more rigorous security assessment and drive progress toward safer agent deployment.
21. STARN-GAT: A Multi-Modal Spatio-Temporal Graph Attention Network for Accident Severity Prediction
- Authors: Pritom Ray Nobin, Imran Ahammad Rifat
- URL: https://arxiv.org/abs/2507.20451
- Abstract:
Accurate prediction of traffic accident severity is critical for improving road safety, optimizing emergency response strategies, and informing the design of safer transportation infrastructure. However, existing approaches often struggle to effectively model the intricate interdependencies among spatial, temporal, and contextual variables that govern accident outcomes. In this study, we introduce STARN-GAT, a Multi-Modal Spatio-Temporal Graph Attention Network, which leverages adaptive graph construction and modality-aware attention mechanisms to capture these complex relationships. Unlike conventional methods, STARN-GAT integrates road network topology, temporal traffic patterns, and environmental context within a unified attention-based framework. The model is evaluated on the Fatality Analysis Reporting System (FARS) dataset, achieving a Macro F1-score of 85 percent, ROC-AUC of 0.91, and recall of 81 percent for severe incidents. To ensure generalizability within the South Asian context, STARN-GAT is further validated on the ARI-BUET traffic accident dataset, where it attains a Macro F1-score of 0.84, recall of 0.78, and ROC-AUC of 0.89. These results demonstrate the model’s effectiveness in identifying high-risk cases and its potential for deployment in real-time, safety-critical traffic management systems. Furthermore, the attention-based architecture enhances interpretability, offering insights into contributing factors and supporting trust in AI-assisted decision-making. Overall, STARN-GAT bridges the gap between advanced graph neural network techniques and practical applications in road safety analytics.
22. Enhancing QoS in Edge Computing through Federated Layering Techniques: A Pathway to Resilient AI Lifelong Learning Systems
- Authors: Chengzhuo Han
- URL: https://arxiv.org/abs/2507.20444
- Abstract:
In the context of the rapidly evolving information technology landscape, marked by the advent of 6G communication networks, we face an increased data volume and complexity in network environments. This paper addresses these challenges by focusing on Quality of Service (QoS) in edge computing frameworks. We propose a novel approach to enhance QoS through the development of General Artificial Intelligence Lifelong Learning Systems, with a special emphasis on Federated Layering Techniques (FLT). Our work introduces a federated layering-based small model collaborative mechanism aimed at improving AI models’ operational efficiency and response time in environments where resources are limited. This innovative method leverages the strengths of cloud and edge computing, incorporating a negotiation and debate mechanism among small AI models to enhance reasoning and decision-making processes. By integrating model layering techniques with privacy protection measures, our approach ensures the secure transmission of model parameters while maintaining high efficiency in learning and reasoning capabilities. The experimental results demonstrate that our strategy not only enhances learning efficiency and reasoning accuracy but also effectively protects the privacy of edge nodes. This presents a viable solution for achieving resilient large model lifelong learning systems, with a significant improvement in QoS for edge computing environments.
23. Memorization in Fine-Tuned Large Language Models
- Authors: Danil Savine, Muni Sreenivas Pydi, Jamal Atif, Olivier Cappé
- URL: https://arxiv.org/abs/2507.21009
- Abstract:
This study investigates the mechanisms and factors influencing memorization in fine-tuned large language models (LLMs), with a focus on the medical domain due to its privacy-sensitive nature. We examine how different aspects of the fine-tuning process affect a model’s propensity to memorize training data, using the PHEE dataset of pharmacovigilance events. Our research employs two main approaches: a membership inference attack to detect memorized data, and a generation task with prompted prefixes to assess verbatim reproduction. We analyze the impact of adapting different weight matrices in the transformer architecture, the relationship between perplexity and memorization, and the effect of increasing the rank in low-rank adaptation (LoRA) fine-tuning. Key findings include: (1) Value and Output matrices contribute more significantly to memorization compared to Query and Key matrices; (2) Lower perplexity in the fine-tuned model correlates with increased memorization; (3) Higher LoRA ranks lead to increased memorization, but with diminishing returns at higher ranks. These results provide insights into the trade-offs between model performance and privacy risks in fine-tuned LLMs. Our findings have implications for developing more effective and responsible strategies for adapting large language models while managing data privacy concerns.
24. Compositional Function Networks: A High-Performance Alternative to Deep Neural Networks with Built-in Interpretability
- Authors: Fang Li
- URL: https://arxiv.org/abs/2507.21004
- Abstract:
Deep Neural Networks (DNNs) deliver impressive performance but their black-box nature limits deployment in high-stakes domains requiring transparency. We introduce Compositional Function Networks (CFNs), a novel framework that builds inherently interpretable models by composing elementary mathematical functions with clear semantics. Unlike existing interpretable approaches that are limited to simple additive structures, CFNs support diverse compositional patterns – sequential, parallel, and conditional – enabling complex feature interactions while maintaining transparency. A key innovation is that CFNs are fully differentiable, allowing efficient training through standard gradient descent. We demonstrate CFNs’ versatility across multiple domains, from symbolic regression to image classification with deep hierarchical networks. Our empirical evaluation shows CFNs achieve competitive performance against black-box models (96.24% accuracy on CIFAR-10) while outperforming state-of-the-art interpretable models like Explainable Boosting Machines. By combining the hierarchical expressiveness and efficient training of deep learning with the intrinsic interpretability of well-defined mathematical functions, CFNs offer a powerful framework for applications where both performance and accountability are paramount.
25. Modular Delta Merging with Orthogonal Constraints: A Scalable Framework for Continual and Reversible Model Composition
- Authors: Haris Khan, Shumaila Asif, Sadia Asif
- URL: https://arxiv.org/abs/2507.20997
- Abstract:
In real-world machine learning deployments, models must be continually updated, composed, and when required, selectively undone. However, existing approaches to model merging and continual learning often suffer from task interference, catastrophic forgetting, or lack of reversibility. We propose Modular Delta Merging with Orthogonal Constraints (MDM-OC), a novel framework that enables scalable, interference-free, and reversible composition of fine-tuned models. Each task-specific model is encoded as a delta from a shared base and projected into an orthogonal subspace to eliminate conflict. These projected deltas are then merged via gradient-based optimization to form a unified model that retains performance across tasks. Our approach supports continual integration of new models, structured unmerging for compliance such as GDPR requirements, and model stability via elastic weight consolidation and synthetic replay. Extensive experiments on vision and natural language processing benchmarks demonstrate that MDM-OC outperforms prior baselines in accuracy, backward transfer, and unmerge fidelity, while remaining memory-efficient and computationally tractable. This framework offers a principled solution for modular and compliant AI system design.
26. Security Tensors as a Cross-Modal Bridge: Extending Text-Aligned Safety to Vision in LVLM
- Authors: Shen Li, Liuyi Yao, Wujia Niu, Lan Zhang, Yaliang Li
- URL: https://arxiv.org/abs/2507.20994
- Abstract:
Large visual-language models (LVLMs) integrate aligned large language models (LLMs) with visual modules to process multimodal inputs. However, the safety mechanisms developed for text-based LLMs do not naturally extend to visual modalities, leaving LVLMs vulnerable to harmful image inputs. To address this cross-modal safety gap, we introduce security tensors - trainable input vectors applied during inference through either the textual or visual modality. These tensors transfer textual safety alignment to visual processing without modifying the model’s parameters. They are optimized using a curated dataset containing (i) malicious image-text pairs requiring rejection, (ii) contrastive benign pairs with text structurally similar to malicious queries, with the purpose of being contrastive examples to guide visual reliance, and (iii) general benign samples preserving model functionality. Experimental results demonstrate that both textual and visual security tensors significantly enhance LVLMs’ ability to reject diverse harmful visual inputs while maintaining near-identical performance on benign tasks. Further internal analysis towards hidden-layer representations reveals that security tensors successfully activate the language module’s textual “safety layers” in visual inputs, thereby effectively extending text-based safety to the visual modality.
27. Personalized Treatment Effect Estimation from Unstructured Data
- Authors: Henri Arno, Thomas Demeester
- URL: https://arxiv.org/abs/2507.20993
- Abstract:
Existing methods for estimating personalized treatment effects typically rely on structured covariates, limiting their applicability to unstructured data. Yet, leveraging unstructured data for causal inference has considerable application potential, for instance in healthcare, where clinical notes or medical images are abundant. To this end, we first introduce an approximate ‘plug-in’ method trained directly on the neural representations of unstructured data. However, when these fail to capture all confounding information, the method may be subject to confounding bias. We therefore introduce two theoretically grounded estimators that leverage structured measurements of the confounders during training, but allow estimating personalized treatment effects purely from unstructured inputs, while avoiding confounding bias. When these structured measurements are only available for a non-representative subset of the data, these estimators may suffer from sampling bias. To address this, we further introduce a regression-based correction that accounts for the non-uniform sampling, assuming the sampling mechanism is known or can be well-estimated. Our experiments on two benchmark datasets show that the plug-in method, directly trainable on large unstructured datasets, achieves strong empirical performance across all settings, despite its simplicity.
28. JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version 1
- Authors: Xinhan Di, Kristin Qi, Pengqian Yu
- URL: https://arxiv.org/abs/2507.20987
- Abstract:
Recent advances in diffusion-based video generation have enabled photo-realistic short clips, but current methods still struggle to achieve multi-modal consistency when jointly generating whole-body motion and natural speech. Current approaches lack comprehensive eval- uation frameworks that assess both visual and audio quality, and there are insufficient benchmarks for region- specific performance analysis. To address these gaps, we introduce the Joint Whole-Body Talking Avatar and Speech Generation Version I(JWB-DH-V1), comprising a large-scale multi-modal dataset with 10,000 unique identities across 2 million video samples, and an evalua- tion protocol for assessing joint audio-video generation of whole-body animatable avatars. Our evaluation of SOTA models reveals consistent performance disparities between face/hand-centric and whole-body performance, which incidates essential areas for future research. The dataset and evaluation tools are publicly available at this https URL.
29. SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment
- Authors: Yixin Song, Zhenliang Xue, Dongliang Wei, Feiyang Chen, Jianxiang Gao, Junchen Liu, Hangyu Liang, Guangshuo Qin, Chengrong Tian, Bo Wen, Longyu Zhao, Xinrui Zheng, Zeyu Mi, Haibo Chen
- URL: https://arxiv.org/abs/2507.20984
- Abstract:
While frontier large language models (LLMs) continue to push capability boundaries, their deployment remains confined to GPU-powered cloud infrastructure. We challenge this paradigm with SmallThinker, a family of LLMs natively designed - not adapted - for the unique constraints of local devices: weak computational power, limited memory, and slow storage. Unlike traditional approaches that mainly compress existing models built for clouds, we architect SmallThinker from the ground up to thrive within these limitations. Our innovation lies in a deployment-aware architecture that transforms constraints into design principles. First, We introduce a two-level sparse structure combining fine-grained Mixture-of-Experts (MoE) with sparse feed-forward networks, drastically reducing computational demands without sacrificing model capacity. Second, to conquer the I/O bottleneck of slow storage, we design a pre-attention router that enables our co-designed inference engine to prefetch expert parameters from storage while computing attention, effectively hiding storage latency that would otherwise cripple on-device inference. Third, for memory efficiency, we utilize NoPE-RoPE hybrid sparse attention mechanism to slash KV cache requirements. We release SmallThinker-4B-A0.6B and SmallThinker-21B-A3B, which achieve state-of-the-art performance scores and even outperform larger LLMs. Remarkably, our co-designed system mostly eliminates the need for expensive GPU hardware: with Q4_0 quantization, both models exceed 20 tokens/s on ordinary consumer CPUs, while consuming only 1GB and 8GB of memory respectively. SmallThinker is publicly available at this http URL and this http URL.
30. From Entanglement to Alignment: Representation Space Decomposition for Unsupervised Time Series Domain Adaptation
- Authors: Rongyao Cai, Ming Jin, Qingsong Wen, Kexin Zhang
- URL: https://arxiv.org/abs/2507.20968
- Abstract:
Domain shift poses a fundamental challenge in time series analysis, where models trained on source domain often fail dramatically when applied in target domain with different yet similar distributions. While current unsupervised domain adaptation (UDA) methods attempt to align cross-domain feature distributions, they typically treat features as indivisible entities, ignoring their intrinsic compositions that governs domain adaptation. We introduce DARSD, a novel UDA framework with theoretical explainability that explicitly realizes UDA tasks from the perspective of representation space decomposition. Our core insight is that effective domain adaptation requires not just alignment, but principled disentanglement of transferable knowledge from mixed representations. DARSD consists three synergistic components: (I) An adversarial learnable common invariant basis that projects original features into a domain-invariant subspace while preserving semantic content; (II) A prototypical pseudo-labeling mechanism that dynamically separates target features based on confidence, hindering error accumulation; (III) A hybrid contrastive optimization strategy that simultaneously enforces feature clustering and consistency while mitigating emerging distribution gaps. Comprehensive experiments conducted on four benchmark datasets (WISDM, HAR, HHAR, and MFD) demonstrate DARSD’s superiority against 12 UDA algorithms, achieving optimal performance in 35 out of 53 cross-domain scenarios.
31. Handoff Design in User-Centric Cell-Free Massive MIMO Networks Using DRL
- Authors: Hussein A. Ammar, Raviraj Adve, Shahram Shahbazpanahi, Gary Boudreau, Israfil Bahceci
- URL: https://arxiv.org/abs/2507.20966
- Abstract:
In the user-centric cell-free massive MIMO (UC-mMIMO) network scheme, user mobility necessitates updating the set of serving access points to maintain the user-centric clustering. Such updates are typically performed through handoff (HO) operations; however, frequent HOs lead to overheads associated with the allocation and release of resources. This paper presents a deep reinforcement learning (DRL)-based solution to predict and manage these connections for mobile users. Our solution employs the Soft Actor-Critic algorithm, with continuous action space representation, to train a deep neural network to serve as the HO policy. We present a novel proposition for a reward function that integrates a HO penalty in order to balance the attainable rate and the associated overhead related to HOs. We develop two variants of our system; the first one uses mobility direction-assisted (DA) observations that are based on the user movement pattern, while the second one uses history-assisted (HA) observations that are based on the history of the large-scale fading (LSF). Simulation results show that our DRL-based continuous action space approach is more scalable than discrete space counterpart, and that our derived HO policy automatically learns to gather HOs in specific time slots to minimize the overhead of initiating HOs. Our solution can also operate in real time with a response time less than 0.4 ms.
32. Your AI, Not Your View: The Bias of LLMs in Investment Analysis
- Authors: Hoyoung Lee, Junhyuk Seo, Suhwan Park, Junhyeong Lee, Wonbin Ahn, Chanyeol Choi, Alejandro Lopez-Lira, Yongjae Lee
- URL: https://arxiv.org/abs/2507.20957
- Abstract:
In finance, Large Language Models (LLMs) face frequent knowledge conflicts due to discrepancies between pre-trained parametric knowledge and real-time market data. These conflicts become particularly problematic when LLMs are deployed in real-world investment services, where misalignment between a model’s embedded preferences and those of the financial institution can lead to unreliable recommendations. Yet little research has examined what investment views LLMs actually hold. We propose an experimental framework to investigate such conflicts, offering the first quantitative analysis of confirmation bias in LLM-based investment analysis. Using hypothetical scenarios with balanced and imbalanced arguments, we extract models’ latent preferences and measure their persistence. Focusing on sector, size, and momentum, our analysis reveals distinct, model-specific tendencies. In particular, we observe a consistent preference for large-cap stocks and contrarian strategies across most models. These preferences often harden into confirmation bias, with models clinging to initial judgments despite counter-evidence.
33. Mind the Gap: Conformative Decoding to Improve Output Diversity of Instruction-Tuned Large Language Models
- Authors: Max Peeperkorn, Tom Kouwenhoven, Dan Brown, Anna Jordanous
- URL: https://arxiv.org/abs/2507.20956
- Abstract:
Instruction-tuning large language models (LLMs) reduces the diversity of their outputs, which has implications for many tasks, particularly for creative tasks. This paper investigates the ``diversity gap’’ for a writing prompt narrative generation task. This gap emerges as measured by current diversity metrics for various open-weight and open-source LLMs. The results show significant decreases in diversity due to instruction-tuning. We explore the diversity loss at each fine-tuning stage for the OLMo and OLMo 2 models to further understand how output diversity is affected. The results indicate that DPO has the most substantial impact on diversity. Motivated by these findings, we present a new decoding strategy, conformative decoding, which guides an instruct model using its more diverse base model to reintroduce output diversity. We show that conformative decoding typically increases diversity and even maintains or improves quality.
34. Multivariate Conformal Prediction via Conformalized Gaussian Scoring
- Authors: Sacha Braun, Eugène Berta, Michael I. Jordan, Francis Bach
- URL: https://arxiv.org/abs/2507.20941
- Abstract:
While achieving exact conditional coverage in conformal prediction is unattainable without making strong, untestable regularity assumptions, the promise of conformal prediction hinges on finding approximations to conditional guarantees that are realizable in practice. A promising direction for obtaining conditional dependence for conformal sets–in particular capturing heteroskedasticity–is through estimating the conditional density $\mathbb{P}_{Y X}$ and conformalizing its level sets. Previous work in this vein has focused on nonconformity scores based on the empirical cumulative distribution function (CDF). Such scores are, however, computationally costly, typically requiring expensive sampling methods. To avoid the need for sampling, we observe that the CDF-based score reduces to a Mahalanobis distance in the case of Gaussian scores, yielding a closed-form expression that can be directly conformalized. Moreover, the use of a Gaussian-based score opens the door to a number of extensions of the basic conformal method; in particular, we show how to construct conformal sets with missing output values, refine conformal sets as partial information about $Y$ becomes available, and construct conformal sets on transformations of the output space. Finally, empirical results indicate that our approach produces conformal sets that more closely approximate conditional coverage in multivariate settings compared to alternative methods.
35. Dissecting Persona-Driven Reasoning in Language Models via Activation Patching
- Authors: Ansh Poonia, Maeghal Jain
- URL: https://arxiv.org/abs/2507.20936
- Abstract:
Large language models (LLMs) exhibit remarkable versatility in adopting diverse personas. In this study, we examine how assigning a persona influences a model’s reasoning on an objective task. Using activation patching, we take a first step toward understanding how key components of the model encode persona-specific information. Our findings reveal that the early Multi-Layer Perceptron (MLP) layers attend not only to the syntactic structure of the input but also process its semantic content. These layers transform persona tokens into richer representations, which are then used by the middle Multi-Head Attention (MHA) layers to shape the model’s output. Additionally, we identify specific attention heads that disproportionately attend to racial and color-based identities.
36. FRED: Financial Retrieval-Enhanced Detection and Editing of Hallucinations in Language Models
- Authors: Likun Tan, Kuan-Wei Huang, Kevin Wu
- URL: https://arxiv.org/abs/2507.20930
- Abstract:
Hallucinations in large language models pose a critical challenge for applications requiring factual reliability, particularly in high-stakes domains such as finance. This work presents an effective approach for detecting and editing factually incorrect content in model-generated responses based on the provided context. Given a user-defined domain-specific error taxonomy, we construct a synthetic dataset by inserting tagged errors into financial question-answering corpora and then fine-tune four language models, Phi-4, Phi-4-mini, Qwen3-4B, and Qwen3-14B, to detect and edit these factual inaccuracies. Our best-performing model, fine-tuned Phi-4, achieves an 8% improvement in binary F1 score and a 30% gain in overall detection performance compared to OpenAI-o3. Notably, our fine-tuned Phi-4-mini model, despite having only 4 billion parameters, maintains competitive performance with just a 2% drop in binary detection and a 0.1% decline in overall detection compared to OpenAI-o3. Our work provides a practical solution for detecting and editing factual inconsistencies in financial text generation while introducing a generalizable framework that can enhance the trustworthiness and alignment of large language models across diverse applications beyond finance. Our code and data are available at this https URL.
37. FHSTP@EXIST 2025 Benchmark: Sexism Detection with Transparent Speech Concept Bottleneck Models
- Authors: Roberto Labadie-Tamayo, Adrian Jaques Böck, Djordje Slijepčević, Xihui Chen, Andreas Babic, Matthias Zeppelzauer
- URL: https://arxiv.org/abs/2507.20924
- Abstract:
Sexism has become widespread on social media and in online conversation. To help address this issue, the fifth Sexism Identification in Social Networks (EXIST) challenge is initiated at CLEF 2025. Among this year’s international benchmarks, we concentrate on solving the first task aiming to identify and classify sexism in social media textual posts. In this paper, we describe our solutions and report results for three subtasks: Subtask 1.1 - Sexism Identification in Tweets, Subtask 1.2 - Source Intention in Tweets, and Subtask 1.3 - Sexism Categorization in Tweets. We implement three models to address each subtask which constitute three individual runs: Speech Concept Bottleneck Model (SCBM), Speech Concept Bottleneck Model with Transformer (SCBMT), and a fine-tuned XLM-RoBERTa transformer model. SCBM uses descriptive adjectives as human-interpretable bottleneck concepts. SCBM leverages large language models (LLMs) to encode input texts into a human-interpretable representation of adjectives, then used to train a lightweight classifier for downstream tasks. SCBMT extends SCBM by fusing adjective-based representation with contextual embeddings from transformers to balance interpretability and classification performance. Beyond competitive results, these two models offer fine-grained explanations at both instance (local) and class (global) levels. We also investigate how additional metadata, e.g., annotators’ demographic profiles, can be leveraged. For Subtask 1.1, XLM-RoBERTa, fine-tuned on provided data augmented with prior datasets, ranks 6th for English and Spanish and 4th for English in the Soft-Soft evaluation. Our SCBMT achieves 7th for English and Spanish and 6th for Spanish.
38. Pareto-Grid-Guided Large Language Models for Fast and High-Quality Heuristics Design in Multi-Objective Combinatorial Optimization
- Authors: Minh Hieu Ha, Hung Phan, Tung Duy Doan, Tung Dao, Dao Tran, Huynh Thi Thanh Binh
- URL: https://arxiv.org/abs/2507.20923
- Abstract:
Multi-objective combinatorial optimization problems (MOCOP) frequently arise in practical applications that require the simultaneous optimization of conflicting objectives. Although traditional evolutionary algorithms can be effective, they typically depend on domain knowledge and repeated parameter tuning, limiting flexibility when applied to unseen MOCOP instances. Recently, integration of Large Language Models (LLMs) into evolutionary computation has opened new avenues for automatic heuristic generation, using their advanced language understanding and code synthesis capabilities. Nevertheless, most existing approaches predominantly focus on single-objective tasks, often neglecting key considerations such as runtime efficiency and heuristic diversity in multi-objective settings. To bridge this gap, we introduce Multi-heuristics for MOCOP via Pareto-Grid-guided Evolution of LLMs (MPaGE), a novel enhancement of the Simple Evolutionary Multiobjective Optimization (SEMO) framework that leverages LLMs and Pareto Front Grid (PFG) technique. By partitioning the objective space into grids and retaining top-performing candidates to guide heuristic generation, MPaGE utilizes LLMs to prioritize heuristics with semantically distinct logical structures during variation, thus promoting diversity and mitigating redundancy within the population. Through extensive evaluations, MPaGE demonstrates superior performance over existing LLM-based frameworks, and achieves competitive results to traditional Multi-objective evolutionary algorithms (MOEAs), with significantly faster runtime. Our code is available at: this https URL.
39. Modeling User Behavior from Adaptive Surveys with Supplemental Context
- Authors: Aman Shukla, Daniel Patrick Scantlebury, Rishabh Kumar
- URL: https://arxiv.org/abs/2507.20919
- Abstract:
Modeling user behavior is critical across many industries where understanding preferences, intent, or decisions informs personalization, targeting, and strategic outcomes. Surveys have long served as a classical mechanism for collecting such behavioral data due to their interpretability, structure, and ease of deployment. However, surveys alone are inherently limited by user fatigue, incomplete responses, and practical constraints on their length making them insufficient for capturing user behavior. In this work, we present LANTERN (Late-Attentive Network for Enriched Response Modeling), a modular architecture for modeling user behavior by fusing adaptive survey responses with supplemental contextual signals. We demonstrate the architectural value of maintaining survey primacy through selective gating, residual connections and late fusion via cross-attention, treating survey data as the primary signal while incorporating external modalities only when relevant. LANTERN outperforms strong survey-only baselines in multi-label prediction of survey responses. We further investigate threshold sensitivity and the benefits of selective modality reliance through ablation and rare/frequent attribute analysis. LANTERN’s modularity supports scalable integration of new encoders and evolving datasets. This work provides a practical and extensible blueprint for behavior modeling in survey-centric applications.
40. MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation
- Authors: Adrien Bazoge
- URL: https://arxiv.org/abs/2507.20917
- Abstract:
This work introduces MediQAl, a French medical question answering dataset designed to evaluate the capabilities of language models in factual medical recall and reasoning over real-world clinical scenarios. MediQAl contains 32,603 questions sourced from French medical examinations across 41 medical subjects. The dataset includes three tasks: (i) Multiple-Choice Question with Unique answer, (ii) Multiple-Choice Question with Multiple answer, and (iii) Open-Ended Question with Short-Answer. Each question is labeled as Understanding or Reasoning, enabling a detailed analysis of models’ cognitive capabilities. We validate the MediQAl dataset through extensive evaluation with 14 large language models, including recent reasoning-augmented models, and observe a significant performance gap between factual recall and reasoning tasks. Our evaluation provides a comprehensive benchmark for assessing language models’ performance on French medical question answering, addressing a crucial gap in multilingual resources for the medical domain.
41. HAMLET-FFD: Hierarchical Adaptive Multi-modal Learning Embeddings Transformation for Face Forgery Detection
- Authors: Jialei Cui, Jianwei Du, Yanzhe Li, Lei Gao, Hui Jiang, Chenfu Bao
- URL: https://arxiv.org/abs/2507.20913
- Abstract:
The rapid evolution of face manipulation techniques poses a critical challenge for face forgery detection: cross-domain generalization. Conventional methods, which rely on simple classification objectives, often fail to learn domain-invariant representations. We propose HAMLET-FFD, a cognitively inspired Hierarchical Adaptive Multi-modal Learning framework that tackles this challenge via bidirectional cross-modal reasoning. Building on contrastive vision-language models such as CLIP, HAMLET-FFD introduces a knowledge refinement loop that iteratively assesses authenticity by integrating visual evidence with conceptual cues, emulating expert forensic analysis. A key innovation is a bidirectional fusion mechanism in which textual authenticity embeddings guide the aggregation of hierarchical visual features, while modulated visual features refine text embeddings to generate image-adaptive prompts. This closed-loop process progressively aligns visual observations with semantic priors to enhance authenticity assessment. By design, HAMLET-FFD freezes all pretrained parameters, serving as an external plugin that preserves CLIP’s original capabilities. Extensive experiments demonstrate its superior generalization to unseen manipulations across multiple benchmarks, and visual analyses reveal a division of labor among embeddings, with distinct representations specializing in fine-grained artifact recognition.
42. SCORPION: Addressing Scanner-Induced Variability in Histopathology
- Authors: Jeongun Ryu, Heon Song, Seungeun Lee, Soo Ick Cho, Jiwon Shin, Kyunghyun Paeng, Sérgio Pereira
- URL: https://arxiv.org/abs/2507.20907
- Abstract:
Ensuring reliable model performance across diverse domains is a critical challenge in computational pathology. A particular source of variability in Whole-Slide Images is introduced by differences in digital scanners, thus calling for better scanner generalization. This is critical for the real-world adoption of computational pathology, where the scanning devices may differ per institution or hospital, and the model should not be dependent on scanner-induced details, which can ultimately affect the patient’s diagnosis and treatment planning. However, past efforts have primarily focused on standard domain generalization settings, evaluating on unseen scanners during training, without directly evaluating consistency across scanners for the same tissue. To overcome this limitation, we introduce SCORPION, a new dataset explicitly designed to evaluate model reliability under scanner variability. SCORPION includes 480 tissue samples, each scanned with 5 scanners, yielding 2,400 spatially aligned patches. This scanner-paired design allows for the isolation of scanner-induced variability, enabling a rigorous evaluation of model consistency while controlling for differences in tissue composition. Furthermore, we propose SimCons, a flexible framework that combines augmentation-based domain generalization techniques with a consistency loss to explicitly address scanner generalization. We empirically show that SimCons improves model consistency on varying scanners without compromising task-specific performance. By releasing the SCORPION dataset and proposing SimCons, we provide the research community with a crucial resource for evaluating and improving model consistency across diverse scanners, setting a new standard for reliability testing.
43. Music Arena: Live Evaluation for Text-to-Music
- Authors: Yonghyun Kim, Wayne Chi, Anastasios N. Angelopoulos, Wei-Lin Chiang, Koichi Saito, Shinji Watanabe, Yuki Mitsufuji, Chris Donahue
- URL: https://arxiv.org/abs/2507.20900
- Abstract:
We present Music Arena, an open platform for scalable human preference evaluation of text-to-music (TTM) models. Soliciting human preferences via listening studies is the gold standard for evaluation in TTM, but these studies are expensive to conduct and difficult to compare, as study protocols may differ across systems. Moreover, human preferences might help researchers align their TTM systems or improve automatic evaluation metrics, but an open and renewable source of preferences does not currently exist. We aim to fill these gaps by offering live evaluation for TTM. In Music Arena, real-world users input text prompts of their choosing and compare outputs from two TTM systems, and their preferences are used to compile a leaderboard. While Music Arena follows recent evaluation trends in other AI domains, we also design it with key features tailored to music: an LLM-based routing system to navigate the heterogeneous type signatures of TTM systems, and the collection of detailed preferences including listening data and natural language feedback. We also propose a rolling data release policy with user privacy guarantees, providing a renewable source of preference data and increasing platform transparency. Through its standardized evaluation protocol, transparent data access policies, and music-specific features, Music Arena not only addresses key challenges in the TTM ecosystem but also demonstrates how live evaluation can be thoughtfully adapted to unique characteristics of specific AI domains. Music Arena is available at: this https URL
44. JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment
- Authors: Renhang Liu, Chia-Yu Hung, Navonil Majumder, Taylor Gautreaux, Amir Ali Bagherzadeh, Chuan Li, Dorien Herremans, Soujanya Poria
- URL: https://arxiv.org/abs/2507.20880
- Abstract:
Diffusion and flow-matching models have revolutionized automatic text-to-audio generation in recent times. These models are increasingly capable of generating high quality and faithful audio outputs capturing to speech and acoustic events. However, there is still much room for improvement in creative audio generation that primarily involves music and songs. Recent open lyrics-to-song models, such as, DiffRhythm, ACE-Step, and LeVo, have set an acceptable standard in automatic song generation for recreational use. However, these models lack fine-grained word-level controllability often desired by musicians in their workflows. To the best of our knowledge, our flow-matching-based JAM is the first effort toward endowing word-level timing and duration control in song generation, allowing fine-grained vocal control. To enhance the quality of generated songs to better align with human preferences, we implement aesthetic alignment through Direct Preference Optimization, which iteratively refines the model using a synthetic dataset, eliminating the need or manual data annotations. Furthermore, we aim to standardize the evaluation of such lyrics-to-song models through our public evaluation dataset JAME. We show that JAM outperforms the existing models in terms of the music-specific attributes.
45. Not Only Grey Matter: OmniBrain for Robust Multimodal Classification of Alzheimer’s Disease
- Authors: Ahmed Sharshar, Yasser Ashraf, Tameem Bakr, Salma Hassan, Hosam Elgendy, Mohammad Yaqub, Mohsen Guizani
- URL: https://arxiv.org/abs/2507.20872
- Abstract:
Alzheimer’s disease affects over 55 million people worldwide and is projected to more than double by 2050, necessitating rapid, accurate, and scalable diagnostics. However, existing approaches are limited because they cannot achieve clinically acceptable accuracy, generalization across datasets, robustness to missing modalities, and explainability all at the same time. This inability to satisfy all these requirements simultaneously undermines their reliability in clinical settings. We propose OmniBrain, a multimodal framework that integrates brain MRI, radiomics, gene expression, and clinical data using a unified model with cross-attention and modality dropout. OmniBrain achieves $92.2 \pm 2.4\%$accuracy on the ANMerge dataset and generalizes to the MRI-only ADNI dataset with $70.4 \pm 2.7\%$ accuracy, outperforming unimodal and prior multimodal approaches. Explainability analyses highlight neuropathologically relevant brain regions and genes, enhancing clinical trust. OmniBrain offers a robust, interpretable, and practical solution for real-world Alzheimer’s diagnosis.
46. Geometry of Neural Reinforcement Learning in Continuous State and Action Spaces
- Authors: Saket Tiwari, Omer Gottesman, George Konidaris
- URL: https://arxiv.org/abs/2507.20853
- Abstract:
Advances in reinforcement learning (RL) have led to its successful application in complex tasks with continuous state and action spaces. Despite these advances in practice, most theoretical work pertains to finite state and action spaces. We propose building a theoretical understanding of continuous state and action spaces by employing a geometric lens to understand the locally attained set of states. The set of all parametrised policies learnt through a semi-gradient based approach induces a set of attainable states in RL. We show that the training dynamics of a two-layer neural policy induce a low dimensional manifold of attainable states embedded in the high-dimensional nominal state space trained using an actor-critic algorithm. We prove that, under certain conditions, the dimensionality of this manifold is of the order of the dimensionality of the action space. This is the first result of its kind, linking the geometry of the state space to the dimensionality of the action space. We empirically corroborate this upper bound for four MuJoCo environments and also demonstrate the results in a toy environment with varying dimensionality. We also show the applicability of this theoretical result by introducing a local manifold learning layer to the policy and value function networks to improve the performance in control environments with very high degrees of freedom by changing one layer of the neural network to learn sparse representations.
47. Free Energy-Inspired Cognitive Risk Integration for AV Navigation in Pedestrian-Rich Environments
- Authors: Meiting Dang, Yanping Wu, Yafei Wang, Dezong Zhao, David Flynn, Chongfeng Wei
- URL: https://arxiv.org/abs/2507.20850
- Abstract:
Recent advances in autonomous vehicle (AV) behavior planning have shown impressive social interaction capabilities when interacting with other road users. However, achieving human-like prediction and decision-making in interactions with vulnerable road users remains a key challenge in complex multi-agent interactive environments. Existing research focuses primarily on crowd navigation for small mobile robots, which cannot be directly applied to AVs due to inherent differences in their decision-making strategies and dynamic boundaries. Moreover, pedestrians in these multi-agent simulations follow fixed behavior patterns that cannot dynamically respond to AV actions. To overcome these limitations, this paper proposes a novel framework for modeling interactions between the AV and multiple pedestrians. In this framework, a cognitive process modeling approach inspired by the Free Energy Principle is integrated into both the AV and pedestrian models to simulate more realistic interaction dynamics. Specifically, the proposed pedestrian Cognitive-Risk Social Force Model adjusts goal-directed and repulsive forces using a fused measure of cognitive uncertainty and physical risk to produce human-like trajectories. Meanwhile, the AV leverages this fused risk to construct a dynamic, risk-aware adjacency matrix for a Graph Convolutional Network within a Soft Actor-Critic architecture, allowing it to make more reasonable and informed decisions. Simulation results indicate that our proposed framework effectively improves safety, efficiency, and smoothness of AV navigation compared to the state-of-the-art method.
48. First Hallucination Tokens Are Different from Conditional Ones
- Authors: Jakob Snel, Seong Joon Oh
- URL: https://arxiv.org/abs/2507.20836
- Abstract:
Hallucination, the generation of untruthful content, is one of the major concerns regarding foundational models. Detecting hallucinations at the token level is vital for real-time filtering and targeted correction, yet the variation of hallucination signals within token sequences is not fully understood. Leveraging the RAGTruth corpus with token-level annotations and reproduced logits, we analyse how these signals depend on a token’s position within hallucinated spans, contributing to an improved understanding of token-level hallucination. Our results show that the first hallucinated token carries a stronger signal and is more detectable than conditional tokens. We release our analysis framework, along with code for logit reproduction and metric computation at this https URL.
49. Why Flow Matching is Particle Swarm Optimization?
- Authors: Kaichen Ouyang
- URL: https://arxiv.org/abs/2507.20810
- Abstract:
This paper preliminarily investigates the duality between flow matching in generative models and particle swarm optimization (PSO) in evolutionary computation. Through theoretical analysis, we reveal the intrinsic connections between these two approaches in terms of their mathematical formulations and optimization mechanisms: the vector field learning in flow matching shares similar mathematical expressions with the velocity update rules in PSO; both methods follow the fundamental framework of progressive evolution from initial to target distributions; and both can be formulated as dynamical systems governed by ordinary differential equations. Our study demonstrates that flow matching can be viewed as a continuous generalization of PSO, while PSO provides a discrete implementation of swarm intelligence principles. This duality understanding establishes a theoretical foundation for developing novel hybrid algorithms and creates a unified framework for analyzing both methods. Although this paper only presents preliminary discussions, the revealed correspondences suggest several promising research directions, including improving swarm intelligence algorithms based on flow matching principles and enhancing generative models using swarm intelligence concepts.
50. LanternNet: A Novel Hub-and-Spoke System to Seek and Suppress Spotted Lanternfly Populations
- Authors: Vinil Polepalli
- URL: https://arxiv.org/abs/2507.20800
- Abstract:
The invasive spotted lanternfly (SLF) poses a significant threat to agriculture and ecosystems, causing widespread damage. Current control methods, such as egg scraping, pesticides, and quarantines, prove labor-intensive, environmentally hazardous, and inadequate for long-term SLF suppression. This research introduces LanternNet, a novel autonomous robotic Hub-and-Spoke system designed for scalable detection and suppression of SLF populations. A central, tree-mimicking hub utilizes a YOLOv8 computer vision model for precise SLF identification. Three specialized robotic spokes perform targeted tasks: pest neutralization, environmental monitoring, and navigation/mapping. Field deployment across multiple infested sites over 5 weeks demonstrated LanternNet’s efficacy. Quantitative analysis revealed significant reductions (p < 0.01, paired t-tests) in SLF populations and corresponding improvements in tree health indicators across the majority of test sites. Compared to conventional methods, LanternNet offers substantial cost advantages and improved scalability. Furthermore, the system’s adaptability for enhanced autonomy and targeting of other invasive species presents significant potential for broader ecological impact. LanternNet demonstrates the transformative potential of integrating robotics and AI for advanced invasive species management and improved environmental outcomes.
51. Aligning Large Language Model Agents with Rational and Moral Preferences: A Supervised Fine-Tuning Approach
- Authors: Wei Lu, Daniel L. Chen, Christian B. Hansen
- URL: https://arxiv.org/abs/2507.20796
- Abstract:
Understanding how large language model (LLM) agents behave in strategic interactions is essential as these systems increasingly participate autonomously in economically and morally consequential decisions. We evaluate LLM preferences using canonical economic games, finding substantial deviations from human behavior. Models like GPT-4o show excessive cooperation and limited incentive sensitivity, while reasoning models, such as o3-mini, align more consistently with payoff-maximizing strategies. We propose a supervised fine-tuning pipeline that uses synthetic datasets derived from economic reasoning to align LLM agents with economic preferences, focusing on two stylized preference structures. In the first, utility depends only on individual payoffs (homo economicus), while utility also depends on a notion of Kantian universalizability in the second preference structure (homo moralis). We find that fine-tuning based on small datasets shifts LLM agent behavior toward the corresponding economic agent. We further assess the fine-tuned agents’ behavior in two applications: Moral dilemmas involving autonomous vehicles and algorithmic pricing in competitive markets. These examples illustrate how different normative objectives embedded via realizations from structured preference structures can influence market and moral outcomes. This work contributes a replicable, cost-efficient, and economically grounded pipeline to align AI preferences using moral-economic principles.
52. Investigation of Accuracy and Bias in Face Recognition Trained with Synthetic Data
- Authors: Pavel Korshunov, Ketan Kotwal, Christophe Ecabert, Vidit Vidit, Amir Mohammadi, Sebastien Marcel
- URL: https://arxiv.org/abs/2507.20782
- Abstract:
Synthetic data has emerged as a promising alternative for training face recognition (FR) models, offering advantages in scalability, privacy compliance, and potential for bias mitigation. However, critical questions remain on whether both high accuracy and fairness can be achieved with synthetic data. In this work, we evaluate the impact of synthetic data on bias and performance of FR systems. We generate balanced face dataset, FairFaceGen, using two state of the art text-to-image generators, Flux.1-dev and Stable Diffusion v3.5 (SD35), and combine them with several identity augmentation methods, including Arc2Face and four IP-Adapters. By maintaining equal identity count across synthetic and real datasets, we ensure fair comparisons when evaluating FR performance on standard (LFW, AgeDB-30, etc.) and challenging IJB-B/C benchmarks and FR bias on Racial Faces in-the-Wild (RFW) dataset. Our results demonstrate that although synthetic data still lags behind the real datasets in the generalization on IJB-B/C, demographically balanced synthetic datasets, especially those generated with SD35, show potential for bias mitigation. We also observe that the number and quality of intra-class augmentations significantly affect FR accuracy and fairness. These findings provide practical guidelines for constructing fairer FR systems using synthetic data.
53. Learning to See Inside Opaque Liquid Containers using Speckle Vibrometry
- Authors: Matan Kichler, Shai Bagon, Mark Sheinin
- URL: https://arxiv.org/abs/2507.20757
- Abstract:
Computer vision seeks to infer a wide range of information about objects and events. However, vision systems based on conventional imaging are limited to extracting information only from the visible surfaces of scene objects. For instance, a vision system can detect and identify a Coke can in the scene, but it cannot determine whether the can is full or empty. In this paper, we aim to expand the scope of computer vision to include the novel task of inferring the hidden liquid levels of opaque containers by sensing the tiny vibrations on their surfaces. Our method provides a first-of-a-kind way to inspect the fill level of multiple sealed containers remotely, at once, without needing physical manipulation and manual weighing. First, we propose a novel speckle-based vibration sensing system for simultaneously capturing scene vibrations on a 2D grid of points. We use our system to efficiently and remotely capture a dataset of vibration responses for a variety of everyday liquid containers. Then, we develop a transformer-based approach for analyzing the captured vibrations and classifying the container type and its hidden liquid level at the time of measurement. Our architecture is invariant to the vibration source, yielding correct liquid level estimates for controlled and ambient scene sound sources. Moreover, our model generalizes to unseen container instances within known classes (e.g., training on five Coke cans of a six-pack, testing on a sixth) and fluid levels. We demonstrate our method by recovering liquid levels from various everyday containers.
54. Industry Insights from Comparing Deep Learning and GBDT Models for E-Commerce Learning-to-Rank
- Authors: Yunus Lutz, Timo Wilm, Philipp Duwe
- URL: https://arxiv.org/abs/2507.20753
- Abstract:
In e-commerce recommender and search systems, tree-based models, such as LambdaMART, have set a strong baseline for Learning-to-Rank (LTR) tasks. Despite their effectiveness and widespread adoption in industry, the debate continues whether deep neural networks (DNNs) can outperform traditional tree-based models in this domain. To contribute to this discussion, we systematically benchmark DNNs against our production-grade LambdaMART model. We evaluate multiple DNN architectures and loss functions on a proprietary dataset from OTTO and validate our findings through an 8-week online A/B test. The results show that a simple DNN architecture outperforms a strong tree-based baseline in terms of total clicks and revenue, while achieving parity in total units sold.
55. AR-LIF: Adaptive reset leaky-integrate and fire neuron for spiking neural networks
- Authors: Zeyu Huang, Wei Meng, Quan Liu, Kun Chen, Li Ma
- URL: https://arxiv.org/abs/2507.20746
- Abstract:
Spiking neural networks possess the advantage of low energy consumption due to their event-driven nature. Compared with binary spike outputs, their inherent floating-point dynamics are more worthy of attention. The threshold level and re- set mode of neurons play a crucial role in determining the number and timing of spikes. The existing hard reset method causes information loss, while the improved soft reset method adopts a uniform treatment for neurons. In response to this, this paper designs an adaptive reset neuron, establishing the correlation between input, output and reset, and integrating a simple yet effective threshold adjustment strategy. It achieves excellent performance on various datasets while maintaining the advantage of low energy consumption.
56. Regularizing Subspace Redundancy of Low-Rank Adaptation
- Authors: Yue Zhu, Haiwen Diao, Shang Gao, Jiazuo Yu, Jiawen Zhu, Yunzhi Zhuge, Shuai Hao, Xu Jia, Lu Zhang, Ying Zhang, Huchuan Lu
- URL: https://arxiv.org/abs/2507.20745
- Abstract:
Low-Rank Adaptation (LoRA) and its variants have delivered strong capability in Parameter-Efficient Transfer Learning (PETL) by minimizing trainable parameters and benefiting from reparameterization. However, their projection matrices remain unrestricted during training, causing high representation redundancy and diminishing the effectiveness of feature adaptation in the resulting subspaces. While existing methods mitigate this by manually adjusting the rank or implicitly applying channel-wise masks, they lack flexibility and generalize poorly across various datasets and architectures. Hence, we propose ReSoRA, a method that explicitly models redundancy between mapping subspaces and adaptively Regularizes Subspace redundancy of Low-Rank Adaptation. Specifically, it theoretically decomposes the low-rank submatrices into multiple equivalent subspaces and systematically applies de-redundancy constraints to the feature distributions across different projections. Extensive experiments validate that our proposed method consistently facilitates existing state-of-the-art PETL methods across various backbones and datasets in vision-language retrieval and standard visual classification benchmarks. Besides, as a training supervision, ReSoRA can be seamlessly integrated into existing approaches in a plug-and-play manner, with no additional inference costs. Code is publicly available at: this https URL.
57. Multi-Masked Querying Network for Robust Emotion Recognition from Incomplete Multi-Modal Physiological Signals
- Authors: Geng-Xin Xu, Xiang Zuo, Ye Li
- URL: https://arxiv.org/abs/2507.20737
- Abstract:
Emotion recognition from physiological data is crucial for mental health assessment, yet it faces two significant challenges: incomplete multi-modal signals and interference from body movements and artifacts. This paper presents a novel Multi-Masked Querying Network (MMQ-Net) to address these issues by integrating multiple querying mechanisms into a unified framework. Specifically, it uses modality queries to reconstruct missing data from incomplete signals, category queries to focus on emotional state features, and interference queries to separate relevant information from noise. Extensive experiment results demonstrate the superior emotion recognition performance of MMQ-Net compared to existing approaches, particularly under high levels of data incompleteness.
58. Prostate Cancer Classification Using Multimodal Feature Fusion and Explainable AI
- Authors: Asma Sadia Khan, Fariba Tasnia Khan, Tanjim Mahmud, Salman Karim Khan, Rishita Chakma, Nahed Sharmen, Mohammad Shahadat Hossain, Karl Andersson
- URL: https://arxiv.org/abs/2507.20714
- Abstract:
Prostate cancer, the second most prevalent male malignancy, requires advanced diagnostic tools. We propose an explainable AI system combining BERT (for textual clinical notes) and Random Forest (for numerical lab data) through a novel multimodal fusion strategy, achieving superior classification performance on PLCO-NIH dataset (98% accuracy, 99% AUC). While multimodal fusion is established, our work demonstrates that a simple yet interpretable BERT+RF pipeline delivers clinically significant improvements - particularly for intermediate cancer stages (Class 2/3 recall: 0.900 combined vs 0.824 numerical/0.725 textual). SHAP analysis provides transparent feature importance rankings, while ablation studies prove textual features’ complementary value. This accessible approach offers hospitals a balance of high performance (F1=89%), computational efficiency, and clinical interpretability - addressing critical needs in prostate cancer diagnostics.
59. Text2VLM: Adapting Text-Only Datasets to Evaluate Alignment Training in Visual Language Models
- Authors: Gabriel Downer, Sean Craven, Damian Ruck, Jake Thomas
- URL: https://arxiv.org/abs/2507.20704
- Abstract:
The increasing integration of Visual Language Models (VLMs) into AI systems necessitates robust model alignment, especially when handling multimodal content that combines text and images. Existing evaluation datasets heavily lean towards text-only prompts, leaving visual vulnerabilities under evaluated. To address this gap, we propose \textbf{Text2VLM}, a novel multi-stage pipeline that adapts text-only datasets into multimodal formats, specifically designed to evaluate the resilience of VLMs against typographic prompt injection attacks. The Text2VLM pipeline identifies harmful content in the original text and converts it into a typographic image, creating a multimodal prompt for VLMs. Also, our evaluation of open-source VLMs highlights their increased susceptibility to prompt injection when visual inputs are introduced, revealing critical weaknesses in the current models’ alignment. This is in addition to a significant performance gap compared to closed-source frontier models. We validate Text2VLM through human evaluations, ensuring the alignment of extracted salient concepts; text summarization and output classification align with human expectations. Text2VLM provides a scalable tool for comprehensive safety assessment, contributing to the development of more robust safety mechanisms for VLMs. By enhancing the evaluation of multimodal vulnerabilities, Text2VLM plays a role in advancing the safe deployment of VLMs in diverse, real-world applications.
60. A Multimodal Architecture for Endpoint Position Prediction in Team-based Multiplayer Games
- Authors: Jonas Peche, Aliaksei Tsishurou, Alexander Zap, Guenter Wallner
- URL: https://arxiv.org/abs/2507.20670
- Abstract:
Understanding and predicting player movement in multiplayer games is crucial for achieving use cases such as player-mimicking bot navigation, preemptive bot control, strategy recommendation, and real-time player behavior analytics. However, the complex environments allow for a high degree of navigational freedom, and the interactions and team-play between players require models that make effective use of the available heterogeneous input data. This paper presents a multimodal architecture for predicting future player locations on a dynamic time horizon, using a U-Net-based approach for calculating endpoint location probability heatmaps, conditioned using a multimodal feature encoder. The application of a multi-head attention mechanism for different groups of features allows for communication between agents. In doing so, the architecture makes efficient use of the multimodal game state including image inputs, numerical and categorical features, as well as dynamic game data. Consequently, the presented technique lays the foundation for various downstream tasks that rely on future player positions such as the creation of player-predictive bot behavior or player anomaly detection.
61. MIMII-Agent: Leveraging LLMs with Function Calling for Relative Evaluation of Anomalous Sound Detection
- Authors: Harsh Purohit, Tomoya Nishida, Kota Dohi, Takashi Endo, Yohei Kawaguchi
- URL: https://arxiv.org/abs/2507.20666
- Abstract:
This paper proposes a method for generating machine-type-specific anomalies to evaluate the relative performance of unsupervised anomalous sound detection (UASD) systems across different machine types, even in the absence of real anomaly sound data. Conventional keyword-based data augmentation methods often produce unrealistic sounds due to their reliance on manually defined labels, limiting scalability as machine types and anomaly patterns diversify. Advanced audio generative models, such as MIMII-Gen, show promise but typically depend on anomalous training data, making them less effective when diverse anomalous examples are unavailable. To address these limitations, we propose a novel synthesis approach leveraging large language models (LLMs) to interpret textual descriptions of faults and automatically select audio transformation functions, converting normal machine sounds into diverse and plausible anomalous sounds. We validate this approach by evaluating a UASD system trained only on normal sounds from five machine types, using both real and synthetic anomaly data. Experimental results reveal consistent trends in relative detection difficulty across machine types between synthetic and real anomalies. This finding supports our hypothesis and highlights the effectiveness of the proposed LLM-based synthesis approach for relative evaluation of UASD systems.
62. Hot-Swap MarkBoard: An Efficient Black-box Watermarking Approach for Large-scale Model Distribution
- Authors: Zhicheng Zhang, Peizhuo Lv, Mengke Wan, Jiang Fang, Diandian Guo, Yezeng Chen, Yinlong Liu, Wei Ma, Jiyan Sun, Liru Geng
- URL: https://arxiv.org/abs/2507.20650
- Abstract:
Recently, Deep Learning (DL) models have been increasingly deployed on end-user devices as On-Device AI, offering improved efficiency and privacy. However, this deployment trend poses more serious Intellectual Property (IP) risks, as models are distributed on numerous local devices, making them vulnerable to theft and redistribution. Most existing ownership protection solutions (e.g., backdoor-based watermarking) are designed for cloud-based AI-as-a-Service (AIaaS) and are not directly applicable to large-scale distribution scenarios, where each user-specific model instance must carry a unique watermark. These methods typically embed a fixed watermark, and modifying the embedded watermark requires retraining the model. To address these challenges, we propose Hot-Swap MarkBoard, an efficient watermarking method. It encodes user-specific $n$-bit binary signatures by independently embedding multiple watermarks into a multi-branch Low-Rank Adaptation (LoRA) module, enabling efficient watermark customization without retraining through branch swapping. A parameter obfuscation mechanism further entangles the watermark weights with those of the base model, preventing removal without degrading model performance. The method supports black-box verification and is compatible with various model architectures and DL tasks, including classification, image generation, and text generation. Extensive experiments across three types of tasks and six backbone models demonstrate our method’s superior efficiency and adaptability compared to existing approaches, achieving 100\% verification accuracy.
63. Ontology-Enhanced Knowledge Graph Completion using Large Language Models
- Authors: Wenbin Guo, Xin Wang, Jiaoyan Chen, Zhao Li, Zirui Chen
- URL: https://arxiv.org/abs/2507.20643
- Abstract:
Large Language Models (LLMs) have been extensively adopted in Knowledge Graph Completion (KGC), showcasing significant research advancements. However, as black-box models driven by deep neural architectures, current LLM-based KGC methods rely on implicit knowledge representation with parallel propagation of erroneous knowledge, thereby hindering their ability to produce conclusive and decisive reasoning outcomes. We aim to integrate neural-perceptual structural information with ontological knowledge, leveraging the powerful capabilities of LLMs to achieve a deeper understanding of the intrinsic logic of the knowledge. We propose an ontology enhanced KGC method using LLMs – OL-KGC. It first leverages neural perceptual mechanisms to effectively embed structural information into the textual space, and then uses an automated extraction algorithm to retrieve ontological knowledge from the knowledge graphs (KGs) that needs to be completed, which is further transformed into a textual format comprehensible to LLMs for providing logic guidance. We conducted extensive experiments on three widely-used benchmarks – FB15K-237, UMLS and WN18RR. The experimental results demonstrate that OL-KGC significantly outperforms existing mainstream KGC methods across multiple evaluation metrics, achieving state-of-the-art performance.
64. TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model
- Authors: Ao Li, Yuxiang Duan, Jinghui Zhang, Congbo Ma, Yutong Xie, Gustavo Carneiro, Mohammad Yaqub, Hu Wang
- URL: https://arxiv.org/abs/2507.20630
- Abstract:
Large Vision-Language Models (LVLMs) have advanced multimodal learning but face high computational costs due to the large number of visual tokens, motivating token pruning to improve inference efficiency. The key challenge lies in identifying which tokens are truly important. Most existing approaches rely on attention-based criteria to estimate token importance. However, they inherently suffer from certain limitations, such as positional bias. In this work, we explore a new perspective on token importance based on token transitions in LVLMs. We observe that the transition of token representations provides a meaningful signal of semantic information. Based on this insight, we propose TransPrune, a training-free and efficient token pruning method. Specifically, TransPrune progressively prunes tokens by assessing their importance through a combination of Token Transition Variation (TTV)-which measures changes in both the magnitude and direction of token representations-and Instruction-Guided Attention (IGA), which measures how strongly the instruction attends to image tokens via attention. Extensive experiments demonstrate that TransPrune achieves comparable multimodal performance to original LVLMs, such as LLaVA-v1.5 and LLaVA-Next, across eight benchmarks, while reducing inference TFLOPs by more than half. Moreover, TTV alone can serve as an effective criterion without relying on attention, achieving performance comparable to attention-based methods. The code will be made publicly available upon acceptance of the paper at this https URL.
65. Controllable Video-to-Music Generation with Multiple Time-Varying Conditions
- Authors: Junxian Wu, Weitao You, Heda Zuo, Dengming Zhang, Pei Chen, Lingyun Sun
- URL: https://arxiv.org/abs/2507.20627
- Abstract:
Music enhances video narratives and emotions, driving demand for automatic video-to-music (V2M) generation. However, existing V2M methods relying solely on visual features or supplementary textual inputs generate music in a black-box manner, often failing to meet user expectations. To address this challenge, we propose a novel multi-condition guided V2M generation framework that incorporates multiple time-varying conditions for enhanced control over music generation. Our method uses a two-stage training strategy that enables learning of V2M fundamentals and audiovisual temporal synchronization while meeting users’ needs for multi-condition control. In the first stage, we introduce a fine-grained feature selection module and a progressive temporal alignment attention mechanism to ensure flexible feature alignment. For the second stage, we develop a dynamic conditional fusion module and a control-guided decoder module to integrate multiple conditions and accurately guide the music composition process. Extensive experiments demonstrate that our method outperforms existing V2M pipelines in both subjective and objective evaluations, significantly enhancing control and alignment with user expectations.
66. Lightweight Remote Sensing Scene Classification on Edge Devices via Knowledge Distillation and Early-exit
- Authors: Yang Zhao, Shusheng Li, Xueshang Feng
- URL: https://arxiv.org/abs/2507.20623
- Abstract:
As the development of lightweight deep learning algorithms, various deep neural network (DNN) models have been proposed for the remote sensing scene classification (RSSC) application. However, it is still challenging for these RSSC models to achieve optimal performance among model accuracy, inference latency, and energy consumption on resource-constrained edge devices. In this paper, we propose a lightweight RSSC framework, which includes a distilled global filter network (GFNet) model and an early-exit mechanism designed for edge devices to achieve state-of-the-art performance. Specifically, we first apply frequency domain distillation on the GFNet model to reduce model size. Then we design a dynamic early-exit model tailored for DNN models on edge devices to further improve model inference efficiency. We evaluate our E3C model on three edge devices across four datasets. Extensive experimental results show that it achieves an average of 1.3x speedup on model inference and over 40% improvement on energy efficiency, while maintaining high classification accuracy.
67. Beyond Interactions: Node-Level Graph Generation for Knowledge-Free Augmentation in Recommender Systems
- Authors: Zhaoyan Wang, Hyunjun Ahn, In-Young Ko
- URL: https://arxiv.org/abs/2507.20578
- Abstract:
Recent advances in recommender systems rely on external resources such as knowledge graphs or large language models to enhance recommendations, which limit applicability in real-world settings due to data dependency and computational overhead. Although knowledge-free models are able to bolster recommendations by direct edge operations as well, the absence of augmentation primitives drives them to fall short in bridging semantic and structural gaps as high-quality paradigm substitutes. Unlike existing diffusion-based works that remodel user-item interactions, this work proposes NodeDiffRec, a pioneering knowledge-free augmentation framework that enables fine-grained node-level graph generation for recommendations and expands the scope of restricted augmentation primitives via diffusion. By synthesizing pseudo-items and corresponding interactions that align with the underlying distribution for injection, and further refining user preferences through a denoising preference modeling process, NodeDiffRec dramatically enhances both semantic diversity and structural connectivity without external knowledge. Extensive experiments across diverse datasets and recommendation algorithms demonstrate the superiority of NodeDiffRec, achieving State-of-the-Art (SOTA) performance, with maximum average performance improvement 98.6% in Recall@5 and 84.0% in NDCG@5 over selected baselines.
68. Implicit Spatiotemporal Bandwidth Enhancement Filter by Sine-activated Deep Learning Model for Fast 3D Photoacoustic Tomography
- Authors: I Gede Eka Sulistyawan, Takuro Ishii, Riku Suzuki, Yoshifumi Saijo
- URL: https://arxiv.org/abs/2507.20575
- Abstract:
3D photoacoustic tomography (3D-PAT) using high-frequency hemispherical transducers offers near-omnidirectional reception and enhanced sensitivity to the finer structural details encoded in the high-frequency components of the broadband photoacoustic (PA) signal. However, practical constraints such as limited number of channels with bandlimited sampling rate often result in sparse and bandlimited sensors that degrade image quality. To address this, we revisit the 2D deep learning (DL) approach applied directly to sensor-wise PA radio-frequency (PARF) data. Specifically, we introduce sine activation into the DL model to restore the broadband nature of PARF signals given the observed band-limited and high-frequency PARF data. Given the scarcity of 3D training data, we employ simplified training strategies by simulating random spherical absorbers. This combination of sine-activated model and randomized training is designed to emphasize bandwidth learning over dataset memorization. Our model was evaluated on a leaf skeleton phantom, a micro-CT-verified 3D spiral phantom and in-vivo human palm vasculature. The results showed that the proposed training mechanism on sine-activated model was well-generalized across the different tests by effectively increasing the sensor density and recovering the spatiotemporal bandwidth. Qualitatively, the sine-activated model uniquely enhanced high-frequency content that produces clearer vascular structure with fewer artefacts. Quantitatively, the sine-activated model exhibits full bandwidth at -12 dB spectrum and significantly higher contrast-to-noise ratio with minimal loss of structural similarity index. Lastly, we optimized our approach to enable fast enhanced 3D-PAT at 2 volumes-per-second for better practical imaging of a free-moving targets.
69. DAG-AFL:Directed Acyclic Graph-based Asynchronous Federated Learning
- Authors: Shuaipeng Zhang, Lanju Kong, Yixin Zhang, Wei He, Yongqing Zheng, Han Yu, Lizhen Cui
- URL: https://arxiv.org/abs/2507.20571
- Abstract:
Due to the distributed nature of federated learning (FL), the vulnerability of the global model and the need for coordination among many client devices pose significant challenges. As a promising decentralized, scalable and secure solution, blockchain-based FL methods have attracted widespread attention in recent years. However, traditional consensus mechanisms designed for Proof of Work (PoW) similar to blockchain incur substantial resource consumption and compromise the efficiency of FL, particularly when participating devices are wireless and resource-limited. To address asynchronous client participation and data heterogeneity in FL, while limiting the additional resource overhead introduced by blockchain, we propose the Directed Acyclic Graph-based Asynchronous Federated Learning (DAG-AFL) framework. We develop a tip selection algorithm that considers temporal freshness, node reachability and model accuracy, with a DAG-based trusted verification strategy. Extensive experiments on 3 benchmarking datasets against eight state-of-the-art approaches demonstrate that DAG-AFL significantly improves training efficiency and model accuracy by 22.7% and 6.5% on average, respectively.
70. Learning Phonetic Context-Dependent Viseme for Enhancing Speech-Driven 3D Facial Animation
- Authors: Hyung Kyu Kim, Hak Gu Kim
- URL: https://arxiv.org/abs/2507.20568
- Abstract:
Speech-driven 3D facial animation aims to generate realistic facial movements synchronized with audio. Traditional methods primarily minimize reconstruction loss by aligning each frame with ground-truth. However, this frame-wise approach often fails to capture the continuity of facial motion, leading to jittery and unnatural outputs due to coarticulation. To address this, we propose a novel phonetic context-aware loss, which explicitly models the influence of phonetic context on viseme transitions. By incorporating a viseme coarticulation weight, we assign adaptive importance to facial movements based on their dynamic changes over time, ensuring smoother and perceptually consistent animations. Extensive experiments demonstrate that replacing the conventional reconstruction loss with ours improves both quantitative metrics and visual quality. It highlights the importance of explicitly modeling phonetic context-dependent visemes in synthesizing natural speech-driven 3D facial animation. Project page: this https URL
71. MemoryTalker: Personalized Speech-Driven 3D Facial Animation via Audio-Guided Stylization
- Authors: Hyung Kyu Kim, Sangmin Lee, Hak Gu Kim
- URL: https://arxiv.org/abs/2507.20562
- Abstract:
Speech-driven 3D facial animation aims to synthesize realistic facial motion sequences from given audio, matching the speaker’s speaking style. However, previous works often require priors such as class labels of a speaker or additional 3D facial meshes at inference, which makes them fail to reflect the speaking style and limits their practical use. To address these issues, we propose MemoryTalker which enables realistic and accurate 3D facial motion synthesis by reflecting speaking style only with audio input to maximize usability in applications. Our framework consists of two training stages: 1-stage is storing and retrieving general motion (i.e., Memorizing), and 2-stage is to perform the personalized facial motion synthesis (i.e., Animating) with the motion memory stylized by the audio-driven speaking style feature. In this second stage, our model learns about which facial motion types should be emphasized for a particular piece of audio. As a result, our MemoryTalker can generate a reliable personalized facial animation without additional prior information. With quantitative and qualitative evaluations, as well as user study, we show the effectiveness of our model and its performance enhancement for personalized facial animation over state-of-the-art methods.
72. Enhancing Hallucination Detection via Future Context
- Authors: Joosung Lee, Cheonbok Park, Hwiyeol Jo, Jeonghoon Kim, Joonsuk Park, Kang Min Yoo
- URL: https://arxiv.org/abs/2507.20546
- Abstract:
Large Language Models (LLMs) are widely used to generate plausible text on online platforms, without revealing the generation process. As users increasingly encounter such black-box outputs, detecting hallucinations has become a critical challenge. To address this challenge, we focus on developing a hallucination detection framework for black-box generators. Motivated by the observation that hallucinations, once introduced, tend to persist, we sample future contexts. The sampled future contexts provide valuable clues for hallucination detection and can be effectively integrated with various sampling-based methods. We extensively demonstrate performance improvements across multiple methods using our proposed sampling approach.
73. T2I-Copilot: A Training-Free Multi-Agent Text-to-Image System for Enhanced Prompt Interpretation and Interactive Generation
- Authors: Chieh-Yun Chen, Min Shi, Gong Zhang, Humphrey Shi
- URL: https://arxiv.org/abs/2507.20536
- Abstract:
Text-to-Image (T2I) generative models have revolutionized content creation but remain highly sensitive to prompt phrasing, often requiring users to repeatedly refine prompts multiple times without clear feedback. While techniques such as automatic prompt engineering, controlled text embeddings, denoising, and multi-turn generation mitigate these issues, they offer limited controllability, or often necessitate additional training, restricting the generalization abilities. Thus, we introduce T2I-Copilot, a training-free multi-agent system that leverages collaboration between (Multimodal) Large Language Models to automate prompt phrasing, model selection, and iterative refinement. This approach significantly simplifies prompt engineering while enhancing generation quality and text-image alignment compared to direct generation. Specifically, T2I-Copilot consists of three agents: (1) Input Interpreter, which parses the input prompt, resolves ambiguities, and generates a standardized report; (2) Generation Engine, which selects the appropriate model from different types of T2I models and organizes visual and textual prompts to initiate generation; and (3) Quality Evaluator, which assesses aesthetic quality and text-image alignment, providing scores and feedback for potential regeneration. T2I-Copilot can operate fully autonomously while also supporting human-in-the-loop intervention for fine-grained control. On GenAI-Bench, using open-source generation models, T2I-Copilot achieves a VQA score comparable to commercial models RecraftV3 and Imagen 3, surpasses FLUX1.1-pro by 6.17% at only 16.59% of its cost, and outperforms FLUX.1-dev and SD 3.5 Large by 9.11% and 6.36%. Code will be released at: this https URL.
74. Kimi K2: Open Agentic Intelligence
- Authors: Kimi Team: Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Hongcheng Gao, Peizhong Gao, Tong Gao, Xinran Gu, Longyu Guan, Haiqing Guo, Jianhang Guo, Hao Hu, Xiaoru Hao, Tianhong He, Weiran He, Wenyang He, Chao Hong, Yangyang Hu, Zhenxing Hu, Weixiao Huang, Zhiqi Huang, Zihao Huang, Tao Jiang, Zhejun Jiang, Xinyi Jin, Yongsheng Kang, Guokun Lai, Cheng Li, Fang Li, Haoyang Li, Ming Li, Wentao Li, Yanhao Li, Yiwei Li, Zhaowei Li, Zheming Li, Hongzhan Lin, Xiaohan Lin, Zongyu Lin, Chengyin Liu, Chenyu Liu, Hongzhang Liu, Jingyuan Liu, Junqi Liu, Liang Liu, Shaowei Liu, T.Y. Liu, Tianwei Liu, Weizhou Liu, Yangyang Liu, Yibo Liu, Yiping Liu, Yue Liu, Zhengying Liu, Enzhe Lu, Lijun Lu, Shengling Ma, Xinyu Ma, Yingwei Ma, Shaoguang Mao, Jie Mei, Xin Men, Yibo Miao, Siyuan Pan, Yebo Peng, Ruoyu Qin, Bowen Qu, Zeyu Shang, Lidong Shi, Shengyuan Shi, Feifan Song, Jianlin Su, Zhengyuan Su, Xinjie Sun, Flood Sung, Heyi Tang, Jiawen Tao, Qifeng Teng, Chensi Wang, Dinglu Wang, Feng Wang, Haiming Wang
- URL: https://arxiv.org/abs/2507.20534
- Abstract:
We introduce Kimi K2, a Mixture-of-Experts (MoE) large language model with 32 billion activated parameters and 1 trillion total parameters. We propose the MuonClip optimizer, which improves upon Muon with a novel QK-clip technique to address training instability while enjoying the advanced token efficiency of Muon. Based on MuonClip, K2 was pre-trained on 15.5 trillion tokens with zero loss spike. During post-training, K2 undergoes a multi-stage post-training process, highlighted by a large-scale agentic data synthesis pipeline and a joint reinforcement learning (RL) stage, where the model improves its capabilities through interactions with real and synthetic environments. Kimi K2 achieves state-of-the-art performance among open-source non-thinking models, with strengths in agentic capabilities. Notably, K2 obtains 66.1 on Tau2-Bench, 76.5 on ACEBench (En), 65.8 on SWE-Bench Verified, and 47.3 on SWE-Bench Multilingual – surpassing most open and closed-sourced baselines in non-thinking settings. It also exhibits strong capabilities in coding, mathematics, and reasoning tasks, with a score of 53.7 on LiveCodeBench v6, 49.5 on AIME 2025, 75.1 on GPQA-Diamond, and 27.1 on OJBench, all without extended thinking. These results position Kimi K2 as one of the most capable open-source large language models to date, particularly in software engineering and agentic tasks. We release our base and post-trained model checkpoints to facilitate future research and applications of agentic intelligence.
75. Enhancing Spatial Reasoning through Visual and Textual Thinking
- Authors: Xun Liang, Xin Guo, Zhongming Jin, Weihang Pan, Penghui Shang, Deng Cai, Binbin Lin, Jieping Ye
- URL: https://arxiv.org/abs/2507.20529
- Abstract:
The spatial reasoning task aims to reason about the spatial relationships in 2D and 3D space, which is a fundamental capability for Visual Question Answering (VQA) and robotics. Although vision language models (VLMs) have developed rapidly in recent years, they are still struggling with the spatial reasoning task. In this paper, we introduce a method that can enhance Spatial reasoning through Visual and Textual thinking Simultaneously (SpatialVTS). In the spatial visual thinking phase, our model is trained to generate location-related specific tokens of essential targets automatically. Not only are the objects mentioned in the problem addressed, but also the potential objects related to the reasoning are considered. During the spatial textual thinking phase, Our model conducts long-term thinking based on visual cues and dialogues, gradually inferring the answers to spatial reasoning problems. To effectively support the model’s training, we perform manual corrections to the existing spatial reasoning dataset, eliminating numerous incorrect labels resulting from automatic annotation, restructuring the data input format to enhance generalization ability, and developing thinking processes with logical reasoning details. Without introducing additional information (such as masks or depth), our model’s overall average level in several spatial understanding tasks has significantly improved compared with other models.
76. The Xeno Sutra: Can Meaning and Value be Ascribed to an AI-Generated “Sacred” Text?
- Authors: Murray Shanahan, Tara Das, Robert Thurman
- URL: https://arxiv.org/abs/2507.20525
- Abstract:
This paper presents a case study in the use of a large language model to generate a fictional Buddhist “sutr”’, and offers a detailed analysis of the resulting text from a philosophical and literary point of view. The conceptual subtlety, rich imagery, and density of allusion found in the text make it hard to causally dismiss on account of its mechanistic origin. This raises questions about how we, as a society, should come to terms with the potentially unsettling possibility of a technology that encroaches on human meaning-making. We suggest that Buddhist philosophy, by its very nature, is well placed to adapt.
77. AQUA: A Large Language Model for Aquaculture & Fisheries
- Authors: Praneeth Narisetty, Uday Kumar Reddy Kattamanchi, Lohit Akshant Nimma, Sri Ram Kaushik Karnati, Shiva Nagendra Babu Kore, Mounika Golamari, Tejashree Nageshreddy
- URL: https://arxiv.org/abs/2507.20520
- Abstract:
Aquaculture plays a vital role in global food security and coastal economies by providing sustainable protein sources. As the industry expands to meet rising demand, it faces growing challenges such as disease outbreaks, inefficient feeding practices, rising labor costs, logistical inefficiencies, and critical hatchery issues, including high mortality rates and poor water quality control. Although artificial intelligence has made significant progress, existing machine learning methods fall short of addressing the domain-specific complexities of aquaculture. To bridge this gap, we introduce AQUA, the first large language model (LLM) tailored for aquaculture, designed to support farmers, researchers, and industry practitioners. Central to this effort is AQUADAPT (Data Acquisition, Processing and Tuning), an Agentic Framework for generating and refining high-quality synthetic data using a combination of expert knowledge, largescale language models, and automated evaluation techniques. Our work lays the foundation for LLM-driven innovations in aquaculture research, advisory systems, and decision-making tools.
78. LLMs-guided adaptive compensator: Bringing Adaptivity to Automatic Control Systems with Large Language Models
- Authors: Zhongchao Zhou, Yuxi Lu, Yaonan Zhu, Yifan Zhao, Bin He, Liang He, Wenwen Yu, Yusuke Iwasawa
- URL: https://arxiv.org/abs/2507.20509
- Abstract:
With rapid advances in code generation, reasoning, and problem-solving, Large Language Models (LLMs) are increasingly applied in robotics. Most existing work focuses on high-level tasks such as task decomposition. A few studies have explored the use of LLMs in feedback controller design; however, these efforts are restricted to overly simplified systems, fixed-structure gain tuning, and lack real-world validation. To further investigate LLMs in automatic control, this work targets a key subfield: adaptive control. Inspired by the framework of model reference adaptive control (MRAC), we propose an LLM-guided adaptive compensator framework that avoids designing controllers from scratch. Instead, the LLMs are prompted using the discrepancies between an unknown system and a reference system to design a compensator that aligns the response of the unknown system with that of the reference, thereby achieving adaptivity. Experiments evaluate five methods: LLM-guided adaptive compensator, LLM-guided adaptive controller, indirect adaptive control, learning-based adaptive control, and MRAC, on soft and humanoid robots in both simulated and real-world environments. Results show that the LLM-guided adaptive compensator outperforms traditional adaptive controllers and significantly reduces reasoning complexity compared to the LLM-guided adaptive controller. The Lyapunov-based analysis and reasoning-path inspection demonstrate that the LLM-guided adaptive compensator enables a more structured design process by transforming mathematical derivation into a reasoning task, while exhibiting strong generalizability, adaptability, and robustness. This study opens a new direction for applying LLMs in the field of automatic control, offering greater deployability and practicality compared to vision-language models.
79. DmC: Nearest Neighbor Guidance Diffusion Model for Offline Cross-domain Reinforcement Learning
- Authors: Linh Le Pham Van, Minh Hoang Nguyen, Duc Kieu, Hung Le, Hung The Tran, Sunil Gupta
- URL: https://arxiv.org/abs/2507.20499
- Abstract:
Cross-domain offline reinforcement learning (RL) seeks to enhance sample efficiency in offline RL by utilizing additional offline source datasets. A key challenge is to identify and utilize source samples that are most relevant to the target domain. Existing approaches address this challenge by measuring domain gaps through domain classifiers, target transition dynamics modeling, or mutual information estimation using contrastive loss. However, these methods often require large target datasets, which is impractical in many real-world scenarios. In this work, we address cross-domain offline RL under a limited target data setting, identifying two primary challenges: (1) Dataset imbalance, which is caused by large source and small target datasets and leads to overfitting in neural network-based domain gap estimators, resulting in uninformative measurements; and (2) Partial domain overlap, where only a subset of the source data is closely aligned with the target domain. To overcome these issues, we propose DmC, a novel framework for cross-domain offline RL with limited target samples. Specifically, DmC utilizes $k$-nearest neighbor ($k$-NN) based estimation to measure domain proximity without neural network training, effectively mitigating overfitting. Then, by utilizing this domain proximity, we introduce a nearest-neighbor-guided diffusion model to generate additional source samples that are better aligned with the target domain, thus enhancing policy learning with more effective source samples. Through theoretical analysis and extensive experiments in diverse MuJoCo environments, we demonstrate that DmC significantly outperforms state-of-the-art cross-domain offline RL methods, achieving substantial performance gains.
80. Speaking in Words, Thinking in Logic: A Dual-Process Framework in QA Systems
- Authors: Tuan Bui, Trong Le, Phat Thai, Sang Nguyen, Minh Hua, Ngan Pham, Thang Bui, Tho Quan
- URL: https://arxiv.org/abs/2507.20491
- Abstract:
Recent advances in large language models (LLMs) have significantly enhanced question-answering (QA) capabilities, particularly in open-domain contexts. However, in closed-domain scenarios such as education, healthcare, and law, users demand not only accurate answers but also transparent reasoning and explainable decision-making processes. While neural-symbolic (NeSy) frameworks have emerged as a promising solution, leveraging LLMs for natural language understanding and symbolic systems for formal reasoning, existing approaches often rely on large-scale models and exhibit inefficiencies in translating natural language into formal logic representations. To address these limitations, we introduce Text-JEPA (Text-based Joint-Embedding Predictive Architecture), a lightweight yet effective framework for converting natural language into first-order logic (NL2FOL). Drawing inspiration from dual-system cognitive theory, Text-JEPA emulates System 1 by efficiently generating logic representations, while the Z3 solver operates as System 2, enabling robust logical inference. To rigorously evaluate the NL2FOL-to-reasoning pipeline, we propose a comprehensive evaluation framework comprising three custom metrics: conversion score, reasoning score, and Spearman rho score, which collectively capture the quality of logical translation and its downstream impact on reasoning accuracy. Empirical results on domain-specific datasets demonstrate that Text-JEPA achieves competitive performance with significantly lower computational overhead compared to larger LLM-based systems. Our findings highlight the potential of structured, interpretable reasoning frameworks for building efficient and explainable QA systems in specialized domains.
81. Shapley-Value-Based Graph Sparsification for GNN Inference
- Authors: Selahattin Akkas, Ariful Azad
- URL: https://arxiv.org/abs/2507.20460
- Abstract:
Graph sparsification is a key technique for improving inference efficiency in Graph Neural Networks by removing edges with minimal impact on predictions. GNN explainability methods generate local importance scores, which can be aggregated into global scores for graph sparsification. However, many explainability methods produce only non-negative scores, limiting their applicability for sparsification. In contrast, Shapley value based methods assign both positive and negative contributions to node predictions, offering a theoretically robust and fair allocation of importance by evaluating many subsets of graphs. Unlike gradient-based or perturbation-based explainers, Shapley values enable better pruning strategies that preserve influential edges while removing misleading or adversarial connections. Our approach shows that Shapley value-based graph sparsification maintains predictive performance while significantly reducing graph complexity, enhancing both interpretability and efficiency in GNN inference.