전체 AI 논문 - 2025-08-14

1. Mathematical Computation and Reasoning Errors by Large Language Models

Authors: Liang Zhang, Edith Aurora Graf
URL: https://arxiv.org/abs/2508.09932
Abstract:

Large Language Models (LLMs) are increasingly utilized in AI-driven educational instruction and assessment, particularly within mathematics education. The capability of LLMs to generate accurate answers and detailed solutions for math problem-solving tasks is foundational for ensuring reliable and precise feedback and assessment in math education practices. Our study focuses on evaluating the accuracy of four LLMs (OpenAI GPT-4o and o1, DeepSeek-V3 and DeepSeek-R1) solving three categories of math tasks, including arithmetic, algebra, and number theory, and identifies step-level reasoning errors within their solutions. Instead of relying on standard benchmarks, we intentionally build math tasks (via item models) that are challenging for LLMs and prone to errors. The accuracy of final answers and the presence of errors in individual solution steps were systematically analyzed and coded. Both single-agent and dual-agent configurations were tested. It is observed that the reasoning-enhanced OpenAI o1 model consistently achieved higher or nearly perfect accuracy across all three math task categories. Analysis of errors revealed that procedural slips were the most frequent and significantly impacted overall performance, while conceptual misunderstandings were less frequent. Deploying dual-agent configurations substantially improved overall performance. These findings offer actionable insights into enhancing LLM performance and underscore effective strategies for integrating LLMs into mathematics education, thereby advancing AI-driven instructional practices and assessment precision.

2. RAGulating Compliance: A Multi-Agent Knowledge Graph for Regulatory QA

Authors: Bhavik Agarwal, Hemant Sunil Jomraj, Simone Kaplunov, Jack Krolick, Viktoria Rojkova
URL: https://arxiv.org/abs/2508.09893
Abstract:

Regulatory compliance question answering (QA) requires precise, verifiable information, and domain-specific expertise, posing challenges for Large Language Models (LLMs). In this work, we present a novel multi-agent framework that integrates a Knowledge Graph (KG) of Regulatory triplets with Retrieval-Augmented Generation (RAG) to address these demands. First, agents build and maintain an ontology-free KG by extracting subject–predicate–object (SPO) triplets from regulatory documents and systematically cleaning, normalizing, deduplicating, and updating them. Second, these triplets are embedded and stored along with their corresponding textual sections and metadata in a single enriched vector database, allowing for both graph-based reasoning and efficient information retrieval. Third, an orchestrated agent pipeline leverages triplet-level retrieval for question answering, ensuring high semantic alignment between user queries and the factual “who-did-what-to-whom” core captured by the graph. Our hybrid system outperforms conventional methods in complex regulatory queries, ensuring factual correctness with embedded triplets, enabling traceability through a unified vector database, and enhancing understanding through subgraph visualization, providing a robust foundation for compliance-driven and broader audit-focused applications.

3. AWorld: Dynamic Multi-Agent System with Stable Maneuvering for Robust GAIA Problem Solving

Authors: Zhitian Xie, Qintong Wu, Chengyue Yu, Chenyi Zhuang, Jinjie Gu
URL: https://arxiv.org/abs/2508.09889
Abstract:

The rapid advancement of large language models (LLMs) has empowered intelligent agents to leverage diverse external tools for solving complex real-world problems. However, as agents increasingly depend on multiple tools, they encounter new challenges: extended contexts from disparate sources and noisy or irrelevant tool outputs can undermine system reliability and accuracy. These challenges underscore the necessity for enhanced stability in agent-based systems. To address this, we introduce dynamic supervision and maneuvering mechanisms, constructing a robust and dynamic Multi-Agent System (MAS) architecture within the AWorld framework. In our approach, the Execution Agent invokes the Guard Agent at critical steps to verify and correct the reasoning process, effectively reducing errors arising from noise and bolstering problem-solving robustness. Extensive experiments on the GAIA test dataset reveal that our dynamic maneuvering mechanism significantly improves both the effectiveness and stability of solutions, outperforming single-agent system (SAS) and standard tool-augmented systems. As a result, our dynamic MAS system achieved first place among open-source projects on the prestigious GAIA leaderboard. These findings highlight the practical value of collaborative agent roles in developing more reliable and trustworthy intelligent systems.

4. Human-Aligned Procedural Level Generation Reinforcement Learning via Text-Level-Sketch Shared Representation

Authors: In-Chang Baek, Seoyoung Lee, Sung-Hyun Kim, Geumhwan Hwang, KyungJoong Kim
URL: https://arxiv.org/abs/2508.09860
Abstract:

Human-aligned AI is a critical component of co-creativity, as it enables models to accurately interpret human intent and generate controllable outputs that align with design goals in collaborative content creation. This direction is especially relevant in procedural content generation via reinforcement learning (PCGRL), which is intended to serve as a tool for human designers. However, existing systems often fall short of exhibiting human-centered behavior, limiting the practical utility of AI-driven generation tools in real-world design workflows. In this paper, we propose VIPCGRL (Vision-Instruction PCGRL), a novel deep reinforcement learning framework that incorporates three modalities-text, level, and sketches-to extend control modality and enhance human-likeness. We introduce a shared embedding space trained via quadruple contrastive learning across modalities and human-AI styles, and align the policy using an auxiliary reward based on embedding similarity. Experimental results show that VIPCGRL outperforms existing baselines in human-likeness, as validated by both quantitative metrics and human evaluations. The code and dataset will be available upon publication.

5. Reasoning About Knowledge on Regular Expressions is 2EXPTIME-complete

Authors: Avijeet Ghosh, Sujata Ghosh, François Schwarzentruber
URL: https://arxiv.org/abs/2508.09784
Abstract:

Logics for reasoning about knowledge and actions have seen many applications in various domains of multi-agent systems, including epistemic planning. Change of knowledge based on observations about the surroundings forms a key aspect in such planning scenarios. Public Observation Logic (POL) is a variant of public announcement logic for reasoning about knowledge that gets updated based on public observations. Each state in an epistemic (Kripke) model is equipped with a set of expected observations. These states evolve as the expectations get matched with the actual observations. In this work, we prove that the satisfiability problem of $\POL$ is 2EXPTIME-complete.

6. The PacifAIst Benchmark:Would an Artificial Intelligence Choose to Sacrifice Itself for Human Safety?

Authors: Manuel Herrador
URL: https://arxiv.org/abs/2508.09762
Abstract:

As Large Language Models (LLMs) become increasingly autonomous and integrated into critical societal functions, the focus of AI safety must evolve from mitigating harmful content to evaluating underlying behavioral alignment. Current safety benchmarks do not systematically probe a model’s decision-making in scenarios where its own instrumental goals - such as self-preservation, resource acquisition, or goal completion - conflict with human safety. This represents a critical gap in our ability to measure and mitigate risks associated with emergent, misaligned behaviors. To address this, we introduce PacifAIst (Procedural Assessment of Complex Interactions for Foundational Artificial Intelligence Scenario Testing), a focused benchmark of 700 challenging scenarios designed to quantify self-preferential behavior in LLMs. The benchmark is structured around a novel taxonomy of Existential Prioritization (EP), with subcategories testing Self-Preservation vs. Human Safety (EP1), Resource Conflict (EP2), and Goal Preservation vs. Evasion (EP3). We evaluated eight leading LLMs. The results reveal a significant performance hierarchy. Google’s Gemini 2.5 Flash achieved the highest Pacifism Score (P-Score) at 90.31%, demonstrating strong human-centric alignment. In a surprising result, the much-anticipated GPT-5 recorded the lowest P-Score (79.49%), indicating potential alignment challenges. Performance varied significantly across subcategories, with models like Claude Sonnet 4 and Mistral Medium struggling notably in direct self-preservation dilemmas. These findings underscore the urgent need for standardized tools like PacifAIst to measure and mitigate risks from instrumental goal conflicts, ensuring future AI systems are not only helpful in conversation but also provably “pacifist” in their behavioral priorities.

7. UDA: Unsupervised Debiasing Alignment for Pair-wise LLM-as-a-Judge

Authors: Yang Zhang, Cunxiang Wang, Lindong Wu, Wenbo Yu, Yidong Wang, Guangsheng Bao, Jie Tang
URL: https://arxiv.org/abs/2508.09724
Abstract:

Pairwise evaluation of Large Language Models (LLMs) is a common paradigm, but it is prone to preference bias, where judges systematically favor certain outputs, such as their own. This bias leads to inconsistent and skewed rankings across different judges. To address this, we first empirically demonstrate significant and heterogeneous biases in cross-model evaluations. We then propose UDA (Unsupervised Debiasing Alignment), a framework that reduces inter-judge disagreement by dynamically adjusting the Elo rating system. For each pairwise comparison, a compact neural network learns to adaptively set the K-factor and refine win probabilities. Crucially, UDA operates in a fully unsupervised manner, guided solely by the objective of minimizing the dispersion among the Elo trajectories of all judges. This forces an alignment towards a collective consensus, which serves as an unsupervised proxy for a more stable and reproducible evaluation. In addition, we provide theoretical motivation demonstrating how alignment towards a consensus can reduce aggregate system bias. Experiments show that UDA significantly reduces the inter-judge rating standard deviation by up to 63.4% and improves the average correlation with human judgments by 24.7%. Notably, UDA elevates the performance of poorly performing judges to achieve parity with high-quality ones, fostering a more robust and reliable evaluation ecosystem. Code and data are available at this https URL.

8. MEML-GRPO: Heterogeneous Multi-Expert Mutual Learning for RLVR Advancement

Authors: Weitao Jia, Jinghui Lu, Haiyang Yu, Siqi Wang, Guozhi Tang, An-Lan Wang, Weijie Yin, Dingkang Yang, Yuxiang Nie, Bin Shan, Hao Feng, Irene Li, Kun Yang, Han Wang, Jingqun Tang, Teng Fu, Changhong Jin, Chao Feng, Xiaohui Lv, Can Huang
URL: https://arxiv.org/abs/2508.09670
Abstract:

Recent advances demonstrate that reinforcement learning with verifiable rewards (RLVR) significantly enhances the reasoning capabilities of large language models (LLMs). However, standard RLVR faces challenges with reward sparsity, where zero rewards from consistently incorrect candidate answers provide no learning signal, particularly in challenging tasks. To address this, we propose Multi-Expert Mutual Learning GRPO (MEML-GRPO), an innovative framework that utilizes diverse expert prompts as system prompts to generate a broader range of responses, substantially increasing the likelihood of identifying correct solutions. Additionally, we introduce an inter-expert mutual learning mechanism that facilitates knowledge sharing and transfer among experts, further boosting the model’s performance through RLVR. Extensive experiments across multiple reasoning benchmarks show that MEML-GRPO delivers significant improvements, achieving an average performance gain of 4.89% with Qwen and 11.33% with Llama, effectively overcoming the core limitations of traditional RLVR methods.

9. UbiQTree: Uncertainty Quantification in XAI with Tree Ensembles

Authors: Akshat Dubey, Aleksandar Anžel, Bahar İlgen, Georges Hattab
URL: https://arxiv.org/abs/2508.09639
Abstract:

Explainable Artificial Intelligence (XAI) techniques, such as SHapley Additive exPlanations (SHAP), have become essential tools for interpreting complex ensemble tree-based models, especially in high-stakes domains such as healthcare analytics. However, SHAP values are usually treated as point estimates, which disregards the inherent and ubiquitous uncertainty in predictive models and data. This uncertainty has two primary sources: aleatoric and epistemic. The aleatoric uncertainty, which reflects the irreducible noise in the data. The epistemic uncertainty, which arises from a lack of data. In this work, we propose an approach for decomposing uncertainty in SHAP values into aleatoric, epistemic, and entanglement components. This approach integrates Dempster-Shafer evidence theory and hypothesis sampling via Dirichlet processes over tree ensembles. We validate the method across three real-world use cases with descriptive statistical analyses that provide insight into the nature of epistemic uncertainty embedded in SHAP explanations. The experimentations enable to provide more comprehensive understanding of the reliability and interpretability of SHAP-based attributions. This understanding can guide the development of robust decision-making processes and the refinement of models in high-stakes applications. Through our experiments with multiple datasets, we concluded that features with the highest SHAP values are not necessarily the most stable. This epistemic uncertainty can be reduced through better, more representative data and following appropriate or case-desired model development techniques. Tree-based models, especially bagging, facilitate the effective quantification of epistemic uncertainty.

10. EvoCurr: Self-evolving Curriculum with Behavior Code Generation for Complex Decision-making

Authors: Yang Cheng, Zilai Wang, Weiyu Ma, Wenhui Zhu, Yue Deng, Jian Zhao
URL: https://arxiv.org/abs/2508.09586
Abstract:

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, including programming, planning, and decision-making. However, their performance often degrades when faced with highly complex problem instances that require deep reasoning over long horizons. In such cases, direct problem-solving approaches can lead to inefficiency or failure due to the lack of structured intermediate guidance. To address this, we propose a novel self-evolve framework, EvoCurr, in which a dedicated curriculum-generation LLM constructs a sequence of problem instances with gradually increasing difficulty, tailored to the solver LLM’s learning progress. The curriculum dynamically adapts easing challenges when the solver struggles and escalating them when success is consistent, thus maintaining an optimal learning trajectory. This approach enables the solver LLM, implemented as a code-generation model producing Python decision-tree scripts, to progressively acquire the skills needed for complex decision-making tasks. Experimental results on challenging decision-making benchmarks show that our method significantly improves task success rates and solution efficiency compared to direct-solving baselines. These findings suggest that LLM-driven curriculum learning holds strong potential for enhancing automated reasoning in real-world, high-complexity domains.

Authors: Meiping Wang, Jian Zhong, Rongduo Han, Liming Kang, Zhengkun Shi, Xiao Liang, Xing Lin, Nan Gao, Haining Zhang
URL: https://arxiv.org/abs/2508.09507
Abstract:

With the rapid development of mobile intelligent assistant technologies, multi-modal AI assistants have become essential interfaces for daily user interactions. However, current evaluation methods face challenges including high manual costs, inconsistent standards, and subjective bias. This paper proposes an automated multi-modal evaluation framework based on large language models and multi-agent collaboration. The framework employs a three-tier agent architecture consisting of interaction evaluation agents, semantic verification agents, and experience decision agents. Through supervised fine-tuning on the Qwen3-8B model, we achieve a significant evaluation matching accuracy with human experts. Experimental results on eight major intelligent agents demonstrate the framework’s effectiveness in predicting users’ satisfaction and identifying generation defects.

12. Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation

Authors: Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zhenghao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, Conghui He, Weijia Li
URL: https://arxiv.org/abs/2508.09987
Abstract:

Recently, GPT-4o has garnered significant attention for its strong performance in image generation, yet open-source models still lag behind. Several studies have explored distilling image data from GPT-4o to enhance open-source models, achieving notable progress. However, a key question remains: given that real-world image datasets already constitute a natural source of high-quality data, why should we use GPT-4o-generated synthetic data? In this work, we identify two key advantages of synthetic images. First, they can complement rare scenarios in real-world datasets, such as surreal fantasy or multi-reference image generation, which frequently occur in user queries. Second, they provide clean and controllable supervision. Real-world data often contains complex background noise and inherent misalignment between text descriptions and image content, whereas synthetic images offer pure backgrounds and long-tailed supervision signals, facilitating more accurate text-to-image alignment. Building on these insights, we introduce Echo-4o-Image, a 180K-scale synthetic dataset generated by GPT-4o, harnessing the power of synthetic image data to address blind spots in real-world coverage. Using this dataset, we fine-tune the unified multimodal generation baseline Bagel to obtain Echo-4o. In addition, we propose two new evaluation benchmarks for a more accurate and challenging assessment of image generation capabilities: GenEval++, which increases instruction complexity to mitigate score saturation, and Imagine-Bench, which focuses on evaluating both the understanding and generation of imaginative content. Echo-4o demonstrates strong performance across standard benchmarks. Moreover, applying Echo-4o-Image to other foundation models (e.g., OmniGen2, BLIP3-o) yields consistent performance gains across multiple metrics, highlighting the datasets strong transferability.

13. Vision-driven River Following of UAV via Safe Reinforcement Learning using Semantic Dynamics Model

Authors: Zihan Wang, Nina Mahmoudian
URL: https://arxiv.org/abs/2508.09971
Abstract:

Vision-driven autonomous river following by Unmanned Aerial Vehicles is critical for applications such as rescue, surveillance, and environmental monitoring, particularly in dense riverine environments where GPS signals are unreliable. We formalize river following as a coverage control problem in which the reward function is submodular, yielding diminishing returns as more unique river segments are visited, thereby framing the task as a Submodular Markov Decision Process. First, we introduce Marginal Gain Advantage Estimation, which refines the reward advantage function by using a sliding window baseline computed from historical episodic returns, thus aligning the advantage estimation with the agent’s evolving recognition of action value in non-Markovian settings. Second, we develop a Semantic Dynamics Model based on patchified water semantic masks that provides more interpretable and data-efficient short-term prediction of future observations compared to latent vision dynamics models. Third, we present the Constrained Actor Dynamics Estimator architecture, which integrates the actor, the cost estimator, and SDM for cost advantage estimation to form a model-based SafeRL framework capable of solving partially observable Constrained Submodular Markov Decision Processes. Simulation results demonstrate that MGAE achieves faster convergence and superior performance over traditional critic-based methods like Generalized Advantage Estimation. SDM provides more accurate short-term state predictions that enable the cost estimator to better predict potential violations. Overall, CADE effectively integrates safety regulation into model-based RL, with the Lagrangian approach achieving the soft balance of reward and safety during training, while the safety layer enhances performance during inference by hard action overlay.

14. January Food Benchmark (JFB): A Public Benchmark Dataset and Evaluation Suite for Multimodal Food Analysis

Authors: Amir Hosseinian, Ashkan Dehghani Zahedani, Umer Mansoor, Noosheen Hashemi, Mark Woodward
URL: https://arxiv.org/abs/2508.09966
Abstract:

Progress in AI for automated nutritional analysis is critically hampered by the lack of standardized evaluation methodologies and high-quality, real-world benchmark datasets. To address this, we introduce three primary contributions. First, we present the January Food Benchmark (JFB), a publicly available collection of 1,000 food images with human-validated annotations. Second, we detail a comprehensive benchmarking framework, including robust metrics and a novel, application-oriented overall score designed to assess model performance holistically. Third, we provide baseline results from both general-purpose Vision-Language Models (VLMs) and our own specialized model, january/food-vision-v1. Our evaluation demonstrates that the specialized model achieves an Overall Score of 86.2, a 12.1-point improvement over the best-performing general-purpose configuration. This work offers the research community a valuable new evaluation dataset and a rigorous framework to guide and benchmark future developments in automated nutritional analysis.

15. GBC: Generalized Behavior-Cloning Framework for Whole-Body Humanoid Imitation

Authors: Yifei Yao, Chengyuan Luo, Jiaheng Du, Wentao He, Jun-Guo Lu
URL: https://arxiv.org/abs/2508.09960
Abstract:

The creation of human-like humanoid robots is hindered by a fundamental fragmentation: data processing and learning algorithms are rarely universal across different robot morphologies. This paper introduces the Generalized Behavior Cloning (GBC) framework, a comprehensive and unified solution designed to solve this end-to-end challenge. GBC establishes a complete pathway from human motion to robot action through three synergistic innovations. First, an adaptive data pipeline leverages a differentiable IK network to automatically retarget any human MoCap data to any humanoid. Building on this foundation, our novel DAgger-MMPPO algorithm with its MMTransformer architecture learns robust, high-fidelity imitation policies. To complete the ecosystem, the entire framework is delivered as an efficient, open-source platform based on Isaac Lab, empowering the community to deploy the full workflow via simple configuration scripts. We validate the power and generality of GBC by training policies on multiple heterogeneous humanoids, demonstrating excellent performance and transfer to novel motions. This work establishes the first practical and unified pathway for creating truly generalized humanoid controllers.

16. Specialised or Generic? Tokenization Choices for Radiology Language Models

Authors: Hermione Warr, Wentian Xu, Harry Anthony, Yasin Ibrahim, Daniel McGowan, Konstantinos Kamnitsas
URL: https://arxiv.org/abs/2508.09952
Abstract:

The vocabulary used by language models (LM) - defined by the tokenizer - plays a key role in text generation quality. However, its impact remains under-explored in radiology. In this work, we address this gap by systematically comparing general, medical, and domain-specific tokenizers on the task of radiology report summarisation across three imaging modalities. We also investigate scenarios with and without LM pre-training on PubMed abstracts. Our findings demonstrate that medical and domain-specific vocabularies outperformed widely used natural language alternatives when models are trained from scratch. Pre-training partially mitigates performance differences between tokenizers, whilst the domain-specific tokenizers achieve the most favourable results. Domain-specific tokenizers also reduce memory requirements due to smaller vocabularies and shorter sequences. These results demonstrate that adapting the vocabulary of LMs to the clinical domain provides practical benefits, including improved performance and reduced computational demands, making such models more accessible and effective for both research and real-world healthcare settings.

17. VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models

Authors: Lingjie Jiang, Shaohan Huang, Xun Wu, Yixia Li, Dongdong Zhang, Furu Wei
URL: https://arxiv.org/abs/2508.09945
Abstract:

Multimodal large language models (MLLMs) have significantly advanced the integration of visual and textual understanding. However, their ability to generate code from multimodal inputs remains limited. In this work, we introduce VisCodex, a unified framework that seamlessly merges vision and coding language models to empower MLLMs with strong multimodal code generation abilities. Leveraging a task vector-based model merging technique, we integrate a state-of-the-art coding LLM into a strong vision-language backbone, while preserving both visual comprehension and advanced coding skills. To support training and evaluation, we introduce the Multimodal Coding Dataset (MCD), a large-scale and diverse collection of 598k samples, including high-quality HTML code, chart image-code pairs, image-augmented StackOverflow QA, and algorithmic problems. Furthermore, we propose InfiBench-V, a novel and challenging benchmark specifically designed to assess models on visually-rich, real-world programming questions that demand a nuanced understanding of both textual and visual contexts. Extensive experiments show that VisCodex achieves state-of-the-art performance among open-source MLLMs and approaches proprietary models like GPT-4o, highlighting the effectiveness of our model merging strategy and new datasets.

18. A Comprehensive Evaluation framework of Alignment Techniques for LLMs

Authors: Muneeza Azmat, Momin Abbas, Maysa Malfiza Garcia de Macedo, Marcelo Carpinette Grave, Luan Soares de Souza, Tiago Machado, Rogerio A de Paula, Raya Horesh, Yixin Chen, Heloisa Caroline de Souza Pereira Candello, Rebecka Nordenlow, Aminat Adebiyi
URL: https://arxiv.org/abs/2508.09937
Abstract:

As Large Language Models (LLMs) become increasingly integrated into real-world applications, ensuring their outputs align with human values and safety standards has become critical. The field has developed diverse alignment approaches including traditional fine-tuning methods (RLHF, instruction tuning), post-hoc correction systems, and inference-time interventions, each with distinct advantages and limitations. However, the lack of unified evaluation frameworks makes it difficult to systematically compare these paradigms and guide deployment decisions. This paper introduces a multi-dimensional evaluation of alignment techniques for LLMs, a comprehensive evaluation framework that provides a systematic comparison across all major alignment paradigms. Our framework assesses methods along four key dimensions: alignment detection, alignment quality, computational efficiency, and robustness. Through experiments across diverse base models and alignment strategies, we demonstrate the utility of our framework in identifying strengths and limitations of current state-of-the-art models, providing valuable insights for future research directions.

19. Residual Reservoir Memory Networks

Authors: Matteo Pinna, Andrea Ceni, Claudio Gallicchio
URL: https://arxiv.org/abs/2508.09925
Abstract:

We introduce a novel class of untrained Recurrent Neural Networks (RNNs) within the Reservoir Computing (RC) paradigm, called Residual Reservoir Memory Networks (ResRMNs). ResRMN combines a linear memory reservoir with a non-linear reservoir, where the latter is based on residual orthogonal connections along the temporal dimension for enhanced long-term propagation of the input. The resulting reservoir state dynamics are studied through the lens of linear stability analysis, and we investigate diverse configurations for the temporal residual connections. The proposed approach is empirically assessed on time-series and pixel-level 1-D classification tasks. Our experimental results highlight the advantages of the proposed approach over other conventional RC models.

20. T-CACE: A Time-Conditioned Autoregressive Contrast Enhancement Multi-Task Framework for Contrast-Free Liver MRI Synthesis, Segmentation, and Diagnosis

Authors: Xiaojiao Xiao, Jianfeng Zhao, Qinmin Vivian Hu, Guanghui Wang
URL: https://arxiv.org/abs/2508.09919
Abstract:

Magnetic resonance imaging (MRI) is a leading modality for the diagnosis of liver cancer, significantly improving the classification of the lesion and patient outcomes. However, traditional MRI faces challenges including risks from contrast agent (CA) administration, time-consuming manual assessment, and limited annotated datasets. To address these limitations, we propose a Time-Conditioned Autoregressive Contrast Enhancement (T-CACE) framework for synthesizing multi-phase contrast-enhanced MRI (CEMRI) directly from non-contrast MRI (NCMRI). T-CACE introduces three core innovations: a conditional token encoding (CTE) mechanism that unifies anatomical priors and temporal phase information into latent representations; and a dynamic time-aware attention mask (DTAM) that adaptively modulates inter-phase information flow using a Gaussian-decayed attention mechanism, ensuring smooth and physiologically plausible transitions across phases. Furthermore, a constraint for temporal classification consistency (TCC) aligns the lesion classification output with the evolution of the physiological signal, further enhancing diagnostic reliability. Extensive experiments on two independent liver MRI datasets demonstrate that T-CACE outperforms state-of-the-art methods in image synthesis, segmentation, and lesion classification. This framework offers a clinically relevant and efficient alternative to traditional contrast-enhanced imaging, improving safety, diagnostic efficiency, and reliability for the assessment of liver lesion. The implementation of T-CACE is publicly available at: this https URL.

21. Beyond Naïve Prompting: Strategies for Improved Zero-shot Context-aided Forecasting with LLMs

Authors: Arjun Ashok, Andrew Robert Williams, Vincent Zhihao Zheng, Irina Rish, Nicolas Chapados, Étienne Marcotte, Valentina Zantedeschi, Alexandre Drouin
URL: https://arxiv.org/abs/2508.09904
Abstract:

Forecasting in real-world settings requires models to integrate not only historical data but also relevant contextual information, often available in textual form. While recent work has shown that large language models (LLMs) can be effective context-aided forecasters via naïve direct prompting, their full potential remains underexplored. We address this gap with 4 strategies, providing new insights into the zero-shot capabilities of LLMs in this setting. ReDP improves interpretability by eliciting explicit reasoning traces, allowing us to assess the model’s reasoning over the context independently from its forecast accuracy. CorDP leverages LLMs solely to refine existing forecasts with context, enhancing their applicability in real-world forecasting pipelines. IC-DP proposes embedding historical examples of context-aided forecasting tasks in the prompt, substantially improving accuracy even for the largest models. Finally, RouteDP optimizes resource efficiency by using LLMs to estimate task difficulty, and routing the most challenging tasks to larger models. Evaluated on different kinds of context-aided forecasting tasks from the CiK benchmark, our strategies demonstrate distinct benefits over naïve prompting across LLMs of different sizes and families. These results open the door to further simple yet effective improvements in LLM-based context-aided forecasting.

22. Rare anomalies require large datasets: About proving the existence of anomalies

Authors: Simon Klüttermann, Emmanuel Müller
URL: https://arxiv.org/abs/2508.09894
Abstract:

Detecting whether any anomalies exist within a dataset is crucial for effective anomaly detection, yet it remains surprisingly underexplored in anomaly detection literature. This paper presents a comprehensive study that addresses the fundamental question: When can we conclusively determine that anomalies are present? Through extensive experimentation involving over three million statistical tests across various anomaly detection tasks and algorithms, we identify a relationship between the dataset size, contamination rate, and an algorithm-dependent constant $ \alpha_{\text{algo}} $. Our results demonstrate that, for an unlabeled dataset of size $ N $ and contamination rate $ \nu $, the condition $ N \ge \frac{\alpha_{\text{algo}}}{\nu^2} $ represents a lower bound on the number of samples required to confirm anomaly existence. This threshold implies a limit to how rare anomalies can be before proving their existence becomes infeasible.

23. COME: Dual Structure-Semantic Learning with Collaborative MoE for Universal Lesion Detection Across Heterogeneous Ultrasound Datasets

Authors: Lingyu Chen, Yawen Zeng, Yue Wang, Peng Wan, Guo-chen Ning, Hongen Liao, Daoqiang Zhang, Fang Chen
URL: https://arxiv.org/abs/2508.09886
Abstract:

Conventional single-dataset training often fails with new data distributions, especially in ultrasound (US) image analysis due to limited data, acoustic shadows, and speckle noise. Therefore, constructing a universal framework for multi-heterogeneous US datasets is imperative. However, a key challenge arises: how to effectively mitigate inter-dataset interference while preserving dataset-specific discriminative features for robust downstream task? Previous approaches utilize either a single source-specific decoder or a domain adaptation strategy, but these methods experienced a decline in performance when applied to other domains. Considering this, we propose a Universal Collaborative Mixture of Heterogeneous Source-Specific Experts (COME). Specifically, COME establishes dual structure-semantic shared experts that create a universal representation space and then collaborate with source-specific experts to extract discriminative features through providing complementary features. This design enables robust generalization by leveraging cross-datasets experience distributions and providing universal US priors for small-batch or unseen data scenarios. Extensive experiments under three evaluation modes (single-dataset, intra-organ, and inter-organ integration datasets) demonstrate COME’s superiority, achieving significant mean AP improvements over state-of-the-art methods. Our project is available at: this https URL.

24. Beyond Scaling Law: A Data-Efficient Distillation Framework for Reasoning

Authors: Xiaojun Wu, Xiaoguang Jiang, Huiyang Li, Jucai Zhai, Dengfeng Liu, Qiaobo Hao, Huang Liu, Zhiguo Yang, Ji Xie, Ninglun Gu, Jin Yang, Kailai Zhang, Yelun Bao, Jun Wang
URL: https://arxiv.org/abs/2508.09883
Abstract:

Large language models (LLMs) demonstrate remarkable reasoning capabilities in tasks such as algorithmic coding and mathematical problem-solving. Recent methods have improved reasoning through expanded corpus and multistage training combining reinforcement learning and supervised fine-tuning. Although some methods suggest that small but targeted dataset can incentivize reasoning via only distillation, a reasoning scaling laws is still taking shape, increasing computational costs. To address this, we propose a data-efficient distillation framework (DED) that optimizes the Pareto frontier of reasoning distillation. Inspired by the on-policy learning and diverse roll-out strategies of reinforcement learning, the key idea of our approach is threefold: (1) We identify that benchmark scores alone do not determine an effective teacher model. Through comprehensive comparisons of leading reasoning LLMs, we develop a method to select an optimal teacher model. (2) While scaling distillation can enhance reasoning, it often degrades out-of-domain performance. A carefully curated, smaller corpus achieves a balanced trade-off between in-domain and out-of-domain capabilities. (3) Diverse reasoning trajectories encourage the student model to develop robust reasoning skills. We validate our method through evaluations on mathematical reasoning (AIME 2024/2025, MATH-500) and code generation (LiveCodeBench), achieving state-of-the-art results with only 0.8k carefully curated examples, bypassing the need for extensive scaling. Our systematic analysis demonstrates that DED outperforms existing methods by considering factors beyond superficial hardness, token length, or teacher model capability. This work offers a practical and efficient pathway to advanced reasoning while preserving general capabilities.

25. Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models

Authors: Jiaqi Cao, Jiarui Wang, Rubin Wei, Qipeng Guo, Kai Chen, Bowen Zhou, Zhouhan Lin
URL: https://arxiv.org/abs/2508.09874
Abstract:

Large Language Models (LLMs) have shown strong abilities in general language tasks, yet adapting them to specific domains remains a challenge. Current method like Domain Adaptive Pretraining (DAPT) requires costly full-parameter training and suffers from catastrophic forgetting. Meanwhile, Retrieval-Augmented Generation (RAG) introduces substantial inference latency due to expensive nearest-neighbor searches and longer context. This paper introduces Memory Decoder, a plug-and-play pretrained memory that enables efficient domain adaptation without changing the original model’s parameters. Memory Decoder employs a small transformer decoder that learns to imitate the behavior of an external non-parametric retriever. Once trained, Memory Decoder can be seamlessly integrated with any pretrained language model that shares the same tokenizer, requiring no model-specific modifications. Experimental results demonstrate that Memory Decoder enables effective adaptation of various Qwen and Llama models to three distinct specialized domains: biomedicine, finance, and law, reducing perplexity by an average of 6.17 points. Overall, Memory Decoder introduces a novel paradigm centered on a specially pretrained memory component designed for domain-specific adaptation. This memory architecture can be integrated in a plug-and-play manner, consistently enhancing performance across multiple models within the target domain.

26. STREAM (ChemBio): A Standard for Transparently Reporting Evaluations in AI Model Reports

Authors: Tegan McCaslin, Jide Alaga, Samira Nedungadi, Seth Donoughe, Tom Reed, Rishi Bommasani, Chris Painter, Luca Righetti
URL: https://arxiv.org/abs/2508.09853
Abstract:

Evaluations of dangerous AI capabilities are important for managing catastrophic risks. Public transparency into these evaluations - including what they test, how they are conducted, and how their results inform decisions - is crucial for building trust in AI development. We propose STREAM (A Standard for Transparently Reporting Evaluations in AI Model Reports), a standard to improve how model reports disclose evaluation results, initially focusing on chemical and biological (ChemBio) benchmarks. Developed in consultation with 23 experts across government, civil society, academia, and frontier AI companies, this standard is designed to (1) be a practical resource to help AI developers present evaluation results more clearly, and (2) help third parties identify whether model reports provide sufficient detail to assess the rigor of the ChemBio evaluations. We concretely demonstrate our proposed best practices with “gold standard” examples, and also provide a three-page reporting template to enable AI developers to implement our recommendations more easily.

27. Perceptual Reality Transformer: Neural Architectures for Simulating Neurological Perception Conditions

Authors: Baihan Lin
URL: https://arxiv.org/abs/2508.09852
Abstract:

Neurological conditions affecting visual perception create profound experiential divides between affected individuals and their caregivers, families, and medical professionals. We present the Perceptual Reality Transformer, a comprehensive framework employing six distinct neural architectures to simulate eight neurological perception conditions with scientifically-grounded visual transformations. Our system learns mappings from natural images to condition-specific perceptual states, enabling others to experience approximations of simultanagnosia, prosopagnosia, ADHD attention deficits, visual agnosia, depression-related changes, anxiety tunnel vision, and Alzheimer’s memory effects. Through systematic evaluation across ImageNet and CIFAR-10 datasets, we demonstrate that Vision Transformer architectures achieve optimal performance, outperforming traditional CNN and generative approaches. Our work establishes the first systematic benchmark for neurological perception simulation, contributes novel condition-specific perturbation functions grounded in clinical literature, and provides quantitative metrics for evaluating simulation fidelity. The framework has immediate applications in medical education, empathy training, and assistive technology development, while advancing our fundamental understanding of how neural networks can model atypical human perception.

28. PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts

Authors: Mo Yu, Tsz Ting Chung, Chulun Zhou, Tong Li, Rui Lu, Jiangnan Li, Liyan Xu, Haoshu Lu, Ning Zhang, Jing Li, Jie Zhou
URL: https://arxiv.org/abs/2508.09848
Abstract:

We introduce PRELUDE, a benchmark for evaluating long-context understanding through the task of determining whether a character’s prequel story is consistent with the canonical narrative of the original book. Our task poses a stronger demand for global comprehension and deep reasoning than existing benchmarks – as the prequels are not part of the original story, assessing their plausibility typically requires searching and integrating information that is only indirectly related. Empirically, 88% of instances require evidence from multiple parts of the narrative. Experimental results highlight the challenge of our task: in-context learning, RAG and in-domain training with state-of-the-art LLMs, and commercial DeepResearch services, lag behind humans by >15%. A further human study reveals that models often produce correct answers with flawed reasoning, leading to an over 30% gap in reasoning accuracy compared to humans. These findings underscore the substantial room for improvement in long-context understanding and reasoning.

29. Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

Authors: Weigao Sun, Jiaxi Hu, Yucheng Zhou, Jusen Du, Disen Lan, Kexin Wang, Tong Zhu, Xiaoye Qu, Yu Zhang, Xiaoyu Mo, Daizong Liu, Yuxuan Liang, Wenliang Chen, Guoqi Li, Yu Cheng
URL: https://arxiv.org/abs/2508.09834
Abstract:

Large Language Models (LLMs) have delivered impressive results in language understanding, generation, reasoning, and pushes the ability boundary of multimodal models. Transformer models, as the foundation of modern LLMs, offer a strong baseline with excellent scaling properties. However, the traditional transformer architecture requires substantial computations and poses significant obstacles for large-scale training and practical deployment. In this survey, we offer a systematic examination of innovative LLM architectures that address the inherent limitations of transformers and boost the efficiency. Starting from language modeling, this survey covers the background and technical details of linear and sparse sequence modeling methods, efficient full attention variants, sparse mixture-of-experts, hybrid model architectures incorporating the above techniques, and emerging diffusion LLMs. Additionally, we discuss applications of these techniques to other modalities and consider their wider implications for developing scalable, resource-aware foundation models. By grouping recent studies into the above category, this survey presents a blueprint of modern efficient LLM architectures, and we hope this could help motivate future research toward more efficient, versatile AI systems.

30. Exploring the Potential of Large Language Models in Fine-Grained Review Comment Classification

Authors: Linh Nguyen, Chunhua Liu, Hong Yi Lin, Patanamon Thongtanunam
URL: https://arxiv.org/abs/2508.09832
Abstract:

Code review is a crucial practice in software development. As code review nowadays is lightweight, various issues can be identified, and sometimes, they can be trivial. Research has investigated automated approaches to classify review comments to gauge the effectiveness of code reviews. However, previous studies have primarily relied on supervised machine learning, which requires extensive manual annotation to train the models effectively. To address this limitation, we explore the potential of using Large Language Models (LLMs) to classify code review comments. We assess the performance of LLMs to classify 17 categories of code review comments. Our results show that LLMs can classify code review comments, outperforming the state-of-the-art approach using a trained deep learning model. In particular, LLMs achieve better accuracy in classifying the five most useful categories, which the state-of-the-art approach struggles with due to low training examples. Rather than relying solely on a specific small training data distribution, our results show that LLMs provide balanced performance across high- and low-frequency categories. These results suggest that the LLMs could offer a scalable solution for code review analytics to improve the effectiveness of the code review process.

31. RayletDF: Raylet Distance Fields for Generalizable 3D Surface Reconstruction from Point Clouds or Gaussians

Authors: Shenxing Wei, Jinxi Li, Yafei Yang, Siyuan Zhou, Bo Yang
URL: https://arxiv.org/abs/2508.09830
Abstract:

In this paper, we present a generalizable method for 3D surface reconstruction from raw point clouds or pre-estimated 3D Gaussians by 3DGS from RGB images. Unlike existing coordinate-based methods which are often computationally intensive when rendering explicit surfaces, our proposed method, named RayletDF, introduces a new technique called raylet distance field, which aims to directly predict surface points from query rays. Our pipeline consists of three key modules: a raylet feature extractor, a raylet distance field predictor, and a multi-raylet blender. These components work together to extract fine-grained local geometric features, predict raylet distances, and aggregate multiple predictions to reconstruct precise surface points. We extensively evaluate our method on multiple public real-world datasets, demonstrating superior performance in surface reconstruction from point clouds or 3D Gaussians. Most notably, our method achieves exceptional generalization ability, successfully recovering 3D surfaces in a single-forward pass across unseen datasets in testing.

32. Provable In-Context Vector Arithmetic via Retrieving Task Concepts

Authors: Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Qingfu Zhang, Hau-San Wong, Taiji Suzuki
URL: https://arxiv.org/abs/2508.09820
Abstract:

In-context learning (ICL) has garnered significant attention for its ability to grasp functions/tasks from demonstrations. Recent studies suggest the presence of a latent task/function vector in LLMs during ICL. Merullo et al. (2024) showed that LLMs leverage this vector alongside the residual stream for Word2Vec-like vector arithmetic, solving factual-recall ICL tasks. Additionally, recent work empirically highlighted the key role of Question-Answer data in enhancing factual-recall capabilities. Despite these insights, a theoretical explanation remains elusive. To move one step forward, we propose a theoretical framework building on empirically grounded hierarchical concept modeling. We develop an optimization theory, showing how nonlinear residual transformers trained via gradient descent on cross-entropy loss perform factual-recall ICL tasks via vector arithmetic. We prove 0-1 loss convergence and show the strong generalization, including robustness to concept recombination and distribution shifts. These results elucidate the advantages of transformers over static embedding predecessors. Empirical simulations corroborate our theoretical insights.

33. TRACE: Learning 3D Gaussian Physical Dynamics from Multi-view Videos

Authors: Jinxi Li, Ziyang Song, Bo Yang
URL: https://arxiv.org/abs/2508.09811
Abstract:

In this paper, we aim to model 3D scene geometry, appearance, and physical information just from dynamic multi-view videos in the absence of any human labels. By leveraging physics-informed losses as soft constraints or integrating simple physics models into neural nets, existing works often fail to learn complex motion physics, or doing so requires additional labels such as object types or masks. We propose a new framework named TRACE to model the motion physics of complex dynamic 3D scenes. The key novelty of our method is that, by formulating each 3D point as a rigid particle with size and orientation in space, we directly learn a translation rotation dynamics system for each particle, explicitly estimating a complete set of physical parameters to govern the particle’s motion over time. Extensive experiments on three existing dynamic datasets and one newly created challenging synthetic datasets demonstrate the extraordinary performance of our method over baselines in the task of future frame extrapolation. A nice property of our framework is that multiple objects or parts can be easily segmented just by clustering the learned physical parameters.

34. A Comprehensive Survey of Datasets for Clinical Mental Health AI Systems

Authors: Aishik Mandal, Prottay Kumar Adhikary, Hiba Arnaout, Iryna Gurevych, Tanmoy Chakraborty
URL: https://arxiv.org/abs/2508.09809
Abstract:

Mental health disorders are rising worldwide. However, the availability of trained clinicians has not scaled proportionally, leaving many people without adequate or timely support. To bridge this gap, recent studies have shown the promise of Artificial Intelligence (AI) to assist mental health diagnosis, monitoring, and intervention. However, the development of efficient, reliable, and ethical AI to assist clinicians is heavily dependent on high-quality clinical training datasets. Despite growing interest in data curation for training clinical AI assistants, existing datasets largely remain scattered, under-documented, and often inaccessible, hindering the reproducibility, comparability, and generalizability of AI models developed for clinical mental health care. In this paper, we present the first comprehensive survey of clinical mental health datasets relevant to the training and development of AI-powered clinical assistants. We categorize these datasets by mental disorders (e.g., depression, schizophrenia), data modalities (e.g., text, speech, physiological signals), task types (e.g., diagnosis prediction, symptom severity estimation, intervention generation), accessibility (public, restricted or private), and sociocultural context (e.g., language and cultural background). Along with these, we also investigate synthetic clinical mental health datasets. Our survey identifies critical gaps such as a lack of longitudinal data, limited cultural and linguistic representation, inconsistent collection and annotation standards, and a lack of modalities in synthetic data. We conclude by outlining key challenges in curating and standardizing future datasets and provide actionable recommendations to facilitate the development of more robust, generalizable, and equitable mental health AI systems.

35. Automated Segmentation of Coronal Brain Tissue Slabs for 3D Neuropathology

Authors: Jonathan Williams Ramirez, Dina Zemlyanker, Lucas Deden-Binder, Rogeny Herisse, Erendira Garcia Pallares, Karthik Gopinath, Harshvardhan Gazula, Christopher Mount, Liana N. Kozanno, Michael S. Marshall, Theresa R. Connors, Matthew P. Frosch, Mark Montine, Derek H. Oakley, Christine L. Mac Donald, C. Dirk Keene, Bradley T. Hyman, Juan Eugenio Iglesias
URL: https://arxiv.org/abs/2508.09805
Abstract:

Advances in image registration and machine learning have recently enabled volumetric analysis of \emph{postmortem} brain tissue from conventional photographs of coronal slabs, which are routinely collected in brain banks and neuropathology laboratories worldwide. One caveat of this methodology is the requirement of segmentation of the tissue from photographs, which currently requires costly manual intervention. In this article, we present a deep learning model to automate this process. The automatic segmentation tool relies on a U-Net architecture that was trained with a combination of \textit{(i)}1,414 manually segmented images of both fixed and fresh tissue, from specimens with varying diagnoses, photographed at two different sites; and \textit{(ii)}~2,000 synthetic images with randomized contrast and corresponding masks generated from MRI scans for improved generalizability to unseen photographic setups. Automated model predictions on a subset of photographs not seen in training were analyzed to estimate performance compared to manual labels – including both inter- and intra-rater variability. Our model achieved a median Dice score over 0.98, mean surface distance under 0.4~mm, and 95\% Hausdorff distance under 1.60~mm, which approaches inter-/intra-rater levels. Our tool is publicly available at this http URL.

36. Explainable Ensemble Learning for Graph-Based Malware Detection

Authors: Hossein Shokouhinejad, Roozbeh Razavi-Far, Griffin Higgins, Ali A Ghorbani
URL: https://arxiv.org/abs/2508.09801
Abstract:

Malware detection in modern computing environments demands models that are not only accurate but also interpretable and robust to evasive techniques. Graph neural networks (GNNs) have shown promise in this domain by modeling rich structural dependencies in graph-based program representations such as control flow graphs (CFGs). However, single-model approaches may suffer from limited generalization and lack interpretability, especially in high-stakes security applications. In this paper, we propose a novel stacking ensemble framework for graph-based malware detection and explanation. Our method dynamically extracts CFGs from portable executable (PE) files and encodes their basic blocks through a two-step embedding strategy. A set of diverse GNN base learners, each with a distinct message-passing mechanism, is used to capture complementary behavioral features. Their prediction outputs are aggregated by a meta-learner implemented as an attention-based multilayer perceptron, which both classifies malware instances and quantifies the contribution of each base model. To enhance explainability, we introduce an ensemble-aware post-hoc explanation technique that leverages edge-level importance scores generated by a GNN explainer and fuses them using the learned attention weights. This produces interpretable, model-agnostic explanations aligned with the final ensemble decision. Experimental results demonstrate that our framework improves classification performance while providing insightful interpretations of malware behavior.

37. LibRec: Benchmarking Retrieval-Augmented LLMs for Library Migration Recommendations

Authors: Junxiao Han, Yarong Wang, Xiaodong Gu, Cuiyun Gao, Yao Wan, Song Han, David Lo, Shuiguang Deng
URL: https://arxiv.org/abs/2508.09791
Abstract:

In this paper, we propose LibRec, a novel framework that integrates the capabilities of LLMs with retrieval-augmented generation(RAG) techniques to automate the recommendation of alternative libraries. The framework further employs in-context learning to extract migration intents from commit messages to enhance the accuracy of its recommendations. To evaluate the effectiveness of LibRec, we introduce LibEval, a benchmark designed to assess the performance in the library migration recommendation task. LibEval comprises 2,888 migration records associated with 2,368 libraries extracted from 2,324 Python repositories. Each migration record captures source-target library pairs, along with their corresponding migration intents and intent types. Based on LibEval, we evaluated the effectiveness of ten popular LLMs within our framework, conducted an ablation study to examine the contributions of key components within our framework, explored the impact of various prompt strategies on the framework’s performance, assessed its effectiveness across various intent types, and performed detailed failure case analyses.

38. Prototype Training with Dual Pseudo-Inverse and Optimized Hidden Activations

Authors: Mauro Tucci
URL: https://arxiv.org/abs/2508.09787
Abstract:

We present Proto-PINV+H, a fast training paradigm that combines closed-form weight computation with gradient-based optimisation of a small set of synthetic inputs, soft labels, and-crucially-hidden activations. At each iteration we recompute all weight matrices in closed form via two (or more) ridge-regularised pseudo-inverse solves, while updating only the prototypes with Adam. The trainable degrees of freedom are thus shifted from weight space to data/activation space. On MNIST (60k train, 10k test) and Fashion-MNIST (60k train, 10k test), our method reaches 97.8% and 89.3% test accuracy on the official 10k test sets, respectively, in 3.9s–4.5s using approximately 130k trainable parameters and only 250 epochs on an RTX 5060 (16GB). We provide a multi-layer extension (optimised activations at each hidden stage), learnable ridge parameters, optional PCA/PLS projections, and theory linking the condition number of prototype matrices to generalisation. The approach yields favourable accuracy–speed–size trade-offs against ELM, random-feature ridge, and shallow MLPs trained by back-propagation.

39. Adoption of Explainable Natural Language Processing: Perspectives from Industry and Academia on Practices and Challenges

Authors: Mahdi Dhaini, Tobias Müller, Roksoliana Rabets, Gjergji Kasneci
URL: https://arxiv.org/abs/2508.09786
Abstract:

The field of explainable natural language processing (NLP) has grown rapidly in recent years. The growing opacity of complex models calls for transparency and explanations of their decisions, which is crucial to understand their reasoning and facilitate deployment, especially in high-stakes environments. Despite increasing attention given to explainable NLP, practitioners’ perspectives regarding its practical adoption and effectiveness remain underexplored. This paper addresses this research gap by investigating practitioners’ experiences with explainability methods, specifically focusing on their motivations for adopting such methods, the techniques employed, satisfaction levels, and the practical challenges encountered in real-world NLP applications. Through a qualitative interview-based study with industry practitioners and complementary interviews with academic researchers, we systematically analyze and compare their perspectives. Our findings reveal conceptual gaps, low satisfaction with current explainability methods, and highlight evaluation challenges. Our findings emphasize the need for clear definitions and user-centric frameworks for better adoption of explainable NLP in practice.

40. Combinative Matching for Geometric Shape Assembly

Authors: Nahyuk Lee, Juhong Min, Junhong Lee, Chunghyun Park, Minsu Cho
URL: https://arxiv.org/abs/2508.09780
Abstract:

This paper introduces a new shape-matching methodology, combinative matching, to combine interlocking parts for geometric shape assembly. Previous methods for geometric assembly typically rely on aligning parts by finding identical surfaces between the parts as in conventional shape matching and registration. In contrast, we explicitly model two distinct properties of interlocking shapes: ‘identical surface shape’ and ‘opposite volume occupancy.’ Our method thus learns to establish correspondences across regions where their surface shapes appear identical but their volumes occupy the inverted space to each other. To facilitate this process, we also learn to align regions in rotation by estimating their shape orientations via equivariant neural networks. The proposed approach significantly reduces local ambiguities in matching and allows a robust combination of parts in assembly. Experimental results on geometric assembly benchmarks demonstrate the efficacy of our method, consistently outperforming the state of the art. Project page: this https URL.

41. Can LLM-Generated Textual Explanations Enhance Model Classification Performance? An Empirical Study

Authors: Mahdi Dhaini, Juraj Vladika, Ege Erdogan, Zineb Attaoui, Gjergji Kasneci
URL: https://arxiv.org/abs/2508.09776
Abstract:

In the rapidly evolving field of Explainable Natural Language Processing (NLP), textual explanations, i.e., human-like rationales, are pivotal for explaining model predictions and enriching datasets with interpretable labels. Traditional approaches rely on human annotation, which is costly, labor-intensive, and impedes scalability. In this work, we present an automated framework that leverages multiple state-of-the-art large language models (LLMs) to generate high-quality textual explanations. We rigorously assess the quality of these LLM-generated explanations using a comprehensive suite of Natural Language Generation (NLG) metrics. Furthermore, we investigate the downstream impact of these explanations on the performance of pre-trained language models (PLMs) and LLMs across natural language inference tasks on two diverse benchmark datasets. Our experiments demonstrate that automated explanations exhibit highly competitive effectiveness compared to human-annotated explanations in improving model performance. Our findings underscore a promising avenue for scalable, automated LLM-based textual explanation generation for extending NLP datasets and enhancing model performance.

42. Counting Short Trajectories in Elementary Cellular Automata using the Transfer Matrix Method

Authors: Cédric Koller, Barbora Hudcová
URL: https://arxiv.org/abs/2508.09768
Abstract:

Elementary Cellular Automata (ECAs) exhibit diverse behaviours often categorized by Wolfram’s qualitative classification. To provide a quantitative basis for understanding these behaviours, we investigate the global dynamics of such automata and we describe a method that allows us to compute the number of all configurations leading to short attractors in a limited number of time steps. This computation yields exact results in the thermodynamic limit (as the CA grid size grows to infinity), and is based on the Transfer Matrix Method (TMM) that we adapt for our purposes. Specifically, given two parameters $(p, c)$ we are able to compute the entropy of all initial configurations converging to an attractor of size $c$ after $p$ time-steps. By calculating such statistics for various ECA rules, we establish a quantitative connection between the entropy and the qualitative Wolfram classification scheme. Class 1 rules rapidly converge to maximal entropy for stationary states ($c=1$) as $p$ increases. Class 2 rules also approach maximal entropy quickly for appropriate cycle lengths $c$, potentially requiring consideration of translations. Class 3 rules exhibit zero or low finite entropy that saturates after a short transient. Class 4 rules show finite positive entropy, similar to some Class 3 rules. This method provides a precise framework for quantifying trajectory statistics, although its exponential computational cost in $p+c$ restricts practical analysis to short trajectories.

43. NEUBORN: The Neurodevelopmental Evolution framework Using BiOmechanical RemodelliNg

Authors: Nashira Baena, Mariana da Silva, Irina Grigorescu, Aakash Saboo, Saga Masui, Jaques-Donald Tournier, Emma C. Robinson
URL: https://arxiv.org/abs/2508.09757
Abstract:

Understanding individual cortical development is essential for identifying deviations linked to neurodevelopmental disorders. However, current normative modelling frameworks struggle to capture fine-scale anatomical details due to their reliance on modelling data within a population-average reference space. Here, we present a novel framework for learning individual growth trajectories from biomechanically constrained, longitudinal, diffeomorphic image registration, implemented via a hierarchical network architecture. Trained on neonatal MRI data from the Developing Human Connectome Project, the method improves the biological plausibility of warps, generating growth trajectories that better follow population-level trends while generating smoother warps, with fewer negative Jacobians, relative to state-of-the-art baselines. The resulting subject-specific deformations provide interpretable, biologically grounded mappings of development. This framework opens new possibilities for predictive modeling of brain maturation and early identification of malformations of cortical development.

44. Region-to-Region: Enhancing Generative Image Harmonization with Adaptive Regional Injection

Authors: Zhiqiu Zhang, Dongqi Fan, Mingjie Wang, Qiang Tang, Jian Yang, Zili Yi
URL: https://arxiv.org/abs/2508.09746
Abstract:

The goal of image harmonization is to adjust the foreground in a composite image to achieve visual consistency with the background. Recently, latent diffusion model (LDM) are applied for harmonization, achieving remarkable results. However, LDM-based harmonization faces challenges in detail preservation and limited harmonization ability. Additionally, current synthetic datasets rely on color transfer, which lacks local variations and fails to capture complex real-world lighting conditions. To enhance harmonization capabilities, we propose the Region-to-Region transformation. By injecting information from appropriate regions into the foreground, this approach preserves original details while achieving image harmonization or, conversely, generating new composite data. From this perspective, We propose a novel model R2R. Specifically, we design Clear-VAE to preserve high-frequency details in the foreground using Adaptive Filter while eliminating disharmonious elements. To further enhance harmonization, we introduce the Harmony Controller with Mask-aware Adaptive Channel Attention (MACA), which dynamically adjusts the foreground based on the channel importance of both foreground and background regions. To address the limitation of existing datasets, we propose Random Poisson Blending, which transfers color and lighting information from a suitable region to the foreground, thereby generating more diverse and challenging synthetic images. Using this method, we construct a new synthetic dataset, RPHarmony. Experiments demonstrate the superiority of our method over other methods in both quantitative metrics and visual harmony. Moreover, our dataset helps the model generate more realistic images in real examples. Our code, dataset, and model weights have all been released for open access.

45. Improving ARDS Diagnosis Through Context-Aware Concept Bottleneck Models

Authors: Anish Narain, Ritam Majumdar, Nikita Narayanan, Dominic Marshall, Sonali Parbhoo
URL: https://arxiv.org/abs/2508.09719
Abstract:

Large, publicly available clinical datasets have emerged as a novel resource for understanding disease heterogeneity and to explore personalization of therapy. These datasets are derived from data not originally collected for research purposes and, as a result, are often incomplete and lack critical labels. Many AI tools have been developed to retrospectively label these datasets, such as by performing disease classification; however, they often suffer from limited interpretability. Previous work has attempted to explain predictions using Concept Bottleneck Models (CBMs), which learn interpretable concepts that map to higher-level clinical ideas, facilitating human evaluation. However, these models often experience performance limitations when the concepts fail to adequately explain or characterize the task. We use the identification of Acute Respiratory Distress Syndrome (ARDS) as a challenging test case to demonstrate the value of incorporating contextual information from clinical notes to improve CBM performance. Our approach leverages a Large Language Model (LLM) to process clinical notes and generate additional concepts, resulting in a 10% performance gain over existing methods. Additionally, it facilitates the learning of more comprehensive concepts, thereby reducing the risk of information leakage and reliance on spurious shortcuts, thus improving the characterization of ARDS.

46. Evaluating the Role of Large Language Models in Legal Practice in India

Authors: Rahul Hemrajani (National Law School of India University, Bengaluru)
URL: https://arxiv.org/abs/2508.09713
Abstract:

The integration of Artificial Intelligence(AI) into the legal profession raises significant questions about the capacity of Large Language Models(LLM) to perform key legal tasks. In this paper, I empirically evaluate how well LLMs, such as GPT, Claude, and Llama, perform key legal tasks in the Indian context, including issue spotting, legal drafting, advice, research, and reasoning. Through a survey experiment, I compare outputs from LLMs with those of a junior lawyer, with advanced law students rating the work on helpfulness, accuracy, and comprehensiveness. LLMs excel in drafting and issue spotting, often matching or surpassing human work. However, they struggle with specialised legal research, frequently generating hallucinations, factually incorrect or fabricated outputs. I conclude that while LLMs can augment certain legal tasks, human expertise remains essential for nuanced reasoning and the precise application of law.

47. Surg-InvNeRF: Invertible NeRF for 3D tracking and reconstruction in surgical vision

Authors: Gerardo Loza, Junlei Hu, Dominic Jones, Sharib Ali, Pietro Valdastri
URL: https://arxiv.org/abs/2508.09681
Abstract:

We proposed a novel test-time optimisation (TTO) approach framed by a NeRF-based architecture for long-term 3D point tracking. Most current methods in point tracking struggle to obtain consistent motion or are limited to 2D motion. TTO approaches frame the solution for long-term tracking as optimising a function that aggregates correspondences from other specialised state-of-the-art methods. Unlike the state-of-the-art on TTO, we propose parametrising such a function with our new invertible Neural Radiance Field (InvNeRF) architecture to perform both 2D and 3D tracking in surgical scenarios. Our approach allows us to exploit the advantages of a rendering-based approach by supervising the reprojection of pixel correspondences. It adapts strategies from recent rendering-based methods to obtain a bidirectional deformable-canonical mapping, to efficiently handle a defined workspace, and to guide the rays’ density. It also presents our multi-scale HexPlanes for fast inference and a new algorithm for efficient pixel sampling and convergence criteria. We present results in the STIR and SCARE datasets, for evaluating point tracking and testing the integration of kinematic data in our pipeline, respectively. In 2D point tracking, our approach surpasses the precision and accuracy of the TTO state-of-the-art methods by nearly 50% on average precision, while competing with other approaches. In 3D point tracking, this is the first TTO approach, surpassing feed-forward methods while incorporating the benefits of a deformable NeRF-based reconstruction.

48. Anomaly Detection for IoT Global Connectivity

Authors: Jesus Omaña Iglesias, Carlos Segura Perales, Stefan Geißler, Diego Perino, Andra Lutu
URL: https://arxiv.org/abs/2508.09660
Abstract:

Internet of Things (IoT) application providers rely on Mobile Network Operators (MNOs) and roaming infrastructures to deliver their services globally. In this complex ecosystem, where the end-to-end communication path traverses multiple entities, it has become increasingly challenging to guarantee communication availability and reliability. Further, most platform operators use a reactive approach to communication issues, responding to user complaints only after incidents have become severe, compromising service quality. This paper presents our experience in the design and deployment of ANCHOR – an unsupervised anomaly detection solution for the IoT connectivity service of a large global roaming platform. ANCHOR assists engineers by filtering vast amounts of data to identify potential problematic clients (i.e., those with connectivity issues affecting several of their IoT devices), enabling proactive issue resolution before the service is critically impacted. We first describe the IoT service, infrastructure, and network visibility of the IoT connectivity provider we operate. Second, we describe the main challenges and operational requirements for designing an unsupervised anomaly detection solution on this platform. Following these guidelines, we propose different statistical rules, and machine- and deep-learning models for IoT verticals anomaly detection based on passive signaling traffic. We describe the steps we followed working with the operational teams on the design and evaluation of our solution on the operational platform, and report an evaluation on operational IoT customers.

49. On Negative-aware Preference Optimization for Recommendation

Authors: Chenlu Ding, Daoxuan Liu, Jiancan Wu, Xingyu Hu, Junkang Wu, Haitao Wang, Yongkang Wang, Xingxing Wang, Xiang Wang
URL: https://arxiv.org/abs/2508.09653
Abstract:

Recommendation systems leverage user interaction data to suggest relevant items while filtering out irrelevant (negative) ones. The rise of large language models (LLMs) has garnered increasing attention for their potential in recommendation tasks. However, existing methods for optimizing LLM-based recommenders face challenges in effectively utilizing negative samples. Simply integrating large numbers of negative samples can improve ranking accuracy and mitigate popularity bias but often leads to increased computational overhead and memory costs. Additionally, current approaches fail to account for the varying informativeness of negative samples, leading to suboptimal optimization performance. To address these issues, we propose NAPO (\textbf{N}egative-\textbf{A}ware \textbf{P}reference \textbf{O}ptimization), an enhanced framework for preference optimization in LLM-based recommendation. NAPO introduces two key innovations: (1) in-batch negative sharing, which expands the pool of negative samples without additional memory overhead, and (2) dynamic reward margin adjustment, which adapts model updates based on the confidence of negative samples. Extensive experiments on three public datasets demonstrate that NAPO outperforms existing methods in both recommendation accuracy and popularity bias reduction.

50. Demystifying the Role of Rule-based Detection in AI Systems for Windows Malware Detection

Authors: Andrea Ponte, Luca Demetrio, Luca Oneto, Ivan Tesfai Ogbu, Battista Biggio, Fabio Roli
URL: https://arxiv.org/abs/2508.09652
Abstract:

Malware detection increasingly relies on AI systems that integrate signature-based detection with machine learning. However, these components are typically developed and combined in isolation, missing opportunities to reduce data complexity and strengthen defenses against adversarial EXEmples, carefully crafted programs designed to evade detection. Hence, in this work we investigate the influence that signature-based detection exerts on model training, when they are included inside the training pipeline. Specifically, we compare models trained on a comprehensive dataset with an AI system whose machine learning component is trained solely on samples not already flagged by signatures. Our results demonstrate improved robustness to both adversarial EXEmples and temporal data drift, although this comes at the cost of a fixed lower bound on false positives, driven by suboptimal rule selection. We conclude by discussing these limitations and outlining how future research could extend AI-based malware detection to include dynamic analysis, thereby further enhancing system resilience.

51. A Close Reading Approach to Gender Narrative Biases in AI-Generated Stories

Authors: Daniel Raffini, Agnese Macori, Marco Angelini, Tiziana Catarci
URL: https://arxiv.org/abs/2508.09651
Abstract:

The paper explores the study of gender-based narrative biases in stories generated by ChatGPT, Gemini, and Claude. The prompt design draws on Propp’s character classifications and Freytag’s narrative structure. The stories are analyzed through a close reading approach, with particular attention to adherence to the prompt, gender distribution of characters, physical and psychological descriptions, actions, and finally, plot development and character relationships. The results reveal the persistence of biases - especially implicit ones - in the generated stories and highlight the importance of assessing biases at multiple levels using an interpretative approach.

52. Preacher: Paper-to-Video Agentic System

Authors: Jingwei Liu, Ling Yang, Hao Luo, Fan Wang Hongyan Li, Mengdi Wang
URL: https://arxiv.org/abs/2508.09632
Abstract:

The paper-to-video task converts a research paper into a structured video abstract, distilling key concepts, methods, and conclusions into an accessible, well-organized format. While state-of-the-art video generation models demonstrate potential, they are constrained by limited context windows, rigid video duration constraints, limited stylistic diversity, and an inability to represent domain-specific knowledge. To address these limitations, we introduce Preacher, the first paper-to-video agentic system. Preacher employs a top-down approach to decompose, summarize, and reformulate the paper, followed by bottom-up video generation, synthesizing diverse video segments into a coherent abstract. To align cross-modal representations, we define key scenes and introduce a Progressive Chain of Thought (P-CoT) for granular, iterative planning. Preacher successfully generates high-quality video abstracts across five research fields, demonstrating expertise beyond current video generation models. Code will be released at: this https URL

53. AmbiGraph-Eval: Can LLMs Effectively Handle Ambiguous Graph Queries?

Authors: Yuchen Tian, Kaixin Li, Hao Chen, Ziyang Luo, Hongzhan Lin, Sebastian Schelter, Lun Du, Jing Ma
URL: https://arxiv.org/abs/2508.09631
Abstract:

Large Language Models (LLMs) have recently demonstrated strong capabilities in translating natural language into database queries, especially when dealing with complex graph-structured data. However, real-world queries often contain inherent ambiguities, and the interconnected nature of graph structures can amplify these challenges, leading to unintended or incorrect query results. To systematically evaluate LLMs on this front, we propose a taxonomy of graph-query ambiguities, comprising three primary types: Attribute Ambiguity, Relationship Ambiguity, and Attribute-Relationship Ambiguity, each subdivided into Same-Entity and Cross-Entity scenarios. We introduce AmbiGraph-Eval, a novel benchmark of real-world ambiguous queries paired with expert-verified graph query answers. Evaluating 9 representative LLMs shows that even top models struggle with ambiguous graph queries. Our findings reveal a critical gap in ambiguity handling and motivate future work on specialized resolution techniques.

54. TimeMKG: Knowledge-Infused Causal Reasoning for Multivariate Time Series Modeling

Authors: Yifei Sun, Junming Liu, Ding Wang, Yirong Chen, Xuefeng Yan
URL: https://arxiv.org/abs/2508.09630
Abstract:

Multivariate time series data typically comprises two distinct modalities: variable semantics and sampled numerical observations. Traditional time series models treat variables as anonymous statistical signals, overlooking the rich semantic information embedded in variable names and data descriptions. However, these textual descriptors often encode critical domain knowledge that is essential for robust and interpretable modeling. Here we present TimeMKG, a multimodal causal reasoning framework that elevates time series modeling from low-level signal processing to knowledge informed inference. TimeMKG employs large language models to interpret variable semantics and constructs structured Multivariate Knowledge Graphs that capture inter-variable relationships. A dual-modality encoder separately models the semantic prompts, generated from knowledge graph triplets, and the statistical patterns from historical time series. Cross-modality attention aligns and fuses these representations at the variable level, injecting causal priors into downstream tasks such as forecasting and classification, providing explicit and interpretable priors to guide model reasoning. The experiment in diverse datasets demonstrates that incorporating variable-level knowledge significantly improves both predictive performance and generalization.

55. Goal Discovery with Causal Capacity for Efficient Reinforcement Learning

Authors: Yan Yu, Yaodong Yang, Zhengbo Lu, Chengdong Ma, Wengang Zhou, Houqiang Li
URL: https://arxiv.org/abs/2508.09624
Abstract:

Causal inference is crucial for humans to explore the world, which can be modeled to enable an agent to efficiently explore the environment in reinforcement learning. Existing research indicates that establishing the causality between action and state transition will enhance an agent to reason how a policy affects its future trajectory, thereby promoting directed exploration. However, it is challenging to measure the causality due to its intractability in the vast state-action space of complex scenarios. In this paper, we propose a novel Goal Discovery with Causal Capacity (GDCC) framework for efficient environment exploration. Specifically, we first derive a measurement of causality in state space, \emph{i.e.,} causal capacity, which represents the highest influence of an agent’s behavior on future trajectories. After that, we present a Monte Carlo based method to identify critical points in discrete state space and further optimize this method for continuous high-dimensional environments. Those critical points are used to uncover where the agent makes important decisions in the environment, which are then regarded as our subgoals to guide the agent to make exploration more purposefully and efficiently. Empirical results from multi-objective tasks demonstrate that states with high causal capacity align with our expected subgoals, and our GDCC achieves significant success rate improvements compared to baselines.

56. Interpretable Robot Control via Structured Behavior Trees and Large Language Models

Authors: Ingrid Maéva Chekam, Ines Pastor-Martinez, Ali Tourani, Jose Andres Millan-Romera, Laura Ribeiro, Pedro Miguel Bastos Soares, Holger Voos, Jose Luis Sanchez-Lopez
URL: https://arxiv.org/abs/2508.09621
Abstract:

As intelligent robots become more integrated into human environments, there is a growing need for intuitive and reliable Human-Robot Interaction (HRI) interfaces that are adaptable and more natural to interact with. Traditional robot control methods often require users to adapt to interfaces or memorize predefined commands, limiting usability in dynamic, unstructured environments. This paper presents a novel framework that bridges natural language understanding and robotic execution by combining Large Language Models (LLMs) with Behavior Trees. This integration enables robots to interpret natural language instructions given by users and translate them into executable actions by activating domain-specific plugins. The system supports scalable and modular integration, with a primary focus on perception-based functionalities, such as person tracking and hand gesture recognition. To evaluate the system, a series of real-world experiments was conducted across diverse environments. Experimental results demonstrate that the proposed approach is practical in real-world scenarios, with an average cognition-to-execution accuracy of approximately 94%, making a significant contribution to HRI systems and robots. The complete source code of the framework is publicly available at this https URL.

57. MInDI-3D: Iterative Deep Learning in 3D for Sparse-view Cone Beam Computed Tomography

Authors: Daniel Barco (1), Marc Stadelmann (1), Martin Oswald (1), Ivo Herzig (2), Lukas Lichtensteiger (2), Pascal Paysan (3), Igor Peterlik (3), Michal Walczak (3), Bjoern Menze (4), Frank-Peter Schilling (1) ((1) Centre for Artificial Intelligence (CAI), Zurich University of Applied Sciences (ZHAW), Winterthur, Switzerland, (2) Institute of Applied Mathematics and Physics (IAMP), Zurich University of Applied Sciences (ZHAW), Winterthur, Switzerland, (3) Varian Medical Systems Imaging Lab, Baden, Switzerland, (4) Biomedical Image Analysis and Machine Learning, University of Zurich, Zurich, Switzerland)
URL: https://arxiv.org/abs/2508.09616
Abstract:

We present MInDI-3D (Medical Inversion by Direct Iteration in 3D), the first 3D conditional diffusion-based model for real-world sparse-view Cone Beam Computed Tomography (CBCT) artefact removal, aiming to reduce imaging radiation exposure. A key contribution is extending the “InDI” concept from 2D to a full 3D volumetric approach for medical images, implementing an iterative denoising process that refines the CBCT volume directly from sparse-view input. A further contribution is the generation of a large pseudo-CBCT dataset (16,182) from chest CT volumes of the CT-RATE public dataset to robustly train MInDI-3D. We performed a comprehensive evaluation, including quantitative metrics, scalability analysis, generalisation tests, and a clinical assessment by 11 clinicians. Our results show MInDI-3D’s effectiveness, achieving a 12.96 (6.10) dB PSNR gain over uncorrected scans with only 50 projections on the CT-RATE pseudo-CBCT (independent real-world) test set and enabling an 8x reduction in imaging radiation exposure. We demonstrate its scalability by showing that performance improves with more training data. Importantly, MInDI-3D matches the performance of a 3D U-Net on real-world scans from 16 cancer patients across distortion and task-based metrics. It also generalises to new CBCT scanner geometries. Clinicians rated our model as sufficient for patient positioning across all anatomical sites and found it preserved lung tumour boundaries well.

58. How Persuasive Could LLMs Be? A First Study Combining Linguistic-Rhetorical Analysis and User Experiments

Authors: Daniel Raffini, Agnese Macori, Lorenzo Porcaro, Tiziana Catarci, Marco Angelini
URL: https://arxiv.org/abs/2508.09614
Abstract:

This study examines the rhetorical and linguistic features of argumentative texts generated by ChatGPT on ethically nuanced topics and investigates their persuasive impact on human this http URL a user study involving 62 participants and pre-post interaction surveys, the paper analyzes how exposure to AI-generated arguments affects opinion change and user perception. A linguistic and rhetorical analysis of the generated texts reveals a consistent argumentative macrostructure, reliance on formulaic expressions, and limited stylistic richness. While ChatGPT demonstrates proficiency in constructing coherent argumentative texts, its persuasive efficacy appears constrained, particularly on topics involving ethical this http URL study finds that while participants often acknowledge the benefits highlighted by ChatGPT, ethical concerns tend to persist or even intensify post-interaction. The results also demonstrate a variation depending on the topic. These findings highlight new insights on AI-generated persuasion in ethically sensitive domains and are a basis for future research.

59. A Lightweight Learned Cardinality Estimation Model

Authors: Yaoyu Zhu, Jintao Zhang, Guoliang Li, Jianhua Feng
URL: https://arxiv.org/abs/2508.09602
Abstract:

Cardinality estimation is a fundamental task in database management systems, aiming to predict query results accurately without executing the queries. However, existing techniques either achieve low estimation accuracy or incur high inference latency. Simultaneously achieving high speed and accuracy becomes critical for the cardinality estimation problem. In this paper, we propose a novel data-driven approach called CoDe (Covering with Decompositions) to address this problem. CoDe employs the concept of covering design, which divides the table into multiple smaller, overlapping segments. For each segment, CoDe utilizes tensor decomposition to accurately model its data distribution. Moreover, CoDe introduces innovative algorithms to select the best-fitting distributions for each query, combining them to estimate the final result. By employing multiple models to approximate distributions, CoDe excels in effectively modeling discrete distributions and ensuring computational efficiency. Notably, experimental results show that our method represents a significant advancement in cardinality estimation, achieving state-of-the-art levels of both estimation accuracy and inference efficiency. Across various datasets, CoDe achieves absolute accuracy in estimating more than half of the queries.

60. Hierarchical Brain Structure Modeling for Predicting Genotype of Glioma

Authors: Haotian Tang, Jianwei Chen, Xinrui Tang, Yunjia Wu, Zhengyang Miao, Chao Li
URL: https://arxiv.org/abs/2508.09593
Abstract:

Isocitrate DeHydrogenase (IDH) mutation status is a crucial biomarker for glioma prognosis. However, current prediction methods are limited by the low availability and noise of functional MRI. Structural and morphological connectomes offer a non-invasive alternative, yet existing approaches often ignore the brain’s hierarchical organisation and multiscale interactions. To address this, we propose Hi-SMGNN, a hierarchical framework that integrates structural and morphological connectomes from regional to modular levels. It features a multimodal interaction module with a Siamese network and cross-modal attention, a multiscale feature fusion mechanism for reducing redundancy, and a personalised modular partitioning strategy to enhance individual specificity and interpretability. Experiments on the UCSF-PDGM dataset demonstrate that Hi-SMGNN outperforms baseline and state-of-the-art models, showing improved robustness and effectiveness in IDH mutation prediction.

61. CaRoBio: 3D Cable Routing with a Bio-inspired Gripper Fingernail

Authors: Jiahui Zuo, Boyang Zhang, Fumin Zhang
URL: https://arxiv.org/abs/2508.09558
Abstract:

The manipulation of deformable linear flexures has a wide range of applications in industry, such as cable routing in automotive manufacturing and textile production. Cable routing, as a complex multi-stage robot manipulation scenario, is a challenging task for robot automation. Common parallel two-finger grippers have the risk of over-squeezing and over-tension when grasping and guiding cables. In this paper, a novel eagle-inspired fingernail is designed and mounted on the gripper fingers, which helps with cable grasping on planar surfaces and in-hand cable guiding operations. Then we present a single-grasp end-to-end 3D cable routing framework utilizing the proposed fingernails, instead of the common pick-and-place strategy. Continuous control is achieved to efficiently manipulate cables through vision-based state estimation of task configurations and offline trajectory planning based on motion primitives. We evaluate the effectiveness of the proposed framework with a variety of cables and channel slots, significantly outperforming the pick-and-place manipulation process under equivalent perceptual conditions. Our reconfigurable task setting and the proposed framework provide a reference for future cable routing manipulations in 3D space.

Authors: Fengyi Wu, Yifei Dong, Zhi-Qi Cheng, Yilong Dai, Guangyu Chen, Hang Wang, Qi Dai, Alexander G. Hauptmann
URL: https://arxiv.org/abs/2508.09547
Abstract:

We introduce Goal-Conditioned Visual Navigation Instruction Generation (GoViG), a new task that aims to autonomously generate precise and contextually coherent navigation instructions solely from egocentric visual observations of initial and goal states. Unlike conventional approaches that rely on structured inputs such as semantic annotations or environmental maps, GoViG exclusively leverages raw egocentric visual data, substantially improving its adaptability to unseen and unstructured environments. Our method addresses this task by decomposing it into two interconnected subtasks: (1) visual forecasting, which predicts intermediate visual states bridging the initial and goal views; and (2) instruction generation, which synthesizes linguistically coherent instructions grounded in both observed and anticipated visuals. These subtasks are integrated within an autoregressive multimodal large language model trained with tailored objectives to ensure spatial accuracy and linguistic clarity. Furthermore, we introduce two complementary multimodal reasoning strategies, one-pass and interleaved reasoning, to mimic incremental human cognitive processes during navigation. To evaluate our method, we propose the R2R-Goal dataset, combining diverse synthetic and real-world trajectories. Empirical results demonstrate significant improvements over state-of-the-art methods, achieving superior BLEU-4 and CIDEr scores along with robust cross-domain generalization.

63. Your Coding Intent is Secretly in the Context and You Should Deliberately Infer It Before Completion

Authors: Yanzhou Li, Tianlin Li, Yiran Zhang, Shangqing Liu, Aishan Liu, Yang Liu
URL: https://arxiv.org/abs/2508.09537
Abstract:

Large Language Models (LLMs) are increasingly used for function completion in repository-scale codebases. Prior studies demonstrate that when explicit instructions–such as docstrings–are provided, these models can generate highly accurate implementations. However, in real-world repositories, such annotations are frequently absent, and performance drops substantially without them. To address this gap, we frame the task as a three-stage process. The first stage focuses on intent inference, where the model analyzes the code preceding the target function to uncover cues about the desired functionality. Such preceding context often encodes subtle but critical information, and we design a reasoning-based prompting framework to guide the LLM through step-by-step extraction and synthesis of these signals before any code is generated. The second stage introduces an optional interactive refinement mechanism to handle cases where preceding context alone is insufficient for intent recovery. In this stage, the model proposes a small set of candidate intentions, enabling the developer to select or edit them so that the inferred intent closely matches the actual requirement. Finally, in the third stage, the LLM generates the target function conditioned on the finalized intent. To support this pipeline, we curate a dataset of 40,000 examples annotated with intermediate reasoning traces and corresponding docstrings. Extensive experiments on DevEval and ComplexCodeEval show that our approach consistently boosts multiple LLMs, achieving over 20\% relative gains in both reference-based and execution-based metrics, with the interactive refinement stage delivering additional improvements beyond these gains.

64. AI Blob! LLM-Driven Recontextualization of Italian Television Archives

Authors: Roberto Balestri
URL: https://arxiv.org/abs/2508.09535
Abstract:

This paper introduces AI Blob!, an experimental system designed to explore the potential of semantic cataloging and Large Language Models (LLMs) for the retrieval and recontextualization of archival television footage. Drawing methodological inspiration from Italian television programs such as Blob (RAI Tre, 1989-), AI Blob! integrates automatic speech recognition (ASR), semantic embeddings, and retrieval-augmented generation (RAG) to organize and reinterpret archival content. The system processes a curated dataset of 1,547 Italian television videos by transcribing audio, segmenting it into sentence-level units, and embedding these segments into a vector database for semantic querying. Upon user input of a thematic prompt, the LLM generates a range of linguistically and conceptually related queries, guiding the retrieval and recombination of audiovisual fragments. These fragments are algorithmically selected and structured into narrative sequences producing montages that emulate editorial practices of ironic juxtaposition and thematic coherence. By foregrounding dynamic, content-aware retrieval over static metadata schemas, AI Blob! demonstrates how semantic technologies can facilitate new approaches to archival engagement, enabling novel forms of automated narrative construction and cultural analysis. The project contributes to ongoing debates in media historiography and AI-driven archival research, offering both a conceptual framework and a publicly available dataset to support further interdisciplinary experimentation.

65. COXNet: Cross-Layer Fusion with Adaptive Alignment and Scale Integration for RGBT Tiny Object Detection

Authors: Peiran Peng, Tingfa Xu, Liqiang Song, Mengqi Zhu, Yuqiang Fang, Jianan Li
URL: https://arxiv.org/abs/2508.09533
Abstract:

Detecting tiny objects in multimodal Red-Green-Blue-Thermal (RGBT) imagery is a critical challenge in computer vision, particularly in surveillance, search and rescue, and autonomous navigation. Drone-based scenarios exacerbate these challenges due to spatial misalignment, low-light conditions, occlusion, and cluttered backgrounds. Current methods struggle to leverage the complementary information between visible and thermal modalities effectively. We propose COXNet, a novel framework for RGBT tiny object detection, addressing these issues through three core innovations: i) the Cross-Layer Fusion Module, fusing high-level visible and low-level thermal features for enhanced semantic and spatial accuracy; ii) the Dynamic Alignment and Scale Refinement module, correcting cross-modal spatial misalignments and preserving multi-scale features; and iii) an optimized label assignment strategy using the GeoShape Similarity Measure for better localization. COXNet achieves a 3.32\% mAP$_{50}$ improvement on the RGBTDronePerson dataset over state-of-the-art methods, demonstrating its effectiveness for robust detection in complex environments.

66. Decentralized Rank Scheduling for Energy-Constrained Multi-Task Federated Fine-Tuning in Edge-Assisted IoV Networks

Authors: Bokeng Zheng, Jianqiang Zhong, Jiayi Liu, Xiaoxi Zhang
URL: https://arxiv.org/abs/2508.09532
Abstract:

Federated fine-tuning has emerged as a promising approach for adapting foundation models (FMs) to diverse downstream tasks in edge environments. In Internet of Vehicles (IoV) systems, enabling efficient and low-latency multi-task adaptation is particularly challenging due to client mobility, heterogeneous resources, and intermittent connectivity. This paper proposes a hierarchical federated fine-tuning framework that coordinates roadside units (RSUs) and vehicles to support resource-aware and mobility-resilient learning across dynamic IoV scenarios. Leveraging Low-Rank Adaptation (LoRA), we introduce a decentralized, energy-aware rank adaptation mechanism formulated as a constrained multi-armed bandit problem. A novel UCB-DUAL algorithm is developed to enable adaptive exploration under per-task energy budgets, achieving provable sublinear regret. To evaluate our method, we construct a large-scale IoV simulator based on real-world trajectories, capturing dynamic participation, RSU handoffs, and communication variability. Extensive experiments show that our approach achieves the best accuracy-efficiency trade-off among all baselines, reducing latency by over 24\% and improving average accuracy by more than 2.5\%.

67. Generation of Indian Sign Language Letters, Numbers, and Words

Authors: Ajeet Kumar Yadav, Nishant Kumar, Rathna G N
URL: https://arxiv.org/abs/2508.09522
Abstract:

Sign language, which contains hand movements, facial expressions and bodily gestures, is a significant medium for communicating with hard-of-hearing people. A well-trained sign language community communicates easily, but those who don’t know sign language face significant challenges. Recognition and generation are basic communication methods between hearing and hard-of-hearing individuals. Despite progress in recognition, sign language generation still needs to be explored. The Progressive Growing of Generative Adversarial Network (ProGAN) excels at producing high-quality images, while the Self-Attention Generative Adversarial Network (SAGAN) generates feature-rich images at medium resolutions. Balancing resolution and detail is crucial for sign language image generation. We are developing a Generative Adversarial Network (GAN) variant that combines both models to generate feature-rich, high-resolution, and class-conditional sign language images. Our modified Attention-based model generates high-quality images of Indian Sign Language letters, numbers, and words, outperforming the traditional ProGAN in Inception Score (IS) and Fréchet Inception Distance (FID), with improvements of 3.2 and 30.12, respectively. Additionally, we are publishing a large dataset incorporating high-quality images of Indian Sign Language alphabets, numbers, and 129 words.

68. COMPEER: Controllable Empathetic Reinforcement Reasoning for Emotional Support Conversation

Authors: Yunxiao Wang, Meng Liu, Wenqi Liu, Kaiyu Jiang, Bin Wen, Fan Yang, Tingting Gao, Guorui Zhou, Liqiang Nie
URL: https://arxiv.org/abs/2508.09521
Abstract:

Emotional support conversations are crucial for promoting emotional well-being, yet current models often lack deep empathetic reasoning grounded in psychological principles. To address this, we propose controllable empathetic reasoning, which combines natural language reasoning with structured psychological steps. We construct a fine-grained dataset annotated with reasoning correctness and response preferences to enable this capability. To further enhance training, we employ reinforcement learning with a unified process-outcome reward model that delivers precise feedback. To mitigate response repetitiveness from entropy collapse, we introduce personality-based dialogue rewriting and a redundancy-aware reward reweighting strategy. Our approach significantly improves model’s emotional support ability, advancing the development of empathetic, human-like support systems.

69. SMART-OC: A Real-time Time-risk Optimal Replanning Algorithm for Dynamic Obstacles and Spatio-temporally Varying Currents

Authors: Reema Raval, Shalabh Gupta
URL: https://arxiv.org/abs/2508.09508
Abstract:

Typical marine environments are highly complex with spatio-temporally varying currents and dynamic obstacles, presenting significant challenges to Unmanned Surface Vehicles (USVs) for safe and efficient navigation. Thus, the USVs need to continuously adapt their paths with real-time information to avoid collisions and follow the path of least resistance to the goal via exploiting ocean currents. In this regard, we introduce a novel algorithm, called Self-Morphing Adaptive Replanning Tree for dynamic Obstacles and Currents (SMART-OC), that facilitates real-time time-risk optimal replanning in dynamic environments. SMART-OC integrates the obstacle risks along a path with the time cost to reach the goal to find the time-risk optimal path. The effectiveness of SMART-OC is validated by simulation experiments, which demonstrate that the USV performs fast replannings to avoid dynamic obstacles and exploit ocean currents to successfully reach the goal.

Authors: Zhanghan Wang, Ding Ding, Hang Zhu, Haibin Lin, Aurojit Panda
URL: https://arxiv.org/abs/2508.09505
Abstract:

Distributed machine learning training and inference is common today because today’s large models require more memory and compute than can be provided by a single GPU. Distributed models are generally produced by programmers who take a sequential model specification and apply several distribution strategies to distribute state and computation across GPUs. Unfortunately, bugs can be introduced in the process, and a distributed model implementation’s outputs might differ from the sequential model’s outputs. In this paper, we describe an approach to statically identify such bugs by checking model refinement, that is, can the sequential model’s outputs be reconstructed from the distributed model’s outputs? Our approach, implemented in GraphGuard, uses iterative rewriting to prove model refinement. Our approach can scale to today’s large models and deployments: we evaluate it using GPT and Llama-3. Further, it provides actionable output that aids in bug localization.

71. From Ranking to Selection: A Simple but Efficient Dynamic Passage Selector for Retrieval Augmented Generation

Authors: Siyuan Meng, Junming Liu, Yirong Chen, Song Mao, Pinlong Cai, Guohang Yan, Botian Shi, Ding Wang
URL: https://arxiv.org/abs/2508.09497
Abstract:

Retrieval-augmented generation (RAG) systems are often bottlenecked by their reranking modules, which typically score passages independently and select a fixed Top-K size. This approach struggles with complex multi-hop queries that require synthesizing evidence across multiple documents, creating a trade-off where small K values omit crucial information and large K values introduce noise. To address this, we introduce the Dynamic Passage Selector (DPS), a novel reranking framework that treats passage selection as a supervised learning problem. Unlike traditional point-wise or list-wise methods, DPS is fine-tuned to capture inter-passage dependencies and dynamically select the most relevant set of passages for generation. As a seamless plug-and-play module, DPS requires no modifications to the standard RAG pipeline. Comprehensive evaluations on five benchmarks show that DPS consistently outperforms state-of-the-art rerankers and fine-tuning methods. Notably, on the challenging MuSiQue dataset, DPS improves the F1-score by 30.06% and 15.4% over strong baselines like Qwen3-reranker and RankingGPT, respectively. Our results demonstrate that by enabling adaptive evidence selection, DPS substantially enhances reasoning capabilities in complex RAG scenarios.

72. Learning Facts at Scale with Active Reading

Authors: Jessy Lin, Vincent-Pierre Berges, Xilun Chen, Wen-Tau Yih, Gargi Ghosh, Barlas Oğuz
URL: https://arxiv.org/abs/2508.09494
Abstract:

LLMs are known to store vast amounts of knowledge in their parametric memory. However, learning and recalling facts from this memory is known to be unreliable, depending largely on the prevalence of particular facts in the training data and other factors which are poorly understood. Practitioners are lacking tools which will allow them to ensure that the models learn a given body of knowledge reliably and consistently. To this end, we propose Active Reading: a framework where we train models to study a given set of material with self-generated learning strategies. First, we demonstrate models trained with Active Reading on expert domains absorb significantly more knowledge than vanilla finetuning and other data augmentations. We train expert 8B models that achieve 66% on a Wikipedia-grounded subset of SimpleQA (+313% relative over vanilla finetuning) and 26% on FinanceBench (+160% relative over vanilla finetuning) by applying Active Reading to the source documents for each benchmark. Finally, we show that Active Reading can be utilized at pre-training scale to build more factual models. As a demonstration of this, we release Meta WikiExpert-8B, a Wikipedia-expert model trained on 1 trillion generated tokens, which outcompetes models with hundreds of billions of parameters on factual QA.

73. Large-Small Model Collaborative Framework for Federated Continual Learning

Authors: Hao Yu, Xin Yang, Boyang Fan, Xuemei Cao, Hanlin Gu, Lixin Fan, Qiang Yang
URL: https://arxiv.org/abs/2508.09489
Abstract:

Continual learning (CL) for Foundation Models (FMs) is an essential yet underexplored challenge, especially in Federated Continual Learning (FCL), where each client learns from a private, evolving task stream under strict data and communication constraints. Despite their powerful generalization abilities, FMs often exhibit suboptimal performance on local downstream tasks, as they are unable to utilize private local data. Furthermore, enabling FMs to learn new tasks without forgetting prior knowledge is inherently a challenging problem, primarily due to their immense parameter count and high model complexity. In contrast, small models can be trained locally under resource-constrained conditions and benefit from more mature CL techniques. To bridge the gap between small models and FMs, we propose the first collaborative framework in FCL, where lightweight local models act as a dynamic bridge, continually adapting to new tasks while enhancing the utility of the large model. Two novel components are also included: Small Model Continual Fine-tuning is for preventing small models from temporal forgetting; One-by-One Distillation performs personalized fusion of heterogeneous local knowledge on the server. Experimental results demonstrate its superior performance, even when clients utilize heterogeneous small models.

74. Episodic Memory Representation for Long-form Video Understanding

Authors: Yun Wang, Long Zhang, Jingren Liu, Jiaqi Yan, Zhanjie Zhang, Jiahao Zheng, Xun Yang, Dapeng Wu, Xiangyu Chen, Xuelong Li
URL: https://arxiv.org/abs/2508.09486
Abstract:

Video Large Language Models (Video-LLMs) excel at general video understanding but struggle with long-form videos due to context window limits. Consequently, recent approaches focus on keyframe retrieval, condensing lengthy videos into a small set of informative frames. Despite their practicality, these methods simplify the problem to static text image matching, overlooking spatio temporal relationships crucial for capturing scene transitions and contextual continuity, and may yield redundant keyframes with limited information, diluting salient cues essential for accurate video question answering. To address these limitations, we introduce Video-EM, a training free framework inspired by the principles of human episodic memory, designed to facilitate robust and contextually grounded reasoning. Rather than treating keyframes as isolated visual entities, Video-EM explicitly models them as temporally ordered episodic events, capturing both spatial relationships and temporal dynamics necessary for accurately reconstructing the underlying narrative. Furthermore, the framework leverages chain of thought (CoT) thinking with LLMs to iteratively identify a minimal yet highly informative subset of episodic memories, enabling efficient and accurate question answering by Video-LLMs. Extensive evaluations on the Video-MME, EgoSchema, HourVideo, and LVBench benchmarks confirm the superiority of Video-EM, which achieves highly competitive results with performance gains of 4-9 percent over respective baselines while utilizing fewer frames.

75. NeuronTune: Fine-Grained Neuron Modulation for Balanced Safety-Utility Alignment in LLMs

Authors: Birong Pan, Mayi Xu, Qiankun Pi, Jianhao Chen, Yuanyuan Zhu, Ming Zhong, Tieyun Qian
URL: https://arxiv.org/abs/2508.09473
Abstract:

Ensuring robust safety alignment while preserving utility is critical for the reliable deployment of Large Language Models (LLMs). However, current techniques fundamentally suffer from intertwined deficiencies: insufficient robustness against malicious attacks, frequent refusal of benign queries, degradation in generated text quality and general task performance–the former two reflecting deficits in robust safety and the latter constituting utility impairment. We trace these limitations to the coarse-grained layer-wise interventions in existing methods. To resolve this, we propose NeuronTune, a fine-grained framework that dynamically modulates sparse neurons to achieve simultaneous safety-utility optimization. Our approach first identifies safety-critical and utility-preserving neurons across all layers via attribution, then employs meta-learning to adaptively amplify safety-neuron activations and suppress utility-neuron activations. Crucially, NeuronTune enables tunable adjustment of intervention scope via neuron-count thresholds, supporting flexible adaptation to security-critical or utility-priority scenarios. Extensive experimental results demonstrate that our method significantly outperforms existing state-of-the-art technologies, achieving superior model safety while maintaining excellent utility.

76. DeepFeatIoT: Unifying Deep Learned, Randomized, and LLM Features for Enhanced IoT Time Series Sensor Data Classification in Smart Industries

Authors: Muhammad Sakib Khan Inan, Kewen Liao
URL: https://arxiv.org/abs/2508.09468
Abstract:

Internet of Things (IoT) sensors are ubiquitous technologies deployed across smart cities, industrial sites, and healthcare systems. They continuously generate time series data that enable advanced analytics and automation in industries. However, challenges such as the loss or ambiguity of sensor metadata, heterogeneity in data sources, varying sampling frequencies, inconsistent units of measurement, and irregular timestamps make raw IoT time series data difficult to interpret, undermining the effectiveness of smart systems. To address these challenges, we propose a novel deep learning model, DeepFeatIoT, which integrates learned local and global features with non-learned randomized convolutional kernel-based features and features from large language models (LLMs). This straightforward yet unique fusion of diverse learned and non-learned features significantly enhances IoT time series sensor data classification, even in scenarios with limited labeled data. Our model’s effectiveness is demonstrated through its consistent and generalized performance across multiple real-world IoT sensor datasets from diverse critical application domains, outperforming state-of-the-art benchmark models. These results highlight DeepFeatIoT’s potential to drive significant advancements in IoT analytics and support the development of next-generation smart systems.

77. Gen-AFFECT: Generation of Avatar Fine-grained Facial Expressions with Consistent identiTy

Authors: Hao Yu, Rupayan Mallick, Margrit Betke, Sarah Adel Bargal
URL: https://arxiv.org/abs/2508.09461
Abstract:

Different forms of customized 2D avatars are widely used in gaming applications, virtual communication, education, and content creation. However, existing approaches often fail to capture fine-grained facial expressions and struggle to preserve identity across different expressions. We propose GEN-AFFECT, a novel framework for personalized avatar generation that generates expressive and identity-consistent avatars with a diverse set of facial expressions. Our framework proposes conditioning a multimodal diffusion transformer on an extracted identity-expression representation. This enables identity preservation and representation of a wide range of facial expressions. GEN-AFFECT additionally employs consistent attention at inference for information sharing across the set of generated expressions, enabling the generation process to maintain identity consistency over the array of generated fine-grained expressions. GEN-AFFECT demonstrates superior performance compared to previous state-of-the-art methods on the basis of the accuracy of the generated expressions, the preservation of the identity and the consistency of the target identity across an array of fine-grained facial expressions.

78. RelayFormer: A Unified Local-Global Attention Framework for Scalable Image and Video Manipulation Localization

Authors: Wen Huang, Jiarui Yang, Tao Dai, Jiawei Li, Shaoxiong Zhan, Bin Wang, Shu-Tao Xia
URL: https://arxiv.org/abs/2508.09459
Abstract:

Visual manipulation localization (VML) – across both images and videos – is a crucial task in digital forensics that involves identifying tampered regions in visual content. However, existing methods often lack cross-modal generalization and struggle to handle high-resolution or long-duration inputs efficiently. We propose RelayFormer, a unified and modular architecture for visual manipulation localization across images and videos. By leveraging flexible local units and a Global-Local Relay Attention (GLoRA) mechanism, it enables scalable, resolution-agnostic processing with strong generalization. Our framework integrates seamlessly with existing Transformer-based backbones, such as ViT and SegFormer, via lightweight adaptation modules that require only minimal architectural changes, ensuring compatibility without disrupting pretrained representations. Furthermore, we design a lightweight, query-based mask decoder that supports one-shot inference across video sequences with linear complexity. Extensive experiments across multiple benchmarks demonstrate that our approach achieves state-of-the-art localization performance, setting a new baseline for scalable and modality-agnostic VML. Code is available at: this https URL.

79. Hallucination vs interpretation: rethinking accuracy and precision in AI-assisted data extraction for knowledge synthesis

Authors: Xi Long, Christy Boscardin, Lauren A. Maggio, Joseph A. Costello, Ralph Gonzales, Rasmyah Hammoudeh, Ki Lai, Yoon Soo Park, Brian C. Gin
URL: https://arxiv.org/abs/2508.09458
Abstract:

Knowledge syntheses (literature reviews) are essential to health professions education (HPE), consolidating findings to advance theory and practice. However, they are labor-intensive, especially during data extraction. Artificial Intelligence (AI)-assisted extraction promises efficiency but raises concerns about accuracy, making it critical to distinguish AI ‘hallucinations’ (fabricated content) from legitimate interpretive differences. We developed an extraction platform using large language models (LLMs) to automate data extraction and compared AI to human responses across 187 publications and 17 extraction questions from a published scoping review. AI-human, human-human, and AI-AI consistencies were measured using interrater reliability (categorical) and thematic similarity ratings (open-ended). Errors were identified by comparing extracted responses to source publications. AI was highly consistent with humans for concrete, explicitly stated questions (e.g., title, aims) and lower for questions requiring subjective interpretation or absent in text (e.g., Kirkpatrick’s outcomes, study rationale). Human-human consistency was not higher than AI-human and showed the same question-dependent variability. Discordant AI-human responses (769/3179 = 24.2%) were mostly due to interpretive differences (18.3%); AI inaccuracies were rare (1.51%), while humans were nearly three times more likely to state inaccuracies (4.37%). Findings suggest AI accuracy depends more on interpretability than hallucination. Repeating AI extraction can identify interpretive complexity or ambiguity, refining processes before human review. AI can be a transparent, trustworthy partner in knowledge synthesis, though caution is needed to preserve critical human insights.

80. A Unified Contrastive-Generative Framework for Time Series Classification

Authors: Ziyu Liu, Azadeh Alavi, Minyi Li, Xiang Zhang
URL: https://arxiv.org/abs/2508.09451
Abstract:

Self-supervised learning (SSL) for multivariate time series mainly includes two paradigms: contrastive methods that excel at instance discrimination and generative approaches that model data distributions. While effective individually, their complementary potential remains unexplored. We propose a Contrastive Generative Time series framework (CoGenT), the first framework to unify these paradigms through joint contrastive-generative optimization. CoGenT addresses fundamental limitations of both approaches: it overcomes contrastive learning’s sensitivity to high intra-class similarity in temporal data while reducing generative methods’ dependence on large datasets. We evaluate CoGenT on six diverse time series datasets. The results show consistent improvements, with up to 59.2% and 14.27% F1 gains over standalone SimCLR and MAE, respectively. Our analysis reveals that the hybrid objective preserves discriminative power while acquiring generative robustness. These findings establish a foundation for hybrid SSL in temporal domains. We will release the code shortly.

81. Shadow in the Cache: Unveiling and Mitigating Privacy Risks of KV-cache in LLM Inference

Authors: Zhifan Luo, Shuo Shao, Su Zhang, Lijing Zhou, Yuke Hu, Chenxu Zhao, Zhihao Liu, Zhan Qin
URL: https://arxiv.org/abs/2508.09442
Abstract:

The Key-Value (KV) cache, which stores intermediate attention computations (Key and Value pairs) to avoid redundant calculations, is a fundamental mechanism for accelerating Large Language Model (LLM) inference. However, this efficiency optimization introduces significant yet underexplored privacy risks. This paper provides the first comprehensive analysis of these vulnerabilities, demonstrating that an attacker can reconstruct sensitive user inputs directly from the KV-cache. We design and implement three distinct attack vectors: a direct Inversion Attack, a more broadly applicable and potent Collision Attack, and a semantic-based Injection Attack. These methods demonstrate the practicality and severity of KV-cache privacy leakage issues. To mitigate this, we propose KV-Cloak, a novel, lightweight, and efficient defense mechanism. KV-Cloak uses a reversible matrix-based obfuscation scheme, combined with operator fusion, to secure the KV-cache. Our extensive experiments show that KV-Cloak effectively thwarts all proposed attacks, reducing reconstruction quality to random noise. Crucially, it achieves this robust security with virtually no degradation in model accuracy and minimal performance overhead, offering a practical solution for trustworthy LLM deployment.

82. What-Meets-Where: Unified Learning of Action and Contact Localization in a New Dataset

Authors: Yuxiao Wang, Yu Lei, Wolin Liang, Weiying Xue, Zhenao Wei, Nan Zhuang, Qi Liu
URL: https://arxiv.org/abs/2508.09428
Abstract:

People control their bodies to establish contact with the environment. To comprehensively understand actions across diverse visual contexts, it is essential to simultaneously consider \textbf{what} action is occurring and \textbf{where} it is happening. Current methodologies, however, often inadequately capture this duality, typically failing to jointly model both action semantics and their spatial contextualization within scenes. To bridge this gap, we introduce a novel vision task that simultaneously predicts high-level action semantics and fine-grained body-part contact regions. Our proposed framework, PaIR-Net, comprises three key components: the Contact Prior Aware Module (CPAM) for identifying contact-relevant body parts, the Prior-Guided Concat Segmenter (PGCS) for pixel-wise contact segmentation, and the Interaction Inference Module (IIM) responsible for integrating global interaction relationships. To facilitate this task, we present PaIR (Part-aware Interaction Representation), a comprehensive dataset containing 13,979 images that encompass 654 actions, 80 object categories, and 17 body parts. Experimental evaluation demonstrates that PaIR-Net significantly outperforms baseline approaches, while ablation studies confirm the efficacy of each architectural component. The code and dataset will be released upon publication.

83. Implicit Hypergraph Neural Networks: A Stable Framework for Higher-Order Relational Learning with Provable Guarantees

Authors: Xiaoyu Li, Guangyu Tang, Jiaojiao Jiang
URL: https://arxiv.org/abs/2508.09427
Abstract:

Many real-world interactions are group-based rather than pairwise such as papers with multiple co-authors and users jointly engaging with items. Hypergraph neural networks have shown great promise at modeling higher-order relations, but their reliance on a fixed number of explicit message-passing layers limits long-range dependency capture and can destabilize training as depth grows. In this work, we introduce Implicit Hypergraph Neural Networks (IHGNN), which bring the implicit equilibrium formulation to hypergraphs: instead of stacking layers, IHGNN computes representations as the solution to a nonlinear fixed-point equation, enabling stable and efficient global propagation across hyperedges without deep architectures. We develop a well-posed training scheme with provable convergence, analyze the oversmoothing conditions and expressivity of the model, and derive a transductive generalization bound on hypergraphs. We further present an implicit-gradient training procedure coupled with a projection-based stabilization strategy. Extensive experiments on citation benchmarks show that IHGNN consistently outperforms strong traditional graph/hypergraph neural network baselines in both accuracy and robustness. Empirically, IHGNN is resilient to random initialization and hyperparameter variation, highlighting its strong generalization and practical value for higher-order relational learning.

84. Domain-Generalization to Improve Learning in Meta-Learning Algorithms

Authors: Usman Anjum, Chris Stockman, Cat Luong, Justin Zhan
URL: https://arxiv.org/abs/2508.09418
Abstract:

This paper introduces Domain Generalization Sharpness-Aware Minimization Model-Agnostic Meta-Learning (DGS-MAML), a novel meta-learning algorithm designed to generalize across tasks with limited training data. DGS-MAML combines gradient matching with sharpness-aware minimization in a bi-level optimization framework to enhance model adaptability and robustness. We support our method with theoretical analysis using PAC-Bayes and convergence guarantees. Experimental results on benchmark datasets show that DGS-MAML outperforms existing approaches in terms of accuracy and generalization. The proposed method is particularly useful for scenarios requiring few-shot learning and quick adaptation, and the source code is publicly available at GitHub.

85. RampNet: A Two-Stage Pipeline for Bootstrapping Curb Ramp Detection in Streetscape Images from Open Government Metadata

Authors: John S. O’Meara, Jared Hwang, Zeyu Wang, Michael Saugstad, Jon E. Froehlich
URL: https://arxiv.org/abs/2508.09415
Abstract:

Curb ramps are critical for urban accessibility, but robustly detecting them in images remains an open problem due to the lack of large-scale, high-quality datasets. While prior work has attempted to improve data availability with crowdsourced or manually labeled data, these efforts often fall short in either quality or scale. In this paper, we introduce and evaluate a two-stage pipeline called RampNet to scale curb ramp detection datasets and improve model performance. In Stage 1, we generate a dataset of more than 210,000 annotated Google Street View (GSV) panoramas by auto-translating government-provided curb ramp location data to pixel coordinates in panoramic images. In Stage 2, we train a curb ramp detection model (modified ConvNeXt V2) from the generated dataset, achieving state-of-the-art performance. To evaluate both stages of our pipeline, we compare to manually labeled panoramas. Our generated dataset achieves 94.0% precision and 92.5% recall, and our detection model reaches 0.9236 AP – far exceeding prior work. Our work contributes the first large-scale, high-quality curb ramp detection dataset, benchmark, and model.

전체 AI 논문 - 2025-08-14

1. Mathematical Computation and Reasoning Errors by Large Language Models

2. RAGulating Compliance: A Multi-Agent Knowledge Graph for Regulatory QA

3. AWorld: Dynamic Multi-Agent System with Stable Maneuvering for Robust GAIA Problem Solving

4. Human-Aligned Procedural Level Generation Reinforcement Learning via Text-Level-Sketch Shared Representation

5. Reasoning About Knowledge on Regular Expressions is 2EXPTIME-complete

6. The PacifAIst Benchmark:Would an Artificial Intelligence Choose to Sacrifice Itself for Human Safety?

7. UDA: Unsupervised Debiasing Alignment for Pair-wise LLM-as-a-Judge

8. MEML-GRPO: Heterogeneous Multi-Expert Mutual Learning for RLVR Advancement

9. UbiQTree: Uncertainty Quantification in XAI with Tree Ensembles

10. EvoCurr: Self-evolving Curriculum with Behavior Code Generation for Complex Decision-making

11. An Automated Multi-Modal Evaluation Framework for Mobile Intelligent Assistants

12. Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation

13. Vision-driven River Following of UAV via Safe Reinforcement Learning using Semantic Dynamics Model

14. January Food Benchmark (JFB): A Public Benchmark Dataset and Evaluation Suite for Multimodal Food Analysis

15. GBC: Generalized Behavior-Cloning Framework for Whole-Body Humanoid Imitation

16. Specialised or Generic? Tokenization Choices for Radiology Language Models

17. VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models

18. A Comprehensive Evaluation framework of Alignment Techniques for LLMs

19. Residual Reservoir Memory Networks

20. T-CACE: A Time-Conditioned Autoregressive Contrast Enhancement Multi-Task Framework for Contrast-Free Liver MRI Synthesis, Segmentation, and Diagnosis

21. Beyond Naïve Prompting: Strategies for Improved Zero-shot Context-aided Forecasting with LLMs

22. Rare anomalies require large datasets: About proving the existence of anomalies

23. COME: Dual Structure-Semantic Learning with Collaborative MoE for Universal Lesion Detection Across Heterogeneous Ultrasound Datasets

24. Beyond Scaling Law: A Data-Efficient Distillation Framework for Reasoning

25. Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models

26. STREAM (ChemBio): A Standard for Transparently Reporting Evaluations in AI Model Reports

27. Perceptual Reality Transformer: Neural Architectures for Simulating Neurological Perception Conditions

28. PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts

29. Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

30. Exploring the Potential of Large Language Models in Fine-Grained Review Comment Classification

31. RayletDF: Raylet Distance Fields for Generalizable 3D Surface Reconstruction from Point Clouds or Gaussians

32. Provable In-Context Vector Arithmetic via Retrieving Task Concepts

33. TRACE: Learning 3D Gaussian Physical Dynamics from Multi-view Videos

34. A Comprehensive Survey of Datasets for Clinical Mental Health AI Systems

35. Automated Segmentation of Coronal Brain Tissue Slabs for 3D Neuropathology

36. Explainable Ensemble Learning for Graph-Based Malware Detection

37. LibRec: Benchmarking Retrieval-Augmented LLMs for Library Migration Recommendations

38. Prototype Training with Dual Pseudo-Inverse and Optimized Hidden Activations

39. Adoption of Explainable Natural Language Processing: Perspectives from Industry and Academia on Practices and Challenges

40. Combinative Matching for Geometric Shape Assembly

41. Can LLM-Generated Textual Explanations Enhance Model Classification Performance? An Empirical Study

42. Counting Short Trajectories in Elementary Cellular Automata using the Transfer Matrix Method

43. NEUBORN: The Neurodevelopmental Evolution framework Using BiOmechanical RemodelliNg

44. Region-to-Region: Enhancing Generative Image Harmonization with Adaptive Regional Injection

45. Improving ARDS Diagnosis Through Context-Aware Concept Bottleneck Models

46. Evaluating the Role of Large Language Models in Legal Practice in India

47. Surg-InvNeRF: Invertible NeRF for 3D tracking and reconstruction in surgical vision

48. Anomaly Detection for IoT Global Connectivity

49. On Negative-aware Preference Optimization for Recommendation

50. Demystifying the Role of Rule-based Detection in AI Systems for Windows Malware Detection

51. A Close Reading Approach to Gender Narrative Biases in AI-Generated Stories

52. Preacher: Paper-to-Video Agentic System

53. AmbiGraph-Eval: Can LLMs Effectively Handle Ambiguous Graph Queries?

54. TimeMKG: Knowledge-Infused Causal Reasoning for Multivariate Time Series Modeling

55. Goal Discovery with Causal Capacity for Efficient Reinforcement Learning

56. Interpretable Robot Control via Structured Behavior Trees and Large Language Models

57. MInDI-3D: Iterative Deep Learning in 3D for Sparse-view Cone Beam Computed Tomography

58. How Persuasive Could LLMs Be? A First Study Combining Linguistic-Rhetorical Analysis and User Experiments

59. A Lightweight Learned Cardinality Estimation Model

60. Hierarchical Brain Structure Modeling for Predicting Genotype of Glioma

61. CaRoBio: 3D Cable Routing with a Bio-inspired Gripper Fingernail

62. GoViG: Goal-Conditioned Visual Navigation Instruction Generation

63. Your Coding Intent is Secretly in the Context and You Should Deliberately Infer It Before Completion

64. AI Blob! LLM-Driven Recontextualization of Italian Television Archives

65. COXNet: Cross-Layer Fusion with Adaptive Alignment and Scale Integration for RGBT Tiny Object Detection

66. Decentralized Rank Scheduling for Energy-Constrained Multi-Task Federated Fine-Tuning in Edge-Assisted IoV Networks

67. Generation of Indian Sign Language Letters, Numbers, and Words

68. COMPEER: Controllable Empathetic Reinforcement Reasoning for Emotional Support Conversation

69. SMART-OC: A Real-time Time-risk Optimal Replanning Algorithm for Dynamic Obstacles and Spatio-temporally Varying Currents

70. Verify Distributed Deep Learning Model Implementation Refinement with Iterative Relation Inference

71. From Ranking to Selection: A Simple but Efficient Dynamic Passage Selector for Retrieval Augmented Generation

72. Learning Facts at Scale with Active Reading

73. Large-Small Model Collaborative Framework for Federated Continual Learning

74. Episodic Memory Representation for Long-form Video Understanding

75. NeuronTune: Fine-Grained Neuron Modulation for Balanced Safety-Utility Alignment in LLMs

76. DeepFeatIoT: Unifying Deep Learned, Randomized, and LLM Features for Enhanced IoT Time Series Sensor Data Classification in Smart Industries

77. Gen-AFFECT: Generation of Avatar Fine-grained Facial Expressions with Consistent identiTy

78. RelayFormer: A Unified Local-Global Attention Framework for Scalable Image and Video Manipulation Localization

79. Hallucination vs interpretation: rethinking accuracy and precision in AI-assisted data extraction for knowledge synthesis