[arXiv Digest] 2025-07-08

1. When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors

Authors: Scott Emmons, Erik Jenner, David K. Elson, Rif A. Saurous, Senthooran Rajamanoharan, Heng Chen, Irhum Shafkat, Rohin Shah
URL: https://arxiv.org/abs/2507.05246
요약 (영문): chain-of-thought (CoT) monitoring is an appealing AI safety defense . recent work on “unfaithfulness” has cast doubt on its reliability .
요약 (한글): 생각의 사슬(CoT) 모니터링은 매력적인 AI 안전 방어 수단이지만, 최근 ‘불성실성’에 대한 연구로 인해 그 신뢰성에 의문이 제기되고 있습니다.

2. MARBLE: A Multi-Agent Rule-Based LLM Reasoning Engine for Accident Severity Prediction

Authors: Kaleem Ullah Qasim, Jiashu Zhang
URL: https://arxiv.org/abs/2507.04893
요약 (영문): accident severity prediction plays a critical role in transportation safety systems but is a persistently difficult task due to incomplete data, strong feature dependencies and severe class imbalance . existing methods often rely on monolithic models or black box prompting, which struggle to scale in noisy, real-world settings .
요약 (한글): 사고 심각도 예측은 교통 안전 시스템에서 중요한 역할을 하지만 불완전한 데이터, 강력한 기능 종속성, 심각한 등급 불균형으로 인해 지속적으로 어려운 작업입니다. 기존 방법은 종종 모놀리식 모델이나 블랙박스 프롬프트에 의존하는데, 이는 잡음이 많은 실제 환경에서 확장하기 어렵습니다.

3. DoPI: Doctor-like Proactive Interrogation LLM for Traditional Chinese Medicine

Authors: Zewen Sun, Ruoxiang Huang, Jiahe Feng, Rundong Kong, Yuqian Wang, Hengyu Liu, Ziqi Gong, Yuyuan Qin, Yingxue Wang, Yu Wang
URL: https://arxiv.org/abs/2507.04877
요약 (영문): current large language models exhibit notable limitations in medical applications, particularly in conducting effective multi-turn dialogues and proactive questioning . these shortcomings hinder practical application and effectiveness in simulating real-world diagnostic scenarios .
요약 (한글): 현재의 대규모 언어 모델은 의료 애플리케이션, 특히 효과적인 멀티턴 대화 및 사전 질문 수행에 있어 현저한 한계를 보입니다. 이러한 단점은 실제 진단 시나리오를 시뮬레이션할 때 실제 적용과 효과를 저해합니다.

4. Application and Evaluation of Large Language Models for Forecasting the Impact of Traffic Incidents

Authors: George Jagadeesh, Srikrishna Iyer, Michal Polanowski, Kai Xin Thia
URL: https://arxiv.org/abs/2507.04803
요약 (영문): this study examines the feasibility of applying large language models for forecasting the impact of traffic incidents on the traffic flow . the use of LLMs has several advantages over existing machine learning-based solutions such as not requiring a large training dataset and the ability to utilize incident logs .
요약 (한글): 이 연구는 교통 사고가 교통 흐름에 미치는 영향을 예측하기 위한 대규모 언어 모델 적용의 타당성을 검토합니다. LLM을 사용하면 대규모 학습 데이터 세트가 필요하지 않고 사고 로그를 활용할 수 있는 등 기존 머신 러닝 기반 솔루션에 비해 몇 가지 장점이 있습니다.

5. FurniMAS: Language-Guided Furniture Decoration using Multi-Agent System

Authors: Toan Nguyen, Tri Le, Quang Nguyen, Anh Nguyen
URL: https://arxiv.org/abs/2507.04770
요약 (영문): we propose a multi-agent system for automatic furniture decoration . given a human prompt and a household furniture item such as a working desk or a TV stand, our system suggests automating the decoration process.
요약 (한글): 우리는 자동 가구 장식을 위한 다중 에이전트 시스템을 제안합니다. 사람의 프롬프트와 작업용 책상이나 TV 스탠드와 같은 가정용 가구 품목이 주어지면 우리 시스템은 장식 프로세스를 자동화할 것을 제안합니다.

6. LLM-based Question-Answer Framework for Sensor-driven HVAC System Interaction

Authors: Sungmin Lee, Minju Kang, Joonhee Lee, Seungyong Lee, Dongju Kim, Jingi Hong, Jun Shin, Pei Zhang, JeongGil Ko
URL: https://arxiv.org/abs/2507.04748
요약 (영문): QA interfaces powered by large language models (LLMs) present a promising direction for improving interactivity with HVAC systems . enabling accurate, real-time, and context-aware interactions introduces unique challenges, including the integration of frequently updated sensor data .
요약 (한글): 대규모 언어 모델(LLM)로 구동되는 QA 인터페이스는 HVAC 시스템과의 상호 작용을 개선하기 위한 유망한 방향을 제시합니다. 정확한 실시간 상황 인식 상호 작용을 구현하려면 자주 업데이트되는 센서 데이터의 통합을 비롯한 고유한 과제가 발생합니다.

7. Activation Steering for Chain-of-Thought Compression

Authors: Seyedarmin Azizi, Erfan Baghaei Potraghloo, Massoud Pedram
URL: https://arxiv.org/abs/2507.04742
요약 (영문): large language models excel at complex reasoning when they include intermediate steps . verbose, English-heavy CoTs and concise, math-centric coTs occupy distinct regions in the model’s residual-stream activation space .
요약 (한글): 대규모 언어 모델은 중간 단계를 포함할 때 복잡한 추론에 탁월합니다. 장황한 영어 중심의 CoT와 간결하고 수학 중심적인 CoT는 모델의 잔류 스트림 활성화 공간에서 서로 다른 영역을 차지합니다.

8. ChipSeek-R1: Generating Human-Surpassing RTL with LLM via Hierarchical Reward-Driven Reinforcement Learning

Authors: Zhirong Chen, Kaiyan Chang, Zhuolin Li, Xinyang He, Chujie Chen, Cangyuan Li, Mengdi Wang, Haobo Xu, Yinhe Han, Ying Wang
URL: https://arxiv.org/abs/2507.04736
요약 (영문): large language models (LLMs) show significant potential for automating RTL code generation . but current approaches face a critical challenge: they can not simultaneously optimize for functional correctness and hardware quality . post-processing techniques that attempt to improve PP can improve performance .
요약 (한글): 대규모 언어 모델(LLM)은 RTL 코드 생성 자동화에 상당한 잠재력을 보이지만 현재의 접근 방식은 기능적 정확성과 하드웨어 품질을 동시에 최적화할 수 없다는 중요한 과제에 직면해 있습니다. PP를 개선하려는 후처리 기술은 성능을 향상시킬 수 있습니다.

9. Trojan Horse Prompting: Jailbreaking Conversational Multimodal Models by Forging Assistant Message

Authors: Wei Duan, Li Qian
URL: https://arxiv.org/abs/2507.04673
요약 (영문): the rise of conversational interfaces has greatly enhanced usability . this reliance introduces an unexplored attack surface . a malicious payload is injected into a model-attributed message .
요약 (한글): 대화형 인터페이스의 등장으로 사용성이 크게 향상되었습니다. 이러한 의존성은 미개척 공격 표면을 도입합니다. 모델 어트리뷰션 메시지에 악성 페이로드가 삽입됩니다.

10. Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models?

Authors: Yun Qu, Qi Cheems Wang, Yixiu Mao, Vincent Tao Hu, Xiangyang Ji
URL: https://arxiv.org/abs/2507.04632
요약 (영문): recent advances have witnessed the effectiveness of reinforcement learning (RL) finetuning in enhancing the reasoning capabilities of large language models (LLMs) the optimization process often requires numerous iterations to achieve satisfactory performance .
요약 (한글): 최근의 발전으로 대규모 언어 모델(LLM)의 추론 능력을 향상시키는 데 있어 강화 학습(RL) 미세 조정의 효과가 입증되었습니다. 최적화 프로세스에는 만족스러운 성능을 달성하기 위해 수많은 반복이 필요한 경우가 많습니다.

11. Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Authors: Yuanzhe Hu, Yu Wang, Julian McAuley
URL: https://arxiv.org/abs/2507.05257
요약 (영문): benchmarks for LLM agents focus on evaluating reasoning, planning, and execution capabilities . another critical component-memory, encompassing how agents memorize, update, and retrieve long-term information, is under-evaluated .
요약 (한글): LLM 에이전트에 대한 벤치마크는 추론, 계획 및 실행 능력을 평가하는 데 중점을 두고 있습니다. 또 다른 중요한 요소인 에이전트가 장기 정보를 암기, 업데이트 및 검색하는 방법을 포괄하는 메모리는 저평가되어 있습니다.

12. All in One: Visual-Description-Guided Unified Point Cloud Segmentation

Authors: Zongyan Han, Mohamed El Amine Boudjoghra, Jiahua Dong, Jinhong Wang, Rao Muhammad Anwer
URL: https://arxiv.org/abs/2507.05211
요약 (영문): unified segmentation of 3D point clouds is crucial for scene understanding, but is hindered by its sparse structure, limited annotations, and the challenge of distinguishing fine-grained objects in complex environments . to address these challenges, we propose VDG-Uni3DSeg, a novel framework .
요약 (한글): 3D 포인트 클라우드의 통합된 분할은 장면 이해에 매우 중요하지만, 희박한 구조와 제한된 주석, 복잡한 환경에서 세분화된 오브젝트를 구별하는 데 어려움을 겪습니다. 이러한 문제를 해결하기 위해 유니티는 새로운 프레임워크인 VDG-Uni3DSeg를 제안합니다.

13. Train-before-Test Harmonizes Language Model Rankings

Authors: Guanhua Zhang, Ricardo Dominguez-Olmedo, Moritz Hardt
URL: https://arxiv.org/abs/2507.05195
요약 (영문): conflicting rankings hamper model selection, clouds comparisons, and adds confusion to growing ecosystem of competing models . a candidate solution to the problem is train on the test task .
요약 (한글): 상충되는 순위는 모델 선택을 방해하고, 비교를 흐리게 하며, 경쟁 모델의 생태계가 성장함에 따라 혼란을 가중시킵니다. 이 문제에 대한 후보 솔루션은 테스트 작업에 대한 훈련입니다.

14. CREW-WILDFIRE: Benchmarking Agentic Multi-Agent Collaborations at Scale

Authors: Jonathan Hyun, Nicholas R Waytowich, Boyuan Chen
URL: https://arxiv.org/abs/2507.05178
요약 (영문): despite rapid progress in large language model (LLM)-based multi-agent systems, current benchmarks fall short in evaluating their scalability, robustness, and coordination capabilities . existing environments typically focus on small-scale, fully observable, or low-complexity domains .
요약 (한글): 대규모 언어 모델(LLM) 기반 다중 에이전트 시스템의 빠른 발전에도 불구하고 현재 벤치마크는 확장성, 견고성 및 조정 기능을 평가하는 데 부족합니다. 기존 환경은 일반적으로 소규모, 완전히 관찰 가능하거나 복잡성이 낮은 도메인에 초점을 맞추고 있습니다.

15. OpenS2S: Advancing Open-Source End-to-End Empathetic Large Speech Language Model

Authors: Chen Wang, Tianyu Peng, Wen Yang, Yinan Bai, Guangfu Wang, Jun Lin, Lanpeng Jia, Lingxiang Wu, Jinqiao Wang, Chengqing Zong, Jiajun Zhang
URL: https://arxiv.org/abs/2507.05177
요약 (영문): the most powerful empathetic LSLMs are closed off, leaving the crucial details about the architecture, data and development opaque to researchers . openS2S is a fully open-source, transparent and transparent .
요약 (한글): 가장 강력한 공감형 LSLM은 폐쇄적이어서 아키텍처, 데이터 및 개발에 대한 중요한 세부 사항을 연구자에게 불투명하게 남겨두고 있습니다. openS2S는 완전 오픈 소스이며 투명하고 투명한 .

16. AI Generated Text Detection Using Instruction Fine-tuned Large Language and Transformer-Based Models

Authors: Chinnappa Guggilla, Budhaditya Roy, Trupti Ramdas Chavan, Abdul Rahman, Edward Bowen
URL: https://arxiv.org/abs/2507.05157
요약 (영문): Large Language Models (LLMs) adapt to various styles and genres . they produce content that is both grammatically correct and semantically meaningful . recently, they have been misused to create highly realistic phishing emails, spread fake news, generate code to automate cyber crime .
요약 (한글): LLM(대규모 언어 모델)은 다양한 스타일과 장르에 적응하며 문법적으로 정확하고 의미적으로 의미 있는 콘텐츠를 생성하며 최근에는 매우 사실적인 피싱 이메일 생성, 가짜 뉴스 확산, 사이버 범죄 자동화를 위한 코드 생성에 악용되고 있습니다.

17. Interpretable Mnemonic Generation for Kanji Learning via Expectation-Maximization

Authors: Jaewook Lee, Alexander Scarlatos, Andrew Lan
URL: https://arxiv.org/abs/2507.05137
요약 (영문): Japanese combines syllabaries like hiragana with kanji, which are logographic characters of Chinese origin . keywords mnemonics are a common strategy to aid memorization .
요약 (한글): 일본어는 히라가나와 같은 음절과 한자, 즉 한자에서 유래한 문자를 결합한 언어입니다. 키워드 니모닉은 암기를 돕기 위한 일반적인 전략입니다.