LLM 관련 주요 논문 - 2026-02-20

1. AutoNumerics: An Autonomous, PDE-Agnostic Multi-Agent Pipeline for Scientific Computing


2. AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games


3. ODESteer: A Unified ODE-Based Steering Framework for LLM Alignment


4. KLong: Training LLM Agent for Extremely Long-horizon Tasks


5. Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability


6. Enhancing Large Language Models (LLMs) for Telecom using Dynamic Knowledge Graphs and Explainable Retrieval-Augmented Generation


7. A Privacy by Design Framework for Large Language Model-Based Applications for Children


8. MedClarify: An information-seeking AI agent for medical diagnosis with case-specific follow-up questions


9. ArXiv-to-Model: A Practical Study of Scientific LM Training


10. Web Verbs: Typed Abstractions for Reliable Task Composition on the Agentic Web


11. All Leaks Count, Some Count More: Interpretable Temporal Contamination Detection in LLM Backtesting


12. Mechanistic Interpretability of Cognitive Complexity in LLMs via Linear Probing using Bloom’s Taxonomy


13. Decoding the Human Factor: High Fidelity Behavioral Prediction for Strategic Foresight


14. Instructor-Aligned Knowledge Graphs for Personalized Learning


15. Toward Trustworthy Evaluation of Sustainability Rating Methodologies: A Human-AI Collaborative Framework for Benchmark Dataset Construction


16. Agentic Wireless Communication for 6G: Intent-Aware and Continuously Evolving Physical-Layer Intelligence


17. How AI Coding Agents Communicate: A Study of Pull Request Description Characteristics and Human Review Responses


18. Predictive Batch Scheduling: Accelerating Language Model Training Through Loss-Aware Sample Prioritization


19. Dynamic System Instructions and Tool Exposure for Efficient Agentic LLMs


20. Phase-Aware Mixture of Experts for Agentic Reinforcement Learning


21. Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation


22. Automating Agent Hijacking via Structural Template Injection


23. LLM4Cov: Execution-Aware Agentic Learning for High-coverage Testbench Generation


24. Mind the GAP: Text Safety Does Not Transfer to Tool-Call Safety in LLM Agents


25. SourceBench: Can AI Answers Reference Quality Web Sources?


26. DeepContext: Stateful Real-Time Detection of Multi-Turn Adversarial Intent Drift in LLMs


27. Narrow fine-tuning erodes safety alignment in vision-language agents


28. LLM-WikiRace: Benchmarking Long-term Planning and Reasoning over Real-World Knowledge Graphs


29. AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks


30. IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages


31. NeuDiff Agent: A Governed AI Workflow for Single-Crystal Neutron Crystallography


32. Simple Baselines are Competitive with Code Evolution


33. When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation


34. Mobility-Aware Cache Framework for Scalable LLM-Based Human Mobility Simulation


35. Retrieval Augmented (Knowledge Graph), and Large Language Model-Driven Design Structure Matrix (DSM) Generation of Cyber-Physical Systems


36. Sink-Aware Pruning for Diffusion Language Models


37. Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting


38. FAMOSE: A ReAct Approach to Automated Feature Discovery


39. When to Trust the Cheap Check: Weak and Strong Verification for Reasoning


40. Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs


41. Towards Anytime-Valid Statistical Watermarking


42. The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$\rightarrow$LLM Pipelines?


43. MASPO: Unifying Gradient Utilization, Probability Mass, and Signal Reliability for Robust and Sample-Efficient LLM Reasoning


44. The Anxiety of Influence: Bloom Filters in Transformer Attention Heads


45. What Do LLMs Associate with Your Name? A Human-Centered Black-Box Audit of Personal Data


46. Jolt Atlas: Verifiable Inference via Lookup Arguments in Zero Knowledge


47. Beyond Pipelines: A Fundamental Study on the Rise of Generative-Retrieval Architectures in Web Research


48. Fine-Grained Uncertainty Quantification for Long-Form Language Model Outputs: A Comparative Study


49. Improving LLM-based Recommendation with Self-Hard Negatives from Intermediate Layers


50. Voice-Driven Semantic Perception for UAV-Assisted Emergency Networks


51. What Breaks Embodied AI Security:LLM Vulnerabilities, CPS Flaws,or Something Else?


52. Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation


53. Towards Cross-lingual Values Assessment: A Consensus-Pluralism Perspective


54. The Bots of Persuasion: Examining How Conversational Agents’ Linguistic Expressions of Personality Affect User Perceptions and Decisions


55. Robustness and Reasoning Fidelity of Large Language Models in Long-Context Code Question Answering


56. Universal Fine-Grained Symmetry Inference and Enforcement for Rigorous Crystal Structure Prediction


57. FLoRG: Federated Fine-tuning with Low-rank Gram Matrices and Procrustes Alignment


58. Wink: Recovering from Misbehaviors in Coding Agents


59. ReIn: Conversational Error Recovery with Reasoning Inception


60. Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History


61. Exploring LLMs for User Story Extraction from Mockups


62. RankEvolve: Automating the Discovery of Retrieval Algorithms via LLM-Driven Evolution


63. Discovering Multiagent Learning Algorithms with Large Language Models


64. Xray-Visual Models: Scaling Vision models on Industry Scale Data


65. MALLVI: a multi agent framework for integrated generalized robotics manipulation


66. AdaptOrch: Task-Adaptive Multi-Agent Orchestration in the Era of LLM Performance Convergence


67. VAM: Verbalized Action Masking for Controllable Exploration in RL Post-Training – A Chess Case Study


68. One-step Language Modeling via Continuous Denoising


69. Evaluating Monolingual and Multilingual Large Language Models for Greek Question Answering: The DemosQA Benchmark


70. References Improve LLM Alignment in Non-Verifiable Domains


71. Large-scale online deanonymization with LLMs


72. LiveClin: A Live Clinical Benchmark without Leakage


73. Can Adversarial Code Comments Fool AI Security Reviewers – Large-Scale Empirical Study of Comment-Based Attacks and Defenses Against LLM Code Analysis


74. Quantifying LLM Attention-Head Stability: Implications for Circuit Universality


75. APEX-SQL: Talking to the data via Agentic Exploration for Text-to-SQL