Skip to main content
QUICK REVIEW

[論文レビュー] KLong: Training LLM Agent for Extremely Long-horizon Tasks

Yue Liu, Zhiyuan Hu|arXiv (Cornell University)|Feb 19, 2026
Multimodal Machine Learning Applications被引用数 0
ひとこと要約

tldr: KLong is an open-source LLM agent trained to tackle extremely long-horizon tasks by combining trajectory-splitting supervised fine-tuning with progressive reinforcement learning, plus the Research-Factory data pipeline for scalable training data and rubrics.

ABSTRACT

This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks. The principle is to first cold-start the model via trajectory-splitting SFT, then scale it via progressive RL training. Specifically, we first activate basic agentic abilities of a base model with a comprehensive SFT recipe. Then, we introduce Research-Factory, an automated pipeline that generates high-quality training data by collecting research papers and constructing evaluation rubrics. Using this pipeline, we build thousands of long-horizon trajectories distilled from Claude 4.5 Sonnet (Thinking). To train with these extremely long trajectories, we propose a new trajectory-splitting SFT, which preserves early context, progressively truncates later context, and maintains overlap between sub-trajectories. In addition, to further improve long-horizon task-solving capability, we propose a novel progressive RL, which schedules training into multiple stages with progressively extended timeouts. Experiments demonstrate the superiority and generalization of KLong, as shown in Figure 1. Notably, our proposed KLong (106B) surpasses Kimi K2 Thinking (1T) by 11.28% on PaperBench, and the performance improvement generalizes to other coding benchmarks like SWE-bench Verified and MLE-bench.

研究の動機と目的

  • Motivate the need for agents that handle tasks exceeding standard context windows and long-running experiments (e.g., reproducing research papers).
  • Introduce KLong, an open-source LLM agent trained specifically for extremely long-horizon tasks.
  • Propose data-generation and evaluation pipelines (Research-Factory) to scale long-horizon training data and rubrics.
  • Develop trajectory-splitting SFT to preserve early context while fitting within context windows.
  • Propose progressive RL with staged timeouts to improve long-horizon planning and execution.

提案手法

  • Build Research-Factory to automatically collect papers, construct rubrics, and distill thousands of long-horizon trajectories from Claude 4.5 Sonnet (Thinking).
  • Pretrain a base model with a comprehensive SFT covering knowledge, coding, math, and search to activate agentic abilities.
  • Propose trajectory-splitting SFT that pins the paper-reading prefix, overlaps sub-trajectories, and truncates later context to fit the context window.
  • Introduce progressive RL that trains in multiple stages with gradually extended timeouts and trajectory-splitting for long-horizon feedback.
  • Use a unified sandbox and infrastructure optimizations (sandboxing, caching, rollout scheduling, and judge settings) to improve efficiency and robustness.

実験結果

リサーチクエスチョン

  • RQ1Can an LLM agent be trained to solve tasks that exceed standard context windows and require long-running experimentation?
  • RQ2Does a trajectory-splitting SFT approach improve learning of extremely long-horizon behaviors compared to baseline SFT?
  • RQ3Can progressive RL with increasing timeouts stabilize and boost performance on long-horizon tasks?
  • RQ4Does the Research-Factory pipeline produce high-quality, scalable data and rubrics for reproducible research tasks?

主な発見

  • KLong achieves superior average performance among open-source models on PaperBench and narrows the gap to some closed-source systems.
  • Trajectory-splitting SFT substantially increases assistant turns while boosting performance (e.g., from Baseline to improved scores), demonstrating effectiveness for long-horizon behaviors.
  • Progressive RL with longer timeouts yields additional gains, with RL-6H achieving the best overall performance.
  • KLong generalizes well to other long-horizon domains, including SWE-bench Verified, Terminal-Bench Hard, SEC-bench, and MLE-bench competitions.
  • Infrastructure optimizations and the Research-Factory pipeline contribute to scalable data generation and more robust evaluation signals.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。