🔥 CodeFuse-Agent: Achieving 61.67% on SWE-bench Lite via Trajectory-Aware Test Time Scaling

1. Abstract

We present CodeFuse-Agent, an AI agent designed to tackle with software engineering challenges, achieving 61.67% resolution rate on SWE-bench Lite, establishing a new state-of-the-art. Our approach comprises two stages:

(1) Multi-trajectory patch generation by our core agent framework CodeFuse-Agent

(2) Trajectory-Aware Test-Time Scaling (TTS) which performs systematic candidate selection by cross-validating patches against self-generated test cases consolidated from historical debugging trajectories.

By decoupling generation from verification and exploiting the collective debugging artifacts across trajectories, CodeFuse-Agent substantially improves patch selection accuracy.

We fully open-source CodeFuse-Agent to facilitate reproducibility and benefit the broader research community. By lowering the barrier to entry, we hope to accelerate collective progress in building more capable AI coding agents.

2. Introduction

Automated program repair remains a critical challenge in software engineering. While LLM-based agents continue to improve, complex issues often cannot be resolved in a single attempt. Increasing the number of rollouts improves success probability, but introduces a new bottleneck: how to reliably select the correct patch from multiple candidates?

Current approaches predominantly rely on LLM-as-Judge methods, such as using LLMs to vote on candidate patches or even resorting to random selection when votes are tied. We argue that such approaches are inherently unstable and lack robustness—LLM judgments can be inconsistent across runs and sensitive to prompt variations. Furthermore, these approaches suffer from limited interpretability—the rationale provided by LLM judges lacks grounding in executable or testable evidence, making it hard to objectively validate the selection decision.

To explore this, we conducted a preliminary study on SWE-Bench-Verified (500 instances):

Finding 1: Self-validation in a single rollout is unreliable.

	Count	Empty Patch	with debugging process	Without Debugging process
unresolved	174	4	157	13
resolved	326	0	283	43
All Instance	500	4	440	56

Analysis Base on SWE-Bench-Verified

We observed that 88% of instances (440/500) included self-generated tests during the debugging process. However, 35.7% of these (157/440) ultimately failed ground-truth evaluation—despite the agent believing its patch had passed it's own test cases. This indicates that tests from a single trajectory can be incomplete or incorrect, leading to false confidence.

Finding 2: Multiple rollouts reveal stronger collective potential.

Influence of attempt count to overall performance

A single rollout achieves 65.2% success rate (326/500), while the best-of-N oracle reaches 82.4% (412/500)—a gap of 17.2% percentage points. This demonstrates that correct patches do exist among the rollouts; the challenge lies in identifying them.

These findings motivate our approach: aggregate test cases from all rollouts to form a comprehensive test suite, and use it as executable, verifiable evidence for patch selection. By pooling self-generated tests from all trajectories and selecting the patch with the highest pass rate, we leverage the model's latent ability to produce good tests—even when individual rollouts are inconsistent.

To summary, in this work, we present two contributions: (1) CodeFuse-Agent(CFuse), a lightweight, research-oriented agent framework for code generation, and (2) Trajectory-Aware Test-Time Scaling (TTS), a verification mechanism that aggregates self-generated tests from all trajectories for cross-trajectory validation. Together, they achieve 61.67% resolution rate on SWE-bench Lite.

3. Methodology

3.1 Stage 1: Multi-Trajectory Patch Generation

The first stage employs our agent framework to generate diverse candidate patches through multiple independent trajectories.

3.1.1 CodeFuse Agent Architecture

CodeFuse-Agent(CFuse) is a lightweight, cleanly-architected agent framework designed for research and experimentation. It is fully open-source and can be installed with a single pip install command, providing a complete yet minimal toolset for code related task. We open-source CFuse to facilitate reproducible research and encourage further exploration of LLM-based coding agents.

Layer	Responsibility
Interaction	Terminal UI / Headless /HTTP modes
Agent Loop	Core lifecycle: LLM interaction, tool dispatch, iteration control
Context Engine	Message history, environment context, compression, prompt assembly
LLM Provider	Multi-LLM support (OpenAI, Anthropic, Gemini, etc.)
Tool Execution	6 built-in tools + remote execution
Observability	Trajectory logs, execution metrics, cost tracking

Configurable Agent Profiles

Agent behavior is defined through declarative Markdown profiles(system prompt, tools, model etc.), enabling quick switching of system prompts and tool subsets without code changes.

Dual Execution Modes

Local Mode: Execute tool calls directly in the local environment
HTTP Mode: Serve as a tool execution backend or delegate calls to remote sandboxes

This decoupling of agent decisions from environment execution makes CFuse suitable as scaffolding for RL training pipelines.

Built-in Tools

The framework provides six built-in tools for code exploration and modification:

read_file: Read file contents with optional line range selection
write_file: Create or overwrite files
edit_file: Perform edits via search-and-replace
grep: Fast code search powered by ripgrep
glob: File discovery using glob patterns
bash: Execute shell commands with timeout control

3.2 Stage 2: Trajectory-Aware Test Time Scaling

Building upon the trajectories from Stage 1, this stage performs systematic verification and selection through three sequential components.

3.2.1 Test Case Consolidation

Rather than designing yet another heuristic-driven test generation agent, we propose reframing the problem: we introduce a Test Consolidate Agent that consolidate debugging experience from agents' own debugging trajectory into a single executable test file.

Moreover, to mitigate the issue of excessively long context windows, we first extracted only the tool invocation content from those steps in the agent's execution trajectory that are relevant to debugging. Furthermore, we adopted a sliding window approach, processing only a window of N consecutive trajecory，the following formulate will explain it:

Let there be $M$ agent execution trajectories:

$$ \mathcal{T} = \{ T^{(1)}, T^{(2)}, \dots, T^{(M)} \}, $$

where the $m$-th trajectory is denoted as $T^{(m)} = \\{ s^{(m)}_1, s^{(m)}_2, \dots, s^{(m)}_{L_m} \\}$, and $L_m$ is its length. $s^{m}_i$is the step it takes in $i_{th}$iteration round. For each trajectory $T^{(m)}$, we perform the following steps:

Filter debugging-relevant steps:

$$ D^{(m)} = \{ s^{(m)}_i \in T^{(m)} \mid \text{IsDebugRelevant}(s^{(m)}_i) = \text{true} \}. $$

Extract tool invocation content from those steps:

where $K_m = |C^{(m)}|$ denotes the number of extracted tool invocations.

$$ C^{(m)} = \{ c^{(m)}_j = \text{ExtractToolInvocation}(s) \mid s \in D^{(m)} \}, $$

Apply a sliding window of fixed size $N$ over $C^{(m)}$ to generate contextual segments:

$$ \mathcal{W}^{(m)} = \{ W^{(m)}_k = ( c^{(m)}_k, c^{(m)}_{k+1}, \dots, c^{(m)}_{k+N-1} ) \mid k = 0, 1, \dots, \max(0, K_m - N) \}. $$

(If $K_m < N$, implementations may either skip the trajectory or treat the entire $C^{(m)}$ as a single window.)
Enrich the corresponding test file $f^{(m)}$ using each window:

$$ \mathcal{E}^{(m)} = \{ \text{ENRICH}(f^{(m)}, W^{(m)}_k) \mid W^{(m)}_k \in \mathcal{W}^{(m)} \}. $$

The final output is the union of all enriched test cases across trajectories:

$$ \mathcal{E} = \bigcup_{m=1}^{M} \mathcal{E}^{(m)}. $$

This approach ensures that context length remains bounded (by $ N $) while preserving only the most relevant tool interactions for debugging within each individual execution trace.

3.2.2 Cross-Validation and Filtering

Each candidate patch $p_i$ is executed against the unified test suite $T = \\{t_1, t_2, ..., t_n\\}$. We compute:

$$ \text{score}(p_i) = \sum_{j=1}^{n} \mathbb{1}[\text{pass}(p_i, t_j)] $$

Patches are ranked by their pass counts. The top-K candidates with highest scores proceed to the final selection stage. Meanwhile, we employ a Test Evaluation Agent to execute unit tests and report the final pass/fail status, thereby mitigating potential test execution failures or compilation errors caused by long-tail engineering bugs.

4. Main Results

4.1 Experiment Setup

For each issue, we generate 4 candidate trajectories using two model configurations:

Claude Sonnet 4: 2 trajectories
Claude Sonnet 4.5: 2 trajectories

All temperature are set to 0, Each trajectory executes within the official SWE-bench Docker environment. The agent iteratively explores the codebase, formulates hypotheses, and produces a candidate patch.

In our implementation of TTS, all agents(Test Consolidate Agent & Test Evaluation Agent) are constructed based on CFuse, with variations across tasks in terms of the system prompt and the set of tools available for use.

Single Attempt Results:

Base Model	Resolved
Claude-Sonnet-4	54.67%
Claude-Sonnet-4	54%
Claude-Sonnet-4.5	60%
Claude-Sonnet-4.5	61%

Multi-Trajectory Statistics (Combined):

Oracle	Adversary	Average@1	Average@2	Average@3	TTS Rank@1	TTS Rank@2	TTS Rank@3
68.67%	47%	57.67%	64%	65.33%	61.67%	65%	66.33%

Oracle: an instance is considered passed if any given patches passes all test cases.

Adversary: an instance is considered passed if all given patches passes all test cases.

Average@K: an instance is considered passed if any randomly sampled K patches passes all test cases.

Test Case Consolidation: Apply Test Case Consolidation to all patches, and rank the patch according to pass rate

4.2 Observations & Analysis

Single vs. Multiple attempts

Strong Single-Attempt Performance with Inherent Variability: Claude 4.5 achieves high resolution rates (60–61%) in a single attempt, but fluctuations indicate stochasticity in its reasoning, suggesting some problems inherently require multiple tries.
Significant Portion of Problems Are Not Solvable in One Attempt: The gap between Oracle (68.67% solved with any attempts) and Adversary (47% solved with all attempt) results indicates that 21.67% of problems are solvable in principle but not reliably resolved in a single attempt, highlighting the role of randomness in successful inference.
Multiple Attempts Substantially Boost Success Rates: Allowing up to four attempts increases overall solvability to 68.67%, with metrics like Average@k confirming a consistent positive correlation between allowed attempts and task resolution.

insight: We require a robust and systematic approach to reliably derive a correct solution from multiple inference attempts—this necessity constitutes a primary motivation for implementing test case consolidation.

Test Case Consolidation Gains

The top-ranked patch selected via pass-rate–based consolidation (Rank@1) significantly outperforms both the average single-attempt success rate (Average@1) and the best-known single-attempt result of Claude 4.5, demonstrating its effectiveness in identifying high-quality solutions.
The performance gain is not limited to the top candidate—Rank@2 (65%) and Rank@3 (66.33%) also markedly exceed Average@2 (64%) and Average@3 (65.33%), indicating that test case consolidation yields more reliable and higher-quality candidate rankings across multiple positions.
Reranking based on Test Case Consolidation narrows the gap with the oracle; however, relying solely on Rank@1 still leaves a noticeable performance gap. We leave for future work the exploration of how to further identify the best patch from Rank@2 or even Rank@3 candidates.

5. Conclusion

We presented CodeFuse-Agent, a system achieving 61.67% resolution rate on SWE-bench Lite through Trajectory-Aware Test Time Scaling. Our key contribution is demonstrating that agent debugging artifacts—particularly self-generated tests—provide valuable signals for patch selection that complement traditional execution-based validation. The decoupling of diverse generation from systematic verification offers a principled framework for scaling test-time compute in code repair tasks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🔥 CodeFuse-Agent: Achieving 61.67% on SWE-bench Lite via Trajectory-Aware Test Time Scaling

1. Abstract

2. Introduction

3. Methodology

3.1 Stage 1: Multi-Trajectory Patch Generation

3.1.1 CodeFuse Agent Architecture

3.2 Stage 2: Trajectory-Aware Test Time Scaling

3.2.1 Test Case Consolidation

3.2.2 Cross-Validation and Filtering

4. Main Results

4.1 Experiment Setup

4.2 Observations & Analysis

5. Conclusion

FilesExpand file tree

tech_report.md

Latest commit

History

tech_report.md

File metadata and controls

🔥 CodeFuse-Agent: Achieving 61.67% on SWE-bench Lite via Trajectory-Aware Test Time Scaling

1. Abstract

2. Introduction

3. Methodology

3.1 Stage 1: Multi-Trajectory Patch Generation

3.1.1 CodeFuse Agent Architecture

3.2 Stage 2: Trajectory-Aware Test Time Scaling

3.2.1 Test Case Consolidation

3.2.2 Cross-Validation and Filtering

4. Main Results

4.1 Experiment Setup

4.2 Observations & Analysis

5. Conclusion