We present CodeFuse-Agent, an AI agent designed to tackle with software engineering challenges, achieving 61.67% resolution rate on SWE-bench Lite, establishing a new state-of-the-art. Our approach comprises two stages:
(1) Multi-trajectory patch generation by our core agent framework CodeFuse-Agent
(2) Trajectory-Aware Test-Time Scaling (TTS) which performs systematic candidate selection by cross-validating patches against self-generated test cases consolidated from historical debugging trajectories.
By decoupling generation from verification and exploiting the collective debugging artifacts across trajectories, CodeFuse-Agent substantially improves patch selection accuracy.
We fully open-source CodeFuse-Agent to facilitate reproducibility and benefit the broader research community. By lowering the barrier to entry, we hope to accelerate collective progress in building more capable AI coding agents.
Automated program repair remains a critical challenge in software engineering. While LLM-based agents continue to improve, complex issues often cannot be resolved in a single attempt. Increasing the number of rollouts improves success probability, but introduces a new bottleneck: how to reliably select the correct patch from multiple candidates?
Current approaches predominantly rely on LLM-as-Judge methods, such as using LLMs to vote on candidate patches or even resorting to random selection when votes are tied. We argue that such approaches are inherently unstable and lack robustness—LLM judgments can be inconsistent across runs and sensitive to prompt variations. Furthermore, these approaches suffer from limited interpretability—the rationale provided by LLM judges lacks grounding in executable or testable evidence, making it hard to objectively validate the selection decision.
To explore this, we conducted a preliminary study on SWE-Bench-Verified (500 instances):
Finding 1: Self-validation in a single rollout is unreliable.
| Count | Empty Patch | with debugging process | Without Debugging process | |
|---|---|---|---|---|
| unresolved | 174 | 4 | 157 | 13 |
| resolved | 326 | 0 | 283 | 43 |
| All Instance | 500 | 4 | 440 | 56 |
Analysis Base on SWE-Bench-Verified
We observed that 88% of instances (440/500) included self-generated tests during the debugging process. However, 35.7% of these (157/440) ultimately failed ground-truth evaluation—despite the agent believing its patch had passed it's own test cases. This indicates that tests from a single trajectory can be incomplete or incorrect, leading to false confidence.
Finding 2: Multiple rollouts reveal stronger collective potential.
Influence of attempt count to overall performance
A single rollout achieves 65.2% success rate (326/500), while the best-of-N oracle reaches 82.4% (412/500)—a gap of 17.2% percentage points. This demonstrates that correct patches do exist among the rollouts; the challenge lies in identifying them.
These findings motivate our approach: aggregate test cases from all rollouts to form a comprehensive test suite, and use it as executable, verifiable evidence for patch selection. By pooling self-generated tests from all trajectories and selecting the patch with the highest pass rate, we leverage the model's latent ability to produce good tests—even when individual rollouts are inconsistent.
To summary, in this work, we present two contributions: (1) CodeFuse-Agent(CFuse), a lightweight, research-oriented agent framework for code generation, and (2) Trajectory-Aware Test-Time Scaling (TTS), a verification mechanism that aggregates self-generated tests from all trajectories for cross-trajectory validation. Together, they achieve 61.67% resolution rate on SWE-bench Lite.
The first stage employs our agent framework to generate diverse candidate patches through multiple independent trajectories.
CodeFuse-Agent(CFuse) is a lightweight, cleanly-architected agent framework designed for research and experimentation. It is fully open-source and can be installed with a single pip install command, providing a complete yet minimal toolset for code related task. We open-source CFuse to facilitate reproducible research and encourage further exploration of LLM-based coding agents.
| Layer | Responsibility |
|---|---|
| Interaction | Terminal UI / Headless /HTTP modes |
| Agent Loop | Core lifecycle: LLM interaction, tool dispatch, iteration control |
| Context Engine | Message history, environment context, compression, prompt assembly |
| LLM Provider | Multi-LLM support (OpenAI, Anthropic, Gemini, etc.) |
| Tool Execution | 6 built-in tools + remote execution |
| Observability | Trajectory logs, execution metrics, cost tracking |
Configurable Agent Profiles
Agent behavior is defined through declarative Markdown profiles(system prompt, tools, model etc.), enabling quick switching of system prompts and tool subsets without code changes.
Dual Execution Modes
- Local Mode: Execute tool calls directly in the local environment
- HTTP Mode: Serve as a tool execution backend or delegate calls to remote sandboxes
This decoupling of agent decisions from environment execution makes CFuse suitable as scaffolding for RL training pipelines.
Built-in Tools
The framework provides six built-in tools for code exploration and modification:
- read_file: Read file contents with optional line range selection
- write_file: Create or overwrite files
- edit_file: Perform edits via search-and-replace
- grep: Fast code search powered by ripgrep
- glob: File discovery using glob patterns
- bash: Execute shell commands with timeout control
Building upon the trajectories from Stage 1, this stage performs systematic verification and selection through three sequential components.
Rather than designing yet another heuristic-driven test generation agent, we propose reframing the problem: we introduce a Test Consolidate Agent that consolidate debugging experience from agents' own debugging trajectory into a single executable test file.
Moreover, to mitigate the issue of excessively long context windows, we first extracted only the tool invocation content from those steps in the agent's execution trajectory that are relevant to debugging. Furthermore, we adopted a sliding window approach, processing only a window of N consecutive trajecory,the following formulate will explain it:
Let there be
where the
- Filter debugging-relevant steps:
- Extract tool invocation content from those steps:
where
$K_m = |C^{(m)}|$ denotes the number of extracted tool invocations.
-
Apply a sliding window of fixed size
$N$ over$C^{(m)}$ to generate contextual segments:
- (If
$K_m < N$ , implementations may either skip the trajectory or treat the entire$C^{(m)}$ as a single window.) -
Enrich the corresponding test file
$f^{(m)}$ using each window:
The final output is the union of all enriched test cases across trajectories:
This approach ensures that context length remains bounded (by $ N $) while preserving only the most relevant tool interactions for debugging within each individual execution trace.
Each candidate patch
Patches are ranked by their pass counts. The top-K candidates with highest scores proceed to the final selection stage. Meanwhile, we employ a Test Evaluation Agent to execute unit tests and report the final pass/fail status, thereby mitigating potential test execution failures or compilation errors caused by long-tail engineering bugs.
For each issue, we generate 4 candidate trajectories using two model configurations:
- Claude Sonnet 4: 2 trajectories
- Claude Sonnet 4.5: 2 trajectories
All temperature are set to 0, Each trajectory executes within the official SWE-bench Docker environment. The agent iteratively explores the codebase, formulates hypotheses, and produces a candidate patch.
In our implementation of TTS, all agents(Test Consolidate Agent & Test Evaluation Agent) are constructed based on CFuse, with variations across tasks in terms of the system prompt and the set of tools available for use.
Single Attempt Results:
| Base Model | Resolved |
|---|---|
| Claude-Sonnet-4 | 54.67% |
| Claude-Sonnet-4 | 54% |
| Claude-Sonnet-4.5 | 60% |
| Claude-Sonnet-4.5 | 61% |
Multi-Trajectory Statistics (Combined):
| Oracle | Adversary | Average@1 | Average@2 | Average@3 | TTS Rank@1 | TTS Rank@2 | TTS Rank@3 |
|---|---|---|---|---|---|---|---|
| 68.67% | 47% | 57.67% | 64% | 65.33% | 61.67% | 65% | 66.33% |
Oracle: an instance is considered passed if any given patches passes all test cases.
Adversary: an instance is considered passed if all given patches passes all test cases.
Average@K: an instance is considered passed if any randomly sampled K patches passes all test cases.
Test Case Consolidation: Apply Test Case Consolidation to all patches, and rank the patch according to pass rate
Single vs. Multiple attempts
- Strong Single-Attempt Performance with Inherent Variability: Claude 4.5 achieves high resolution rates (60–61%) in a single attempt, but fluctuations indicate stochasticity in its reasoning, suggesting some problems inherently require multiple tries.
- Significant Portion of Problems Are Not Solvable in One Attempt: The gap between Oracle (68.67% solved with any attempts) and Adversary (47% solved with all attempt) results indicates that 21.67% of problems are solvable in principle but not reliably resolved in a single attempt, highlighting the role of randomness in successful inference.
- Multiple Attempts Substantially Boost Success Rates: Allowing up to four attempts increases overall solvability to 68.67%, with metrics like Average@k confirming a consistent positive correlation between allowed attempts and task resolution.
insight: We require a robust and systematic approach to reliably derive a correct solution from multiple inference attempts—this necessity constitutes a primary motivation for implementing test case consolidation.
Test Case Consolidation Gains
- The top-ranked patch selected via pass-rate–based consolidation (Rank@1) significantly outperforms both the average single-attempt success rate (Average@1) and the best-known single-attempt result of Claude 4.5, demonstrating its effectiveness in identifying high-quality solutions.
- The performance gain is not limited to the top candidate—Rank@2 (65%) and Rank@3 (66.33%) also markedly exceed Average@2 (64%) and Average@3 (65.33%), indicating that test case consolidation yields more reliable and higher-quality candidate rankings across multiple positions.
- Reranking based on Test Case Consolidation narrows the gap with the oracle; however, relying solely on Rank@1 still leaves a noticeable performance gap. We leave for future work the exploration of how to further identify the best patch from Rank@2 or even Rank@3 candidates.
We presented CodeFuse-Agent, a system achieving 61.67% resolution rate on SWE-bench Lite through Trajectory-Aware Test Time Scaling. Our key contribution is demonstrating that agent debugging artifacts—particularly self-generated tests—provide valuable signals for patch selection that complement traditional execution-based validation. The decoupling of diverse generation from systematic verification offers a principled framework for scaling test-time compute in code repair tasks.


