Skip to content

Autoresearch/mar14#511

Closed
hughdbrown wants to merge 6 commits intoroborev-dev:mainfrom
hughdbrown:autoresearch/mar14
Closed

Autoresearch/mar14#511
hughdbrown wants to merge 6 commits intoroborev-dev:mainfrom
hughdbrown:autoresearch/mar14

Conversation

@hughdbrown
Copy link
Copy Markdown
Contributor

Use approach in karpathy's autoresearch to optimize roborev-fix skill.

@roborev-ci
Copy link
Copy Markdown

roborev-ci bot commented Mar 15, 2026

roborev: Combined Review (f000d0c)

Summary: The PR introduces the skill-optimizer tool
and updates the roborev-fix skill, but contains several medium-severity issues related to scoring logic, scenario expectations, and a potential path traversal security risk.

Medium

1. Potential Local File Exfiltration via Symlinked Skill Paths
Files: tools/skill-optimizer/
evaluate.py
, [tools/skill-optimizer/simulate.py](/home/roborev/repos/roborev/tools/skill-optimizer/simulate.py#L
35), tools/skill-optimizer/simulate.py
--skill is accepted as an arbitrary filesystem path, read with Path.read_text(), and then
embedded verbatim into the prompt sent to the LLM. Because this follows symlinks and there is no validation that the target is a regular SKILL.md under an approved directory, a malicious repo can plant a symlinked skill file pointing at sensitive local files (e.g., ~/.aws/credentials). If
evaluated, the tool will exfiltrate the file contents to the remote model API.
Suggested fix: Resolve and validate the input path before reading it. Reject symlinks and non-regular files, and restrict --skill to expected directories or filenames.

2. Inexact Command Matching in Scoring
File
:
scoring.py
score_completion() uses raw startswith() matching, meaning incorrect commands can score as correct (e.g., roborev show --job 420 matches expected roborev show --job 42). This allows traces to operate on the wrong review or ref and still get full completion credit.
Suggested fix: Compare shell tokens with exact boundaries, allowing extra trailing tokens only where intended.

3. Scenario and Skill Instruction
Mismatch

Files: roborev-fix.json, [SKILL.md](/home/roborev/repos/rob
orev/internal/skills/claude/roborev-fix/SKILL.md#L86)
The happy_path_single_review scenario hard-requires git show abc1234, but the updated skill instructs to run git show only when the review
output is unclear. In this scenario, the mocked review already includes necessary details, so a trace following the new skill will be penalized.
Suggested fix: Remove git show from the scenario’s expected_commands, or make the mocked review ambiguous enough that git show is required.

**
4. Misaligned Efficiency Metric**
Files: evaluate.py, [simulate.py](/home/roborev/repos/roborev/tools/skill-optimizer
/simulate.py#L58)
Efficiency is scored with actual_calls=len(trace), but the simulator prompt explicitly asks for file reads, edits, and user communication in addition to bash commands. This metric penalizes verbosity and trace formatting rather than just tool usage.
Suggested fix:
Count only executable/tool actions for efficiency, or rename/redefine the metric and scenario field so they match.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link
Copy Markdown

roborev-ci bot commented Mar 15, 2026

roborev: Combined Review (27bcada)

Verdict: The PR successfully introduces the skill-optimizer tool and refines the roborev-fix skill, but requires several medium-severity bug fixes, deterministic configuration, and a security mitigation for a test dependency before merging.

Medium Severity

  • [evaluate.py](/home/roborev/repos/roborev/tools/skill-optimizer
    /evaluate.py#L64):
    run_evaluation() converts per-scenario exceptions into zero scores and still returns success. A transient Anthropic/API failure will look like a legitimately bad skill change, so the optimizer can discard a good edit instead of retrying. Distinguish evaluator failures from scored results
    and exit non-zero when any scenario fails. Add a test for that error path.
  • roborev-fix.json and
    SKILL.md: The skill now says git show <git_ref> is conditional, but the happy-path scenario still
    requires git show abc1234 unconditionally. An agent that follows the updated skill exactly can now lose completion score for doing the intended thing. Make that command optional in the scenario, or split the scenario into “context clear” and “context unclear” variants.
  • [simulate.py](/home/
    roborev/repos/roborev/tools/skill-optimizer/simulate.py#L69)
    and judge.py: Neither Anthropic request pins sampling
    to deterministic settings. Since this harness is used for keep/discard decisions, score noise across identical runs will corrupt the optimization loop. Set temperature=0 for both simulation and judging, and keep the evaluation config fixed.
  • [test_scoring.py](/home/roborev/repos/rob
    orev/tools/skill-optimizer/tests/test_scoring.py#L39):
    test_missing_commands() expects 2/5, but score_completion() leaves trace_idx unchanged after a miss, so the later roborev comment still matches and
    the function returns 3/5. As written, this new test should fail. Either update the expected value to 3/5, or change the scoring algorithm if later matches are supposed to be invalidated by an earlier miss.
  • [tools/skill-optimizer/pyproject.toml:10](/home/rob
    orev/repos/roborev/tools/skill-optimizer/pyproject.toml#L10):
    The new dev dependency introduces pytest 9.0.2, which is affected by the disclosed tmpdir vulnerability on Unix (CVE-2025-7117 6). Pytest uses predictable /tmp/pytest-of-{user} paths and can be abused via symlink/TOCTOU attacks on shared systems. Suggested remediation: upgrade to a fixed pytest release as soon as one is available. Until then, set a private TMPDIR/PYTEST_DEBUG_ TEMPROOT in CI and developer docs so pytest uses a securely created directory instead of the shared /tmp pattern.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

hughdbrown and others added 6 commits March 15, 2026 06:47
Split the chained `roborev comment && roborev close` into two separate
commands so the simulated agent emits them as individual bash calls,
matching the expected_commands in scenarios.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ance

Bold the sort-by-severity and group-by-file instructions and add
explicit ordering (HIGH → MEDIUM → LOW) to help the quality judge
score these dimensions higher.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…d findings

Reword comment instructions to emphasize referencing each finding by
severity/file and noting intentionally skipped findings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change step 3 to only run git show when the review output lacks
sufficient context, rather than automatically for every review.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reduce the IMPORTANT section from 5 lines to 2, removing redundant
detail while preserving the key directives.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… example

Use separate bullet points for comment and close commands in
the explicit job IDs example, matching the auto-discovery format.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@hughdbrown
Copy link
Copy Markdown
Contributor Author

hughdbrown commented Mar 15, 2026

I rebased the commits so that changes to roborev-fix skill all come first and then I deleted all commits that added and changed tools/ directory. On my local branch, the changes look like:

2026-03-15 06:47 -0600 Hugh Brown         o [autoresearch/mar14] experiment: align explicit_job_ids example format with auto-discovery
2026-03-15 06:47 -0600 Hugh Brown         o experiment: condense IMPORTANT section for brevity
2026-03-15 06:47 -0600 Hugh Brown         o experiment: make git show conditional to reduce unnecessary tool calls
2026-03-15 06:47 -0600 Hugh Brown         o experiment: improve comment guidance to reference severity and skipped findings
2026-03-15 06:47 -0600 Hugh Brown         o experiment: strengthen severity prioritization and file grouping guidance
2026-03-15 06:47 -0600 Hugh Brown         o experiment: separate comment and close into distinct commands

All of the roborev-ci combined reviews should now be commenting on files that are effectively deleted or "never existed".

@roborev-ci
Copy link
Copy Markdown

roborev-ci bot commented Mar 15, 2026

roborev: Combined Review (72014a0)

Verdict: The updates to the roborev-fix skill instructions introduce a potential workflow regression by removing the execution failure guard between comment and close commands.

Medium

  • internal/skills/claude/rob
    orev-fix/SKILL.md:107

    Changing roborev comment ... && roborev close ... to two independent commands removes the failure guard between them. If roborev comment fails but the agent continues, the review can still be closed without the required summary comment, which is a workflow regression and loses audit trail. Suggested fix: keep the commands separate only for readability in the prose/examples, but
    explicitly require "only run roborev close after roborev comment succeeds", or keep && in the executable example.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@wesm
Copy link
Copy Markdown
Collaborator

wesm commented Mar 15, 2026

Cool — one thing that I need to take a closer look at is that I notice that Codex will sometimes try to invoke the fix skill even when presented with code review findings in the review prompt with instructions to "address the findings".

@wesm
Copy link
Copy Markdown
Collaborator

wesm commented Mar 16, 2026

closing in favor of #515

@wesm wesm closed this Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants