Autoresearch/mar14 by hughdbrown · Pull Request #511 · roborev-dev/roborev

hughdbrown · 2026-03-15T02:27:04Z

Use approach in karpathy's autoresearch to optimize roborev-fix skill.

roborev-ci · 2026-03-15T02:31:17Z

roborev: Combined Review (`f000d0c`)

Summary: The PR introduces the skill-optimizer tool
and updates the roborev-fix skill, but contains several medium-severity issues related to scoring logic, scenario expectations, and a potential path traversal security risk.

Medium

1. Potential Local File Exfiltration via Symlinked Skill Paths
Files: tools/skill-optimizer/
evaluate.py, [tools/skill-optimizer/simulate.py](/home/roborev/repos/roborev/tools/skill-optimizer/simulate.py#L
35), tools/skill-optimizer/simulate.py
--skill is accepted as an arbitrary filesystem path, read with Path.read_text(), and then
embedded verbatim into the prompt sent to the LLM. Because this follows symlinks and there is no validation that the target is a regular SKILL.md under an approved directory, a malicious repo can plant a symlinked skill file pointing at sensitive local files (e.g., ~/.aws/credentials). If
evaluated, the tool will exfiltrate the file contents to the remote model API.
Suggested fix: Resolve and validate the input path before reading it. Reject symlinks and non-regular files, and restrict --skill to expected directories or filenames.

2. Inexact Command Matching in Scoring
File
: scoring.py
score_completion() uses raw startswith() matching, meaning incorrect commands can score as correct (e.g., roborev show --job 420 matches expected roborev show --job 42). This allows traces to operate on the wrong review or ref and still get full completion credit.
Suggested fix: Compare shell tokens with exact boundaries, allowing extra trailing tokens only where intended.

3. Scenario and Skill Instruction
Mismatch
Files: roborev-fix.json, [SKILL.md](/home/roborev/repos/rob
orev/internal/skills/claude/roborev-fix/SKILL.md#L86)
The happy_path_single_review scenario hard-requires git show abc1234, but the updated skill instructs to run git show only when the review
output is unclear. In this scenario, the mocked review already includes necessary details, so a trace following the new skill will be penalized.
Suggested fix: Remove git show from the scenario’s expected_commands, or make the mocked review ambiguous enough that git show is required.

**
4. Misaligned Efficiency Metric**
Files: evaluate.py, [simulate.py](/home/roborev/repos/roborev/tools/skill-optimizer
/simulate.py#L58)
Efficiency is scored with actual_calls=len(trace), but the simulator prompt explicitly asks for file reads, edits, and user communication in addition to bash commands. This metric penalizes verbosity and trace formatting rather than just tool usage.
Suggested fix:
Count only executable/tool actions for efficiency, or rename/redefine the metric and scenario field so they match.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

roborev-ci · 2026-03-15T03:30:31Z

roborev: Combined Review (`27bcada`)

Verdict: The PR successfully introduces the skill-optimizer tool and refines the roborev-fix skill, but requires several medium-severity bug fixes, deterministic configuration, and a security mitigation for a test dependency before merging.

Medium Severity

[evaluate.py](/home/roborev/repos/roborev/tools/skill-optimizer
/evaluate.py#L64): run_evaluation() converts per-scenario exceptions into zero scores and still returns success. A transient Anthropic/API failure will look like a legitimately bad skill change, so the optimizer can discard a good edit instead of retrying. Distinguish evaluator failures from scored results
and exit non-zero when any scenario fails. Add a test for that error path.
roborev-fix.json and
SKILL.md: The skill now says git show <git_ref> is conditional, but the happy-path scenario still
requires git show abc1234 unconditionally. An agent that follows the updated skill exactly can now lose completion score for doing the intended thing. Make that command optional in the scenario, or split the scenario into “context clear” and “context unclear” variants.
[simulate.py](/home/
roborev/repos/roborev/tools/skill-optimizer/simulate.py#L69) and judge.py: Neither Anthropic request pins sampling
to deterministic settings. Since this harness is used for keep/discard decisions, score noise across identical runs will corrupt the optimization loop. Set temperature=0 for both simulation and judging, and keep the evaluation config fixed.
[test_scoring.py](/home/roborev/repos/rob
orev/tools/skill-optimizer/tests/test_scoring.py#L39): test_missing_commands() expects 2/5, but score_completion() leaves trace_idx unchanged after a miss, so the later roborev comment still matches and
the function returns 3/5. As written, this new test should fail. Either update the expected value to 3/5, or change the scoring algorithm if later matches are supposed to be invalidated by an earlier miss.
[tools/skill-optimizer/pyproject.toml:10](/home/rob
orev/repos/roborev/tools/skill-optimizer/pyproject.toml#L10): The new dev dependency introduces pytest 9.0.2, which is affected by the disclosed tmpdir vulnerability on Unix (CVE-2025-7117 6). Pytest uses predictable /tmp/pytest-of-{user} paths and can be abused via symlink/TOCTOU attacks on shared systems. Suggested remediation: upgrade to a fixed pytest release as soon as one is available. Until then, set a private TMPDIR/PYTEST_DEBUG_ TEMPROOT in CI and developer docs so pytest uses a securely created directory instead of the shared /tmp pattern.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Split the chained `roborev comment && roborev close` into two separate commands so the simulated agent emits them as individual bash calls, matching the expected_commands in scenarios. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ance Bold the sort-by-severity and group-by-file instructions and add explicit ordering (HIGH → MEDIUM → LOW) to help the quality judge score these dimensions higher. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…d findings Reword comment instructions to emphasize referencing each finding by severity/file and noting intentionally skipped findings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Change step 3 to only run git show when the review output lacks sufficient context, rather than automatically for every review. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Reduce the IMPORTANT section from 5 lines to 2, removing redundant detail while preserving the key directives. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… example Use separate bullet points for comment and close commands in the explicit job IDs example, matching the auto-discovery format. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

hughdbrown · 2026-03-15T12:55:38Z

I rebased the commits so that changes to roborev-fix skill all come first and then I deleted all commits that added and changed tools/ directory. On my local branch, the changes look like:

2026-03-15 06:47 -0600 Hugh Brown         o [autoresearch/mar14] experiment: align explicit_job_ids example format with auto-discovery
2026-03-15 06:47 -0600 Hugh Brown         o experiment: condense IMPORTANT section for brevity
2026-03-15 06:47 -0600 Hugh Brown         o experiment: make git show conditional to reduce unnecessary tool calls
2026-03-15 06:47 -0600 Hugh Brown         o experiment: improve comment guidance to reference severity and skipped findings
2026-03-15 06:47 -0600 Hugh Brown         o experiment: strengthen severity prioritization and file grouping guidance
2026-03-15 06:47 -0600 Hugh Brown         o experiment: separate comment and close into distinct commands

All of the roborev-ci combined reviews should now be commenting on files that are effectively deleted or "never existed".

roborev-ci · 2026-03-15T13:04:27Z

roborev: Combined Review (`72014a0`)

Verdict: The updates to the roborev-fix skill instructions introduce a potential workflow regression by removing the execution failure guard between comment and close commands.

Medium

internal/skills/claude/rob
orev-fix/SKILL.md:107
Changing roborev comment ... && roborev close ... to two independent commands removes the failure guard between them. If roborev comment fails but the agent continues, the review can still be closed without the required summary comment, which is a workflow regression and loses audit trail. Suggested fix: keep the commands separate only for readability in the prose/examples, but
explicitly require "only run roborev close after roborev comment succeeds", or keep && in the executable example.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

wesm · 2026-03-15T13:25:17Z

Cool — one thing that I need to take a closer look at is that I notice that Codex will sometimes try to invoke the fix skill even when presented with code review findings in the review prompt with instructions to "address the findings".

wesm · 2026-03-16T01:17:46Z

closing in favor of #515

hughdbrown and others added 6 commits March 15, 2026 06:47

experiment: improve comment guidance to reference severity and skippe…

67d2414

…d findings Reword comment instructions to emphasize referencing each finding by severity/file and noting intentionally skipped findings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

experiment: make git show conditional to reduce unnecessary tool calls

0378b74

Change step 3 to only run git show when the review output lacks sufficient context, rather than automatically for every review. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

experiment: condense IMPORTANT section for brevity

9d5f8dd

Reduce the IMPORTANT section from 5 lines to 2, removing redundant detail while preserving the key directives. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

experiment: align explicit_job_ids example format with auto-discovery…

72014a0

… example Use separate bullet points for comment and close commands in the explicit job IDs example, matching the auto-discovery format. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

hughdbrown force-pushed the autoresearch/mar14 branch from 27bcada to 72014a0 Compare March 15, 2026 13:01

wesm closed this Mar 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoresearch/mar14#511

Autoresearch/mar14#511
hughdbrown wants to merge 6 commits intoroborev-dev:mainfrom
hughdbrown:autoresearch/mar14

hughdbrown commented Mar 15, 2026

Uh oh!

roborev-ci bot commented Mar 15, 2026

Uh oh!

roborev-ci bot commented Mar 15, 2026

Uh oh!

hughdbrown commented Mar 15, 2026 •

edited

Loading

Uh oh!

roborev-ci bot commented Mar 15, 2026

Uh oh!

wesm commented Mar 15, 2026

Uh oh!

wesm commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hughdbrown commented Mar 15, 2026

Uh oh!

roborev-ci bot commented Mar 15, 2026

roborev: Combined Review (f000d0c)

Medium

Uh oh!

roborev-ci bot commented Mar 15, 2026

roborev: Combined Review (27bcada)

Medium Severity

Uh oh!

hughdbrown commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

roborev-ci bot commented Mar 15, 2026

roborev: Combined Review (72014a0)

Medium

Uh oh!

wesm commented Mar 15, 2026

Uh oh!

wesm commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

roborev: Combined Review (`f000d0c`)

roborev: Combined Review (`27bcada`)

hughdbrown commented Mar 15, 2026 •

edited

Loading

roborev: Combined Review (`72014a0`)