Skip to content

Commit b181ca0

Browse files
Copilotkdinev
andcommitted
feat: add copilot-cli and gemini-cli agent modes to eval runner
Co-authored-by: kdinev <1472513+kdinev@users.noreply.github.com>
1 parent 5da6711 commit b181ca0

5 files changed

Lines changed: 521 additions & 37 deletions

File tree

.github/workflows/skill-eval.yml

Lines changed: 83 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -5,15 +5,31 @@ on:
55
paths:
66
- 'skills/**'
77
- 'evals/**'
8+
workflow_dispatch:
9+
inputs:
10+
agent:
11+
description: 'Agent to run evals against (copilot or gemini)'
12+
required: true
13+
default: 'copilot'
14+
type: choice
15+
options:
16+
- copilot
17+
- gemini
18+
trials:
19+
description: 'Number of trials per task'
20+
required: false
21+
default: '1'
22+
type: string
823

924
permissions:
1025
contents: read
1126
pull-requests: write
1227

1328
jobs:
14-
skill_eval:
29+
# Job 1: Always validate graders against reference solutions
30+
validate_graders:
1531
runs-on: ubuntu-latest
16-
timeout-minutes: 30
32+
timeout-minutes: 10
1733

1834
steps:
1935
- name: Checkout repository
@@ -28,16 +44,70 @@ jobs:
2844
working-directory: evals
2945
run: bash run-eval.sh --all --validate
3046

31-
- name: Upload results
47+
- name: Upload validation results
3248
if: always()
3349
uses: actions/upload-artifact@v4
3450
with:
35-
name: skill-eval-results
51+
name: skill-eval-validation-results
3652
path: evals/results/
3753
retention-days: 30
3854

55+
# Job 2: Run evals against an AI agent (copilot or gemini)
56+
# Triggered manually via workflow_dispatch, or can be called from other workflows
57+
agent_eval:
58+
if: github.event_name == 'workflow_dispatch'
59+
runs-on: ubuntu-latest
60+
timeout-minutes: 60
61+
62+
steps:
63+
- name: Checkout repository
64+
uses: actions/checkout@v4
65+
66+
- name: Set up Node.js
67+
uses: actions/setup-node@v4
68+
with:
69+
node-version: '22'
70+
71+
- name: Install Copilot CLI
72+
if: inputs.agent == 'copilot'
73+
run: npm install -g @github/copilot
74+
75+
- name: Install Gemini CLI
76+
if: inputs.agent == 'gemini'
77+
run: npm install -g @google/gemini-cli
78+
79+
- name: Run agent-based eval
80+
working-directory: evals
81+
env:
82+
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
83+
GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
84+
run: |
85+
bash run-eval.sh --all \
86+
--agent ${{ inputs.agent }} \
87+
--trials ${{ inputs.trials || '1' }}
88+
89+
- name: Upload agent eval results
90+
if: always()
91+
uses: actions/upload-artifact@v4
92+
with:
93+
name: skill-eval-agent-${{ inputs.agent }}-results
94+
path: evals/results/
95+
retention-days: 30
96+
97+
# Job 3: Post summary comment on PRs
98+
post_summary:
99+
if: always() && github.event_name == 'pull_request' && github.event.pull_request.head.repo.fork == false
100+
needs: [validate_graders]
101+
runs-on: ubuntu-latest
102+
103+
steps:
104+
- name: Download validation results
105+
uses: actions/download-artifact@v4
106+
with:
107+
name: skill-eval-validation-results
108+
path: evals/results/
109+
39110
- name: Post summary comment
40-
if: always() && github.event_name == 'pull_request' && github.event.pull_request.head.repo.fork == false
41111
uses: actions/github-script@v7
42112
with:
43113
script: |
@@ -52,26 +122,27 @@ jobs:
52122
if (files.length === 0) {
53123
summary += '> ⚠️ No eval results found. The eval run may have failed.\n';
54124
} else {
55-
summary += '| Task | Pass Rate | pass@5 | Status |\n';
56-
summary += '|---|---|---|---|\n';
125+
summary += '| Task | Agent | Pass Rate | pass@k | Status |\n';
126+
summary += '|---|---|---|---|---|\n';
57127
58128
for (const file of files) {
59129
try {
60130
const data = JSON.parse(fs.readFileSync(path.join(resultsDir, file), 'utf8'));
61131
const taskName = data.task || file.replace('.json', '');
132+
const agent = data.agent || 'reference';
62133
const passRate = data.passRate != null ? `${(data.passRate * 100).toFixed(0)}%` : 'N/A';
63134
const passAtK = data.passAtK != null ? `${(data.passAtK * 100).toFixed(0)}%` : 'N/A';
64135
const status = data.passAtK >= 0.8 ? '✅' : data.passAtK >= 0.6 ? '⚠️' : '❌';
65-
summary += `| ${taskName} | ${passRate} | ${passAtK} | ${status} |\n`;
136+
summary += `| ${taskName} | ${agent} | ${passRate} | ${passAtK} | ${status} |\n`;
66137
} catch (e) {
67-
summary += `| ${file} | Error | Error | ❌ |\n`;
138+
summary += `| ${file} | — | Error | Error | ❌ |\n`;
68139
}
69140
}
70141
71142
summary += '\n### Thresholds\n';
72-
summary += '- ✅ `pass@5 ≥ 80%` — merge gate passed\n';
73-
summary += '- ⚠️ `pass@5 ≥ 60%` — needs investigation\n';
74-
summary += '- ❌ `pass@5 < 60%` — blocks merge for affected skill\n';
143+
summary += '- ✅ `pass@k ≥ 80%` — merge gate passed\n';
144+
summary += '- ⚠️ `pass@k ≥ 60%` — needs investigation\n';
145+
summary += '- ❌ `pass@k < 60%` — blocks merge for affected skill\n';
75146
}
76147
} catch (e) {
77148
summary += `> ⚠️ Could not read results: ${e.message}\n`;

evals/README.md

Lines changed: 99 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,9 @@ architecture and extended with patterns from
77

88
The infrastructure is **self-contained** — there are no external eval-framework
99
dependencies. A lightweight shell runner (`run-eval.sh`) executes each task's
10-
reference solution and deterministic grader.
10+
reference solution and deterministic grader, and can also dispatch tasks to
11+
AI coding agents (GitHub Copilot CLI or Google Gemini CLI) for end-to-end
12+
evaluation.
1113

1214
## Overview
1315

@@ -32,6 +34,14 @@ Each task includes:
3234

3335
- Bash 4+
3436
- `bc` (installed by default on most Linux / macOS systems)
37+
- Node.js 20+ (for config parsing and agent CLI installation)
38+
39+
**For agent-based evaluation (optional):**
40+
41+
| Agent | Install | Auth |
42+
|---|---|---|
43+
| GitHub Copilot | `npm install -g @github/copilot` | Active Copilot subscription; `GITHUB_TOKEN` env var |
44+
| Google Gemini | `npm install -g @google/gemini-cli` | `GEMINI_API_KEY` env var |
3545

3646
## Running Evals Locally
3747

@@ -50,16 +60,73 @@ bash run-eval.sh --all --validate
5060
bash run-eval.sh grid-basic-setup --validate
5161
```
5262

63+
### Run evals against an AI agent
64+
65+
Send the `instruction.md` to a coding agent CLI, let the agent generate code
66+
in an isolated workspace, then run the deterministic grader on the output.
67+
68+
```bash
69+
cd evals
70+
71+
# Run all tasks with GitHub Copilot CLI
72+
bash run-eval.sh --all --agent copilot
73+
74+
# Run a single task with Gemini CLI
75+
bash run-eval.sh grid-basic-setup --agent gemini
76+
77+
# Run 3 trials per task for statistical robustness
78+
bash run-eval.sh --all --agent copilot --trials 3
79+
```
80+
5381
### npm scripts (convenience wrappers)
5482

5583
```bash
5684
cd evals
85+
86+
# Validation (reference solutions)
5787
npm run validate # all tasks
5888
npm run validate:grid # grid-basic-setup only
5989
npm run validate:combo # component-combo-reactive-form only
6090
npm run validate:theming # theming-palette-generation only
91+
92+
# Agent-based evaluation
93+
npm run agent:copilot # all tasks with Copilot
94+
npm run agent:copilot:grid # grid task with Copilot
95+
npm run agent:gemini # all tasks with Gemini
96+
npm run agent:gemini:theming # theming task with Gemini
97+
```
98+
99+
## Agent Configuration
100+
101+
Agent settings are stored in `eval-config.json`:
102+
103+
```json
104+
{
105+
"defaultAgent": "copilot",
106+
"agents": {
107+
"copilot": {
108+
"command": "copilot",
109+
"installCommand": "npm install -g @github/copilot",
110+
"promptArgs": ["-p"],
111+
"autoApproveArgs": ["--yes"],
112+
"envAuth": "GITHUB_TOKEN"
113+
},
114+
"gemini": {
115+
"command": "gemini",
116+
"installCommand": "npm install -g @google/gemini-cli",
117+
"promptArgs": ["-p"],
118+
"autoApproveArgs": ["--sandbox"],
119+
"envAuth": "GEMINI_API_KEY"
120+
}
121+
},
122+
"trialCount": 1,
123+
"timeoutSec": 600
124+
}
61125
```
62126

127+
You can customize the agent command, flags, and timeouts by editing this file.
128+
To switch the default agent, change `defaultAgent`.
129+
63130
## Adding a New Task
64131

65132
1. Create a directory under `evals/tasks/<task-id>/` with the standard structure:
@@ -95,25 +162,43 @@ npm run validate:theming # theming-palette-generation only
95162
bash run-eval.sh <task-id> --validate
96163
```
97164

165+
7. Test against at least one agent:
166+
167+
```bash
168+
bash run-eval.sh <task-id> --agent copilot
169+
```
170+
98171
## Pass / Fail Thresholds
99172

100173
Following [Anthropic's recommendations](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents):
101174

102175
| Metric | Threshold | Effect |
103176
|---|---|---|
104-
| `pass@5 ≥ 80%` | **Merge gate** | At least 1 success in 5 trials required |
105-
| `pass^5 ≥ 60%` | **Tracked** | Flags flaky skills for investigation |
106-
| `pass@5 < 60%` | **Blocks merge** | On PRs touching the relevant skill |
177+
| `pass@k ≥ 80%` | **Merge gate** | At least 1 success in k trials required |
178+
| `pass@k ≥ 60%` | **Tracked** | Flags flaky skills for investigation |
179+
| `pass@k < 60%` | **Blocks merge** | On PRs touching the relevant skill |
107180

108181
## CI Integration
109182

110-
The GitHub Actions workflow at `.github/workflows/skill-eval.yml` runs
111-
automatically on PRs that modify `skills/**` or `evals/**`. It:
183+
The GitHub Actions workflow at `.github/workflows/skill-eval.yml` provides two
184+
evaluation modes:
112185

113-
1. Checks out the repo
114-
2. Validates all graders against their reference solutions
115-
3. Uploads results as an artifact
116-
4. Posts a summary comment on the PR
186+
### Automatic (on PR)
187+
Runs on every PR that modifies `skills/**` or `evals/**`:
188+
1. Validates all graders against their reference solutions
189+
2. Uploads results as an artifact
190+
3. Posts a summary comment on the PR
191+
192+
### Manual (workflow_dispatch)
193+
Triggered manually from the Actions tab to run agent-based evaluation:
194+
1. Select the agent (`copilot` or `gemini`) and number of trials
195+
2. Installs the selected agent CLI
196+
3. Runs all tasks against the agent
197+
4. Uploads results as an artifact
198+
199+
**Secrets required for agent-based CI:**
200+
- `GITHUB_TOKEN` — automatically available (for Copilot)
201+
- `GEMINI_API_KEY` — must be added as a repository secret (for Gemini)
117202

118203
## Grading Strategy
119204

@@ -135,3 +220,7 @@ automatically on PRs that modify `skills/**` or `evals/**`. It:
135220
Baseline results are stored in `evals/results/baseline.json` and used for
136221
regression comparison on PRs. The CI workflow uploads per-run results as
137222
GitHub Actions artifacts.
223+
224+
Agent-based results are suffixed with the agent name (e.g.,
225+
`grid-basic-setup-copilot.json`) to distinguish them from reference
226+
validation results.

evals/eval-config.json

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
{
2+
"defaultAgent": "copilot",
3+
"agents": {
4+
"copilot": {
5+
"command": "copilot",
6+
"installCommand": "npm install -g @github/copilot",
7+
"promptArgs": ["-p"],
8+
"autoApproveArgs": ["--yes"],
9+
"envAuth": "GITHUB_TOKEN",
10+
"description": "GitHub Copilot CLI (requires active Copilot subscription)"
11+
},
12+
"gemini": {
13+
"command": "gemini",
14+
"installCommand": "npm install -g @google/gemini-cli",
15+
"promptArgs": ["-p"],
16+
"autoApproveArgs": ["--sandbox"],
17+
"envAuth": "GEMINI_API_KEY",
18+
"description": "Google Gemini CLI (requires GEMINI_API_KEY)"
19+
}
20+
},
21+
"trialCount": 1,
22+
"timeoutSec": 600
23+
}

evals/package.json

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,15 @@
1212
"validate": "bash run-eval.sh --all --validate",
1313
"validate:grid": "bash run-eval.sh grid-basic-setup --validate",
1414
"validate:combo": "bash run-eval.sh component-combo-reactive-form --validate",
15-
"validate:theming": "bash run-eval.sh theming-palette-generation --validate"
15+
"validate:theming": "bash run-eval.sh theming-palette-generation --validate",
16+
"agent:copilot": "bash run-eval.sh --all --agent copilot",
17+
"agent:copilot:grid": "bash run-eval.sh grid-basic-setup --agent copilot",
18+
"agent:copilot:combo": "bash run-eval.sh component-combo-reactive-form --agent copilot",
19+
"agent:copilot:theming": "bash run-eval.sh theming-palette-generation --agent copilot",
20+
"agent:gemini": "bash run-eval.sh --all --agent gemini",
21+
"agent:gemini:grid": "bash run-eval.sh grid-basic-setup --agent gemini",
22+
"agent:gemini:combo": "bash run-eval.sh component-combo-reactive-form --agent gemini",
23+
"agent:gemini:theming": "bash run-eval.sh theming-palette-generation --agent gemini"
1624
},
1725
"engines": {
1826
"node": ">=20.0.0"

0 commit comments

Comments
 (0)