@@ -7,7 +7,9 @@ architecture and extended with patterns from
77
88The infrastructure is ** self-contained** — there are no external eval-framework
99dependencies. A lightweight shell runner (` run-eval.sh ` ) executes each task's
10- reference solution and deterministic grader.
10+ reference solution and deterministic grader, and can also dispatch tasks to
11+ AI coding agents (GitHub Copilot CLI or Google Gemini CLI) for end-to-end
12+ evaluation.
1113
1214## Overview
1315
@@ -32,6 +34,14 @@ Each task includes:
3234
3335- Bash 4+
3436- ` bc ` (installed by default on most Linux / macOS systems)
37+ - Node.js 20+ (for config parsing and agent CLI installation)
38+
39+ ** For agent-based evaluation (optional):**
40+
41+ | Agent | Install | Auth |
42+ | ---| ---| ---|
43+ | GitHub Copilot | ` npm install -g @github/copilot ` | Active Copilot subscription; ` GITHUB_TOKEN ` env var |
44+ | Google Gemini | ` npm install -g @google/gemini-cli ` | ` GEMINI_API_KEY ` env var |
3545
3646## Running Evals Locally
3747
@@ -50,16 +60,73 @@ bash run-eval.sh --all --validate
5060bash run-eval.sh grid-basic-setup --validate
5161```
5262
63+ ### Run evals against an AI agent
64+
65+ Send the ` instruction.md ` to a coding agent CLI, let the agent generate code
66+ in an isolated workspace, then run the deterministic grader on the output.
67+
68+ ``` bash
69+ cd evals
70+
71+ # Run all tasks with GitHub Copilot CLI
72+ bash run-eval.sh --all --agent copilot
73+
74+ # Run a single task with Gemini CLI
75+ bash run-eval.sh grid-basic-setup --agent gemini
76+
77+ # Run 3 trials per task for statistical robustness
78+ bash run-eval.sh --all --agent copilot --trials 3
79+ ```
80+
5381### npm scripts (convenience wrappers)
5482
5583``` bash
5684cd evals
85+
86+ # Validation (reference solutions)
5787npm run validate # all tasks
5888npm run validate:grid # grid-basic-setup only
5989npm run validate:combo # component-combo-reactive-form only
6090npm run validate:theming # theming-palette-generation only
91+
92+ # Agent-based evaluation
93+ npm run agent:copilot # all tasks with Copilot
94+ npm run agent:copilot:grid # grid task with Copilot
95+ npm run agent:gemini # all tasks with Gemini
96+ npm run agent:gemini:theming # theming task with Gemini
97+ ```
98+
99+ ## Agent Configuration
100+
101+ Agent settings are stored in ` eval-config.json ` :
102+
103+ ``` json
104+ {
105+ "defaultAgent" : " copilot" ,
106+ "agents" : {
107+ "copilot" : {
108+ "command" : " copilot" ,
109+ "installCommand" : " npm install -g @github/copilot" ,
110+ "promptArgs" : [" -p" ],
111+ "autoApproveArgs" : [" --yes" ],
112+ "envAuth" : " GITHUB_TOKEN"
113+ },
114+ "gemini" : {
115+ "command" : " gemini" ,
116+ "installCommand" : " npm install -g @google/gemini-cli" ,
117+ "promptArgs" : [" -p" ],
118+ "autoApproveArgs" : [" --sandbox" ],
119+ "envAuth" : " GEMINI_API_KEY"
120+ }
121+ },
122+ "trialCount" : 1 ,
123+ "timeoutSec" : 600
124+ }
61125```
62126
127+ You can customize the agent command, flags, and timeouts by editing this file.
128+ To switch the default agent, change ` defaultAgent ` .
129+
63130## Adding a New Task
64131
651321 . Create a directory under ` evals/tasks/<task-id>/ ` with the standard structure:
@@ -95,25 +162,43 @@ npm run validate:theming # theming-palette-generation only
95162 bash run-eval.sh < task-id> --validate
96163 ```
97164
165+ 7 . Test against at least one agent:
166+
167+ ``` bash
168+ bash run-eval.sh < task-id> --agent copilot
169+ ```
170+
98171## Pass / Fail Thresholds
99172
100173Following [ Anthropic's recommendations] ( https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents ) :
101174
102175| Metric | Threshold | Effect |
103176| ---| ---| ---|
104- | ` pass@5 ≥ 80% ` | ** Merge gate** | At least 1 success in 5 trials required |
105- | ` pass^5 ≥ 60% ` | ** Tracked** | Flags flaky skills for investigation |
106- | ` pass@5 < 60% ` | ** Blocks merge** | On PRs touching the relevant skill |
177+ | ` pass@k ≥ 80% ` | ** Merge gate** | At least 1 success in k trials required |
178+ | ` pass@k ≥ 60% ` | ** Tracked** | Flags flaky skills for investigation |
179+ | ` pass@k < 60% ` | ** Blocks merge** | On PRs touching the relevant skill |
107180
108181## CI Integration
109182
110- The GitHub Actions workflow at ` .github/workflows/skill-eval.yml ` runs
111- automatically on PRs that modify ` skills/** ` or ` evals/** ` . It :
183+ The GitHub Actions workflow at ` .github/workflows/skill-eval.yml ` provides two
184+ evaluation modes :
112185
113- 1 . Checks out the repo
114- 2 . Validates all graders against their reference solutions
115- 3 . Uploads results as an artifact
116- 4 . Posts a summary comment on the PR
186+ ### Automatic (on PR)
187+ Runs on every PR that modifies ` skills/** ` or ` evals/** ` :
188+ 1 . Validates all graders against their reference solutions
189+ 2 . Uploads results as an artifact
190+ 3 . Posts a summary comment on the PR
191+
192+ ### Manual (workflow_dispatch)
193+ Triggered manually from the Actions tab to run agent-based evaluation:
194+ 1 . Select the agent (` copilot ` or ` gemini ` ) and number of trials
195+ 2 . Installs the selected agent CLI
196+ 3 . Runs all tasks against the agent
197+ 4 . Uploads results as an artifact
198+
199+ ** Secrets required for agent-based CI:**
200+ - ` GITHUB_TOKEN ` — automatically available (for Copilot)
201+ - ` GEMINI_API_KEY ` — must be added as a repository secret (for Gemini)
117202
118203## Grading Strategy
119204
@@ -135,3 +220,7 @@ automatically on PRs that modify `skills/**` or `evals/**`. It:
135220Baseline results are stored in ` evals/results/baseline.json ` and used for
136221regression comparison on PRs. The CI workflow uploads per-run results as
137222GitHub Actions artifacts.
223+
224+ Agent-based results are suffixed with the agent name (e.g.,
225+ ` grid-basic-setup-copilot.json ` ) to distinguish them from reference
226+ validation results.
0 commit comments