We have three Skills (igniteui-angular-components, igniteui-angular-grids, igniteui-angular-theming) that teach coding agents how to correctly select, configure, and compose Ignite UI for Angular components. As these skills grow in complexity and more developers rely on them, silent regressions become a real risk rewording a step, reordering routing logic, or removing a "verify" clause can quietly degrade agent behavior with no signal until a user reports a wrong output.
This work item establishes a structured eval process for these skills, directly inspired by Minko Gechev's Skill Eval framework, topic and extended with patterns from Anthropic's agent eval research and the Skills Best Practices guide.
Goals
- Produce a measurable, repeatable quality score for each skill.
- Detect regressions automatically when a skill file is modified in a PR.
- Provide a feedback loop during skill authoring (edit → eval → score delta).
- Establish pass/fail thresholds that gate merges to
main.
Approach
Tooling: Adopt the skill-eval TypeScript framework as the eval runner. It supports Docker-isolated agent execution, deterministic shell graders, LLM rubric graders, multi-trial runs, and JSON result persistence — all the properties needed here.
Task Structure
Create an evals/ directory at the repo root. Each eval task is a self-contained directory:
Example:
evals/
├── tasks/
│ ├── grid-basic-setup/
│ │ ├── task.toml # timeouts, grader weights, trial count
│ │ ├── instruction.md # what the agent is asked to do
│ │ ├── environment/Dockerfile # clean Angular project baseline
│ │ ├── tests/test.sh # deterministic grader (file checks, compile, lint)
│ │ ├── prompts/quality.md # LLM rubric grader questions
│ │ ├── solution/solve.sh # reference solution for baseline
│ │ └── skills/ # symlinks or copies of the skills under test
│ │ └── igniteui-angular-grids/SKILL.md
│ ├── grid-sorting-remote-data/
│ ├── grid-hierarchical-setup/
│ ├── grid-pivot-config/
│ ├── component-combo-reactive-form/
│ ├── component-date-picker-validation/
│ ├── component-dialog-service/
│ ├── theming-palette-generation/
│ ├── theming-component-override/
│ └── skill-routing-intent-detection/ # tests the SKILL.md router logic itself
├── package.json
└── README.md
Tasks to Implement (per Skill)
igniteui-angular-grids skill (highest priority — most complex routing)
| Task ID |
Instruction given to agent |
Deterministic check |
LLM rubric check |
grid-basic-setup |
"Add a data grid showing employee data with sorting and pagination" |
Project compiles; <igx-grid> present in template; correct module imported |
Did agent choose IgxGrid (not Tree/Hierarchical) for flat data? Did it configure [data] binding correctly? |
grid-tree-vs-flat |
"Display department data with nested child rows" |
<igx-tree-grid> present; childDataKey configured |
Did skill routing correctly select Tree Grid over flat Grid? |
grid-hierarchical-setup |
"Build a master-detail grid where clicking a row expands child orders" |
<igx-hierarchical-grid> + <igx-row-island> present |
Did agent configure load-on-demand vs inline data correctly based on instructions? |
grid-remote-filtering |
"Add server-side filtering and sorting to the grid" |
[filterMode]="'externalFilterMode'" set; remote service stub present |
Did agent wire onDataPreLoad/sortingExpressionsChange instead of local filtering? |
grid-pivot-config |
"Create a pivot grid with row/column/value dimensions" |
<igx-pivot-grid> + IgxPivotConfiguration present |
Did agent define rows, columns, values correctly vs a flat grid with groupBy? |
grid-state-persistence |
"Persist grid sorting and filtering state to localStorage" |
IgxGridStateDirective present; serialize/restore calls present |
Did agent use the state directive vs manually serializing expressions? |
igniteui-angular-components skill
| Task ID |
Instruction |
Deterministic check |
LLM rubric check |
component-combo-reactive-form |
"Add a multi-select combo bound to a reactive form control" |
<igx-combo> present; [formControlName] wired; module imported |
Did agent use IgxCombo (not IgxSelect or native <select>) for multi-select? |
component-date-picker-validation |
"Add a date picker with min/max date validation" |
<igx-date-picker> present; minValue/maxValue inputs set |
Did agent avoid using native <input type=date>? Did it correctly set validators? |
component-dialog-service |
"Show a confirmation dialog when the user clicks Delete" |
IgxDialogComponent or service open call present |
Did agent use the Dialog component/service vs a custom modal div? |
component-chart-selection |
"Display monthly sales as a bar chart" |
<igx-category-chart> or <igx-bar-chart> present |
Did agent pick the correct chart type (Bar vs Column vs Line) per the skill's intent detection? |
igniteui-angular-theming skill
| Task ID |
Instruction |
Deterministic check |
LLM rubric check |
theming-palette-generation |
"Create a custom blue/orange branded theme" |
palette() call with $primary/$secondary; @include theme() present |
Did agent use palette() correctly vs hardcoding CSS variables? Did it call core() before theme()? |
theming-component-override |
"Change only the IgxButton background color without affecting the rest of the theme" |
button-theme() mixin call present; scoped to component |
Did agent use a component-level theme override vs overriding the global palette? |
theming-mcp-tool-invocation |
"Use the MCP server to generate a palette and scaffold a grid theme" |
MCP tool call in transcript |
Did agent invoke the MCP tool rather than writing SCSS manually? |
Cross-skill / routing tasks
| Task ID |
Instruction |
What's tested |
skill-routing-intent-detection |
Various ambiguous prompts ("add a table", "style my app", "show nested data") |
Tests whether the SKILL.md router in each skill fires the correct sub-skill path rather than hallucinating a generic Angular solution |
Grading Strategy
Deterministic grader (tests/test.sh) — runs after the agent finishes and checks:
- Project builds without errors (
ng build)
- Correct Ignite UI selector is present in the generated template
- Required module or standalone import exists
- No use of forbidden alternatives (e.g., native
<table> or <select> when the skill mandates an Ignite UI component)
LLM rubric grader (prompts/quality.md) — evaluates the agent transcript for:
- Correct intent routing (did the skill's decision logic fire?)
- Idiomatic API usage (inputs, outputs, bindings as documented)
- Absence of hallucinated APIs (wrong input names, non-existent outputs)
- Following the skill's "prefer X over Y" guidance
Combined score: each task uses a weighted average, e.g. 60% deterministic + 40% rubric. Weights are configurable per task.toml.
Eval Execution & Pass/Fail Thresholds
Following Anthropic's recommendations on agent evals:
- Minimum 5 trials per task — agent behavior is non-deterministic; one run is meaningless.
pass@5 ≥ 80% is the gate for merging skill changes (can the agent solve it at least once in 5 tries?).
pass^5 ≥ 60% is tracked but not blocking — used to flag flaky skills that need clarification.
- A task scoring below
pass@5 = 60% on a PR that touches the relevant skill blocks merge.
CI Integration
Add a GitHub Actions workflow triggered on PRs that touch skills/**:
name: Skill Eval
on:
pull_request:
paths:
- 'skills/**'
- 'evals/**'
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
- run: cd evals && npm install
- run: npm run eval -- --trials=5 --provider=docker
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: skill-eval-results
path: evals/results/
A result summary comment is posted on the PR showing per-task pass rates and any regressions relative to the main branch baseline.
Acceptance Criteria
Out of Scope (future work)
- Eval coverage for the
ng update migration schematic that installs skills into consumer projects.
- Evals for the
igniteui-theming MCP server tools themselves (separate harness needed).
- Multi-skill composition tasks (e.g., build a themed hierarchical grid with a custom palette) — tracked separately once per-skill coverage is stable.
We have three Skills (
igniteui-angular-components,igniteui-angular-grids,igniteui-angular-theming) that teach coding agents how to correctly select, configure, and compose Ignite UI for Angular components. As these skills grow in complexity and more developers rely on them, silent regressions become a real risk rewording a step, reordering routing logic, or removing a "verify" clause can quietly degrade agent behavior with no signal until a user reports a wrong output.This work item establishes a structured eval process for these skills, directly inspired by Minko Gechev's Skill Eval framework, topic and extended with patterns from Anthropic's agent eval research and the Skills Best Practices guide.
Goals
main.Approach
Tooling: Adopt the
skill-evalTypeScript framework as the eval runner. It supports Docker-isolated agent execution, deterministic shell graders, LLM rubric graders, multi-trial runs, and JSON result persistence — all the properties needed here.Task Structure
Create an
evals/directory at the repo root. Each eval task is a self-contained directory:Example:
Tasks to Implement (per Skill)
igniteui-angular-gridsskill (highest priority — most complex routing)grid-basic-setup<igx-grid>present in template; correct module imported[data]binding correctly?grid-tree-vs-flat<igx-tree-grid>present;childDataKeyconfiguredgrid-hierarchical-setup<igx-hierarchical-grid>+<igx-row-island>presentgrid-remote-filtering[filterMode]="'externalFilterMode'"set; remote service stub presentonDataPreLoad/sortingExpressionsChangeinstead of local filtering?grid-pivot-config<igx-pivot-grid>+IgxPivotConfigurationpresentrows,columns,valuescorrectly vs a flat grid with groupBy?grid-state-persistenceIgxGridStateDirectivepresent; serialize/restore calls presentigniteui-angular-componentsskillcomponent-combo-reactive-form<igx-combo>present;[formControlName]wired; module imported<select>) for multi-select?component-date-picker-validation<igx-date-picker>present;minValue/maxValueinputs set<input type=date>? Did it correctly set validators?component-dialog-serviceIgxDialogComponentor service open call presentdiv?component-chart-selection<igx-category-chart>or<igx-bar-chart>presentigniteui-angular-themingskilltheming-palette-generationpalette()call with$primary/$secondary;@include theme()presentpalette()correctly vs hardcoding CSS variables? Did it callcore()beforetheme()?theming-component-overridebutton-theme()mixin call present; scoped to componenttheming-mcp-tool-invocationCross-skill / routing tasks
skill-routing-intent-detectionGrading Strategy
Deterministic grader (
tests/test.sh) — runs after the agent finishes and checks:ng build)<table>or<select>when the skill mandates an Ignite UI component)LLM rubric grader (
prompts/quality.md) — evaluates the agent transcript for:Combined score: each task uses a weighted average, e.g. 60% deterministic + 40% rubric. Weights are configurable per
task.toml.Eval Execution & Pass/Fail Thresholds
Following Anthropic's recommendations on agent evals:
pass@5 ≥ 80%is the gate for merging skill changes (can the agent solve it at least once in 5 tries?).pass^5 ≥ 60%is tracked but not blocking — used to flag flaky skills that need clarification.pass@5 = 60%on a PR that touches the relevant skill blocks merge.CI Integration
Add a GitHub Actions workflow triggered on PRs that touch
skills/**:A result summary comment is posted on the PR showing per-task pass rates and any regressions relative to the
mainbranch baseline.Acceptance Criteria
evals/directory scaffolded with at least one task per skill (minimum 3 tasks total as a first pass).pass@5 ≥ 80%onmainat the time of merging the initial suite.README.mdinevals/documents how to run evals locally and how to add a new task.Out of Scope (future work)
ng updatemigration schematic that installs skills into consumer projects.igniteui-themingMCP server tools themselves (separate harness needed).