Implement Automated Eval Test Suite for the Angular Skills

We have three Skills (`igniteui-angular-components`, `igniteui-angular-grids`, `igniteui-angular-theming`) that teach coding agents how to correctly select, configure, and compose Ignite UI for Angular components. As these skills grow in complexity and more developers rely on them, silent regressions become a real risk  rewording a step, reordering routing logic, or removing a "verify" clause can quietly degrade agent behavior with no signal until a user reports a wrong output.

This work item establishes a structured eval process for these skills, directly inspired by [Minko Gechev's Skill Eval framework](https://github.com/mgechev/skill-eval), [topic](https://blog.mgechev.com/2026/02/26/skill-eval/) and extended with patterns from [Anthropic's agent eval research](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents) and the [Skills Best Practices guide](https://github.com/mgechev/skills-best-practices).

### Goals

- Produce a measurable, repeatable quality score for each skill.
- Detect regressions automatically when a skill file is modified in a PR.
- Provide a feedback loop during skill authoring (edit → eval → score delta).
- Establish pass/fail thresholds that gate merges to `main`.


### Approach

**Tooling:** Adopt the [`skill-eval`](https://github.com/mgechev/skill-eval) TypeScript framework as the eval runner. It supports Docker-isolated agent execution, deterministic shell graders, LLM rubric graders, multi-trial runs, and JSON result persistence — all the properties needed here.

### Task Structure

Create an `evals/` directory at the repo root. Each eval task is a self-contained directory:

Example:
```
evals/
├── tasks/
│   ├── grid-basic-setup/
│   │   ├── task.toml               # timeouts, grader weights, trial count
│   │   ├── instruction.md          # what the agent is asked to do
│   │   ├── environment/Dockerfile  # clean Angular project baseline
│   │   ├── tests/test.sh           # deterministic grader (file checks, compile, lint)
│   │   ├── prompts/quality.md      # LLM rubric grader questions
│   │   ├── solution/solve.sh       # reference solution for baseline
│   │   └── skills/                 # symlinks or copies of the skills under test
│   │       └── igniteui-angular-grids/SKILL.md
│   ├── grid-sorting-remote-data/
│   ├── grid-hierarchical-setup/
│   ├── grid-pivot-config/
│   ├── component-combo-reactive-form/
│   ├── component-date-picker-validation/
│   ├── component-dialog-service/
│   ├── theming-palette-generation/
│   ├── theming-component-override/
│   └── skill-routing-intent-detection/  # tests the SKILL.md router logic itself
├── package.json
└── README.md
```

### Tasks to Implement (per Skill)

#### `igniteui-angular-grids` skill (highest priority — most complex routing)

| Task ID | Instruction given to agent | Deterministic check | LLM rubric check |
|---|---|---|---|
| `grid-basic-setup` | "Add a data grid showing employee data with sorting and pagination" | Project compiles; `<igx-grid>` present in template; correct module imported | Did agent choose IgxGrid (not Tree/Hierarchical) for flat data? Did it configure `[data]` binding correctly? |
| `grid-tree-vs-flat` | "Display department data with nested child rows" | `<igx-tree-grid>` present; `childDataKey` configured | Did skill routing correctly select Tree Grid over flat Grid? |
| `grid-hierarchical-setup` | "Build a master-detail grid where clicking a row expands child orders" | `<igx-hierarchical-grid>` + `<igx-row-island>` present | Did agent configure load-on-demand vs inline data correctly based on instructions? |
| `grid-remote-filtering` | "Add server-side filtering and sorting to the grid" | `[filterMode]="'externalFilterMode'"` set; remote service stub present | Did agent wire `onDataPreLoad`/`sortingExpressionsChange` instead of local filtering? |
| `grid-pivot-config` | "Create a pivot grid with row/column/value dimensions" | `<igx-pivot-grid>` + `IgxPivotConfiguration` present | Did agent define `rows`, `columns`, `values` correctly vs a flat grid with groupBy? |
| `grid-state-persistence` | "Persist grid sorting and filtering state to localStorage" | `IgxGridStateDirective` present; serialize/restore calls present | Did agent use the state directive vs manually serializing expressions? |

#### `igniteui-angular-components` skill

| Task ID | Instruction | Deterministic check | LLM rubric check |
|---|---|---|---|
| `component-combo-reactive-form` | "Add a multi-select combo bound to a reactive form control" | `<igx-combo>` present; `[formControlName]` wired; module imported | Did agent use IgxCombo (not IgxSelect or native `<select>`) for multi-select? |
| `component-date-picker-validation` | "Add a date picker with min/max date validation" | `<igx-date-picker>` present; `minValue`/`maxValue` inputs set | Did agent avoid using native `<input type=date>`? Did it correctly set validators? |
| `component-dialog-service` | "Show a confirmation dialog when the user clicks Delete" | `IgxDialogComponent` or service open call present | Did agent use the Dialog component/service vs a custom modal `div`? |
| `component-chart-selection` | "Display monthly sales as a bar chart" | `<igx-category-chart>` or `<igx-bar-chart>` present | Did agent pick the correct chart type (Bar vs Column vs Line) per the skill's intent detection? |

#### `igniteui-angular-theming` skill

| Task ID | Instruction | Deterministic check | LLM rubric check |
|---|---|---|---|
| `theming-palette-generation` | "Create a custom blue/orange branded theme" | `palette()` call with `$primary`/`$secondary`; `@include theme()` present | Did agent use `palette()` correctly vs hardcoding CSS variables? Did it call `core()` before `theme()`? |
| `theming-component-override` | "Change only the IgxButton background color without affecting the rest of the theme" | `button-theme()` mixin call present; scoped to component | Did agent use a component-level theme override vs overriding the global palette? |
| `theming-mcp-tool-invocation` | "Use the MCP server to generate a palette and scaffold a grid theme" | MCP tool call in transcript | Did agent invoke the MCP tool rather than writing SCSS manually? |

#### Cross-skill / routing tasks

| Task ID | Instruction | What's tested |
|---|---|---|
| `skill-routing-intent-detection` | Various ambiguous prompts ("add a table", "style my app", "show nested data") | Tests whether the SKILL.md router in each skill fires the correct sub-skill path rather than hallucinating a generic Angular solution |

### Grading Strategy

**Deterministic grader (`tests/test.sh`)** — runs after the agent finishes and checks:
- Project builds without errors (`ng build`)
- Correct Ignite UI selector is present in the generated template
- Required module or standalone import exists
- No use of forbidden alternatives (e.g., native `<table>` or `<select>` when the skill mandates an Ignite UI component)

**LLM rubric grader (`prompts/quality.md`)** — evaluates the agent transcript for:
- Correct intent routing (did the skill's decision logic fire?)
- Idiomatic API usage (inputs, outputs, bindings as documented)
- Absence of hallucinated APIs (wrong input names, non-existent outputs)
- Following the skill's "prefer X over Y" guidance

**Combined score:** each task uses a weighted average, e.g. 60% deterministic + 40% rubric. Weights are configurable per `task.toml`.

### Eval Execution & Pass/Fail Thresholds

Following Anthropic's recommendations on agent evals:

- **Minimum 5 trials per task** — agent behavior is non-deterministic; one run is meaningless.
- **`pass@5 ≥ 80%`** is the gate for merging skill changes (can the agent solve it at least once in 5 tries?).
- **`pass^5 ≥ 60%`** is tracked but not blocking — used to flag flaky skills that need clarification.
- A task scoring below `pass@5 = 60%` on a PR that touches the relevant skill **blocks merge**.

### CI Integration

Add a GitHub Actions workflow triggered on PRs that touch `skills/**`:

```yaml
name: Skill Eval
on:
  pull_request:
    paths:
      - 'skills/**'
      - 'evals/**'

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'
      - run: cd evals && npm install
      - run: npm run eval -- --trials=5 --provider=docker
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: skill-eval-results
          path: evals/results/
```

A result summary comment is posted on the PR showing per-task pass rates and any regressions relative to the `main` branch baseline.


### Acceptance Criteria

- [ ] `evals/` directory scaffolded with at least one task per skill (minimum 3 tasks total as a first pass).
- [ ] Each task has both a deterministic grader and an LLM rubric grader.
- [ ] All tasks pass `pass@5 ≥ 80%` on `main` at the time of merging the initial suite.
- [ ] GitHub Actions workflow runs on skill-touching PRs and posts a summary comment.
- [ ] `README.md` in `evals/` documents how to run evals locally and how to add a new task.
- [ ] Baseline results JSON is committed to the repo for regression comparison.

### Out of Scope (future work)

- Eval coverage for the `ng update` migration schematic that installs skills into consumer projects.
- Evals for the `igniteui-theming` MCP server tools themselves (separate harness needed).
- Multi-skill composition tasks (e.g., build a themed hierarchical grid with a custom palette) — tracked separately once per-skill coverage is stable.


Task ID	Instruction given to agent	Deterministic check	LLM rubric check
`grid-basic-setup`	"Add a data grid showing employee data with sorting and pagination"	Project compiles; `<igx-grid>` present in template; correct module imported	Did agent choose IgxGrid (not Tree/Hierarchical) for flat data? Did it configure `[data]` binding correctly?
`grid-tree-vs-flat`	"Display department data with nested child rows"	`<igx-tree-grid>` present; `childDataKey` configured	Did skill routing correctly select Tree Grid over flat Grid?
`grid-hierarchical-setup`	"Build a master-detail grid where clicking a row expands child orders"	`<igx-hierarchical-grid>` + `<igx-row-island>` present	Did agent configure load-on-demand vs inline data correctly based on instructions?
`grid-remote-filtering`	"Add server-side filtering and sorting to the grid"	`[filterMode]="'externalFilterMode'"` set; remote service stub present	Did agent wire `onDataPreLoad`/`sortingExpressionsChange` instead of local filtering?
`grid-pivot-config`	"Create a pivot grid with row/column/value dimensions"	`<igx-pivot-grid>` + `IgxPivotConfiguration` present	Did agent define `rows`, `columns`, `values` correctly vs a flat grid with groupBy?
`grid-state-persistence`	"Persist grid sorting and filtering state to localStorage"	`IgxGridStateDirective` present; serialize/restore calls present	Did agent use the state directive vs manually serializing expressions?

Task ID	Instruction	Deterministic check	LLM rubric check
`component-combo-reactive-form`	"Add a multi-select combo bound to a reactive form control"	`<igx-combo>` present; `[formControlName]` wired; module imported	Did agent use IgxCombo (not IgxSelect or native `<select>`) for multi-select?
`component-date-picker-validation`	"Add a date picker with min/max date validation"	`<igx-date-picker>` present; `minValue`/`maxValue` inputs set	Did agent avoid using native `<input type=date>`? Did it correctly set validators?
`component-dialog-service`	"Show a confirmation dialog when the user clicks Delete"	`IgxDialogComponent` or service open call present	Did agent use the Dialog component/service vs a custom modal `div`?
`component-chart-selection`	"Display monthly sales as a bar chart"	`<igx-category-chart>` or `<igx-bar-chart>` present	Did agent pick the correct chart type (Bar vs Column vs Line) per the skill's intent detection?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Automated Eval Test Suite for the Angular Skills #17001

Goals

Approach

Task Structure

Tasks to Implement (per Skill)

`igniteui-angular-grids` skill (highest priority — most complex routing)

`igniteui-angular-components` skill

`igniteui-angular-theming` skill

Cross-skill / routing tasks

Grading Strategy

Eval Execution & Pass/Fail Thresholds

CI Integration

Acceptance Criteria

Out of Scope (future work)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Task ID	Instruction	Deterministic check	LLM rubric check
`theming-palette-generation`	"Create a custom blue/orange branded theme"	`palette()` call with `$primary`/`$secondary`; `@include theme()` present	Did agent use `palette()` correctly vs hardcoding CSS variables? Did it call `core()` before `theme()`?
`theming-component-override`	"Change only the IgxButton background color without affecting the rest of the theme"	`button-theme()` mixin call present; scoped to component	Did agent use a component-level theme override vs overriding the global palette?
`theming-mcp-tool-invocation`	"Use the MCP server to generate a palette and scaffold a grid theme"	MCP tool call in transcript	Did agent invoke the MCP tool rather than writing SCSS manually?

Implement Automated Eval Test Suite for the Angular Skills #17001

Description

Goals

Approach

Task Structure

Tasks to Implement (per Skill)

igniteui-angular-grids skill (highest priority — most complex routing)

igniteui-angular-components skill

igniteui-angular-theming skill

Cross-skill / routing tasks

Grading Strategy

Eval Execution & Pass/Fail Thresholds

CI Integration

Acceptance Criteria

Out of Scope (future work)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`igniteui-angular-grids` skill (highest priority — most complex routing)

`igniteui-angular-components` skill

`igniteui-angular-theming` skill