clerk-evals

This repository hosts public evaluation suites used by Clerk to test how LLMs perform at writing Clerk code (primarily in Next.js). If an AI contributor is asked to "create a new eval suite for the Waitlist feature", it should add a new folder under src/evals/ with a PROMPT.md and graders.ts, then register it in src/index.ts.

Quickstart

Install Bun >=1.3.0, then gather the required API keys. See .env.example

cp .env.example .env

Run the eval suite (might take about 50s)

bun i
bun start

Add a new evaluation

For detailed, copy-pastable steps see docs/ADDING_EVALS.md. In short:

Create src/evals/your-eval/ with PROMPT.md and graders.ts.
Implement graders that return booleans using defineGraders(...) and shared judges in @/src/graders/catalog.
Append an entry to the evaluations array in src/index.ts with framework, category, and path (e.g., evals/waitlist).
Run bun run start:eval src/evals/your-eval (optionally --debug).

Example scores

[
  {
    "model": "claude-sonnet-4-5",
    "framework": "Next.js",
    "category": "Auth",
    "value": 0.8333333333333334,
    "updatedAt": "2026-01-06T17:51:27.901Z"
  },
  {
    "model": "gpt-5-chat-latest",
    "framework": "Next.js",
    "category": "Auth",
    "value": 0.6666666666666666,
    "updatedAt": "2026-01-06T17:51:30.871Z"
  },
  {
    "model": "claude-opus-4-5",
    "framework": "Next.js",
    "category": "Billing",
    "value": 1.0,
    "updatedAt": "2026-01-06T17:51:56.370Z"
  }
]

Debugging

# Run a single evaluation
bun run start:eval evals/auth/routes

# Run in debug mode
bun run start --debug

# Run a single evaluation in debug mode
bun run start:eval evals/auth/routes --debug

CLI Usage

bun start [options]

Flag	Description
`--mcp`	Enable MCP tools (uses mcp.clerk.dev by default)
`--model "claude-sonnet-4-0"`	Filter by exact model name (case-insensitive)
`--eval "protect"`	Filter evals by category or path
`--debug`	Save outputs to debug-runs/

# Baseline (no tools)
bun start --model "claude-sonnet-4-0" --eval "protect"

# With MCP tools
bun start --mcp --model "claude-sonnet-4-0" --eval "protect"

# Local MCP server
MCP_SERVER_URL_OVERRIDE=http://localhost:8787/mcp bun start --mcp

Agent Evals

Run evaluations using AI coding agents (Claude Code, Cursor) instead of direct LLM calls:

bun start:agent --agent claude-code [options]

Flag	Description
`--agent, -a`	Agent type (required): `claude-code`, `cursor`
`--mcp`	Enable MCP tools
`--eval, -e`	Filter evals by path
`--debug, -d`	Save outputs to debug-runs/
`--timeout, -t`	Timeout per eval (ms)

Shortcuts:

bun agent:claude        # claude-code baseline
bun agent:claude:mcp    # claude-code with MCP

Examples:

# Run all evals with Claude Code
bun start:agent --agent claude-code

# Run specific eval with debug output
bun start:agent -a claude-code -e auth/protect -d

# Run with MCP tools enabled
bun start:agent --agent claude-code --mcp

Output Files

Runner	Output	Description
`bun start`	`scores.json`	Baseline scores (no tools)
`bun start:mcp`	`scores-mcp.json`	MCP scores (with tools)
`bun start:agent`	`agent-scores.json`	Agent evaluation scores
`bun merge-scores`	`llm-scores.json`	Combined for llm-leaderboard

Workflow for llm-leaderboard

bun start              # 1. Baseline -> scores.json
bun start --mcp        # 2. MCP -> scores-mcp.json
bun merge-scores       # 3. Merge -> llm-scores.json

The merge script combines both score files and calculates improvement metrics:

{
  "model": "claude-sonnet-4-5",
  "label": "Claude Sonnet 4.5",
  "framework": "Next.js",
  "category": "Auth",
  "value": 0.83,
  "provider": "anthropic",
  "mcpScore": 0.95,
  "improvement": 0.12
}

Overview

This project is broken up into a few core pieces:

src/index.ts: This is the main entrypoint of the project. Evaluations, models, reporters, and the runner are registered here, and all executed.
/evals: Folders that contain a prompt and grading expectations. Runners currently assume that eval folders contain two files: graders.ts and PROMPT.md.
/runners: The primary logic responsible for loading evaluations, calling provider llms, and outputting scores.
/reporters: The primary logic responsible for sending scores somewhere — stdout, a file, etc.

Running

A runner takes a simple object as an argument:

{
  "provider": "openai",
  "model": "gpt-5",
  "evalPath": "/absolute/path/to/clerk-evals/src/evals/auth/protect"
}

It will resolve the provider and model to the respective SDK.

It will load the designated evaluation, generate LLM text from the prompt, and pass the result to graders.

Evaluations

At the moment, evaluations are simply folders that contain:

PROMPT.md: the instruction for which we're evaluating the model's output on
graders.ts: a module containing grader functions which return true/false signalling if the model's output passed or failed. This is essentially our acceptance criteria.

Graders

Shared grader primitives live in src/graders/index.ts. Use them to declare new checks with a consistent, terse shape:

import { contains, defineGraders, judge } from '@/src/graders'
import { llmChecks } from '@/src/graders/catalog'

export const graders = defineGraders({
  references_middleware: contains('middleware.ts'),
  package_json: llmChecks.packageJsonClerkVersion,
  custom_flow_description: judge(
    'Does the answer walk through protecting a Next.js API route with Clerk auth() and explain the response states?',
  ),
})

contains / containsAny: case-insensitive substring checks by default
matches: regex checks
judge: thin wrappers around the LLM-as-judge scorer. Shared prompts live in src/graders/catalog.ts; add new reusable prompts there.
defineGraders: preserves type inference for the exported graders record.

Score

For a given model, and evaluation, we'll retrieve a score from 0..1, which is the percentage of grader functions that passed.

Reporting

At the moment, we employ two minimal reporters

console: writes scores via console.log()
file: saves scores to a gitignored scores.json file.

Interfaces

For the notable interfaces, see /interfaces.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.claude/commands		.claude/commands
docs		docs
src		src
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
biome.json		biome.json
bun.lock		bun.lock
export-from-db.ts		export-from-db.ts
package.json		package.json
run-evals.sh		run-evals.sh
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

clerk-evals

Quickstart

Add a new evaluation

CLI Usage

Agent Evals

Output Files

Workflow for llm-leaderboard

Overview

Running

Evaluations

Graders

Score

Reporting

Interfaces

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

clerk/clerk-evals

Folders and files

Latest commit

History

Repository files navigation

clerk-evals

Quickstart

Add a new evaluation

CLI Usage

Agent Evals

Output Files

Workflow for llm-leaderboard

Overview

Running

Evaluations

Graders

Score

Reporting

Interfaces

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages