|
| 1 | +# Nightly Research Report — 2026-03-18 (Report #12) |
| 2 | + |
| 3 | +## Executive Summary |
| 4 | + |
| 5 | +Six consecutive days with zero code fixes. The backlog has now grown to ~100 open issues across 12 reports, yet the remediation pipeline remains stalled. This report surfaces a new critical path bug in `claude_baseline_agent.py` (LOCOBENCH template path hardcoded to a non-existent machine), a stale GitHub org URL in the export pipeline, a repeated falsy-value bug on an additional line not previously noted, and an architectural finding that the project's analysis layer is fundamentally broken: the missing `verification_modes` and `use_case_category` fields silently invalidate all filtered queries. The recommended next feature is a **DuckDB-based result analytics store** to replace the flat-file scan pattern and enable queryable, cross-run analysis. |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## 1. Code & Architecture Review |
| 10 | + |
| 11 | +### 1.1 NEW CRITICAL: Second Hardcoded Path in claude_baseline_agent.py |
| 12 | + |
| 13 | +`agents/claude_baseline_agent.py:31` contains a second hardcoded personal path not previously catalogued (the `/tmp` race condition at lines 1134-1141 was noted in report #11, but this path is separate): |
| 14 | + |
| 15 | +```python |
| 16 | +LOCOBENCH_CLAUDE_MD_TEMPLATE = Path( |
| 17 | + "/home/stephanie_jarmak/CodeScaleBench/benchmarks/locobench_agent/templates/CLAUDE.md" |
| 18 | +) |
| 19 | +``` |
| 20 | + |
| 21 | +This path is doubly wrong: |
| 22 | +- Points to `/home/stephanie_jarmak/` (not `/Users/sjarmak/`) |
| 23 | +- References `CodeScaleBench` (not `CodeContextBench`) |
| 24 | + |
| 25 | +The template is used when spawning locobench evaluation tasks. Any locobench run on any machine other than the original dev machine will crash with `FileNotFoundError` at task launch. This is distinct from the `apply_verifier_fixes.py:9` path noted in report #11. |
| 26 | + |
| 27 | +**Fix:** Replace with `Path(__file__).resolve().parent.parent / "benchmarks/locobench_agent/templates/CLAUDE.md"`. |
| 28 | + |
| 29 | +--- |
| 30 | + |
| 31 | +### 1.2 NEW HIGH: Stale GitHub Org URL in Export Pipeline |
| 32 | + |
| 33 | +`scripts/export_official_results.py:45` embeds a hardcoded GitHub URL: |
| 34 | + |
| 35 | +```python |
| 36 | +DEFAULT_REPO_BLOB_BASE = "https://github.com/sourcegraph/CodeScaleBench/blob/main" |
| 37 | +``` |
| 38 | + |
| 39 | +The repository has been renamed to `CodeContextBench`. Every task link generated in exported HTML reports points to a 404 page (or GitHub redirect at best). Given that the export is the primary artifact shared with external reviewers, stale links undermine credibility. |
| 40 | + |
| 41 | +**Fix:** Replace with `"https://github.com/sourcegraph/CodeContextBench/blob/main"` or derive from `git remote get-url origin`. |
| 42 | + |
| 43 | +--- |
| 44 | + |
| 45 | +### 1.3 NEW MEDIUM: Falsy Bug Repeated at Line 1005 of generate_eval_report.py |
| 46 | + |
| 47 | +Report #11 noted the falsy bug at `generate_eval_report.py:147`: |
| 48 | + |
| 49 | +```python |
| 50 | +mcp_mode = hc.get("mcp_mode") or r.config_name |
| 51 | +``` |
| 52 | + |
| 53 | +The same pattern reappears at **line 1005** in the same file (a second code path used when building the comparison table). If `mcp_mode` is a valid empty string, both sites would incorrectly fall back to `config_name`. This means fixes to line 147 alone are incomplete — both sites must be patched. |
| 54 | + |
| 55 | +--- |
| 56 | + |
| 57 | +### 1.4 Analysis Layer Is Structurally Broken: Missing Metadata in All 274 Tasks |
| 58 | + |
| 59 | +Previous reports noted that `verification_modes` and `use_case_category` are missing from all 274 task definitions. The full impact hasn't been articulated until now: |
| 60 | + |
| 61 | +- **`--use-case-category` filter**: silently returns 0 tasks (no error, no warning) |
| 62 | +- **`--verification-mode` filter**: silently returns 0 tasks |
| 63 | +- **Auto-detection of verifier type**: falls back to heuristics that are known to misclassify |
| 64 | +- **Weighted scoring**: weight lookup fails silently (falls back to 1.0 for all tasks) |
| 65 | + |
| 66 | +These fields are not cosmetic — they gate the correctness of every analysis run. Until all 274 tasks have these fields populated, no filtered query can be trusted. There is currently no script to detect this gap at runtime (the `repo_health.py` check validates schema structure but not field completeness). |
| 67 | + |
| 68 | +**Scale**: 274 tasks × 2 missing fields = 548 missing metadata values. |
| 69 | + |
| 70 | +--- |
| 71 | + |
| 72 | +### 1.5 Result Storage Does Not Scale |
| 73 | + |
| 74 | +Benchmark results are stored as flat files (`result.json`, `task_metrics.json` per task per run). The `results/` directory currently contains only `repo_cloc_counts.json` — no run outputs. The `export_official_results.py` pipeline must scan the entire run directory tree to build the HTML export, and the HTML output truncates at 1200 rows (noted in report #9). |
| 75 | + |
| 76 | +Meanwhile, `data/contextbench/` already contains parquet files (test/hard/verified subsets), indicating that the project has partially explored columnar storage. However, there is no queryable interface — no DuckDB schema, no SQL interface, no way to ask "what is the average score by suite across the last 5 runs?" without writing a custom script. |
| 77 | + |
| 78 | +The current architecture requires a full file-system scan for every analysis, which will degrade as the task count grows beyond 283. |
| 79 | + |
| 80 | +--- |
| 81 | + |
| 82 | +### 1.6 Dashboard Referenced in MEMORY.md Does Not Exist |
| 83 | + |
| 84 | +`MEMORY.md` states: "Dashboard at `dashboard/app.py` (Streamlit), DB at `data/codecontextbench.db`" |
| 85 | + |
| 86 | +Neither exists: |
| 87 | +- `/Users/sjarmak/CodeContextBench/dashboard/` — directory absent |
| 88 | +- `data/codecontextbench.db` — file absent (data/ contains only parquet files and CSVs) |
| 89 | + |
| 90 | +Report #10 flagged this as "Dashboard Reference in Docs Does Not Exist" but the memory entry was never corrected. The stale MEMORY.md entry is misleading to agents starting new sessions. |
| 91 | + |
| 92 | +**Fix:** Update `MEMORY.md` to remove the dashboard/DB references until those components are built. |
| 93 | + |
| 94 | +--- |
| 95 | + |
| 96 | +### 1.7 Cost Report Default Guard Pattern Incorrect |
| 97 | + |
| 98 | +`scripts/cost_report.py:262`: |
| 99 | + |
| 100 | +```python |
| 101 | +pct = ((cost / tasks) / (baseline_cost / config_tasks.get("baseline", 1)) - 1) * 100 |
| 102 | +``` |
| 103 | + |
| 104 | +If the "baseline" config is absent from the run, `get("baseline", 1)` returns integer `1`, producing a nonsensical cost comparison (dividing by 1 as a proxy for baseline cost). The correct guard is `get("baseline") or 1` which at least documents the intent of the fallback. But more importantly, the absence of a baseline config should probably surface as an error rather than silently compute a bogus percentage. This is related to the `defaultdict(int)` issue noted in report #9's `cost_report` entry but is a different code path. |
| 105 | + |
| 106 | +--- |
| 107 | + |
| 108 | +## 2. Feature & UX Improvements |
| 109 | + |
| 110 | +### 2.1 No Way to Query Results Across Runs |
| 111 | + |
| 112 | +Currently, answering "how did scores change between run A and run B for suite csb_sdlc_fix?" requires: |
| 113 | +1. Writing a custom script |
| 114 | +2. Scanning thousands of JSON files |
| 115 | +3. Joining across multiple data structures manually |
| 116 | + |
| 117 | +There is no interactive or repeatable query interface. The HTML export at 1200 rows is the only cross-run view, and it's static. |
| 118 | + |
| 119 | +**Sketch:** A `ccb query` CLI command backed by DuckDB could answer: |
| 120 | +```sql |
| 121 | +SELECT suite, AVG(score), COUNT(*) FROM results |
| 122 | +WHERE run_id IN ('run-a', 'run-b') |
| 123 | +GROUP BY suite ORDER BY avg DESC; |
| 124 | +``` |
| 125 | + |
| 126 | +### 2.2 No Diff View Between Config Modes |
| 127 | + |
| 128 | +The core value proposition of the benchmark is measuring MCP vs. baseline performance. Currently, comparing two config modes requires running `generate_eval_report.py` and reading through the HTML. There is no concise "delta report" showing which tasks improved, degraded, or changed behavior between baseline and MCP configs. |
| 129 | + |
| 130 | +**Sketch:** `python3 scripts/compare_configs.py --run-a baseline-run-001 --run-b mcp-run-001 --output delta.md` producing a table sorted by score delta. |
| 131 | + |
| 132 | +### 2.3 MEMORY.md Has Stale Entries |
| 133 | + |
| 134 | +The `MEMORY.md` dashboard/DB entries are actively misleading new agent sessions. Agents reading MEMORY.md at session start will attempt to verify dashboard state or DB queries that cannot succeed. |
| 135 | + |
| 136 | +**Fix:** Prune stale entries immediately; add a note that the dashboard is "planned but not yet built." |
| 137 | + |
| 138 | +--- |
| 139 | + |
| 140 | +## 3. Research Recommendations |
| 141 | + |
| 142 | +### 3.1 DuckDB for Benchmark Result Storage |
| 143 | + |
| 144 | +The `data/contextbench/` directory already contains parquet files, confirming that columnar storage has been considered. [DuckDB](https://duckdb.org/) is the natural next step: |
| 145 | + |
| 146 | +- Zero-server setup (embedded library, single `.duckdb` file) |
| 147 | +- Reads parquet natively: `SELECT * FROM 'data/contextbench/*.parquet'` |
| 148 | +- Handles 283+ tasks × N runs without performance degradation |
| 149 | +- Enables ad-hoc SQL queries from CLI, Jupyter, and Python scripts |
| 150 | +- Integrates with existing Python toolchain: `pip install duckdb` |
| 151 | + |
| 152 | +A schema like: |
| 153 | +```sql |
| 154 | +CREATE TABLE results ( |
| 155 | + run_id TEXT, task_id TEXT, suite TEXT, config_mode TEXT, |
| 156 | + score FLOAT, max_score FLOAT, duration_s FLOAT, |
| 157 | + mcp_type TEXT, timestamp TIMESTAMP |
| 158 | +); |
| 159 | +``` |
| 160 | +would immediately enable the comparison queries the project currently cannot answer. |
| 161 | + |
| 162 | +### 3.2 Task Metadata Inference from Static Analysis |
| 163 | + |
| 164 | +Given that 274 tasks lack `verification_modes` and `use_case_category`, a static analysis pass over `test.sh` files could auto-infer these fields: |
| 165 | + |
| 166 | +- If `test.sh` sources `answer_json_verifier_lib.sh` → `verification_mode: answer_json` |
| 167 | +- If `test.sh` sources `dual_score_lib.sh` → `verification_mode: dual_score` |
| 168 | +- If parent directory is `csb_sdlc_fix` → `use_case_category: fix` |
| 169 | +- If parent directory is `csb_org_crossrepo` → `use_case_category: crossrepo` |
| 170 | + |
| 171 | +This could be implemented as a 100-line `scripts/repair_task_metadata.py` with a `--dry-run` flag. Once run, all 274 tasks would have valid metadata, unblocking filtered queries, weighted scoring, and auto-detection. |
| 172 | + |
| 173 | +### 3.3 Pre-commit with Targeted Ruff Rules (Follow-up from Report #11) |
| 174 | + |
| 175 | +Report #11 recommended adding `pyproject.toml` with Ruff. The specific configuration needed (not yet in place): |
| 176 | + |
| 177 | +```toml |
| 178 | +[tool.ruff.lint] |
| 179 | +select = ["S603", "S604", "SIM115", "BLE001", "PTH", "UP"] |
| 180 | +per-file-ignores = {"scripts/sanitize_secrets.py" = ["S105", "S106"]} |
| 181 | +``` |
| 182 | + |
| 183 | +This catches 4 of the 5 bug categories found in reports #10–#12 automatically. The report #11 PRD covers this in detail; it remains unimplemented. |
| 184 | + |
| 185 | +--- |
| 186 | + |
| 187 | +## 4. Recommended Next Feature |
| 188 | + |
| 189 | +### Task Metadata Auto-Repair + DuckDB Result Store |
| 190 | + |
| 191 | +The project's analysis layer has two compounding gaps: |
| 192 | +1. All 274 tasks lack `verification_modes` and `use_case_category` → every filtered query returns 0 results |
| 193 | +2. No queryable result store → answering "did MCP help?" requires custom scripts every time |
| 194 | + |
| 195 | +These two gaps together mean the benchmark's core output — the MCP vs. baseline comparison — cannot be reliably queried, filtered, or trended. This is the single highest-leverage improvement because it makes all previous and future benchmark runs actually useful for analysis. |
| 196 | + |
| 197 | +**Feature: `scripts/repair_task_metadata.py` + `scripts/init_result_db.py`** |
| 198 | + |
| 199 | +**Part 1 — Task Metadata Repair (2 hours)** |
| 200 | + |
| 201 | +`scripts/repair_task_metadata.py`: |
| 202 | +- Iterates over all 274 task directories in `benchmarks/` |
| 203 | +- Inspects `tests/test.sh` for `source` statements to infer `verification_mode` |
| 204 | + - `answer_json_verifier_lib.sh` → `answer_json` |
| 205 | + - `dual_score_lib.sh` → `dual_score` |
| 206 | + - Neither → `custom` |
| 207 | +- Infers `use_case_category` from parent suite directory name (e.g., `csb_sdlc_fix` → `fix`) |
| 208 | +- Writes inferred values to each task's `task.json` (atomic write: temp + rename) |
| 209 | +- `--dry-run` mode prints diff without writing |
| 210 | +- `--validate` mode checks all tasks have both fields and exits non-zero if any are missing |
| 211 | +- Adds a `repo_health.py` check calling `--validate` to prevent regression |
| 212 | + |
| 213 | +**Part 2 — DuckDB Result Store (3 hours)** |
| 214 | + |
| 215 | +`scripts/init_result_db.py`: |
| 216 | +- Creates `data/results.duckdb` with normalized schema (runs, tasks, metrics tables) |
| 217 | +- `--ingest-run <run-dir>` scans a run directory and inserts all `result.json` files |
| 218 | +- `--query "SELECT ..."` for ad-hoc SQL from CLI |
| 219 | +- Reads existing parquet files in `data/contextbench/` as initial seed data |
| 220 | + |
| 221 | +`scripts/compare_configs.py`: |
| 222 | +- Queries DuckDB to produce a markdown delta report between two run IDs |
| 223 | +- Sorted by score delta descending |
| 224 | +- Flags regressions (score drop > 0.1) with a warning marker |
| 225 | + |
| 226 | +**Acceptance Criteria:** |
| 227 | +1. All 274 tasks have non-empty `verification_modes` and `use_case_category` |
| 228 | +2. `repo_health.py --quick` fails if any task is missing either field |
| 229 | +3. `python3 scripts/repair_task_metadata.py --validate` exits 0 |
| 230 | +4. `python3 scripts/init_result_db.py --ingest-run <dir>` succeeds without error |
| 231 | +5. `python3 scripts/compare_configs.py --run-a X --run-b Y` produces valid markdown output |
| 232 | +6. All new code passes Ruff with rules S603, SIM115, BLE001 |
| 233 | + |
| 234 | +**Why this over the alternatives:** |
| 235 | +- The code quality gate (report #11) is infrastructure — important but doesn't produce analysis output |
| 236 | +- Verifier consolidation (report #10) is maintenance — reduces toil but doesn't unlock new capability |
| 237 | +- This feature directly answers "did MCP help?" in a reliable, repeatable way, which is the project's core purpose |
| 238 | + |
| 239 | +--- |
| 240 | + |
| 241 | +## Issues Added to CLAUDE.md This Session |
| 242 | + |
| 243 | +- `agents/claude_baseline_agent.py:31`: LOCOBENCH_CLAUDE_MD_TEMPLATE hardcoded to `/home/stephanie_jarmak/CodeScaleBench` (second hardcoded path in this file, distinct from /tmp race at lines 1134-1141) |
| 244 | +- `scripts/export_official_results.py:45`: `DEFAULT_REPO_BLOB_BASE` points to stale GitHub org `CodeScaleBench` |
| 245 | +- `scripts/generate_eval_report.py:1005`: Falsy bug repeats (previously only line 147 was catalogued) |
| 246 | +- `data/contextbench/*.parquet`: Parquet files exist; DuckDB integration is the natural next step |
| 247 | +- MEMORY.md: Dashboard/DB entries are stale (dashboard/app.py and data/codecontextbench.db do not exist) |
| 248 | +- 274 tasks × 2 missing metadata fields = 548 values; no runtime detection in repo_health.py |
| 249 | + |
| 250 | +*Remediation velocity: 6 consecutive days without a code fix (Mar 12 → Mar 18). ~100 open issues across 12 reports.* |
0 commit comments