Skip to content

Commit 10fb0fb

Browse files
committed
docs: nightly research report 2026-03-18
Report #12. New findings: - agents/claude_baseline_agent.py:31 — second hardcoded path (LOCOBENCH_CLAUDE_MD_TEMPLATE) - scripts/export_official_results.py:45 — DEFAULT_REPO_BLOB_BASE points to stale CodeScaleBench org - scripts/generate_eval_report.py:1005 — falsy bug repeats (line 147 already catalogued) - 274 tasks × 2 missing metadata fields (verification_modes, use_case_category) = 548 missing values; no runtime check - data/contextbench/ parquet files confirm DuckDB is natural next step Recommended next feature: task metadata auto-repair + DuckDB result analytics store. Stale MEMORY.md dashboard entry corrected (dashboard does not exist).
1 parent 140f20f commit 10fb0fb

File tree

1 file changed

+250
-0
lines changed

1 file changed

+250
-0
lines changed
Lines changed: 250 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,250 @@
1+
# Nightly Research Report — 2026-03-18 (Report #12)
2+
3+
## Executive Summary
4+
5+
Six consecutive days with zero code fixes. The backlog has now grown to ~100 open issues across 12 reports, yet the remediation pipeline remains stalled. This report surfaces a new critical path bug in `claude_baseline_agent.py` (LOCOBENCH template path hardcoded to a non-existent machine), a stale GitHub org URL in the export pipeline, a repeated falsy-value bug on an additional line not previously noted, and an architectural finding that the project's analysis layer is fundamentally broken: the missing `verification_modes` and `use_case_category` fields silently invalidate all filtered queries. The recommended next feature is a **DuckDB-based result analytics store** to replace the flat-file scan pattern and enable queryable, cross-run analysis.
6+
7+
---
8+
9+
## 1. Code & Architecture Review
10+
11+
### 1.1 NEW CRITICAL: Second Hardcoded Path in claude_baseline_agent.py
12+
13+
`agents/claude_baseline_agent.py:31` contains a second hardcoded personal path not previously catalogued (the `/tmp` race condition at lines 1134-1141 was noted in report #11, but this path is separate):
14+
15+
```python
16+
LOCOBENCH_CLAUDE_MD_TEMPLATE = Path(
17+
"/home/stephanie_jarmak/CodeScaleBench/benchmarks/locobench_agent/templates/CLAUDE.md"
18+
)
19+
```
20+
21+
This path is doubly wrong:
22+
- Points to `/home/stephanie_jarmak/` (not `/Users/sjarmak/`)
23+
- References `CodeScaleBench` (not `CodeContextBench`)
24+
25+
The template is used when spawning locobench evaluation tasks. Any locobench run on any machine other than the original dev machine will crash with `FileNotFoundError` at task launch. This is distinct from the `apply_verifier_fixes.py:9` path noted in report #11.
26+
27+
**Fix:** Replace with `Path(__file__).resolve().parent.parent / "benchmarks/locobench_agent/templates/CLAUDE.md"`.
28+
29+
---
30+
31+
### 1.2 NEW HIGH: Stale GitHub Org URL in Export Pipeline
32+
33+
`scripts/export_official_results.py:45` embeds a hardcoded GitHub URL:
34+
35+
```python
36+
DEFAULT_REPO_BLOB_BASE = "https://github.com/sourcegraph/CodeScaleBench/blob/main"
37+
```
38+
39+
The repository has been renamed to `CodeContextBench`. Every task link generated in exported HTML reports points to a 404 page (or GitHub redirect at best). Given that the export is the primary artifact shared with external reviewers, stale links undermine credibility.
40+
41+
**Fix:** Replace with `"https://github.com/sourcegraph/CodeContextBench/blob/main"` or derive from `git remote get-url origin`.
42+
43+
---
44+
45+
### 1.3 NEW MEDIUM: Falsy Bug Repeated at Line 1005 of generate_eval_report.py
46+
47+
Report #11 noted the falsy bug at `generate_eval_report.py:147`:
48+
49+
```python
50+
mcp_mode = hc.get("mcp_mode") or r.config_name
51+
```
52+
53+
The same pattern reappears at **line 1005** in the same file (a second code path used when building the comparison table). If `mcp_mode` is a valid empty string, both sites would incorrectly fall back to `config_name`. This means fixes to line 147 alone are incomplete — both sites must be patched.
54+
55+
---
56+
57+
### 1.4 Analysis Layer Is Structurally Broken: Missing Metadata in All 274 Tasks
58+
59+
Previous reports noted that `verification_modes` and `use_case_category` are missing from all 274 task definitions. The full impact hasn't been articulated until now:
60+
61+
- **`--use-case-category` filter**: silently returns 0 tasks (no error, no warning)
62+
- **`--verification-mode` filter**: silently returns 0 tasks
63+
- **Auto-detection of verifier type**: falls back to heuristics that are known to misclassify
64+
- **Weighted scoring**: weight lookup fails silently (falls back to 1.0 for all tasks)
65+
66+
These fields are not cosmetic — they gate the correctness of every analysis run. Until all 274 tasks have these fields populated, no filtered query can be trusted. There is currently no script to detect this gap at runtime (the `repo_health.py` check validates schema structure but not field completeness).
67+
68+
**Scale**: 274 tasks × 2 missing fields = 548 missing metadata values.
69+
70+
---
71+
72+
### 1.5 Result Storage Does Not Scale
73+
74+
Benchmark results are stored as flat files (`result.json`, `task_metrics.json` per task per run). The `results/` directory currently contains only `repo_cloc_counts.json` — no run outputs. The `export_official_results.py` pipeline must scan the entire run directory tree to build the HTML export, and the HTML output truncates at 1200 rows (noted in report #9).
75+
76+
Meanwhile, `data/contextbench/` already contains parquet files (test/hard/verified subsets), indicating that the project has partially explored columnar storage. However, there is no queryable interface — no DuckDB schema, no SQL interface, no way to ask "what is the average score by suite across the last 5 runs?" without writing a custom script.
77+
78+
The current architecture requires a full file-system scan for every analysis, which will degrade as the task count grows beyond 283.
79+
80+
---
81+
82+
### 1.6 Dashboard Referenced in MEMORY.md Does Not Exist
83+
84+
`MEMORY.md` states: "Dashboard at `dashboard/app.py` (Streamlit), DB at `data/codecontextbench.db`"
85+
86+
Neither exists:
87+
- `/Users/sjarmak/CodeContextBench/dashboard/` — directory absent
88+
- `data/codecontextbench.db` — file absent (data/ contains only parquet files and CSVs)
89+
90+
Report #10 flagged this as "Dashboard Reference in Docs Does Not Exist" but the memory entry was never corrected. The stale MEMORY.md entry is misleading to agents starting new sessions.
91+
92+
**Fix:** Update `MEMORY.md` to remove the dashboard/DB references until those components are built.
93+
94+
---
95+
96+
### 1.7 Cost Report Default Guard Pattern Incorrect
97+
98+
`scripts/cost_report.py:262`:
99+
100+
```python
101+
pct = ((cost / tasks) / (baseline_cost / config_tasks.get("baseline", 1)) - 1) * 100
102+
```
103+
104+
If the "baseline" config is absent from the run, `get("baseline", 1)` returns integer `1`, producing a nonsensical cost comparison (dividing by 1 as a proxy for baseline cost). The correct guard is `get("baseline") or 1` which at least documents the intent of the fallback. But more importantly, the absence of a baseline config should probably surface as an error rather than silently compute a bogus percentage. This is related to the `defaultdict(int)` issue noted in report #9's `cost_report` entry but is a different code path.
105+
106+
---
107+
108+
## 2. Feature & UX Improvements
109+
110+
### 2.1 No Way to Query Results Across Runs
111+
112+
Currently, answering "how did scores change between run A and run B for suite csb_sdlc_fix?" requires:
113+
1. Writing a custom script
114+
2. Scanning thousands of JSON files
115+
3. Joining across multiple data structures manually
116+
117+
There is no interactive or repeatable query interface. The HTML export at 1200 rows is the only cross-run view, and it's static.
118+
119+
**Sketch:** A `ccb query` CLI command backed by DuckDB could answer:
120+
```sql
121+
SELECT suite, AVG(score), COUNT(*) FROM results
122+
WHERE run_id IN ('run-a', 'run-b')
123+
GROUP BY suite ORDER BY avg DESC;
124+
```
125+
126+
### 2.2 No Diff View Between Config Modes
127+
128+
The core value proposition of the benchmark is measuring MCP vs. baseline performance. Currently, comparing two config modes requires running `generate_eval_report.py` and reading through the HTML. There is no concise "delta report" showing which tasks improved, degraded, or changed behavior between baseline and MCP configs.
129+
130+
**Sketch:** `python3 scripts/compare_configs.py --run-a baseline-run-001 --run-b mcp-run-001 --output delta.md` producing a table sorted by score delta.
131+
132+
### 2.3 MEMORY.md Has Stale Entries
133+
134+
The `MEMORY.md` dashboard/DB entries are actively misleading new agent sessions. Agents reading MEMORY.md at session start will attempt to verify dashboard state or DB queries that cannot succeed.
135+
136+
**Fix:** Prune stale entries immediately; add a note that the dashboard is "planned but not yet built."
137+
138+
---
139+
140+
## 3. Research Recommendations
141+
142+
### 3.1 DuckDB for Benchmark Result Storage
143+
144+
The `data/contextbench/` directory already contains parquet files, confirming that columnar storage has been considered. [DuckDB](https://duckdb.org/) is the natural next step:
145+
146+
- Zero-server setup (embedded library, single `.duckdb` file)
147+
- Reads parquet natively: `SELECT * FROM 'data/contextbench/*.parquet'`
148+
- Handles 283+ tasks × N runs without performance degradation
149+
- Enables ad-hoc SQL queries from CLI, Jupyter, and Python scripts
150+
- Integrates with existing Python toolchain: `pip install duckdb`
151+
152+
A schema like:
153+
```sql
154+
CREATE TABLE results (
155+
run_id TEXT, task_id TEXT, suite TEXT, config_mode TEXT,
156+
score FLOAT, max_score FLOAT, duration_s FLOAT,
157+
mcp_type TEXT, timestamp TIMESTAMP
158+
);
159+
```
160+
would immediately enable the comparison queries the project currently cannot answer.
161+
162+
### 3.2 Task Metadata Inference from Static Analysis
163+
164+
Given that 274 tasks lack `verification_modes` and `use_case_category`, a static analysis pass over `test.sh` files could auto-infer these fields:
165+
166+
- If `test.sh` sources `answer_json_verifier_lib.sh``verification_mode: answer_json`
167+
- If `test.sh` sources `dual_score_lib.sh``verification_mode: dual_score`
168+
- If parent directory is `csb_sdlc_fix``use_case_category: fix`
169+
- If parent directory is `csb_org_crossrepo``use_case_category: crossrepo`
170+
171+
This could be implemented as a 100-line `scripts/repair_task_metadata.py` with a `--dry-run` flag. Once run, all 274 tasks would have valid metadata, unblocking filtered queries, weighted scoring, and auto-detection.
172+
173+
### 3.3 Pre-commit with Targeted Ruff Rules (Follow-up from Report #11)
174+
175+
Report #11 recommended adding `pyproject.toml` with Ruff. The specific configuration needed (not yet in place):
176+
177+
```toml
178+
[tool.ruff.lint]
179+
select = ["S603", "S604", "SIM115", "BLE001", "PTH", "UP"]
180+
per-file-ignores = {"scripts/sanitize_secrets.py" = ["S105", "S106"]}
181+
```
182+
183+
This catches 4 of the 5 bug categories found in reports #10#12 automatically. The report #11 PRD covers this in detail; it remains unimplemented.
184+
185+
---
186+
187+
## 4. Recommended Next Feature
188+
189+
### Task Metadata Auto-Repair + DuckDB Result Store
190+
191+
The project's analysis layer has two compounding gaps:
192+
1. All 274 tasks lack `verification_modes` and `use_case_category` → every filtered query returns 0 results
193+
2. No queryable result store → answering "did MCP help?" requires custom scripts every time
194+
195+
These two gaps together mean the benchmark's core output — the MCP vs. baseline comparison — cannot be reliably queried, filtered, or trended. This is the single highest-leverage improvement because it makes all previous and future benchmark runs actually useful for analysis.
196+
197+
**Feature: `scripts/repair_task_metadata.py` + `scripts/init_result_db.py`**
198+
199+
**Part 1 — Task Metadata Repair (2 hours)**
200+
201+
`scripts/repair_task_metadata.py`:
202+
- Iterates over all 274 task directories in `benchmarks/`
203+
- Inspects `tests/test.sh` for `source` statements to infer `verification_mode`
204+
- `answer_json_verifier_lib.sh``answer_json`
205+
- `dual_score_lib.sh``dual_score`
206+
- Neither → `custom`
207+
- Infers `use_case_category` from parent suite directory name (e.g., `csb_sdlc_fix``fix`)
208+
- Writes inferred values to each task's `task.json` (atomic write: temp + rename)
209+
- `--dry-run` mode prints diff without writing
210+
- `--validate` mode checks all tasks have both fields and exits non-zero if any are missing
211+
- Adds a `repo_health.py` check calling `--validate` to prevent regression
212+
213+
**Part 2 — DuckDB Result Store (3 hours)**
214+
215+
`scripts/init_result_db.py`:
216+
- Creates `data/results.duckdb` with normalized schema (runs, tasks, metrics tables)
217+
- `--ingest-run <run-dir>` scans a run directory and inserts all `result.json` files
218+
- `--query "SELECT ..."` for ad-hoc SQL from CLI
219+
- Reads existing parquet files in `data/contextbench/` as initial seed data
220+
221+
`scripts/compare_configs.py`:
222+
- Queries DuckDB to produce a markdown delta report between two run IDs
223+
- Sorted by score delta descending
224+
- Flags regressions (score drop > 0.1) with a warning marker
225+
226+
**Acceptance Criteria:**
227+
1. All 274 tasks have non-empty `verification_modes` and `use_case_category`
228+
2. `repo_health.py --quick` fails if any task is missing either field
229+
3. `python3 scripts/repair_task_metadata.py --validate` exits 0
230+
4. `python3 scripts/init_result_db.py --ingest-run <dir>` succeeds without error
231+
5. `python3 scripts/compare_configs.py --run-a X --run-b Y` produces valid markdown output
232+
6. All new code passes Ruff with rules S603, SIM115, BLE001
233+
234+
**Why this over the alternatives:**
235+
- The code quality gate (report #11) is infrastructure — important but doesn't produce analysis output
236+
- Verifier consolidation (report #10) is maintenance — reduces toil but doesn't unlock new capability
237+
- This feature directly answers "did MCP help?" in a reliable, repeatable way, which is the project's core purpose
238+
239+
---
240+
241+
## Issues Added to CLAUDE.md This Session
242+
243+
- `agents/claude_baseline_agent.py:31`: LOCOBENCH_CLAUDE_MD_TEMPLATE hardcoded to `/home/stephanie_jarmak/CodeScaleBench` (second hardcoded path in this file, distinct from /tmp race at lines 1134-1141)
244+
- `scripts/export_official_results.py:45`: `DEFAULT_REPO_BLOB_BASE` points to stale GitHub org `CodeScaleBench`
245+
- `scripts/generate_eval_report.py:1005`: Falsy bug repeats (previously only line 147 was catalogued)
246+
- `data/contextbench/*.parquet`: Parquet files exist; DuckDB integration is the natural next step
247+
- MEMORY.md: Dashboard/DB entries are stale (dashboard/app.py and data/codecontextbench.db do not exist)
248+
- 274 tasks × 2 missing metadata fields = 548 values; no runtime detection in repo_health.py
249+
250+
*Remediation velocity: 6 consecutive days without a code fix (Mar 12 → Mar 18). ~100 open issues across 12 reports.*

0 commit comments

Comments
 (0)