fix(core): exclude stale entity rows from embedding coverage stats#675
fix(core): exclude stale entity rows from embedding coverage stats#675phernandez merged 3 commits intomainfrom
Conversation
When entities are deleted, their rows in search_index and search_vector_chunks can remain as orphans. This caused `bm project info` to overcount total_indexed_entities and report false "missing embeddings" recommendations after a successful reindex. Two-pronged fix: 1. Filter embedding status queries to only count entity_ids that still exist in the entity table (fixes reporting) 2. Purge stale search rows during reindex (fixes root cause) Fixes #670 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>
|
Claude finished @phernandez's task in 2m 9s —— View job PR Review: fix(core): exclude stale entity rows from embedding coverage statsTasks:
SummaryThe fix correctly addresses the root cause: stale rows in
The overall approach is sound. A few issues to address: Issues Found🔴 Dead parameter in
|
| Item | Status |
|---|---|
| Follows Basic Memory coding conventions (CLAUDE.md) | ✅ |
| Python 3.12+ type annotations and async patterns | ✅ |
SQLAlchemy 2.0 best practices (text(), parameterized queries) |
✅ |
| 100-character line length | ✅ |
| No commented-out code | ✅ |
| Literate programming style (decision point comments) | ✅ |
| Unit tests for new functions/methods | _purge_stale_search_rows itself |
| 100% test coverage | |
| Database changes include Alembic migrations | ✅ N/A (no schema changes) |
| No hardcoded secrets | ✅ |
| Input validation | ✅ |
| Windows-compatible file paths | ✅ N/A |
| No emoji/unicode in console/log output | ✅ |
Minor Notes
- The SQLite-specific embedding deletion before chunk deletion (lines 452–460) is well-commented and correct — good handling of the missing CASCADE in sqlite-vec.
- The
entity_existsstring interpolation approach (f-string intotext()) is acceptable here since both fragments are hardcoded strings, not user input. No SQL injection risk. - The import inside the method (
from basic_memory.repository.sqlite_search_repository import SQLiteSearchRepository) is unusual — consider moving to top of file unless there's a circular import issue.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6bf5343faa
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # Clean up stale rows in search_index and search_vector_chunks | ||
| # that reference entity_ids no longer in the entity table | ||
| await self._purge_stale_search_rows(set(entity_ids)) |
There was a problem hiding this comment.
Ensure vector tables exist before purging stale rows
reindex_vectors now calls _purge_stale_search_rows before sync_entity_vectors_batch, but table creation happens inside sync_entity_vectors_batch (_ensure_vector_tables). On a first-time embeddings run (for example bm reindex --embeddings on a project that has never built vectors), the new purge issues DELETE statements against search_vector_chunks/search_vector_embeddings before those tables exist, which raises a DB error and aborts reindexing.
Useful? React with 👍 / 👎.
1. Remove unused valid_entity_ids parameter from _purge_stale_search_rows 2. Add entity_exists filter to total_embeddings and orphaned_chunks queries for consistency (previously only applied to 3 of 5 queries) 3. Strengthen test assertion to exact match instead of <= Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>
|
Claude finished @phernandez's task in 2m 44s —— View job PR Review: fix(core): exclude stale entity rows from embedding coverage statsTasks:
SummaryThe follow-up commit ( Previous Issues — Resolved ✅
Codex P1 — False PositiveThe Codex review flagged that This concern is not valid: both Remaining Observations🟡 Local import inside
|
| Item | Status |
|---|---|
| Follows Basic Memory coding conventions (CLAUDE.md) | ✅ |
| Python 3.12+ type annotations and async patterns | ✅ |
SQLAlchemy 2.0 best practices (text(), parameterized queries) |
✅ |
| 100-character line length | ✅ |
| No commented-out code | ✅ |
| Literate programming style (decision point comments) | ✅ |
| Unit tests for new functions/methods | _purge_stale_search_rows is exercised through test_reindex_vectors but its stale-row-removal behavior is not directly asserted |
| 100% test coverage | ✅ Lines covered via reindex_vectors test path |
| Database changes include Alembic migrations | ✅ N/A |
| No hardcoded secrets | ✅ |
| Input validation | ✅ |
| Windows-compatible file paths | ✅ N/A |
The core fix is correct and well-structured. The two observations above are non-blocking but would improve defensive coverage for future regressions.
Postgres search_index is a regular table with NOT NULL id constraint, unlike SQLite's FTS5 virtual table where id is UNINDEXED. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>
|
Claude finished @phernandez's task in 1m 13s —— View job PR Review: fix(core): exclude stale entity rows from embedding coverage statsTasks:
SummaryThe latest commit ( Two minor observations remain from the previous review that are still unaddressed: Remaining Observations🟡 Redundant
|
| Item | Status |
|---|---|
| Follows Basic Memory coding conventions (CLAUDE.md) | ✅ |
| Python 3.12+ type annotations and async patterns | ✅ |
SQLAlchemy 2.0 best practices (text(), parameterized queries) |
✅ |
| 100-character line length | ✅ |
| No commented-out code | ✅ |
| Literate programming style (decision point comments) | ✅ |
| Unit tests for new functions/methods | ✅ New test covers stale search_index exclusion; _purge_stale_search_rows exercised via reindex_vectors path |
| 100% test coverage | ✅ |
| Database changes include Alembic migrations | ✅ N/A (no schema changes) |
| No hardcoded secrets | ✅ |
| Input validation | ✅ |
| Windows-compatible file paths | ✅ N/A |
| No emoji/unicode in console/log output | ✅ |
The core fix is correct and all blocking issues are resolved. The two observations above are non-blocking but would tighten defensive coverage.
Summary
bm project infowas reporting false "missing embeddings" after a successfulbm reindexbecause stale rows insearch_indexandsearch_vector_chunks(from previously deleted entities) inflated thetotal_indexed_entitiescountentity_id IN (SELECT id FROM entity ...))bm reindexto purge orphaned search rows for deleted entitiesFixes #670
Test plan
pytest tests/services/test_project_service_embedding_status.py -v— new test for stale entity exclusionpytest tests/services/test_search_service.py -v— existing search service testsbm reindex,bm project infoshould not recommend reindex when all current entities have embeddings🤖 Generated with Claude Code