feat: Add streaming checksum computation for large files (Phase 1)

phernandez · claude · phernandez · commit 716ab7b3c0bb · 2025-10-20T08:18:52.000-05:00
Implement chunked checksum computation to prevent OOM on large files. Part of SPEC-19 Phase 1 memory optimization. Implementation: - Add _compute_checksum_streaming() with 64KB chunk reading - Update _compute_checksum_async() to auto-select streaming for files >1MB - Maintains constant memory usage regardless of file size - 16MB PDF now uses 64KB memory instead of 16MB Testing (4 new tests): - test_compute_checksum_streaming_equivalence: Verify streaming produces same result - test_compute_checksum_large_file_uses_streaming: Confirm >1MB uses streaming - test_compute_checksum_small_file_uses_fast_path: Confirm <1MB uses fast path - test_compute_checksum_streaming_binary_files: Verify binary file handling Benefits: - Prevents OOM on projects with large PDFs and images - Constant memory footprint for checksumming - Works well with network filesystems (TigrisFS) - Backward compatible - transparent to callers - Semaphore already limits concurrent operations All tests passing (10 total: 6 streaming scan + 4 checksum tests) Related: #382 (Optimize memory for large file syncs) Part of: SPEC-19 Phase 1 Core Fixes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: phernandez <paul@basicmachines.co>
diff --git a/specs/SPEC-19 Sync Performance and Memory Optimization.md b/specs/SPEC-19 Sync Performance and Memory Optimization.md
@@ -402,32 +402,32 @@ ALTER TABLE entity ADD COLUMN size INTEGER;
 ### Phase 1: Core Fixes
 
 **mtime-based scanning**:
-- [ ] Add mtime/size columns to Entity model
-- [ ] Database migration (alembic)
-- [ ] Update `scan_directory()` to use stat()
-- [ ] Update `scan()` to compare mtime/size
-- [ ] Only compute checksums for changed files
-- [ ] Unit tests for mtime comparison
+- [x] Add mtime/size columns to Entity model (completed in Phase 0.5)
+- [x] Database migration (alembic) (completed in Phase 0.5)
+- [ ] Refactor `scan()` to use streaming architecture with mtime/size comparison
+- [ ] Update `_process_file()` to store mtime/size in database on upsert
+- [ ] Only compute checksums for changed files (mtime/size differ)
+- [ ] Unit tests for mtime comparison logic
 - [ ] Integration test with 1,000 files
 
 **Streaming checksums**:
-- [ ] Implement `_compute_checksum_streaming()`
+- [ ] Implement `_compute_checksum_streaming()` with chunked reading
 - [ ] Add file size threshold logic (1MB)
 - [ ] Test with large files (16MB PDF)
 - [ ] Verify memory usage stays constant
-- [ ] Test checksum equivalence
+- [ ] Test checksum equivalence (streaming vs non-streaming)
 
 **Bounded concurrency**:
-- [ ] Add semaphore (10 concurrent)
+- [ ] Add semaphore (10 concurrent) to `_read_file_async()`
 - [ ] Add LRU cache for failures (100 max)
 - [ ] Review thread pool size configuration
 - [ ] Load test with 2,000+ files
 - [ ] Verify <500MB peak memory
 
-**cleanup**
-- [ ] remove sync status service
-- [ ] db state - don't select all entities in project - basic_memory.sync.sync_service.SyncService.get_db_file_state
-- [ ] use aiofiles in file_service for file io. Don't block reading files in async loop
+**Cleanup & Optimization**:
+- [ ] Eliminate `get_db_file_state()` - no upfront SELECT all entities
+- [ ] Remove sync status service (if unused)
+- [ ] Consider aiofiles for non-blocking I/O (future enhancement)
 
 ### Phase 2: Cloud Fixes 
 
diff --git a/src/basic_memory/sync/sync_service.py b/src/basic_memory/sync/sync_service.py
@@ -150,12 +150,66 @@ async def _read_file_async(self, file_path: Path) -> str:
             loop = asyncio.get_event_loop()
             return await loop.run_in_executor(self._thread_pool, file_path.read_text, "utf-8")
 
+    async def _compute_checksum_streaming(self, path: str, chunk_size: int = 65536) -> str:
+        """Compute file checksum using chunked reading for large files.
+
+        Reads file in 64KB chunks to maintain constant memory usage regardless of file size.
+        Critical for handling large PDFs and images without causing OOM.
+
+        Args:
+            path: Relative file path
+            chunk_size: Size of chunks to read (default 64KB)
+
+        Returns:
+            SHA256 hexdigest of file content
+        """
+
+        def _sync_compute_checksum_streaming(path_str: str) -> str:
+            """Synchronous streaming checksum computation for thread pool."""
+            import hashlib
+
+            path_obj = self.file_service.base_path / path_str
+            hasher = hashlib.sha256()
+
+            # Always use binary mode for streaming to handle all file types
+            with open(path_obj, "rb") as f:
+                while chunk := f.read(chunk_size):
+                    hasher.update(chunk)
+
+            return hasher.hexdigest()
+
+        async with self._file_semaphore:
+            loop = asyncio.get_event_loop()
+            return await loop.run_in_executor(
+                self._thread_pool, _sync_compute_checksum_streaming, path
+            )
+
     async def _compute_checksum_async(self, path: str) -> str:
-        """Compute file checksum in thread pool to avoid blocking the event loop.
+        """Compute file checksum with automatic streaming for large files.
 
         Uses semaphore to limit concurrent file reads and prevent OOM on large projects.
+        For files >1MB, uses chunked streaming to maintain constant memory usage.
+
+        Args:
+            path: Relative file path
+
+        Returns:
+            SHA256 hexdigest of file content
         """
+        # Check file size to decide whether to stream
+        path_obj = self.file_service.base_path / path
+        try:
+            file_stat = path_obj.stat()
+            # Use streaming for files larger than 1MB
+            if file_stat.st_size > 1_048_576:  # 1MB threshold
+                logger.trace(
+                    f"Using streaming checksum for large file: {path}, size={file_stat.st_size}"
+                )
+                return await self._compute_checksum_streaming(path)
+        except OSError as e:
+            logger.warning(f"Could not stat file {path}: {e}, falling back to non-streaming")
 
+        # Small files: use existing fast path
         def _sync_compute_checksum(path_str: str) -> str:
             # Synchronous version for thread pool execution
             path_obj = self.file_service.base_path / path_str
diff --git a/tests/sync/test_sync_service.py b/tests/sync/test_sync_service.py
@@ -1905,3 +1905,128 @@ async def test_scan_directory_streaming_non_markdown_files(
     assert "image.png" in results
     assert "data.json" in results
     assert "script.py" in results
+
+
+@pytest.mark.asyncio
+async def test_compute_checksum_streaming_equivalence(
+    sync_service: SyncService, project_config: ProjectConfig
+):
+    """Test that streaming and non-streaming checksums produce identical results."""
+    project_dir = project_config.home
+
+    # Create test file with known content
+    test_content = "Test content for checksum validation" * 100  # Multi-line content
+    test_file = project_dir / "checksum_test.md"
+    await create_test_file(test_file, test_content)
+
+    rel_path = test_file.relative_to(project_dir).as_posix()
+
+    # Compute checksum using streaming method
+    streaming_checksum = await sync_service._compute_checksum_streaming(rel_path)
+
+    # Compute checksum using the unified method (which will use non-streaming for small files)
+    unified_checksum = await sync_service._compute_checksum_async(rel_path)
+
+    # Both should produce identical results
+    assert streaming_checksum == unified_checksum
+    assert len(streaming_checksum) == 64  # SHA256 hex digest length
+
+
+@pytest.mark.asyncio
+async def test_compute_checksum_large_file_uses_streaming(
+    sync_service: SyncService, project_config: ProjectConfig
+):
+    """Test that files >1MB automatically use streaming checksum computation."""
+    from unittest.mock import patch
+
+    project_dir = project_config.home
+
+    # Create a file larger than 1MB threshold
+    large_content = "x" * (1_048_577)  # Just over 1MB
+    large_file = project_dir / "large_file.pdf"
+    large_file.write_bytes(large_content.encode())
+
+    rel_path = large_file.relative_to(project_dir).as_posix()
+
+    # Track whether streaming method was called
+    streaming_called = False
+    original_streaming = sync_service._compute_checksum_streaming
+
+    async def mock_streaming(*args, **kwargs):
+        nonlocal streaming_called
+        streaming_called = True
+        return await original_streaming(*args, **kwargs)
+
+    with patch.object(
+        sync_service, "_compute_checksum_streaming", side_effect=mock_streaming
+    ):
+        checksum = await sync_service._compute_checksum_async(rel_path)
+
+    # Verify streaming was used
+    assert streaming_called, "Large file should use streaming checksum"
+    assert checksum is not None
+    assert len(checksum) == 64
+
+
+@pytest.mark.asyncio
+async def test_compute_checksum_small_file_uses_fast_path(
+    sync_service: SyncService, project_config: ProjectConfig
+):
+    """Test that files <1MB use fast non-streaming path."""
+    from unittest.mock import patch
+
+    project_dir = project_config.home
+
+    # Create a small file (under 1MB)
+    small_content = "Small file content"
+    small_file = project_dir / "small_file.md"
+    await create_test_file(small_file, small_content)
+
+    rel_path = small_file.relative_to(project_dir).as_posix()
+
+    # Track whether streaming method was called
+    streaming_called = False
+    original_streaming = sync_service._compute_checksum_streaming
+
+    async def mock_streaming(*args, **kwargs):
+        nonlocal streaming_called
+        streaming_called = True
+        return await original_streaming(*args, **kwargs)
+
+    with patch.object(
+        sync_service, "_compute_checksum_streaming", side_effect=mock_streaming
+    ):
+        checksum = await sync_service._compute_checksum_async(rel_path)
+
+    # Verify streaming was NOT used for small file
+    assert not streaming_called, "Small file should use fast non-streaming path"
+    assert checksum is not None
+    assert len(checksum) == 64
+
+
+@pytest.mark.asyncio
+async def test_compute_checksum_streaming_binary_files(
+    sync_service: SyncService, project_config: ProjectConfig
+):
+    """Test that streaming checksum works correctly with binary files."""
+    project_dir = project_config.home
+
+    # Create a binary file with specific byte pattern
+    binary_content = bytes(range(256)) * 100  # 25.6KB of binary data
+    binary_file = project_dir / "binary_test.bin"
+    binary_file.write_bytes(binary_content)
+
+    rel_path = binary_file.relative_to(project_dir).as_posix()
+
+    # Compute checksum using streaming
+    streaming_checksum = await sync_service._compute_checksum_streaming(rel_path)
+
+    # Verify checksum is valid
+    assert streaming_checksum is not None
+    assert len(streaming_checksum) == 64
+
+    # Compute expected checksum manually for verification
+    import hashlib
+
+    expected_checksum = hashlib.sha256(binary_content).hexdigest()
+    assert streaming_checksum == expected_checksum