Skip to content

[BUG] Code indexing chunker drops function names when body is large #10715

@rossdonald

Description

@rossdonald

Problem (one or two sentences)

The code indexing parser is dropping class and function names when parsing large code blocks. This occurs because the parser splits large nodes into their children, but the individual signature components (like export, class keyword, identifier, implements clause) are each smaller than MIN_BLOCK_CHARS (50 characters) and are therefore discarded.

Function and class names are important for the semantic search and should not be ignored.

Analysis

Constants

From src/services/code-index/constants/index.ts:

  • MIN_BLOCK_CHARS = 50 - Minimum characters required for a code block to be included
  • MAX_BLOCK_CHARS = 1000 - Maximum characters per block before splitting
  • MAX_CHARS_TOLERANCE_FACTOR = 1.15 - 15% tolerance (effective max: 1150 chars)

The Bug Flow

In src/services/code-index/processors/parser.ts, the parseContent method:

  1. Initial Capture: Tree-sitter queries capture class declarations, method definitions, etc.

  2. Queue Processing: Nodes are processed from a queue:

    const queue: Node[] = Array.from(captures).map((capture) => capture.node)

  3. Size Check: Each node is checked against MIN_BLOCK_CHARS (line 180):

    if (currentNode.text.length >= MIN_BLOCK_CHARS) {

    The comment under this if statement says that nodes smaller than the minimum block chars are discarded.
    // Nodes smaller than minBlockChars are ignored

  4. Splitting Logic: If a node exceeds the max size, it's split:

    if (currentNode.text.length > MAX_BLOCK_CHARS * MAX_CHARS_TOLERANCE_FACTOR) {
    if (currentNode.children.filter((child) => child !== null).length > 0) {
    // If it has children, process them instead
    queue.push(...currentNode.children.filter((child) => child !== null))
    } else {

  5. The Problem: When a large function or class declaration is split:

Example Input:

export class TestParser implements ITestParser {
    // ... large body over 1150 chars ...
}
  • The class node (e.g., "export class TestParser implements ITestParser { ... }") is > 1150 chars
  • Its children are pushed to the queue:
    • export node (7 chars) - DISCARDED (< 50)
    • class keyword node (6 chars) - DISCARDED (< 50)
    • TestParser identifier node (11 chars) - DISCARDED (< 50)
    • implements ITestParser node (22 chars) - DISCARDED (< 50)
    • Class body node (large) - KEPT (≥ 50)
  • The signature information is lost because the individual parts are too small

Impact

This bug affects:

  1. Class/Method/Function declarations with large bodies
  2. Any code structure where the signature is small but the body is large

Solution

No nodes should be discarded. They should be appended to a smaller chunk until minimum block size is reached or small nodes are exhausted.

Context (who is affected and when)

Users who use codebase indexing.

Reproduction steps

Enable codebase indexing and have a function or class with a large body over the max block size. Example:

export class TestParser implements ITestParser {
    // ... large body over 1150 chars ...
}

Devs can add console log statements after the chunks are created, ordering by the start line and then printing out all the chunks and comparing with the original file to see the missing chunks.

Expected result

Expect that nodes should not be dropped even if they are small. The class and function name should be included in indexing.

Actual result

The function definition is ignored.

Variations tried (optional)

No response

App Version

3.40.0

API Provider (optional)

None

Model Used (optional)

No response

Roo Code Task Links (optional)

No response

Relevant logs or errors (optional)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions