Skip to content

Conversation

@nleroy917
Copy link
Member

This directly addresses #597

All Submissions:

  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

  • Does your submission pass the existing tests?
  • Have you added tests for your feature?
  • Have you installed pre-commit with pip3 install pre-commit and set up hooks with pre-commit install?

New models submission:

  • Have you added an explanation of why it's important to include this model?
  • Have you added tests for the new model? Were canonical values for tests computed via the original model?
  • Have you added the code snippet for how canonical values were computed?
  • Have you successfully ran tests with your changes locally?

@coderabbitai
Copy link

coderabbitai bot commented Jan 12, 2026

📝 Walkthrough

Walkthrough

This pull request introduces a new protein embedding feature to the fastembed library. A new module fastembed/bio/protein_embedding.py implements ONNX-based protein sequence embeddings with support for the ESM-2 model (facebook/esm2_t12_35M_UR50D). The implementation includes a tokenizer for protein sequences, abstract and concrete embedding classes for ONNX model loading and inference, output post-processing with mean pooling and normalization, multi-worker orchestration support, and a high-level public API. The ProteinEmbedding class is exported through the module hierarchy by updating fastembed/bio/__init__.py and fastembed/__init__.py.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: adding ESM2 model support for protein embeddings, which aligns with the new ProteinEmbedding implementation and model introduction.
Description check ✅ Passed The description references issue #597 and includes relevant contributor checklist items for new feature/model submissions that relate to the protein embedding implementation being added.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @fastembed/bio/protein_embedding.py:
- Around line 65-66: Truncation currently slices input_ids to self.max_length
and can drop the EOS token; change the logic so when len(input_ids) >
self.max_length you preserve the EOS: if you have a known EOS id (use
self.eos_token_id) truncate to self.max_length - 1 and set input_ids[-1] =
self.eos_token_id, or if the original input_ids contained an EOS token, move or
copy that EOS into the final position instead of simply slicing; update the
block around input_ids and self.max_length in protein_embedding.py to implement
this EOS-preserving truncation.
🧹 Nitpick comments (2)
fastembed/bio/protein_embedding.py (2)

204-204: Consider replacing assert with explicit exception.

assert statements are stripped in optimized mode (python -O). While this is internal code, a proper check prevents silent failures if the method is called before _load_onnx_model.

Suggested fix
-        assert self.tokenizer is not None
+        if self.tokenizer is None:
+            raise RuntimeError("Tokenizer not initialized. Call _load_onnx_model first.")

367-367: Annotate mutable class attribute with ClassVar.

Per static analysis, mutable class attributes should use ClassVar to clarify they are not instance attributes.

Suggested fix
+from typing import ClassVar
+
 class ProteinEmbedding(ProteinEmbeddingBase):
-    EMBEDDINGS_REGISTRY: list[Type[ProteinEmbeddingBase]] = [OnnxProteinEmbedding]
+    EMBEDDINGS_REGISTRY: ClassVar[list[Type[ProteinEmbeddingBase]]] = [OnnxProteinEmbedding]
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 800f388 and ad715dc.

📒 Files selected for processing (3)
  • fastembed/__init__.py
  • fastembed/bio/__init__.py
  • fastembed/bio/protein_embedding.py
🧰 Additional context used
🧬 Code graph analysis (2)
fastembed/__init__.py (1)
fastembed/bio/protein_embedding.py (1)
  • ProteinEmbedding (355-447)
fastembed/bio/__init__.py (1)
fastembed/bio/protein_embedding.py (1)
  • ProteinEmbedding (355-447)
🪛 Ruff (0.14.10)
fastembed/bio/protein_embedding.py

148-148: Avoid specifying long messages outside the exception class

(TRY003)


192-192: Avoid specifying long messages outside the exception class

(TRY003)


195-195: Unused method argument: kwargs

(ARG002)


224-224: Unused method argument: kwargs

(ARG002)


231-231: Avoid specifying long messages outside the exception class

(TRY003)


305-305: Unused method argument: parallel

(ARG002)


367-367: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


425-428: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Python 3.13.x on macos-latest test
  • GitHub Check: Python 3.10.x on ubuntu-latest test
  • GitHub Check: Python 3.11.x on ubuntu-latest test
  • GitHub Check: Python 3.12.x on ubuntu-latest test
  • GitHub Check: Python 3.13.x on ubuntu-latest test
  • GitHub Check: Python 3.13.x on windows-latest test
🔇 Additional comments (8)
fastembed/bio/protein_embedding.py (6)

14-25: LGTM!

Model metadata is well-defined with appropriate fields for ESM-2: dimension (480), source, license (MIT), and required files.


101-158: LGTM!

Base class properly defines the interface with appropriate caching for embedding_size and case-insensitive model lookup.


223-239: LGTM!

Mean pooling implementation is correct: expands attention mask for broadcasting, computes masked sum, clips to avoid division by zero, and normalizes the output.


301-327: LGTM!

The embed method correctly handles single/iterable inputs, lazy model loading, and batch processing. The unused parallel parameter is appropriately documented as "not yet supported" for this WIP.


333-352: LGTM!

Worker correctly initializes with threads=1 for parallel worker processes and properly yields batch results with indices.


385-428: LGTM!

The initialization properly iterates through the registry with case-insensitive matching and provides a helpful error message with guidance on listing supported models.

fastembed/bio/__init__.py (1)

1-3: LGTM!

Clean re-export aligns with the usage example in ProteinEmbedding's docstring.

fastembed/__init__.py (1)

3-3: LGTM!

Import and __all__ export follow the established pattern for other embedding types in this package.

Also applies to: 23-23

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant