-
Notifications
You must be signed in to change notification settings - Fork 168
Add ESM2 model for protein embeddings (WIP) #598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughThis pull request introduces a new protein embedding feature to the fastembed library. A new module Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @fastembed/bio/protein_embedding.py:
- Around line 65-66: Truncation currently slices input_ids to self.max_length
and can drop the EOS token; change the logic so when len(input_ids) >
self.max_length you preserve the EOS: if you have a known EOS id (use
self.eos_token_id) truncate to self.max_length - 1 and set input_ids[-1] =
self.eos_token_id, or if the original input_ids contained an EOS token, move or
copy that EOS into the final position instead of simply slicing; update the
block around input_ids and self.max_length in protein_embedding.py to implement
this EOS-preserving truncation.
🧹 Nitpick comments (2)
fastembed/bio/protein_embedding.py (2)
204-204: Consider replacingassertwith explicit exception.
assertstatements are stripped in optimized mode (python -O). While this is internal code, a proper check prevents silent failures if the method is called before_load_onnx_model.Suggested fix
- assert self.tokenizer is not None + if self.tokenizer is None: + raise RuntimeError("Tokenizer not initialized. Call _load_onnx_model first.")
367-367: Annotate mutable class attribute withClassVar.Per static analysis, mutable class attributes should use
ClassVarto clarify they are not instance attributes.Suggested fix
+from typing import ClassVar + class ProteinEmbedding(ProteinEmbeddingBase): - EMBEDDINGS_REGISTRY: list[Type[ProteinEmbeddingBase]] = [OnnxProteinEmbedding] + EMBEDDINGS_REGISTRY: ClassVar[list[Type[ProteinEmbeddingBase]]] = [OnnxProteinEmbedding]
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
fastembed/__init__.pyfastembed/bio/__init__.pyfastembed/bio/protein_embedding.py
🧰 Additional context used
🧬 Code graph analysis (2)
fastembed/__init__.py (1)
fastembed/bio/protein_embedding.py (1)
ProteinEmbedding(355-447)
fastembed/bio/__init__.py (1)
fastembed/bio/protein_embedding.py (1)
ProteinEmbedding(355-447)
🪛 Ruff (0.14.10)
fastembed/bio/protein_embedding.py
148-148: Avoid specifying long messages outside the exception class
(TRY003)
192-192: Avoid specifying long messages outside the exception class
(TRY003)
195-195: Unused method argument: kwargs
(ARG002)
224-224: Unused method argument: kwargs
(ARG002)
231-231: Avoid specifying long messages outside the exception class
(TRY003)
305-305: Unused method argument: parallel
(ARG002)
367-367: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
425-428: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: Python 3.13.x on macos-latest test
- GitHub Check: Python 3.10.x on ubuntu-latest test
- GitHub Check: Python 3.11.x on ubuntu-latest test
- GitHub Check: Python 3.12.x on ubuntu-latest test
- GitHub Check: Python 3.13.x on ubuntu-latest test
- GitHub Check: Python 3.13.x on windows-latest test
🔇 Additional comments (8)
fastembed/bio/protein_embedding.py (6)
14-25: LGTM!Model metadata is well-defined with appropriate fields for ESM-2: dimension (480), source, license (MIT), and required files.
101-158: LGTM!Base class properly defines the interface with appropriate caching for
embedding_sizeand case-insensitive model lookup.
223-239: LGTM!Mean pooling implementation is correct: expands attention mask for broadcasting, computes masked sum, clips to avoid division by zero, and normalizes the output.
301-327: LGTM!The
embedmethod correctly handles single/iterable inputs, lazy model loading, and batch processing. The unusedparallelparameter is appropriately documented as "not yet supported" for this WIP.
333-352: LGTM!Worker correctly initializes with
threads=1for parallel worker processes and properly yields batch results with indices.
385-428: LGTM!The initialization properly iterates through the registry with case-insensitive matching and provides a helpful error message with guidance on listing supported models.
fastembed/bio/__init__.py (1)
1-3: LGTM!Clean re-export aligns with the usage example in
ProteinEmbedding's docstring.fastembed/__init__.py (1)
3-3: LGTM!Import and
__all__export follow the established pattern for other embedding types in this package.Also applies to: 23-23
This directly addresses #597
All Submissions:
New Feature Submissions:
pre-commitwithpip3 install pre-commitand set up hooks withpre-commit install?New models submission: