Skip to content

Conversation

@praateekmahajan
Copy link
Contributor

@praateekmahajan praateekmahajan commented Nov 25, 2025

Description

This pull request refactors the workflow interfaces for the deduplication pipelines (exact, fuzzy, and semantic) to standardize their outputs and improve usability.

Core API and Interface Refactoring

  • Introduced a new WorkflowRunResult dataclass in nemo_curator/pipeline/workflow.py to encapsulate workflow outputs, pipeline task mappings, and metadata. Also added an abstract WorkflowBase class to standardize workflow interfaces.
  • Updated all deduplication workflow classes (ExactDeduplicationWorkflow, FuzzyDeduplicationWorkflow, SemanticDeduplicationWorkflow) to inherit from WorkflowBase and to return a WorkflowRunResult from their run methods, instead of returning None or a dictionary.

Workflow Output and Metadata Improvements

  • Refactored the run methods of all workflows to collect and record detailed timing and result metadata (such as per-stage execution times and duplicate counts) into the WorkflowRunResult object.
  • Each pipeline stage now adds its results and timing to the result object.

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 25, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
@praateekmahajan
Copy link
Contributor Author

/ok to test af0787c

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines 447 to +449
total_start_time = time.time()
workflow_result = WorkflowRunResult(workflow_name="text_semantic_deduplication")
num_duplicates_identified = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timing variables (embedding_time, semantic_time, removal_time, total_time) defined inside try block (lines 472, 495, 509, 526) but referenced after except block (lines 550-554). if exception occurs before assignment, NameError will occur.

Suggested change
total_start_time = time.time()
workflow_result = WorkflowRunResult(workflow_name="text_semantic_deduplication")
num_duplicates_identified = 0
total_start_time = time.time()
workflow_result = WorkflowRunResult(workflow_name="text_semantic_deduplication")
num_duplicates_identified = 0
embedding_time = 0.0
semantic_time = 0.0
removal_time = 0.0
total_time = 0.0

@praateekmahajan praateekmahajan merged commit a8e0040 into NVIDIA-NeMo:main Jan 13, 2026
47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants