feat: add support for line-by-line evaluation#1481
Merged
AAgnihotry merged 9 commits intomainfrom Mar 25, 2026
Merged
Conversation
88191a0 to
8750e34
Compare
This commit adds line-by-line evaluation capability to output evaluators, allowing them to evaluate multi-line outputs on a per-line basis and provide granular feedback with partial credit scoring. Key changes: - Added lineByLineEvaluator config flag to OutputEvaluatorConfig - Added lineDelimiter config to customize split behavior (default: "\n") - Implemented _evaluate_line_by_line() method in BaseOutputEvaluator - Fixed runtime aggregation to handle line-by-line sub-results - Fixed targetOutputKey wrapping for individual line evaluations - Added sample agent demonstrating the feature (samples/line_by_line_test) Benefits: - Provides partial credit (e.g., 2/3 lines correct = 0.67 score) - More granular feedback with per-line results - Useful for evaluating structured multi-line outputs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
5ba8d3d to
730ccc2
Compare
afee629 to
a43c257
Compare
- Add lineByLineEvaluation and lineDelimiter fields to BaseLegacyEvaluator - Implement _evaluate_line_by_line() method for legacy evaluators - Add helper methods: _split_into_lines() and _get_actual_output() - Add 5 comprehensive tests for legacy line-by-line evaluation - Add legacy evaluator to line_by_line_test sample for validation - Update sample README to document both new and legacy evaluators This feature enables legacy evaluators (category/type based) to support line-by-line evaluation with partial credit scoring, matching the functionality already available in new evaluators (version/evaluatorTypeId based). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Import LineByLineEvaluationDetails in test file - Add isinstance() checks to help mypy understand result.details type - Fixes mypy union-attr errors in legacy evaluator tests 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Change pyproject.toml to use local editable install instead of TestPyPI - Fix legacy evaluator JSON to use integer enum values (category: 0, type: 1) - Verified: All 3 evaluators (new line-by-line, regular, legacy line-by-line) work correctly 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Contributor
Author
Chibionos
requested changes
Mar 24, 2026
Contributor
Chibionos
left a comment
There was a problem hiding this comment.
need to rearrange the functions they are pilled up on the core evaluator files
Chibionos
approved these changes
Mar 24, 2026
| "id": "LegacyLineByLineExactMatch", | ||
| "category": 0, | ||
| "type": 1, | ||
| "name": "LegacyLineByLineExactMatch", |
Contributor
There was a problem hiding this comment.
this should not break medline when they turn on URT for them.
We should also make sure we track this on the sprint board.
packages/uipath/samples/line_by_line_test/evaluations/evaluators/line-by-line-exact-match.json
Show resolved
Hide resolved
Added isinstance() checks to help mypy understand that result.details is specifically a LineByLineEvaluationDetails object, not just the generic str | BaseModel | None type. This fixes mypy errors when accessing total_lines_actual and total_lines_expected attributes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
077915e to
c2c050e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


Summary
Adds line-by-line evaluation capability to both new evaluators (version-based) and legacy evaluators (category/type-based), enabling per-line evaluation of multi-line outputs with partial credit scoring.
Problem
Current evaluators provide binary pass/fail results for multi-line outputs. If any part of the output is incorrect, the entire evaluation fails with a score of 0.0, even if most lines are correct.
Solution
This PR introduces line-by-line evaluation that:
\n)Key Changes
New Evaluators (Version-based)
lineByLineEvaluatorandlineDelimiterconfig options toOutputEvaluatorConfig_evaluate_line_by_line()method inBaseOutputEvaluatorLegacy Evaluators (Category/Type-based)
lineByLineEvaluationandlineDelimiterfields toBaseLegacyEvaluator_evaluate_line_by_line()method for legacy evaluators_split_into_lines()and_get_actual_output()Runtime & Bug Fixes
targetOutputKeywasn't properly applied to individual linesSample Project
Benefits
Example Usage
New Evaluators (Low-Code JSON config):
{ "version": "1.0", "evaluatorTypeId": "uipath-exact-match", "evaluatorConfig": { "name": "LineByLineExactMatch", "lineByLineEvaluator": true, "lineDelimiter": "\n" } }Legacy Evaluators (Low-Code JSON config):
{ "category": 0, "type": 1, "name": "LegacyLineByLineExactMatch", "targetOutputKey": "result", "lineByLineEvaluation": true, "lineDelimiter": "\n" }Coded Evaluator:
Test Results
Sample agent output showing all three evaluators:
Testing
Bugs Fixed
targetOutputKey != "*"LineByLineEvaluationDetails🤖 Generated with Claude Code
Development Packages
uipath