Escape pipes and newlines in CSV to Markdown table cells#2035
Open
LeSingh1 wants to merge 1 commit into
Open
Conversation
CsvConverter wrote cell values straight into the Markdown table without escaping. A cell containing a literal pipe added an extra column separator, and a quoted field with an embedded newline split the row in two. Either case produces a malformed table whose data rows no longer match the header column count. Add an _escape_cell helper that escapes pipe as \| and flattens embedded newlines to spaces, and apply it to header and data cells.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CsvConverterwrites cell values straight into the Markdown table without escaping. Two cases produce malformed output:|adds an extra column separator.In both cases the data rows no longer match the header column count, so the table renders incorrectly.
Reproduction
The
Alicerow renders as three columns instead of two. A quoted multi-line field like"line1\nline2"similarly breaks into two rows.Fix
Added an
_escape_cellhelper that escapes|as\|and flattens\r\n/\n/\rto spaces, applied to header and data cells.Tests
Added
packages/markitdown/tests/test_csv_converter.pywithtest_pipe_in_cell_is_escaped(asserts every row has equal unescaped pipe count) andtest_newline_in_cell_does_not_break_row. Both fail on the unpatched converter and pass after the change. The existing module test vectors still pass.Developed with AI assistance and verified locally as described above.