Add validateTex API for fast side-effect-free TeX syntax checking#416
Draft
OlgaRedozubova wants to merge 9 commits into
Draft
Add validateTex API for fast side-effect-free TeX syntax checking#416OlgaRedozubova wants to merge 9 commits into
OlgaRedozubova wants to merge 9 commits into
Conversation
- New MathpixMarkdownModel.validateTex(latex, { display? }) returning a discriminated TexValidationResult union ({ valid: true } | { valid: false; error: TexValidationError }).
- TexValidationError extends Error with code (TexError.id) and latex (the failed input). Prototype chain restored for ES5 instanceof.
- Runs MathJax's TexParser directly on a dedicated MTeX instance (tags: 'none'), bypassing MathItem/MathDocument, post-filters and output jax — no SVG, no DOM mutation, no side-effects on the render pipeline's counter/labels/ids.
- Never throws on bad input; batch callers always get a return value.
- pr-specs/2026-05-validate-tex-api.md documents contract, invariants and design.
- 13 unit tests in tests/_validateTex.js cover return value, no-side-effect guarantees, statelessness, and cross-render isolation. Full suite passes (3491 tests).
- Bump version to 2.0.41 - Add changelog entry describing the validateTex API, TexValidationResult union, TexValidationError class, isolation guarantees, and reference to pr-specs/2026-05-validate-tex-api.md - Add README "Validating TeX formulas" section with usage example, return-value shape, options table, and the never-throws / no-side-effects / no-SVG / stateless guarantees
- Spec, changelog, README: removed wording that framed the API around a specific downstream consumer or LaTeX-compilation failures. Now describes only what the library provides (parse-only check, no SVG, side-effect-free). - Spec Context: corrected inaccurate claim that forLatex is "used by forDocx/forMD/forPptx" — these are independent boolean flags, not a hierarchy. - Spec: Status flipped from Active to Implemented; removed the now-redundant trailing "Status updated to Implemented after merge" checkbox.
…tests
- Error message: TexError is not an Error subclass, so the previous
err instanceof Error ? err.message : String(err) branch returned
'[object Object]' for every TeX parse error. Replace with duck-typing
on (err as any).message so the real MathJax description is preserved.
- Public API: TexValidationError and TexValidationResult are now
exported from the package root (src/index.tsx). Consumers no longer
need to deep-import lib/mathjax/index to get instanceof support and
the union type. Tests switched to the root entry so they guard the
exported surface.
- Lazy initialization: the isolated MTeX is no longer allocated in
initTex. A public getter on MathJaxConfigure creates it on first
validateTex call and wires setMmlFactory from the rendering mTex at
that point. setHandler resets the lazy slot so accessibility/nonumbers
toggles pick up the fresh mmlFactory on next access. Consumers who
never call validateTex pay zero memory cost.
- Error code extraction is defensive: in addition to instanceof TexError,
the catch reads a string .id off the thrown value as a fallback, so a
future MathJax that wraps TexError before re-throwing does not silently
drop the code.
- Tests:
- Pin concrete TeX error ids (UndefinedControlSequence, MissingArgFor)
and assert the message carries the real MathJax description
(/undefined control sequence/i, /missing argument/i) — the prior
/TeX error/i regex would have passed even on '[object Object]'.
- Add render-parity cases: a broken formula yields no <svg> from
markdownToHTML, a valid one does. Both verdicts must agree.
- Add render-parity acceptance for package-driven constructs
(\color, \textcolor, \definecolor, \ce, \boldsymbol, \cancel).
- Add empty-string and whitespace-only acceptance cases.
- Docs:
- Spec, changelog, README: clarify lazy init and zero-cost-for-non-users
guarantee; tighten statelessness wording (per-equation tag state
resets per call; packageData persists across calls — same contract
as the render path within a single parse).
- README: note that error.code is undefined for non-TexError
exceptions; add a "Notes for batch validation" section about
threading the display flag based on source context (inline vs
block); document empty/whitespace input as valid.
- Reset(): comment notes that it touches render-path tags only —
the validateTex jax owns its state and resets per call.
- README: replace ?? with || in the example so the snippet runs on
the ES5 target the library compiles to.
When state.md.options.forLatex is set, paragraph_open tokens for figure
and table environments now carry meta.placement and meta.type:
- meta.placement is the exact specifier captured by the existing regex
(RE_BEGIN_TABLE_OR_FIGURE_WITH_PLACEMENT in common/consts.ts):
'h' | 'H' | 't' | 'b' | 'p' | '!h' | 'h!' | '!H' | 'H!' | '!t' | '!b' | '!p',
or undefined if the source had no bracket.
- meta.type is 'figure' or 'table'.
Previously match[2] (the captured specifier) was discarded and token.latex
was unconditionally '\\begin{<type>}[h]', so a forLatex consumer could not
tell whether the user wrote [t], [!h], an explicit [h], or nothing at all.
The change is purely additive on meta — token.latex still emits
'\\begin{<type>}[h]' so existing forLatex serializers and snapshot tests
are byte-identical.
StatePushPatagraphOpenTable gains an optional placement parameter; both
InlineBlockBeginTable and BeginTable thread match[2] through to it.
The meta merge uses { ...(token.meta ?? {}), type, placement } — defensive
against any future code that may set meta between state.push and this
assignment.
Tests cover explicit [t]/[!h]/[H]/[b], no-bracket source for both figure
and table, whitespace between env name and bracket, and the back-compat
invariant that token.latex remains '\\begin{<type>}[h]'.
Spec: pr-specs/2026-05-figure-placement-bracket.md
…ed tests, JSDoc
Public typing:
- Export FigureTableType ('figure' | 'table'), FigureTablePlacement (literal
union of the 15 captured specifiers: 'h' | 'H' | 't' | 'b' | 'p' | '!h' |
'h!' | '!H' | 'H!' | '!t' | 't!' | '!b' | 'b!' | '!p' | 'p!') and
FigureTableOpenMeta ({ type: FigureTableType; placement?: FigureTablePlacement })
from the package root. Type-only symbols use `export type` so consumers
with isolatedModules/verbatimModuleSyntax do not get spurious imports.
- StatePushPatagraphOpenTable now takes a typed placement?: FigureTablePlacement
and type: FigureTableType; both call sites cast match[1]/[2] accordingly.
- JSDoc on MathJax.ValidateTex, TexValidationError, FigureTablePlacement,
FigureTableOpenMeta — surfaced in consumer IntelliSense. JSDoc notes that
FigureTablePlacement captures single specifiers only; multi-char combinations
like [htbp] are not captured.
- TexValidationError.code JSDoc documents that 'InternalError' signals a parser
crash (RangeError, etc.), not "invalid formula" — caller may want a different
fallback strategy.
Regex coverage:
- RE_BEGIN_TABLE_OR_FIGURE_WITH_PLACEMENT extended to also capture the
symmetric post-bang variants `t!`, `b!`, `p!` (previously only the
`!`-prefixed forms `!t`, `!b`, `!p` were captured); now 15 specifiers.
Cleaner meta contract:
- The `placement` key is omitted from token.meta when the source carried no
bracket (or carried an invalid/empty bracket). Consumers iterating via
Object.entries(meta) see only { type } in the no-bracket case; the key is
written only when a recognized specifier was captured.
Error-message and code contract:
- 'TeX error: ' prefix dropped from TexValidationError.message in both
TexError and fallback branches. Consumers rely on `instanceof
TexValidationError` and `.code` for branching; the message is the raw
MathJax description.
- Non-TexError exceptions are wrapped with `code: 'InternalError'` so batch
callers can filter "parser crashed" from "formula is invalid".
Expanded test coverage:
- _validateTex.js: pin concrete TexError ids that appear in the public docs —
UndefinedControlSequence (\nosuchmacro), MissingArgFor (\frac{1}),
UnknownEnv (\begin{nosuchenv}), ExtraLeftMissingRight (\left( x). Message
bodies are asserted against the real MathJax descriptions
(/undefined control sequence/i, /missing argument/i etc.) — pins the API
against the '[object Object]' regression that exists when instanceof Error
is used to gate String(err).
- _validateTex.js: 8 edge-MML parity cases (sums, stretchy delimiters, nested
over, stackrel, binom, sqrt[3], int, matrix) — pin the spec's
"post-filters never throw on parse-valid input" claim by asserting that
validateTex(s).valid === true AND markdownToHTML('$' + s + '$') contains <svg.
- _validateTex.js: cold-start smoke after texReset + contract assertion that
error.code on a real failure is one of the documented values.
- _validateTex.js: three tests pinning the packageData contract — \newcommand
registered in one validateTex call is visible to the next (with namespaced
macro names that cannot collide with other tests); validateTex does not see
macros registered by markdownToHTML; markdownToHTML does not see macros
registered by validateTex.
- _validateTex.js: mock-based test on the InternalError branch — patches
TexParser.prototype.mml to throw a non-TexError and asserts the wrapped
result carries code === 'InternalError' and the raw message.
- _figure-placement.js: 5 invalid-bracket cases ([x], [], [tt], [ht], [ ])
asserting 'placement' in meta === false; 15 parameterized cases over the
full specifier list asserting meta.placement equals the source value; one
case explicitly asserts placement key absence on no-bracket sources.
- _figure-placement.js: parseTokens and findFirstParagraphOpen factored to
file-level helpers (was duplicated across three describe blocks).
- Replaced should.exist property-access bug with chai.expect(...).to.exist.
Comment hygiene:
- Brief version-pinning note on the { inputData: {} } stub passed to
tags.startEquation ("verified against mathjax-full 3.2.2").
- One-liner on `isInner: false` clarifying it matches the render path's
top-level math context.
- tags.reset(0) comment describes the invariant it guards (a future MathJax
writing to all*Labels outside finishEquation) rather than referencing the
spec file.
- Parity tests carry a one-line comment explaining why <svg> presence is the
precise success signal in this project (MTeX.formatError throws instead of
producing a merror node, so failed renders emit an empty span with no merror).
Docs:
- README: new "Notes for batch validation" snippet
(MathpixMarkdownModel.validateTex(s, { display: srcIsInline ? false : true }));
Guarantees section now correctly states that validateTex owns a separate
parseOptions from the render input jax (packageData does not flow in either
direction), and that within validateTex itself packageData persists across
calls; error.code documented with real ids and the 'InternalError' fallback.
- Spec 2026-05-validate-tex-api.md: test-list expanded with edge-MML,
render-parity, package-driven, persistence/isolation, and concrete-code
categories; test counts updated; 'BadMath' (non-existent in MathJax) removed.
- Spec 2026-05-figure-placement-bracket.md: meta-init snippet synced to the
spread style used in code; "Sparse meta.placement" constraint added; new
exported symbols documented; test breakdown enumerated (8 + 15 + 5 + 1);
Non-Goals updated to clarify multi-char placements like [htbp] are not
captured.
- Changelog: meta.placement description corrected ("key absent" instead of
"=== undefined"); 'BadMath' replaced with real codes; statelessness wording
split to reflect that validate and render are isolated by separate
parseOptions; explicit note on the token.meta truthiness change (null →
object) for forLatex consumers iterating token.meta; new exported types
enumerated; regex extension documented.
… expose resetValidateTex
Regex and contract:
- RE_BEGIN_TABLE_OR_FIGURE_WITH_PLACEMENT extended to also capture t!, b!,
p! (the existing pattern captured only the !-prefixed forms). Behavior
change for sources containing [t!]/[b!]/[p!]: previously the bracket
failed to match, fell through to the no-bracket regex, and the literal
text leaked into the environment's content. Now the bracket is consumed
and meta.placement carries it. This is a parser-fidelity fix.
- Verified no test fixture in tests/ contains [t!], [b!], or [p!] (the
grep result is recorded in the spec as evidence of zero snapshot drift).
- Spec and changelog rewritten to describe the regex extension and the
resulting behavior change; the earlier "Existing regex unchanged"
wording was inconsistent with the diff and has been replaced.
- Specifier counts synced (12 → 15) across spec, tests, and changelog.
resetValidateTex API:
- New MathpixMarkdownModel.resetValidateTex() (and the underlying
MathJax.ResetValidateTex / MathJaxConfigure.resetValidateTex) drops the
isolated validator MTeX instance so the next call rebuilds it with
empty packageData. Useful for long-lived processes that want to bound
memory or clear user-defined macros accumulated across calls (the
validator's parseOptions.packageData is otherwise not cleared between
calls).
- setHandler now resets the lazy _validateTex slot at the end of the
method (after the new mathjax.document calls), not before, so there
is no transient window where mTex.mmlFactory is undefined.
- README Guarantees section references the new method; a Security note
warns batch consumers of untrusted TeX to call resetValidateTex()
between batches to avoid \newcommand redefinition leaking forward.
- Test pins the contract: a \newcommand registered in one validateTex
call is visible to the next; after resetValidateTex() it is not.
Test coverage and isolation:
- _validateTex.js: parameterized render-parity sweep over 11 inputs
(5 invalid + 6 valid). Each case asserts validateTex.valid agrees with
the presence/absence of <svg in markdownToHTML output; invalid cases
additionally pin the exact MathJax error code (MissingArgFor /
ExtraLeftMissingRight / EnvMissingEnd / UndefinedControlSequence) — if
MathJax renames any of these the test fails with a precise diff
instead of silently passing. Package-driven \ce{H2O} and \color{red}{x}
are part of the parity sweep so they exercise both validate and render
paths in lock-step.
- _validateTex.js: contract tests for the codes named in the public docs
(UnknownEnv via \begin{nosuchenv}, ExtraLeftMissingRight via \left( x).
- _validateTex.js: display:false behavior tests are kept in a
honestly-titled describe block — they document that in the current
MathJax config (tags:'none'), display-only constructs are still
accepted in inline mode; the option exists for forward-compat.
- All describe blocks now call MM.resetValidateTex() alongside
MM.texReset() in beforeEach, so validator packageData cannot leak
across tests via accidentally-overlapping macro names.
Code hygiene:
- StatePushPatagraphOpenTable: type-narrowing now uses `placement !==
undefined` (more precise intent than the previous truthy check, since
the regex guarantees a non-empty string but a future regex change with
an optional inner group could permit an empty match).
- begin-table.ts: extract `match[1].trim()` to a single local variable
before reusing in the FigureTableType narrowing — removes a double
trim call at each of the two parser entry points.
- mathjax/index.ts: tags.reset(0) comment now accurately describes that
it clears allLabels/allIds/allCounter and that this is a no-op today
because finishEquation is bypassed (the previous comment claimed
protection against future MathJax writing to all*Labels — replaced
with a precise statement of what the call actually does).
Docs and changelog:
- 'BadMath' (non-existent in MathJax) removed from README, changelog,
JSDoc, and spec. Replaced with real codes that MathJax actually emits.
- README error.code documentation updated with realistic ids and the
'InternalError' fallback for non-TexError exceptions.
- Changelog gets a dedicated "Breaking changes for forLatex consumers"
section at the top of the 2.0.41 entry covering (a) the
paragraph_open.meta truthiness change for figure/table and (b) the
[t!]/[b!]/[p!] consumption change. Subsections for figure-placement
and validateTex follow.
Bug fixes:
- ValidateTex(latex, null) no longer throws on destructuring (uses
`options ?? {}`). Honors the documented "never throws" guarantee for
JS callers that pass null from a config field.
- New 'InvalidInput' code for non-string `latex` arguments (null,
undefined, number, object, ...). The previous behavior surfaced these
as 'InternalError', conflating caller bugs with parser crashes;
batch consumers can now distinguish "I passed the wrong type" from
"MathJax itself misbehaved".
- TexError fallback: error.code defaults to 'TexError' if a TexError
instance ever lacks a string .id (today MathJax always sets one).
Keeps `code` non-undefined for the TexError branch.
New API surface:
- ValidateTex(latex, { isolated: true }) drops accumulated `packageData`
before this call. Useful for validating untrusted user-supplied TeX
without polluting subsequent calls. README documents the per-call
allocation cost (~100-300 KB) and recommends one resetValidateTex()
per batch over per-call isolated:true for high-volume use.
Behavior-change documentation:
- changelog "Breaking changes" → "Behavior changes for forLatex
consumers" with explicit notes: meta change is additive (only
consumers using `meta === null` need to migrate); [t!]/[b!]/[p!] is
a parser-fidelity fix. Version stays 2.0.41 (patch).
- Reset() jsdoc no longer claims the validator "resets per call" —
packageData persists; use ResetValidateTex to drop the instance.
- README example uses console.log instead of console.error (invalid
input is an expected return value, not a runtime error).
- README "Notes for batch validation" merged into one paragraph;
Options table gains the new `isolated` entry.
Code accuracy and typing:
- Narrowed type for the startEquation stub: a local interface captures
the only field MathJax actually reads (`inputData.recompile`) instead
of `{ inputData: {} } as any` at the raw level.
- Single source of truth for placement specifiers: consts.ts exports
FIGURE_TABLE_PLACEMENTS (15-literal const array), derives both
FigureTablePlacement (typeof array[number]) and
RE_BEGIN_TABLE_OR_FIGURE_WITH_PLACEMENT from it, plus a
toFigureTablePlacement() runtime guard. Removed redundant `\!`
escaping (`!` is not a regex metacharacter).
- StatePushPatagraphOpenTable uses `placement !== undefined` (precise
intent) and a typed `FigureTableType` for `type`.
- validateTex getter comment captures the mmlFactory invariant.
Test isolation and coverage:
- Every describe block in tests/_validateTex.js calls
MM.resetValidateTex() alongside MM.texReset() in beforeEach so the
validator's packageData cannot leak across tests.
- The TexParser.prototype.mml mock test has both try/finally and an
after() guard — restoration survives mid-test crashes.
- Specifier sweep in tests/_figure-placement.js now covers both env
types (figure × table, 30 cases — was figure-only, 15).
- Drift detector: TexError carries a .id field; a fresh `new TeX({...})`
from mathjax-full directly exposes parseOptions.clear,
parseOptions.tags.reset, parseOptions.tags.startEquation. The earlier
attempt reached into a non-existent file-private singleton
(`MJ._mj.validateTex`) so the assertions never ran — now the test
exercises mathjax-full directly and will fail loudly on a renaming.
- Binary verdict parity with MJ.TexConvert(throwError=true): both
paths flag failure on three known-broken inputs. Strict `.id` parity
is stronger than the spec — MTeX.formatError wraps TexError in a
plain Error in the render path, losing .id; only the binary signal
is reliable cross-path.
- New tests: null/undefined options don't throw; null/undefined/number
latex → 'InvalidInput'; isolated:true forgets earlier macros;
without forLatex `paragraph_open.meta` stays null.
Style:
- Unused `let should = chai.should()` collapsed to `chai.should()`.
The second-argument regex used `[^}]*`, which couldn't match a `}` and
so truncated specs like `p{11cm}`, `m{2cm}`, or `>{\centering}p{2cm}`
at the first inner `}`. The trailing `}` plus the actual `{content}`
braces then leaked into the cell as literal text — observable in
markdownToHTML output as e.g. `<td>}{TEXT}</td>` or
`<td>p{2cm}}{TEXT}</td>`. forLatex consumers reading
`multi.mc.alignSpec` saw a truncated value (`p{11cm` without the
closing brace), making round-trip to LaTeX impossible.
Replaced `[^}]*` for the second arg with `(?:[^{}]|\{[^{}]*\})*`,
which accepts one level of nested braces — sufficient for the common
LaTeX patterns. Tests cover `p{11cm}`, `m{2cm}`, plain `c`, and a
real-world non-ASCII regression case.
Changelog: new subsection under [2.0.41] documenting the fix and the
before/after HTML shapes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[v2.0.41] Add
validateTexAPI for fast side-effect-free TeX syntax checkingSummary
Adds a new opt-in
MathpixMarkdownModel.validateTex(latex, { display? })method that checks whether a TeX expression parses, without producing SVG output and without touching the rendering pipeline's equation counter, labels, or ids.This PR provides the validation primitive; how to use it is up to the consumer.
Motivation
forLatex: trueskips MathJax conversion entirely (mdPluginRaw.ts:174-191) — tokens carry raw markup. The trade-off: no syntactic check on formula contents in that mode, since MathJax never runs.Reusing the existing
MathJax.TexConvertpath for validation is not acceptable because (a) it runs the full SVG output jax (wasted work for a validity check) and (b) it commits state to the sharedparseOptions.tags—allCounter,allLabels,allIds— which would corrupt the next render.What's in the PR
Public API
Usage:
Implementation
MTeXinstance (mathjax.ts) withtags: 'none'. SharesMmlFactorywith the rendering input jax (stateless, safe to share). NoMathDocumentneeded.TexParserinvocation (mathjax/index.ts) bypassesMathItem/MathDocument, output jax, the math-node wrapping step,finishEquation, and the six post-filter tree walks (cleanSubSup,setInherited,moveLimits,cleanStretchy,cleanAttributes,combineRelations). Only the parser runs.parseOptions.clear()+tags.reset(0)at the start of each call. Repeated identical inputs always produce the same result; no "duplicate label" leakage.TexErroris wrapped inTexValidationErrorwithcodeset; unexpected non-TexErrorexceptions (rare; would indicate a MathJax internal bug) are also wrapped withcodeunset. Batch callers processing thousands of formulas always get a return value.!result.validbranch without an extra guard.Object.setPrototypeOfin theTexValidationErrorconstructor to restore the prototype chain (required because the TypeScript target is ES5, wheresuper(Error)breaksinstanceof).Guarantees (covered by unit tests)
getLastEquationNumber()is unchanged aftervalidateTexcalls (valid auto-numbered equations, invalid formulas, batches of 10+ calls).\label{eq:a}+\eqref{eq:a}produces identical HTML whether or not the same formula was validated beforehand.markdownToHTMLcalls produce the same output whether or notvalidateTexis invoked between them.validateTexcalls with the same\label{...}both succeed.Performance characteristics
validateTex: zero runtime cost. One-time init cost is one extraMTeXinstance, ~100-300 KB (sharesMmlFactoryand configuration packages with the rendering input jax).validateTexper formula: adds TeX-parsing cost to a previously-cheap path. Per-formula cost is dominated by formula complexity:\frac{1}{2}): ~50-200 µsalignblocks, nested matrices): ~1-5 msFiles
Source
src/mathjax/mathjax.ts— newvalidateTex: TeX<any, any, any>instance,MmlFactorywiringsrc/mathjax/index.ts—TexValidationErrorclass,TexValidationResulttype,MathJax.ValidateTexfunctionsrc/mathpix-markdown-model/index.ts— public method exposure (validateTex = MathJax.ValidateTex)Tests
tests/_validateTex.js— 13 unit tests covering return value, side-effect invariants, statelessness, cross-render isolationDocs
pr-specs/2026-05-validate-tex-api.md— design spec (goal, non-goals, architecture, edge cases, risk/rollback)doc/changelog.md— new entry for v2.0.41README.md— new "Validating TeX formulas" section with example, return-value shape, options table, and guaranteesRelease
package.json/package-lock.json— version bump to 2.0.41Test plan
npm testreports 3491 passing (3478 existing + 13 new)result.error.code,result.error.latex)instanceof TexValidationErrorworks in both the test environment and in a downstream TypeScript consumerRisk / rollback
Risk: Low
validateTexautomatically; consumers who don't use it experience zero behavior change.Risk areas to watch
TexParser,ParseOptions.clear,Tags.reset,Tags.startEquation) are not public API. A future MathJax upgrade may change signatures. Integration is documented in the spec so the breakage point is locatable.{ inputData: {} } as anystub passed totags.startEquationis sufficient today because onlymath.inputData.recompileis read (Tags.js:197-201). May need updating if MathJax tightens theMathItemshape used bystartEquation.Rollback: revert PR. No data migrations, no API contracts broken.