Skip to content

Fix LiteralUTF8Char lowering for non-ASCII UTF-8 chars#207

Open
tmdeveloper007 wants to merge 1 commit intoarxlang:mainfrom
tmdeveloper007:ISSUE-205
Open

Fix LiteralUTF8Char lowering for non-ASCII UTF-8 chars#207
tmdeveloper007 wants to merge 1 commit intoarxlang:mainfrom
tmdeveloper007:ISSUE-205

Conversation

@tmdeveloper007
Copy link
Contributor

@tmdeveloper007 tmdeveloper007 commented Mar 7, 2026

Pull Request description

This PR fixes a Unicode lowering bug in LiteralUTF8Char.

When lowering astx.LiteralUTF8Char, the code correctly computed the UTF-8 byte length, but initialized the backing global constant using ASCII encoding. That caused translation to fail for valid non-ASCII characters such as é with a UnicodeEncodeError.

Changes made:

  • use UTF-8 bytes when initializing LiteralUTF8Char storage
  • add a regression test covering a multibyte UTF-8 char literal

Addresses #208

How to test these changes

  • run pytest tests/test_string.py -q -k "utf8_char_non_ascii_translate"
  • confirm the test passes
  • optionally verify that lowering a module containing astx.LiteralUTF8Char("é") no longer raises UnicodeEncodeError

Pull Request checklists

This PR is a:

  • bug-fix
  • new feature
  • maintenance

About this PR:

  • it includes tests.
  • the tests are executed on CI.
  • the tests generate log file(s) (path).
  • pre-commit hooks were executed locally.
  • this PR requires a project documentation update.

Author's checklist:

  • I have reviewed the changes and it contains no misspelling.
  • The code is well commented, especially in the parts that contain more
    complexity.
  • New and old tests passed locally.

Additional information

  • Validation run locally:
    • pytest tests/test_string.py -q -k "utf8_char_non_ascii_translate"
    • ruff check src/irx/builders/llvmliteir.py tests/test_string.py
  • I kept the change minimal and limited it to the UTF-8 char literal lowering path and one regression test.

Reviewer's checklist

Copy and paste this template for your review's note:

## Reviewer's Checklist

- [ ] I managed to reproduce the problem locally from the `main` branch
- [ ] I managed to test the new changes locally
- [ ] I confirm that the issues mentioned were fixed/resolved .

@yuvimittal
Copy link
Member

@tmdeveloper007 , the tests currently only testing that LLVM IR generation doesn't crash, not that the UTF-8 character lowering is correct.

char_literal = astx.LiteralUTF8Char(expected)

decl_tmp = astx.VariableDeclaration(
name="tmp", type_=astx.String(), value=char_literal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the use of tmp here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants