Fix unicode escaping in example generation #98

wade6716 · 2025-08-08T10:58:13Z

Description

Fixes Unicode characters being escaped in example generation due to the default behavior of json.dumps().

In the format_example_as_text function in prompting.py, json.dumps(data_dict, indent=2) is used without setting ensure_ascii=False. This causes non-ASCII characters (e.g., Chinese, Arabic, Cyrillic) to be escaped into \uXXXX sequences, which harms readability and usability in multilingual contexts.

This change adds ensure_ascii=False to preserve Unicode characters in their original form, improving clarity and supporting better internationalization for language model prompts.

Fixes #93

Example Comparison

Before (with escaping)

{
  "text": "\u4f60\u597d\uff0c\u4eca\u5929\u6c14\u6e29\u5982\u4f55\uff1f"
}

After (direct Unicode)

{
  "text": "你好，今天气温如何？"
}

✅ Characters are displayed naturally. Same improvement applies to Japanese, Korean, Arabic, Russian, etc.

How Has This Been Tested?

Manually tested with multilingual inputs including Chinese, Japanese, Korean, Arabic, and Russian.
Confirmed that Unicode characters are no longer escaped in the output.
Ran all required checks locally:
```
./autoformat.sh
pylint --rcfile=.pylintrc langextract tests
pytest tests
```
All formatting, linting, and tests passed.

Checklist

I have signed the Google Contributor License Agreement (CLA)
I have read and followed the HAI-DEF Community Guidelines
My code follows the project's style guide (formatted with ./autoformat.sh)
All tests pass (pylint and pytest)
This PR contains a single, focused change (fixes one specific issue)
I have referenced the related issue (Closes #93)
No infrastructure, build, or documentation files were modified

Additional Notes

This is a small, non-breaking change that improves user experience for multilingual developers.
The change is backward-compatible: ASCII-only use cases remain unaffected.
Enhances prompt clarity for LLMs by preserving natural text representation.
No sensitive data or privacy concerns involved.
Contribution is free of malicious code and made in good faith.

google-cla · 2025-08-08T10:58:18Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

github-actions · 2025-08-08T11:16:01Z

Running live API tests... This will take a few minutes.

github-actions · 2025-08-08T11:16:44Z

✅ Live API tests passed! All endpoints are working correctly.

Fix unicode escaping in example generation

ba034a0

github-actions bot added the size/XS Pull request with less than 50 lines changed label Aug 8, 2025

aksg87 added the ready-to-merge Triggers live API tests for PRs from forks label Aug 8, 2025

aksg87 approved these changes Aug 8, 2025

View reviewed changes

aksg87 merged commit 845258c into google:main Aug 8, 2025
19 of 22 checks passed

wade6716 mentioned this pull request Aug 8, 2025

How to ensure reliability #95

Open

aksg87 pushed a commit that referenced this pull request Aug 21, 2025

Fix unicode escaping in example generation (#98)

9b549ab

sinnaj pushed a commit to sinnaj/langextract that referenced this pull request Sep 3, 2025

Fix unicode escaping in example generation (google#98)

a75d521

aksg87 added a commit that referenced this pull request Sep 12, 2025

Fix unicode escaping in example generation (#98)

a2cee30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix unicode escaping in example generation #98

Fix unicode escaping in example generation #98

Uh oh!

wade6716 commented Aug 8, 2025 •

edited by aksg87

Loading

Uh oh!

google-cla bot commented Aug 8, 2025

Uh oh!

github-actions bot commented Aug 8, 2025

Uh oh!

github-actions bot commented Aug 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix unicode escaping in example generation #98

Fix unicode escaping in example generation #98

Uh oh!

Conversation

wade6716 commented Aug 8, 2025 • edited by aksg87 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Example Comparison

Before (with escaping)

After (direct Unicode)

How Has This Been Tested?

Checklist

Additional Notes

Uh oh!

google-cla bot commented Aug 8, 2025

Uh oh!

github-actions bot commented Aug 8, 2025

Uh oh!

github-actions bot commented Aug 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wade6716 commented Aug 8, 2025 •

edited by aksg87

Loading