Skip to content

Conversation

wade6716
Copy link
Contributor

@wade6716 wade6716 commented Aug 8, 2025

Description

Fixes Unicode characters being escaped in example generation due to the default behavior of json.dumps().

In the format_example_as_text function in prompting.py, json.dumps(data_dict, indent=2) is used without setting ensure_ascii=False. This causes non-ASCII characters (e.g., Chinese, Arabic, Cyrillic) to be escaped into \uXXXX sequences, which harms readability and usability in multilingual contexts.

This change adds ensure_ascii=False to preserve Unicode characters in their original form, improving clarity and supporting better internationalization for language model prompts.

Fixes #93

Example Comparison

Before (with escaping)

{
  "text": "\u4f60\u597d\uff0c\u4eca\u5929\u6c14\u6e29\u5982\u4f55\uff1f"
}

After (direct Unicode)

{
  "text": "你好,今天气温如何?"
}

✅ Characters are displayed naturally. Same improvement applies to Japanese, Korean, Arabic, Russian, etc.

How Has This Been Tested?

  • Manually tested with multilingual inputs including Chinese, Japanese, Korean, Arabic, and Russian.
  • Confirmed that Unicode characters are no longer escaped in the output.
  • Ran all required checks locally:
    ./autoformat.sh
    pylint --rcfile=.pylintrc langextract tests
    pytest tests
    All formatting, linting, and tests passed.

Checklist

  • I have signed the Google Contributor License Agreement (CLA)
  • I have read and followed the HAI-DEF Community Guidelines
  • My code follows the project's style guide (formatted with ./autoformat.sh)
  • All tests pass (pylint and pytest)
  • This PR contains a single, focused change (fixes one specific issue)
  • I have referenced the related issue (Closes #93)
  • No infrastructure, build, or documentation files were modified

Additional Notes

  • This is a small, non-breaking change that improves user experience for multilingual developers.
  • The change is backward-compatible: ASCII-only use cases remain unaffected.
  • Enhances prompt clarity for LLMs by preserving natural text representation.
  • No sensitive data or privacy concerns involved.
  • Contribution is free of malicious code and made in good faith.

Copy link

google-cla bot commented Aug 8, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@github-actions github-actions bot added the size/XS Pull request with less than 50 lines changed label Aug 8, 2025
@aksg87 aksg87 added the ready-to-merge Triggers live API tests for PRs from forks label Aug 8, 2025
Copy link

github-actions bot commented Aug 8, 2025

Running live API tests... This will take a few minutes.

Copy link

github-actions bot commented Aug 8, 2025

✅ Live API tests passed! All endpoints are working correctly.

@aksg87 aksg87 merged commit 845258c into google:main Aug 8, 2025
19 of 22 checks passed
aksg87 pushed a commit that referenced this pull request Aug 21, 2025
sinnaj pushed a commit to sinnaj/langextract that referenced this pull request Sep 3, 2025
aksg87 added a commit that referenced this pull request Sep 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-to-merge Triggers live API tests for PRs from forks size/XS Pull request with less than 50 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Title: Unicode characters being escaped in example generation due to json.dumps() default behavior

2 participants