Fix: Prevent splitting of numbers and English words in Chinese text segmentation #354

JinheLin · 2025-09-14T16:45:48Z

Pull Request

Related issue

It’s very common for Chinese, numbers, and English to appear together in the same sentence. For example:

我从2025年开始学习Rust语言。

In this sentence, charabia would segment it as:

["我", "从", "2", "0", "2", "5", "年", "开始", "学习", "R", "u", "s", "t", "语言", "。"]

Normally, numbers like 2025 and words like Rust should not be split apart.

If all numbers and English words are broken down into the most basic 10 digits and 26 letters, these numbers and words lose their meaning in search engines.

What does this PR do?

This PR prevents splitting of numbers and English words in Chinese text segmentation.

PR checklist

Please check if your PR fulfills the following requirements:

Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
Have you read the contributing guidelines?
Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!

Summary by CodeRabbit

New Features
- Tokenization now treats strings composed solely of ASCII letters and digits as a single token, reducing unwanted splits during search-oriented segmentation.
- Users may notice faster processing and more predictable results when searching with plain English or numeric queries.
Tests
- Added coverage for sentences mixing Chinese characters with ASCII alphanumeric sequences to validate tokenization behavior.

ManyTheFish · 2025-09-29T08:59:26Z

Hey @JinheLin,
Is the PR ready? Can I revew it, or is there still some work to do for you? :)

JinheLin · 2025-09-30T04:06:56Z

@ManyTheFish This PR is ready. You can renew it anytime.

coderabbitai · 2025-10-09T03:30:53Z

Walkthrough

Adds an ASCII-alphanumeric fast path to cut_for_search that returns a single token when the input is entirely ASCII alphanumeric and includes a unit test validating mixed Chinese and ASCII-alphanumeric tokenization. (47 words)

Changes

Cohort / File(s)	Summary
Chinese segmenter logic `charabia/src/segmenter/chinese.rs`	Added early-return fast path in `cut_for_search` for inputs that are entirely ASCII alphanumeric; appended unit test `test_mix_number_and_letter` to verify tokenization with mixed Chinese and ASCII-alphanumeric sequences.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant ChineseSegmenter as cut_for_search
  participant Segmentation as segmenter (fallback)

  Client->>ChineseSegmenter: cut_for_search(text)
  alt text is ASCII alphanumeric only
    ChineseSegmenter-->>Client: [single token: text]
  else mixed or contains non-ASCII
    ChineseSegmenter->>Segmentation: segment(text)
    Segmentation-->>ChineseSegmenter: tokens
    ChineseSegmenter-->>Client: tokens
  end

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

I nibble code with twitching nose,
A quick hop where ASCII flows—
One token wrapped in a tidy shell,
汉字 split where meanings dwell.
Hooray the test: a carrot feast,
Small changes, swift, from rabbit beast. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title succinctly and accurately summarizes the primary change of preserving numbers and English words during Chinese segmentation, directly reflecting the pull request’s objective to fix the splitting issue. It is clear, concise, and specific, without extraneous details or vague phrasing. The “Fix:” prefix is conventional and indicates the action taken.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5df8800 and 45cdf76.

📒 Files selected for processing (1)

charabia/src/segmenter/chinese.rs (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

charabia/src/segmenter/chinese.rs (1)

charabia/src/segmenter/mod.rs (2)

new (150-180)

new (221-223)

🔇 Additional comments (2)

charabia/src/segmenter/chinese.rs (2)

24-26: LGTM! Correctly preserves ASCII alphanumeric sequences.

The fast path correctly prevents splitting of numbers and English words by returning the entire string when it consists only of ASCII alphanumeric characters. This aligns with the PR objective and integrates well with the existing segmentation flow.

341-346: LGTM! Test effectively validates the fix.

The test correctly verifies that numbers like "2025" and English words like "Rust" remain intact within Chinese text segmentation, directly addressing the issue described in the PR.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

charabia/src/segmenter/chinese.rs (2)
24-26: The fix correctly prevents splitting ASCII alphanumeric sequences.

The early return for ASCII alphanumeric strings properly addresses the issue described in the PR, ensuring that numbers like "2025" and English words like "Rust" remain intact.

Consider a minor performance optimization by using bytes() instead of chars() for ASCII-only checks:
-    if s.chars().all(|c| c.is_ascii_alphanumeric()) {
+    if s.bytes().all(|b| b.is_ascii_alphanumeric()) {
         return Box::new(std::iter::once(s));
     }
This avoids Unicode character boundary checks when we know we're only checking ASCII characters.

341-346: Good test coverage for the fix.

The test properly validates the core fix by ensuring "2025" and "Rust" remain as single tokens in mixed Chinese-ASCII text.

Consider adding test cases for additional edge cases to strengthen coverage:
#[test]
fn test_ascii_alphanumeric_edge_cases() {
    let seg = ChineseSegmenter;
    
    // Pure numbers
    let words: Vec<&str> = seg.segment_str("我有123个苹果").collect();
    assert_eq!(words, vec!["我", "有", "123", "个", "苹果"]);
    
    // Pure English
    let words: Vec<&str> = seg.segment_str("我学习Rust").collect();
    assert_eq!(words, vec!["我", "学习", "Rust"]);
    
    // Mixed alphanumeric
    let words: Vec<&str> = seg.segment_str("版本v2025").collect();
    assert_eq!(words, vec!["版本", "v2025"]);
    
    // Multiple ASCII sequences
    let words: Vec<&str> = seg.segment_str("Python和Rust都是2024年流行的语言").collect();
    assert_eq!(words, vec!["Python", "和", "Rust", "都是", "2024", "年", "流行", "的", "语言"]);
}

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 170398d and 5df8800.

📒 Files selected for processing (1)

charabia/src/segmenter/chinese.rs (2 hunks)

JinheLin · 2025-10-09T09:43:31Z

@ManyTheFish I have fixed the issues encountered in the unit tests. Please take a look.

…egmentation

coderabbitai bot reviewed Oct 9, 2025

View reviewed changes

Fix: Prevent splitting of numbers and English words in Chinese text s…

45cdf76

…egmentation

JinheLin force-pushed the fix_num_word branch from 5df8800 to 45cdf76 Compare October 14, 2025 07:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Prevent splitting of numbers and English words in Chinese text segmentation #354

Fix: Prevent splitting of numbers and English words in Chinese text segmentation #354

Uh oh!

JinheLin commented Sep 14, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

ManyTheFish commented Sep 29, 2025

Uh oh!

JinheLin commented Sep 30, 2025

Uh oh!

coderabbitai bot commented Oct 9, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

JinheLin commented Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix: Prevent splitting of numbers and English words in Chinese text segmentation #354

Are you sure you want to change the base?

Fix: Prevent splitting of numbers and English words in Chinese text segmentation #354

Uh oh!

Conversation

JinheLin commented Sep 14, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request

Related issue

What does this PR do?

PR checklist

Summary by CodeRabbit

Uh oh!

ManyTheFish commented Sep 29, 2025

Uh oh!

JinheLin commented Sep 30, 2025

Uh oh!

coderabbitai bot commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

JinheLin commented Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JinheLin commented Sep 14, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 9, 2025 •

edited

Loading