Skip to content

Conversation

JinheLin
Copy link

@JinheLin JinheLin commented Sep 14, 2025

Pull Request

Related issue

It’s very common for Chinese, numbers, and English to appear together in the same sentence. For example:

我从2025年开始学习Rust语言。

In this sentence, charabia would segment it as:

["我", "从", "2", "0", "2", "5", "年", "开始", "学习", "R", "u", "s", "t", "语言", "。"]

Normally, numbers like 2025 and words like Rust should not be split apart.

If all numbers and English words are broken down into the most basic 10 digits and 26 letters, these numbers and words lose their meaning in search engines.

What does this PR do?

  • This PR prevents splitting of numbers and English words in Chinese text segmentation.

PR checklist

Please check if your PR fulfills the following requirements:

  • Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
  • Have you read the contributing guidelines?
  • Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!

Summary by CodeRabbit

  • New Features

    • Tokenization now treats strings composed solely of ASCII letters and digits as a single token, reducing unwanted splits during search-oriented segmentation.
    • Users may notice faster processing and more predictable results when searching with plain English or numeric queries.
  • Tests

    • Added coverage for sentences mixing Chinese characters with ASCII alphanumeric sequences to validate tokenization behavior.

@ManyTheFish
Copy link
Member

Hey @JinheLin,
Is the PR ready? Can I revew it, or is there still some work to do for you? :)

@JinheLin
Copy link
Author

@ManyTheFish This PR is ready. You can renew it anytime.

Copy link

coderabbitai bot commented Oct 9, 2025

Walkthrough

Adds an ASCII-alphanumeric fast path to cut_for_search that returns a single token when the input is entirely ASCII alphanumeric and includes a unit test validating mixed Chinese and ASCII-alphanumeric tokenization. (47 words)

Changes

Cohort / File(s) Summary
Chinese segmenter logic
charabia/src/segmenter/chinese.rs
Added early-return fast path in cut_for_search for inputs that are entirely ASCII alphanumeric; appended unit test test_mix_number_and_letter to verify tokenization with mixed Chinese and ASCII-alphanumeric sequences.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant ChineseSegmenter as cut_for_search
  participant Segmentation as segmenter (fallback)

  Client->>ChineseSegmenter: cut_for_search(text)
  alt text is ASCII alphanumeric only
    ChineseSegmenter-->>Client: [single token: text]
  else mixed or contains non-ASCII
    ChineseSegmenter->>Segmentation: segment(text)
    Segmentation-->>ChineseSegmenter: tokens
    ChineseSegmenter-->>Client: tokens
  end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

I nibble code with twitching nose,
A quick hop where ASCII flows—
One token wrapped in a tidy shell,
汉字 split where meanings dwell.
Hooray the test: a carrot feast,
Small changes, swift, from rabbit beast. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly and accurately summarizes the primary change of preserving numbers and English words during Chinese segmentation, directly reflecting the pull request’s objective to fix the splitting issue. It is clear, concise, and specific, without extraneous details or vague phrasing. The “Fix:” prefix is conventional and indicates the action taken.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5df8800 and 45cdf76.

📒 Files selected for processing (1)
  • charabia/src/segmenter/chinese.rs (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
charabia/src/segmenter/chinese.rs (1)
charabia/src/segmenter/mod.rs (2)
  • new (150-180)
  • new (221-223)
🔇 Additional comments (2)
charabia/src/segmenter/chinese.rs (2)

24-26: LGTM! Correctly preserves ASCII alphanumeric sequences.

The fast path correctly prevents splitting of numbers and English words by returning the entire string when it consists only of ASCII alphanumeric characters. This aligns with the PR objective and integrates well with the existing segmentation flow.


341-346: LGTM! Test effectively validates the fix.

The test correctly verifies that numbers like "2025" and English words like "Rust" remain intact within Chinese text segmentation, directly addressing the issue described in the PR.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
charabia/src/segmenter/chinese.rs (2)

24-26: The fix correctly prevents splitting ASCII alphanumeric sequences.

The early return for ASCII alphanumeric strings properly addresses the issue described in the PR, ensuring that numbers like "2025" and English words like "Rust" remain intact.

Consider a minor performance optimization by using bytes() instead of chars() for ASCII-only checks:

-    if s.chars().all(|c| c.is_ascii_alphanumeric()) {
+    if s.bytes().all(|b| b.is_ascii_alphanumeric()) {
         return Box::new(std::iter::once(s));
     }

This avoids Unicode character boundary checks when we know we're only checking ASCII characters.


341-346: Good test coverage for the fix.

The test properly validates the core fix by ensuring "2025" and "Rust" remain as single tokens in mixed Chinese-ASCII text.

Consider adding test cases for additional edge cases to strengthen coverage:

#[test]
fn test_ascii_alphanumeric_edge_cases() {
    let seg = ChineseSegmenter;
    
    // Pure numbers
    let words: Vec<&str> = seg.segment_str("我有123个苹果").collect();
    assert_eq!(words, vec!["我", "有", "123", "个", "苹果"]);
    
    // Pure English
    let words: Vec<&str> = seg.segment_str("我学习Rust").collect();
    assert_eq!(words, vec!["我", "学习", "Rust"]);
    
    // Mixed alphanumeric
    let words: Vec<&str> = seg.segment_str("版本v2025").collect();
    assert_eq!(words, vec!["版本", "v2025"]);
    
    // Multiple ASCII sequences
    let words: Vec<&str> = seg.segment_str("Python和Rust都是2024年流行的语言").collect();
    assert_eq!(words, vec!["Python", "和", "Rust", "都是", "2024", "年", "流行", "的", "语言"]);
}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 170398d and 5df8800.

📒 Files selected for processing (1)
  • charabia/src/segmenter/chinese.rs (2 hunks)

@JinheLin
Copy link
Author

JinheLin commented Oct 9, 2025

@ManyTheFish I have fixed the issues encountered in the unit tests. Please take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants