-
Notifications
You must be signed in to change notification settings - Fork 100
Fix: Prevent splitting of numbers and English words in Chinese text segmentation #354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hey @JinheLin, |
@ManyTheFish This PR is ready. You can renew it anytime. |
WalkthroughAdds an ASCII-alphanumeric fast path to cut_for_search that returns a single token when the input is entirely ASCII alphanumeric and includes a unit test validating mixed Chinese and ASCII-alphanumeric tokenization. (47 words) Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant ChineseSegmenter as cut_for_search
participant Segmentation as segmenter (fallback)
Client->>ChineseSegmenter: cut_for_search(text)
alt text is ASCII alphanumeric only
ChineseSegmenter-->>Client: [single token: text]
else mixed or contains non-ASCII
ChineseSegmenter->>Segmentation: segment(text)
Segmentation-->>ChineseSegmenter: tokens
ChineseSegmenter-->>Client: tokens
end
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used🧬 Code graph analysis (1)charabia/src/segmenter/chinese.rs (1)
🔇 Additional comments (2)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
charabia/src/segmenter/chinese.rs (2)
24-26
: The fix correctly prevents splitting ASCII alphanumeric sequences.The early return for ASCII alphanumeric strings properly addresses the issue described in the PR, ensuring that numbers like "2025" and English words like "Rust" remain intact.
Consider a minor performance optimization by using
bytes()
instead ofchars()
for ASCII-only checks:- if s.chars().all(|c| c.is_ascii_alphanumeric()) { + if s.bytes().all(|b| b.is_ascii_alphanumeric()) { return Box::new(std::iter::once(s)); }This avoids Unicode character boundary checks when we know we're only checking ASCII characters.
341-346
: Good test coverage for the fix.The test properly validates the core fix by ensuring "2025" and "Rust" remain as single tokens in mixed Chinese-ASCII text.
Consider adding test cases for additional edge cases to strengthen coverage:
#[test] fn test_ascii_alphanumeric_edge_cases() { let seg = ChineseSegmenter; // Pure numbers let words: Vec<&str> = seg.segment_str("我有123个苹果").collect(); assert_eq!(words, vec!["我", "有", "123", "个", "苹果"]); // Pure English let words: Vec<&str> = seg.segment_str("我学习Rust").collect(); assert_eq!(words, vec!["我", "学习", "Rust"]); // Mixed alphanumeric let words: Vec<&str> = seg.segment_str("版本v2025").collect(); assert_eq!(words, vec!["版本", "v2025"]); // Multiple ASCII sequences let words: Vec<&str> = seg.segment_str("Python和Rust都是2024年流行的语言").collect(); assert_eq!(words, vec!["Python", "和", "Rust", "都是", "2024", "年", "流行", "的", "语言"]); }
@ManyTheFish I have fixed the issues encountered in the unit tests. Please take a look. |
5df8800
to
45cdf76
Compare
Pull Request
Related issue
It’s very common for Chinese, numbers, and English to appear together in the same sentence. For example:
In this sentence, charabia would segment it as:
["我", "从", "2", "0", "2", "5", "年", "开始", "学习", "R", "u", "s", "t", "语言", "。"]
Normally, numbers like 2025 and words like Rust should not be split apart.
If all numbers and English words are broken down into the most basic 10 digits and 26 letters, these numbers and words lose their meaning in search engines.
What does this PR do?
PR checklist
Please check if your PR fulfills the following requirements:
Thank you so much for contributing to Meilisearch!
Summary by CodeRabbit
New Features
Tests