There is a question about using haystack in Chinese #4585

mc112611 · 2023-04-04T01:24:51Z

mc112611
Apr 4, 2023

Firstly, thank you very much for your team's dedication and generous open-source efforts.
I tried this in the official tutorial:Build Your First Question Answering System
The link is as follows:[https://haystack.deepset.ai/tutorials/01_basic_qa_pipeline]

I found that its performance on Chinese data is very poor
I suspect that the segmentation rules in Chinese are different from those in English. English segmentation can be based on spaces, while Chinese needs to be combined with some other libraries to achieve this function

But I couldn't integrate the Chinese word segmentation function into Haystack because I couldn't find the code for the word segmentation section. Or I may have mistaken my starting point for solving the problem. I tried Haystack's Utilization Existing FAQs for Question. Replacing the pre trained model with a Chinese text similarity model has shown good performance. I also tried Preprocessing Your Documents by adding Chinese word segmentation functionality to preprocessor.py.
Therefore, I would like to seek help on how to implement a question and answer system for Chinese documents.
Or, after Preprocessing Your Documents, which pipeline is called to implement the question answering system.

Or do I need to replace the tokenize section of nltk with the Chinese tokenize section

Thank you for providing ideas to solve the problem.

Answered by ZanSara

Apr 4, 2023

Hey @mc112611, unfortunately Haystack is not well equipped for Chinese text processing right now. The best idea would be to implement your own PreProcessor to properly handle Chinese text. As long as you subclass BaseComponent, you will be able to use it in a Pipeline like any other Haystack node. Here is some information on how to do it: https://docs.haystack.deepset.ai/docs/custom_nodes. If later you find out that your custom PreProcessor works well and you want to contribute, we would be really happy to accept a PR for this!

View full answer

ZanSara · 2023-04-04T10:37:47Z

ZanSara
Apr 4, 2023

Hey @mc112611, unfortunately Haystack is not well equipped for Chinese text processing right now. The best idea would be to implement your own PreProcessor to properly handle Chinese text. As long as you subclass BaseComponent, you will be able to use it in a Pipeline like any other Haystack node. Here is some information on how to do it: https://docs.haystack.deepset.ai/docs/custom_nodes. If later you find out that your custom PreProcessor works well and you want to contribute, we would be really happy to accept a PR for this!

2 replies

mc112611 Apr 6, 2023
Author

Thank you for the information you provided. I have also considered translating Chinese into English and then using Haystack for processing, but I was not aware of any components as good as pipelines. Thank you, sir or madam

mc112611 Apr 12, 2023
Author

Hi, thank you for providing a valuable solution. Based on the ideas you provided last time, I have translated the Chinese word_ cut，sentence_ Cut, added to haystack. However, instead of creating a subclass of BaseComponent, I added the key value pair "zh: Chinese" to the preprocessor. py file, and modified the preprocessor() object to call the Chinese word when the parameter is set to "language=zh"_ cut,sentence_ cut。 At the same time, when calling the pipeline, the sentence entered by the user is worded_ Cut and join again. Overall, it is equivalent to keeping the format of the data stream as similar as English text as possible. Here is my modified preprocessor.py file
https://github.com/mc112611/TT/blob/main/preprocessor.py

You may need to pay attention to lines 65, 470, 540, and 741_ cut_ Sentences (self, content) are used for Chinese sentences_ Cut(), which determines whether the input language is "zh" before use.

Currently, even if language="zh" is set, English text can still be processed. Because of the vocabulary of Chinese word segmentation, Jieba. It includes English segmentation according to spaces.

I tested it in Chinese on Wikipedia, and its performance was decent, but it couldn't escape the restrictions on words not included in the document.

If you want to try Haystack in Chinese, you can execute this file:
https://github.com/mc112611/TT/blob/main/mc_test.py

In the wiki.rar file, I downloaded the Chinese Wikipedia for testing its effectiveness.
It is almost identical to the Tutorial: Build Your First Question Answering System, except that starting from line 65, I have worded the query sentences_ cut。 The user's sentence is divided into multiple keywords, and the retriever retrieves the document based on these keywords.

Another issue is that bypassing the nltk library may affect the use of nltk's tokenize(), but currently using the Tutorial: Build Your First Question Answering System, this is feasible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

There is a question about using haystack in Chinese #4585

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

There is a question about using haystack in Chinese #4585

Uh oh!

Uh oh!

mc112611 Apr 4, 2023

Replies: 1 comment · 2 replies

Uh oh!

ZanSara Apr 4, 2023

Uh oh!

mc112611 Apr 6, 2023 Author

Uh oh!

Uh oh!

mc112611 Apr 12, 2023 Author

mc112611
Apr 4, 2023

Replies: 1 comment 2 replies

ZanSara
Apr 4, 2023

mc112611 Apr 6, 2023
Author

mc112611 Apr 12, 2023
Author