-
Firstly, thank you very much for your team's dedication and generous open-source efforts. I found that its performance on Chinese data is very poor But I couldn't integrate the Chinese word segmentation function into Haystack because I couldn't find the code for the word segmentation section. Or I may have mistaken my starting point for solving the problem. I tried Haystack's Utilization Existing FAQs for Question. Replacing the pre trained model with a Chinese text similarity model has shown good performance. I also tried Preprocessing Your Documents by adding Chinese word segmentation functionality to preprocessor.py. Or do I need to replace the tokenize section of nltk with the Chinese tokenize section Thank you for providing ideas to solve the problem. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Hey @mc112611, unfortunately Haystack is not well equipped for Chinese text processing right now. The best idea would be to implement your own PreProcessor to properly handle Chinese text. As long as you subclass BaseComponent, you will be able to use it in a Pipeline like any other Haystack node. Here is some information on how to do it: https://docs.haystack.deepset.ai/docs/custom_nodes. If later you find out that your custom PreProcessor works well and you want to contribute, we would be really happy to accept a PR for this! |
Beta Was this translation helpful? Give feedback.
Hey @mc112611, unfortunately Haystack is not well equipped for Chinese text processing right now. The best idea would be to implement your own PreProcessor to properly handle Chinese text. As long as you subclass BaseComponent, you will be able to use it in a Pipeline like any other Haystack node. Here is some information on how to do it: https://docs.haystack.deepset.ai/docs/custom_nodes. If later you find out that your custom PreProcessor works well and you want to contribute, we would be really happy to accept a PR for this!