Skip to content

A Transformer-based Chinese word segmentation model trained on Traditional Chinese Data specific to zh_TW

License

Notifications You must be signed in to change notification settings

hhhuang/ChineseWordSegmenter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A BERT-based Chinese Word Segmentation Model for Traditional Chinese (zh_TW)

A Transformer-based Chinese word segmentation model trained on Traditional Chinese Data specific to zh_TW

This is NOT a sophisticated NLP study for Chinese word segmentation. Instead, this project is intended to provide a handy, easy to use but powerful deep learning based Chinese word segmentation model. There are so many github projects for Chinese word segmentation that are publiched on top CL/NLP conferences. Although they perform at the state-of-the-art level, however, none of them are as easy to use as jieba, which can be easily installed by using pip.

In project is for the case when you just need a handy but powerful, state-of-the-art Chinese word segmentation model. This Chinese word segmentation can be easily installed with single pip. The large pretrained Transformer model will be automatically downloaded and everything will get ready within a couple of seconds.

Installation

!pip install git+https://github.com/hhhuang/ChineseWordSegmenter.git

The large model will be automatically downloaded at the first time.

Usage

from chinese_word_segmenter import ChineseWordSegmenter
cws = ChineseWordSegmenter()
cws.tokenize("法國總統馬克宏已到現場勘災,初步傳出火警可能與目前聖母院的維修工程有關。")

Sample output:

['法國', '總統', '馬克宏', '已', '到', '現場', '勘災', ',', '初步', '傳出', '火警', '可能', '與', '目前', '聖母院', '的', '維修', '工程', '有關', '。']

Note that the max_seq_length of this model is 128. To handle the long input, all the input will be automatically split into clauses by using the punctuation marks 。,!?:; as delimiter.

Technical Information

This model was built on the Transformer text-encoder BERT and fine-tuned on the Traditional Chinese word segmentation corpus. The training corpus is from CKIP, Academia Sinica, Taiwan released by the Second International Chinese Word Segmentation Bakeoff at the 4th SIGHAN Workshop (2005). The implementation is simply based on the simpletransformers.NERModel with the LMRS scheme, with which every Chinese character will be labeled as Leftmost, Middle, Rightmost, or Single. The batch size is 32, and the number of epoch is 5.

Evaluation

Test data: The golden testset in CKIP Academia Sinica Corpus (zh_TW) used in the Second International Chinese Word Segmentation Bakeoff at the 4th SIGHAN Workshop (2005).

Model Precision Recall F-score
jieba-tw 0.8779 0.8953 0.8865
Our Model 0.9624 0.9705 0.9664

About

A Transformer-based Chinese word segmentation model trained on Traditional Chinese Data specific to zh_TW

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages