Chinese word segmentation is a profound and mysterious technology, whether for humans or for AI.
Peking University has released an open-source Chinese word segmentation toolkit called PKUSeg, based on Python.
The segmentation accuracy of this toolkit far exceeds that of the two important competitors, THULAC and JIEBA.
In addition, PKUSeg supports domain-specific word segmentation and also supports training models with new annotated data.
Accuracy Comparison
In this competition, PKUSeg's opponents are two:
One is THULAC from Tsinghua University, and the other is Jieba, which aims to be the "best Chinese word segmentation component". They are both mainstream word segmentation tools at present.
The testing environment is Linux, and the testing datasets are MSRA (news data) and CTB8 (mixed text).
The results are as follows:
The evaluation criteria used in the competition are the word segmentation evaluation scripts provided by the 2nd International Chinese Word Segmentation Evaluation Competition.
In terms of F-score and error rate, PKUSeg is significantly better than the other two competitors.
Usage
Pre-trained Models
PKUSeg provides three pre-trained models trained on different types of datasets.
The first one is a model trained on MSRA (news corpus):
https://pan.baidu.com/s/1twci0QVBeWXUg06dK47tiA
The second one is a model trained on CTB8 (a mixed corpus of news and web text):
https://pan.baidu.com/s/1DCjDOxB0HD2NmP9w1jm8MA
The third one is a model trained on Weibo (web text corpus):
https://pan.baidu.com/s/1QHoK2ahpZnNmX6X7Y9iCgQ
You can choose to load different models according to your needs.
In addition, you can also train new models with new annotated data.
Code Examples:
# Code Example 1: Using the default model and default dictionary for word segmentation
import pkuseg
seg = pkuseg.pkuseg() # Load the model with default configuration
text = seg.cut('我爱北京天安门') # Perform word segmentation
print(text)
# Code Example 2: Setting a user-defined dictionary
import pkuseg
lexicon = ['北京大学', '北京天安门'] # Words in the user dictionary that should not be segmented
seg = pkuseg.pkuseg(user_dict=lexicon) # Load the model and provide the user dictionary
text = seg.cut('我爱北京天安门') # Perform word segmentation
print(text)
# Code Example 3
import pkuseg
seg = pkuseg.pkuseg(model_name='./ctb8') # Assuming the user has downloaded the ctb8 model and placed it in the './ctb8' directory, load the model by setting model_name
text = seg.cut('我爱北京天安门') # Perform word segmentation
print(text)
If you want to train a new model yourself:
# Code Example 5
import pkuseg
pkuseg.train('msr_training.utf8', 'msr_test_gold.utf8', './models', nthread=20) # Train the model with 'msr_training.utf8' as the training file and 'msr_test_gold.utf8' as the testing file. Save the model in the './models' directory and use 20 threads for training.
For more detailed usage, please visit the link at the end of the text.
Go ahead and give it a try
PKUSeg was developed by three authors: Ruixuan Luo, Jingjing Xu, and Xu Sun.
The creation of this toolkit is also based on an ACL paper in which two of the authors participated.
With such high accuracy, why not give it a try?
GitHub link:
https://github.com/lancopku/PKUSeg-python