Skip to content

Commit 88efbe6

Browse files
init mrc project
0 parents  commit 88efbe6

24 files changed

+265354
-0
lines changed

README.md

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
---
2+
language:
3+
- vi
4+
- vn
5+
- en
6+
tags:
7+
- question-answering
8+
- pytorch
9+
datasets:
10+
- squad
11+
license: mit
12+
pipeline_tag: question-answering
13+
metrics:
14+
- squad
15+
widget:
16+
- text: "Bình là chuyên gia về gì ?"
17+
context: "Bình Nguyễn là một người đam mê với lĩnh vực xử lý ngôn ngữ tự nhiên . Anh nhận chứng chỉ Google Developer Expert năm 2020"
18+
- text: "Bình được công nhận với danh hiệu gì ?"
19+
context: "Bình Nguyễn là một người đam mê với lĩnh vực xử lý ngôn ngữ tự nhiên . Anh nhận chứng chỉ Google Developer Expert năm 2020"
20+
---
21+
## Model Description
22+
23+
- Language model: [XLM-RoBERTa](https://huggingface.co/transformers/model_doc/xlmroberta.html)
24+
- Fine-tune: [MRCQuestionAnswering](https://github.com/nguyenvulebinh/extractive-qa-mrc)
25+
- Language: Vietnamese, Englsih
26+
- Downstream-task: Extractive QA
27+
- Dataset (combine English and Vietnamese):
28+
- [Squad 2.0](https://rajpurkar.github.io/SQuAD-explorer/)
29+
- [mailong25](https://github.com/mailong25/bert-vietnamese-question-answering/tree/master/dataset)
30+
- [UIT-ViQuAD](https://www.aclweb.org/anthology/2020.coling-main.233/)
31+
- [MultiLingual Question Answering](https://github.com/facebookresearch/MLQA)
32+
33+
This model is intended to be used for QA in the Vietnamese language so the valid set is Vietnamese only (but English works fine). The evaluation result below using 10% of the Vietnamese dataset.
34+
35+
36+
| Model | EM | F1 |
37+
| ------------- | ------------- | ------------- |
38+
| [base](https://huggingface.co/nguyenvulebinh/vi-mrc-base) | 76.43 | 84.16 |
39+
| [large](https://huggingface.co/nguyenvulebinh/vi-mrc-large) | 77.32 | 85.46 |
40+
41+
42+
[MRCQuestionAnswering](https://github.com/nguyenvulebinh/extractive-qa-mrc) using [XLM-RoBERTa](https://huggingface.co/transformers/model_doc/xlmroberta.html) as a pre-trained language model. By default, XLM-RoBERTa will split word in to sub-words. But in my implementation, I re-combine sub-words representation (after encoded by BERT layer) into word representation using sum strategy.
43+
44+
## Using pre-trained model
45+
46+
- Hugging Face pipeline style (**NOT using sum features strategy**).
47+
48+
```python
49+
from transformers import pipeline
50+
# model_checkpoint = "nguyenvulebinh/vi-mrc-large"
51+
model_checkpoint = "nguyenvulebinh/vi-mrc-base"
52+
nlp = pipeline('question-answering', model=model_checkpoint,
53+
tokenizer=model_checkpoint)
54+
QA_input = {
55+
'question': "Bình là chuyên gia về gì ?",
56+
'context': "Bình Nguyễn là một người đam mê với lĩnh vực xử lý ngôn ngữ tự nhiên . Anh nhận chứng chỉ Google Developer Expert năm 2020"
57+
}
58+
res = nlp(QA_input)
59+
print('pipeline: {}'.format(res))
60+
#{'score': 0.5782045125961304, 'start': 45, 'end': 68, 'answer': 'xử lý ngôn ngữ tự nhiên'}
61+
```
62+
63+
- More accurate infer process ([**Using sum features strategy**](https://github.com/nguyenvulebinh/extractive-qa-mrc))
64+
65+
```python
66+
from infer import tokenize_function, data_collator, extract_answer
67+
from model.mrc_model import MRCQuestionAnswering
68+
from transformers import AutoTokenizer
69+
70+
# model_checkpoint = "nguyenvulebinh/vi-mrc-large"
71+
model_checkpoint = "nguyenvulebinh/vi-mrc-base"
72+
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
73+
model = MRCQuestionAnswering.from_pretrained(model_checkpoint)
74+
75+
QA_input = {
76+
'question': "Bình được công nhận với danh hiệu gì ?",
77+
'context': "Bình Nguyễn là một người đam mê với lĩnh vực xử lý ngôn ngữ tự nhiên . Anh nhận chứng chỉ Google Developer Expert năm 2020"
78+
}
79+
80+
inputs = [tokenize_function(*QA_input)]
81+
inputs_ids = data_collator(inputs)
82+
outputs = model(**inputs_ids)
83+
answer = extract_answer(inputs, outputs, tokenizer)
84+
85+
print(answer)
86+
# answer: Google Developer Expert. Score start: 0.9926977753639221, Score end: 0.9909810423851013
87+
```
88+
89+
## Pre-trained model
90+
In data-bin/raw folder already exist some sample data files for the training process. Do following steps:
91+
92+
- Create environment by using file requirements.txt
93+
94+
- Clean data
95+
96+
```shell
97+
python mrc_anno_to_mrc.py
98+
python train_valid_split.py
99+
```
100+
- Train model
101+
102+
```shell
103+
python main.py
104+
```
105+
106+
- Test model
107+
108+
```shell
109+
python infer.py
110+
```
16.5 MB
Binary file not shown.
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
{
2+
"builder_name": null,
3+
"citation": "",
4+
"config_name": null,
5+
"dataset_size": null,
6+
"description": "",
7+
"download_checksums": null,
8+
"download_size": null,
9+
"features": {
10+
"context": {
11+
"dtype": "string",
12+
"id": null,
13+
"_type": "Value"
14+
},
15+
"question": {
16+
"dtype": "string",
17+
"id": null,
18+
"_type": "Value"
19+
},
20+
"answer_text": {
21+
"dtype": "string",
22+
"id": null,
23+
"_type": "Value"
24+
},
25+
"answer_start_idx": {
26+
"dtype": "int64",
27+
"id": null,
28+
"_type": "Value"
29+
},
30+
"answer_word_start_idx": {
31+
"dtype": "int64",
32+
"id": null,
33+
"_type": "Value"
34+
},
35+
"answer_word_end_idx": {
36+
"dtype": "int64",
37+
"id": null,
38+
"_type": "Value"
39+
}
40+
},
41+
"homepage": "",
42+
"license": "",
43+
"post_processed": null,
44+
"post_processing_size": null,
45+
"size_in_bytes": null,
46+
"splits": null,
47+
"supervised_keys": null,
48+
"task_templates": null,
49+
"version": null
50+
}
111 KB
Binary file not shown.
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
{
2+
"_data_files": [
3+
{
4+
"filename": "dataset.arrow"
5+
}
6+
],
7+
"_fingerprint": "c8c8f2fafa8655e5",
8+
"_format_columns": [
9+
"context",
10+
"question",
11+
"answer_text",
12+
"answer_start_idx",
13+
"answer_word_start_idx",
14+
"answer_word_end_idx"
15+
],
16+
"_format_kwargs": {},
17+
"_format_type": null,
18+
"_indexes": {},
19+
"_indices_data_files": [
20+
{
21+
"filename": "indices.arrow"
22+
}
23+
],
24+
"_output_all_columns": false,
25+
"_split": null
26+
}
16.5 MB
Binary file not shown.
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
{
2+
"builder_name": null,
3+
"citation": "",
4+
"config_name": null,
5+
"dataset_size": null,
6+
"description": "",
7+
"download_checksums": null,
8+
"download_size": null,
9+
"features": {
10+
"context": {
11+
"dtype": "string",
12+
"id": null,
13+
"_type": "Value"
14+
},
15+
"question": {
16+
"dtype": "string",
17+
"id": null,
18+
"_type": "Value"
19+
},
20+
"answer_text": {
21+
"dtype": "string",
22+
"id": null,
23+
"_type": "Value"
24+
},
25+
"answer_start_idx": {
26+
"dtype": "int64",
27+
"id": null,
28+
"_type": "Value"
29+
},
30+
"answer_word_start_idx": {
31+
"dtype": "int64",
32+
"id": null,
33+
"_type": "Value"
34+
},
35+
"answer_word_end_idx": {
36+
"dtype": "int64",
37+
"id": null,
38+
"_type": "Value"
39+
}
40+
},
41+
"homepage": "",
42+
"license": "",
43+
"post_processed": null,
44+
"post_processing_size": null,
45+
"size_in_bytes": null,
46+
"splits": null,
47+
"supervised_keys": null,
48+
"task_templates": null,
49+
"version": null
50+
}
12.6 KB
Binary file not shown.
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
{
2+
"_data_files": [
3+
{
4+
"filename": "dataset.arrow"
5+
}
6+
],
7+
"_fingerprint": "1e4c15e6dcc314e4",
8+
"_format_columns": [
9+
"context",
10+
"question",
11+
"answer_text",
12+
"answer_start_idx",
13+
"answer_word_start_idx",
14+
"answer_word_end_idx"
15+
],
16+
"_format_kwargs": {},
17+
"_format_type": null,
18+
"_indexes": {},
19+
"_indices_data_files": [
20+
{
21+
"filename": "indices.arrow"
22+
}
23+
],
24+
"_output_all_columns": false,
25+
"_split": null
26+
}

0 commit comments

Comments
 (0)