-
Notifications
You must be signed in to change notification settings - Fork 123
Description
Hello, I am trying to train a PURE model with the Korean entity-relation extraction dataset and pre-trained KoBERT (Korean BERT, the model is in huggingface). In the Korean dataset I have, the start and end positions of entities are assigned to the character level. (For example, in English, when there is a sentence “I am a student”, the starting index of the “student” entity is assigned to 8).
Question 1) Can I use the dataset as input to the model with indexing like this? If that's not possible, can I use my dataset as training data for the model if I tokenize Korean sentences into spaces (' ') and recalculate the index accordingly?
Additionally, I assume that the PURE model splits the input tokens into smaller pieces using the tokenizer of the pre-trained model. As a result, the total number of tokens in the sentence will be greater than the number of input tokens. So, I think the start/end token positions of the entity entered must be recalculated.
Question 2) Does the PURE model take action to reflect this (As the number of tokens increases, the start/end token positions of the entity change)?
Question 3) Are there any additional considerations or modifications I need to do when using the custom dataset and custom pre-trained model?
Thank you for your great work and It would be really helpful if you reply.