On using my own data files for fine-tuning a BERT Q&A model? #3412

gavirapp · 2022-10-18T22:57:39Z

gavirapp
Oct 18, 2022

Sorry this might be a naive question but I can't get past this (even after looking at the relevant tutorials):

From my documents, I preprocessed them to create a haystack document store "docs" which is a list of haystack.schema.Documents. I am unable to use it to finetune a model (for example in teacher/student pair where the student is trained with my data. I could replace the files in "tutorial2" with mine, but I did many preprocessing steps on those files to get to "docs". Here is how I got the "docs"

all_docs = convert_files_to_docs(dir_path=folder)
preprocessor = PreProcessor(
clean_empty_lines=True,
clean_whitespace=True,
clean_header_footer=False,
split_by="word",
split_length=100,
split_respect_sentence_boundary=True,
)
docs = preprocessor.process(all_docs)

My question is there a way/need to use:

!python augment_squad.py --squad_path file_path --output_path augmented_dataset.json --multiplication_factor 2 --glove_path glove.6B.300d.txt

and have "docs" to produce the augmented_dataset.json?

I am using Google Colab.

Thanks much,

-George

Answered by gavirapp

Oct 21, 2022

🙏Yes, I was trying to mix tutorials 2 and 3. Instead, I just did tutorial 3 after document_store.write_documents(docs) . Now I am able to use the TDIF Retriever and complete the Q&A. Haystack is great, I am still exploring how to use all aspects of it. Next I am going to do fine-tuning after labeling some question-answer pairs. Also do a bit more preprocessing as my documents come from different decades. Thank you so much Sara.

View full answer

ZanSara · 2022-10-19T11:27:23Z

ZanSara
Oct 19, 2022

augment_squad.py performs data augmentation on your dataset. If you believe you have already plenty of data for the training, I believe you can skip that step. However it's usually beneficial to perform it.

May I ask why you prefer to skip that step?

2 replies

gavirapp Oct 19, 2022
Author

Thank you for your prompt response. I have over 1000 documents in "docs". I was under the impression these are used for augmentation because I don't see the training step to fine-tune a model with these.

ZanSara Oct 20, 2022

Ok I think I understand it better now (please correct me if I'm wrong).

I think there's a misunderstanding here. The problem is that Documents are not the expected format for the augment_squad.py script, because Documents alone can't be used to train a model. What you need is a dataset in squad format, and to create such dataset you need Labels, which you can create using the annotation tool: https://docs.haystack.deepset.ai/docs/annotation

Let me know if this helps 🙂

gavirapp · 2022-10-20T14:15:19Z

gavirapp
Oct 20, 2022
Author

Hi Sara, thanks again for your help. Here is what I am trying to do: In the tutorial for fine-tuning with my own data, *I want to use a bunch of text files*, instead of the wiki_gameofthrone_txt1.zip file. I have the text files as "docs" obtained using https://haystack.deepset.ai/tutorials/preprocessing: all_docs = convert_files_to_docs(dir_path=doc_dir) preprocessor = PreProcessor( clean_empty_lines=True, clean_whitespace=True, clean_header_footer=False, split_by="word", split_length=100, split_respect_sentence_boundary=True, ) docs = preprocessor.process(all_docs ) *My question is there way to use this "docs"* (the same format the wiki_gameofthrone_txt1.zip is converted to?) to* "Build a QA System Without Elasticsearch?* as in https://haystack.deepset.ai/tutorials/without-elasticsearch. I can't seem to have the "Retriever" access "docs". Thanks

…

-George

On Thu, Oct 20, 2022 at 6:06 AM Sara Zan ***@***.***> wrote: Ok I think I understand it better now (please correct me if I'm wrong). I think there's a misunderstanding here. The problem is that Documents are not the expected format for the augment_squad.py script, because Documents alone can't be used to train a model. What you need is a dataset in squad format, and to create such dataset you need Labels <https://docs.haystack.deepset.ai/docs/documents_answers_labels#label>, which you can create using the annotation tool: https://docs.haystack.deepset.ai/docs/annotation Let me know if this helps 🙂 — Reply to this email directly, view it on GitHub <#3412 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACKTPMSTJQIACNLHNHAOBSTWEEKSLANCNFSM6AAAAAARIRRZBQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

1 reply

ZanSara Oct 20, 2022

You're mentioning two different tutorials performing two very different tasks. It's unclear to me, at this point, if you're trying to do fine-tuning (tutorial 2) or inference (tutorial 3).

So:

For fine-tuning (tutorial 2) it's not sufficient to have the docs. You need labels, as explained above.
For inference (tutorial 3), the docs need to be written into the document store to be read by the Retriever. This is covered in Tutorial 3 in the Preprocessing of Documents section: document_store.write_documents(docs)

If I'm still not answering to your question, can you share your non working code? That will clarify most of my doubts.

gavirapp · 2022-10-21T12:33:11Z

gavirapp
Oct 21, 2022
Author

🙏Yes, I was trying to mix tutorials 2 and 3. Instead, I just did tutorial 3 after document_store.write_documents(docs) . Now I am able to use the TDIF Retriever and complete the Q&A. Haystack is great, I am still exploring how to use all aspects of it. Next I am going to do fine-tuning after labeling some question-answer pairs. Also do a bit more preprocessing as my documents come from different decades. Thank you so much Sara.

…

-George

On Thu, Oct 20, 2022 at 11:54 AM Sara Zan ***@***.***> wrote: You're mentioning two different tutorials performing two very different tasks. It's unclear to me, at this point, if you're trying to do fine-tuning (tutorial 2) or inference (tutorial 3). So: - For fine-tuning (tutorial 2) it's not sufficient to have the docs. You need labels, as explained above. - For inference (tutorial 3), the docs need to be written into the document store to be read by the Retriever. This is covered in Tutorial 3 in the Preprocessing of Documents section: document_store.write_documents(docs) If I'm still not answering to your question, can you share your non working code? That will clarify most of my doubts. — Reply to this email directly, view it on GitHub <#3412 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACKTPMVPVVGPLF3JUFZWBILWEFTL5ANCNFSM6AAAAAARIRRZBQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

On using my own data files for fine-tuning a BERT Q&A model? #3412

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

On using my own data files for fine-tuning a BERT Q&A model? #3412

Uh oh!

gavirapp Oct 18, 2022

Replies: 3 comments · 3 replies

Uh oh!

ZanSara Oct 19, 2022

Uh oh!

gavirapp Oct 19, 2022 Author

Uh oh!

ZanSara Oct 20, 2022

Uh oh!

gavirapp Oct 20, 2022 Author

Uh oh!

ZanSara Oct 20, 2022

Uh oh!

gavirapp Oct 21, 2022 Author

gavirapp
Oct 18, 2022

Replies: 3 comments 3 replies

ZanSara
Oct 19, 2022

gavirapp Oct 19, 2022
Author

gavirapp
Oct 20, 2022
Author

gavirapp
Oct 21, 2022
Author