Replies: 3 comments 3 replies
-
May I ask why you prefer to skip that step? |
Beta Was this translation helpful? Give feedback.
2 replies
-
Hi Sara, thanks again for your help. Here is what I am trying to do: In
the tutorial for fine-tuning with my own data, *I want to use a bunch of
text files*, instead of the wiki_gameofthrone_txt1.zip file. I have the
text files as "docs" obtained using
https://haystack.deepset.ai/tutorials/preprocessing:
all_docs = convert_files_to_docs(dir_path=doc_dir) preprocessor =
PreProcessor( clean_empty_lines=True, clean_whitespace=True,
clean_header_footer=False, split_by="word", split_length=100,
split_respect_sentence_boundary=True, ) docs = preprocessor.process(all_docs
)
*My question is there way to use this "docs"* (the same format the
wiki_gameofthrone_txt1.zip is converted to?) to* "Build a QA System
Without Elasticsearch?* as in
https://haystack.deepset.ai/tutorials/without-elasticsearch. I can't seem
to have the "Retriever" access "docs".
Thanks
…-George
On Thu, Oct 20, 2022 at 6:06 AM Sara Zan ***@***.***> wrote:
Ok I think I understand it better now (please correct me if I'm wrong).
I think there's a misunderstanding here. The problem is that Documents are
not the expected format for the augment_squad.py script, because
Documents alone can't be used to train a model. What you need is a dataset
in squad format, and to create such dataset you need Labels
<https://docs.haystack.deepset.ai/docs/documents_answers_labels#label>,
which you can create using the annotation tool:
https://docs.haystack.deepset.ai/docs/annotation
Let me know if this helps 🙂
—
Reply to this email directly, view it on GitHub
<#3412 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACKTPMSTJQIACNLHNHAOBSTWEEKSLANCNFSM6AAAAAARIRRZBQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
1 reply
-
🙏Yes, I was trying to mix tutorials 2 and 3. Instead, I just did tutorial
3 after document_store.write_documents(docs) . Now I am able to use the
TDIF Retriever and complete the Q&A.
Haystack is great, I am still exploring how to use all aspects of it. Next
I am going to do fine-tuning after labeling some question-answer pairs.
Also do a bit more preprocessing as my documents come from different
decades.
Thank you so much Sara.
…-George
On Thu, Oct 20, 2022 at 11:54 AM Sara Zan ***@***.***> wrote:
You're mentioning two different tutorials performing two very different
tasks. It's unclear to me, at this point, if you're trying to do
fine-tuning (tutorial 2) or inference (tutorial 3).
So:
- For fine-tuning (tutorial 2) it's not sufficient to have the docs.
You need labels, as explained above.
- For inference (tutorial 3), the docs need to be written into the
document store to be read by the Retriever. This is covered in Tutorial 3
in the Preprocessing of Documents section:
document_store.write_documents(docs)
If I'm still not answering to your question, can you share your non
working code? That will clarify most of my doubts.
—
Reply to this email directly, view it on GitHub
<#3412 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACKTPMVPVVGPLF3JUFZWBILWEFTL5ANCNFSM6AAAAAARIRRZBQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
0 replies
Answer selected by
gavirapp
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Sorry this might be a naive question but I can't get past this (even after looking at the relevant tutorials):
From my documents, I preprocessed them to create a haystack document store "docs" which is a list of haystack.schema.Documents. I am unable to use it to finetune a model (for example in teacher/student pair where the student is trained with my data. I could replace the files in "tutorial2" with mine, but I did many preprocessing steps on those files to get to "docs". Here is how I got the "docs"
all_docs = convert_files_to_docs(dir_path=folder)
preprocessor = PreProcessor(
clean_empty_lines=True,
clean_whitespace=True,
clean_header_footer=False,
split_by="word",
split_length=100,
split_respect_sentence_boundary=True,
)
docs = preprocessor.process(all_docs)
My question is there a way/need to use:
!python augment_squad.py --squad_path file_path --output_path augmented_dataset.json --multiplication_factor 2 --glove_path glove.6B.300d.txt
and have "docs" to produce the augmented_dataset.json?
I am using Google Colab.
Thanks much,
-George
Beta Was this translation helpful? Give feedback.
All reactions