Replies: 1 comment 3 replies
-
Hi @Tsar06! It's quite hard to judge what's going wrong without knowing how your data looks like. In your description there are several points that caught my attention:
Please let me ask you a few question to better be able to help you:
|
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Tying to find the best model on huggingface and use of Haystack framework to put some meaningful application together with company data but seems like going away the samples examples on Haystack tutorials doesn't do any good - at least as I use it. Can someone see what's wrong?
I have a list (CSV) of inventory company Risks which details problems and resolutions expected. I'd like to fine-tune some good models with this as well as some other policy documents I have build QA DPR or SQUAD2 like formats. Trying to finetune as below for instance does not help much providing meaningful responses using pipelines.
For instance adding risks IDs for instance which are codes/labels as 44444-INF-2022 or 12342-CC-2021 are not added to finetune transformer vocabulary in the embeddings indexing of the retriever for instance. So performing a Question down the road with some pipeline as below does really not bring any good results...
I could do something like using Faiss as a db:
...
print("Create DB", sql_path)
document_store = FAISSDocumentStore(sql_url=sql_path, embedding_dim=512, faiss_index_factory_str="Flat")
...
print("Load Retriever")
retriever = TableTextRetriever(
document_store=document_store,
query_embedding_model=f"{frootModel}/deepset/bert-small-mm_retrieval-question_encoder",
passage_embedding_model=f"{frootModel}/deepset/bert-small-mm_retrieval-passage_encoder",
table_embedding_model=f"{frootModel}/deepset/bert-small-mm_retrieval-table_encoder",
embed_meta_fields=["title", "section_title"],
max_seq_len_query= 64, max_seq_len_passage= 256, max_seq_len_table= 256, top_k= 10, use_gpu= True, batch_size= 16
)
# Fine-tune Retreiver
print("Fine-tune Retriever")
retriever.train(data_dir=dir_path, train_filename="Issues.csv.json",
max_processes=1, dev_split= 0, batch_size=16, embed_meta_fields= ["title", "section_title"],
num_hard_negatives= 0, num_positives= 1, n_epochs= 3, evaluate_every= 1000, n_gpu= 1,
learning_rate= 1e-5, epsilon= 1e-08, weight_decay= 0.0, num_warmup_steps= 100, grad_acc_steps= 8, use_amp= None,
optimizer_name= "AdamW", optimizer_correct_bias= True,
save_dir= f"{frootModel}/deepset",
query_encoder_save_dir= "question_encoder_fine-tuned",
passage_encoder_save_dir= "passage_encoder_fine-tuned",
table_encoder_save_dir= "table_encoder_fine-tuned")
...
route_documents = RouteDocuments()
join_answers = JoinAnswers()
Above example obviously records with fine-tuned 19718-CC-2021 are not found by the retriever.
same thing here PL3 should be retreived as values from some tables which does not work either.
last but not least above same thing Issue Rating is a table field name but retriever does not provide correct records as a start in the pipeline as 19718-CC-2021 is skipped.
I have tried to add lables as such in meta fields, in the questions, in the answers but nothing works.
Do I need to Train a specific model for this to work and which one shall I use?
Is all this made for real life work or just demo toys?
In addition, besides this I have tried to build a Table from some CSV with 250 rows of about 50 columns and tested some queries using TableReader in Haystack... each query returned somehow correct COUNT but calculation was VERY SLOW! I have a Precision DELL 5560 with all CUDA GPU installed and as a small test, it seems like we are far from production here with such technology. probably ok for small tables...
I must do something wrong?
Beta Was this translation helpful? Give feedback.
All reactions