Testing TableTextRetriever on some data for Q&A #2395

Tsar06 · 2022-04-07T12:01:30Z

Tsar06
Apr 7, 2022

Tying to find the best model on huggingface and use of Haystack framework to put some meaningful application together with company data but seems like going away the samples examples on Haystack tutorials doesn't do any good - at least as I use it. Can someone see what's wrong?

I have a list (CSV) of inventory company Risks which details problems and resolutions expected. I'd like to fine-tune some good models with this as well as some other policy documents I have build QA DPR or SQUAD2 like formats. Trying to finetune as below for instance does not help much providing meaningful responses using pipelines.

For instance adding risks IDs for instance which are codes/labels as 44444-INF-2022 or 12342-CC-2021 are not added to finetune transformer vocabulary in the embeddings indexing of the retriever for instance. So performing a Question down the road with some pipeline as below does really not bring any good results...

I could do something like using Faiss as a db:
...
print("Create DB", sql_path)
document_store = FAISSDocumentStore(sql_url=sql_path, embedding_dim=512, faiss_index_factory_str="Flat")
...
print("Load Retriever")
retriever = TableTextRetriever(
document_store=document_store,
query_embedding_model=f"{frootModel}/deepset/bert-small-mm_retrieval-question_encoder",
passage_embedding_model=f"{frootModel}/deepset/bert-small-mm_retrieval-passage_encoder",
table_embedding_model=f"{frootModel}/deepset/bert-small-mm_retrieval-table_encoder",
embed_meta_fields=["title", "section_title"],
max_seq_len_query= 64, max_seq_len_passage= 256, max_seq_len_table= 256, top_k= 10, use_gpu= True, batch_size= 16
)
# Fine-tune Retreiver
print("Fine-tune Retriever")
retriever.train(data_dir=dir_path, train_filename="Issues.csv.json",
max_processes=1, dev_split= 0, batch_size=16, embed_meta_fields= ["title", "section_title"],
num_hard_negatives= 0, num_positives= 1, n_epochs= 3, evaluate_every= 1000, n_gpu= 1,
learning_rate= 1e-5, epsilon= 1e-08, weight_decay= 0.0, num_warmup_steps= 100, grad_acc_steps= 8, use_amp= None,
optimizer_name= "AdamW", optimizer_correct_bias= True,
save_dir= f"{frootModel}/deepset",
query_encoder_save_dir= "question_encoder_fine-tuned",
passage_encoder_save_dir= "passage_encoder_fine-tuned",
table_encoder_save_dir= "table_encoder_fine-tuned")

print("Load Fine-tuned Retriever")
retriever = TableTextRetriever(
    document_store=document_store,
    query_embedding_model=f"{frootModel}/deepset/question_encoder_fine-tuned", 
    passage_embedding_model=f"{frootModel}/deepset/passage_encoder_fine-tuned", 
    table_embedding_model=f"{frootModel}/deepset/table_encoder_fine-tuned",
    embed_meta_fields=["title", "section_title"],
    max_seq_len_query= 64, max_seq_len_passage= 256, max_seq_len_table= 256, top_k= 10, use_gpu= True, batch_size= 16
)
# Add table embeddings to the tables in DocumentStore
document_store.update_embeddings(retriever=retriever)

if update_data_store >0:
    print("Update embeddings")
    # Add table embeddings to the tables in DocumentStore
    document_store.update_embeddings(retriever=retriever)
    print("Save Faiss db")
    document_store.save(f"{dbroot}/faiss_index.dat")

...

print("Load TableReader")
table_reader = TableReader(model_name_or_path=f"{frootModel}/deepset/tapas-large-nq-hn-reader",
                        use_gpu= True, top_k= 10, top_k_per_candidate= 3, max_seq_len=256) #google/tapas-base-finetuned-wtq

print("Load Text Reader")
save_dir= f"{frootModel}/deepset-farm/roberta-base-squad2-finedtuned"

text_reader_to_train = FARMReader(f"{frootModel}/deepset-farm/roberta-base-squad2",
                                   local_files_only=True,
                                   use_gpu=True,
                                   num_processes=1,
                                   context_window_size=150, 
                                   batch_size= 50, no_ans_boost= 0, top_k= 10, top_k_per_candidate= 3, top_k_per_sample= 1, 
                                   max_seq_len= 256, doc_stride= 128, duplicate_filtering= 0, use_confidence_scores= True
                                   ) # deepset/roberta-base-squad2

print("Fine-Tune Text Reader")
text_reader_to_train.train(data_dir=dir_path,
                            train_filename="Issues.csv.json",
                            use_gpu=False, num_processes=1, # GPU crash with memory alloc error - NOT SURE WHY! 
                            batch_size= 10, n_epochs= 2, learning_rate= 0.00001, max_seq_len= None, 
                            warmup_proportion= 0.2, dev_split= 0, evaluate_every= 300, 
                            use_amp= None, checkpoint_root_dir= Path(f"{frootModel}/checkpoints"), 
                            checkpoint_every= None, checkpoints_to_keep= 1, 
                            caching= True, cache_path= Path(f"{frootModel}/cache"),
                            save_dir=save_dir)

text_reader_to_train.save(directory=save_dir)

print("reLoad Text Reader")
text_reader = FARMReader(save_dir, use_gpu=True, num_processes=1)

route_documents = RouteDocuments()
join_answers = JoinAnswers()

print("Start Pipeline")
text_table_qa_pipeline = Pipeline()
text_table_qa_pipeline.add_node(component=retriever, name="TableTextRetriever", inputs=["Query"])
text_table_qa_pipeline.add_node(component=route_documents, name="RouteDocuments", inputs=["TableTextRetriever"])
text_table_qa_pipeline.add_node(component=text_reader, name="TextReader", inputs=["RouteDocuments.output_1"])
text_table_qa_pipeline.add_node(component=table_reader, name="TableReader", inputs=["RouteDocuments.output_2"])
text_table_qa_pipeline.add_node(component=join_answers, name="JoinAnswers", inputs=["TextReader", "TableReader"])

print("Start Predictions")

predictions = text_table_qa_pipeline.run(query="What is the Country of 19718-CC-2021")
print_answers(predictions, details="minimum")

Above example obviously records with fine-tuned 19718-CC-2021 are not found by the retriever.

predictions = text_table_qa_pipeline.run(query="How many PL3")
print_answers(predictions, details="minimum")

same thing here PL3 should be retreived as values from some tables which does not work either.

predictions = text_table_qa_pipeline.run(query="What is the Issue Rating for issue ID 19718-CC-2021")
print_answers(predictions, details="minimum")

last but not least above same thing Issue Rating is a table field name but retriever does not provide correct records as a start in the pipeline as 19718-CC-2021 is skipped.

I have tried to add lables as such in meta fields, in the questions, in the answers but nothing works.

Do I need to Train a specific model for this to work and which one shall I use?
Is all this made for real life work or just demo toys?

In addition, besides this I have tried to build a Table from some CSV with 250 rows of about 50 columns and tested some queries using TableReader in Haystack... each query returned somehow correct COUNT but calculation was VERY SLOW! I have a Precision DELL 5560 with all CUDA GPU installed and as a small test, it seems like we are far from production here with such technology. probably ok for small tables...

I must do something wrong?

bogdankostic · 2022-04-12T20:13:18Z

bogdankostic
Apr 12, 2022

Hi @Tsar06! It's quite hard to judge what's going wrong without knowing how your data looks like. In your description there are several points that caught my attention:

Your data seem to contain a lot of domain-specific abbreviations/IDs. The models might have problems with this because they were not pre-trained on this domain. However, they might be able to grasp these if your fine-tune dataset is large enough.
You seem to use the same file to fine-tune the TableTextRetriever as well as the FARMReader ("Issues.csv.json"). How does your train file look like? In general, retriever and reader train files have a different format.
You're FARMReader is running on CPU as you set use_gpu=False in the reader's train method. That's probably the reason why executing you're pipeline is quite slow.

Please let me ask you a few question to better be able to help you:

Can you elaborate more on your use case, i.e., what kind of information do you need to extract from what kind of data?
How do the documents you want to query on look like? Do they consist of tables and texts or only one of these modalities?
Might metadata filtering work for your use case, i.e., adding the IDs as metadata to your documents and filter for them during querying (see for example here on how to define filter)?

3 replies

Tsar06 Apr 13, 2022
Author

Thanks @bogdankostic for your time answering as it seems mypost was a little messy with the various fonts as i failed embedding code.

The model I normally use in some other apps with sentence_transformers are 'all-mpnet-base-v2' or 'paraphrase-multilingual-mpnet-base-v2' which work fine with my data for embedding and similarity searching. I tried to use with Haystack but run into a bunch of problems with Faiss I couldn't understand and dropped the ball for now.

I am know trying to make it work with the same type of models used in the Haystack examples but then now face other types of issues as I am not sure deepset/bert-small-mm_retrieval-question_encoder + deepset/bert-small-mm_retrieval-passage_encoder are tokenizing some key words in the data such as "1234-CC-2021" for instance during fine-tuning.

Happy if you have an example that works with 'all-mpnet-base-v2' embedding data in document_store with Faiss (I'll move the Elastic later) for TableTextRetriever? Maybe i should rather use simple TextRetreiver and split Tables separately in a different data store using TableRetriever alone?

About the json source I use: I should have mentioned. I have created a special haystack.nodes.file_converter.csv CSVToTextConverter class, amended convert_files_to_dicts from hastack preprocessing.py code to use it. Now I guess Haystack can process (my) CSV data. It basically read a processor.json which holds the CSV columns in the datafile needed for training/embedding. It does a little more but here is the code for your convinience.

Additions in preprocessing.py

`
from typing import Callable, Dict, List, Optional

import re, io, json
import logging
from pathlib import Path

from haystack.nodes.file_converter import BaseConverter, DocxToTextConverter, PDFToTextConverter, TextConverter, HtmlHrefToTextConverter, CSVToTextConverter

def convert_files_to_dicts(
dir_path: str, clean_func: Optional[Callable] = None, split_paragraphs: bool = False, encoding: Optional[str] = None
) -> List[dict]:
"""
Convert all files(.txt, .pdf, .docx, ...) in the sub-directories of the given path to Python dicts that can be written to a
Document Store.

:param dir_path: path for the documents to be written to the DocumentStore
:param clean_func: a custom cleaning function that gets applied to each doc (input: str, output:str)
:param split_paragraphs: split text in paragraphs.
:param encoding: character encoding to use when converting pdf documents.
"""
file_paths = [p for p in Path(dir_path).glob("**/*")]
allowed_suffixes = [".pdf", ".txt", ".docx", ".html", ".csv"]
suffix2converter: Dict[str, BaseConverter] = {}

suffix2paths: Dict[str, List[Path]] = {}
for path in file_paths:
    file_suffix = path.suffix.lower()
    if file_suffix in allowed_suffixes:
        if file_suffix not in suffix2paths:
            suffix2paths[file_suffix] = []
        suffix2paths[file_suffix].append(path)
    elif not path.is_dir():
        logger.warning(
            "Skipped file {0} as type {1} is not supported here. "
            "See haystack.file_converter for support of more file types".format(path, file_suffix)
        )

# No need to initialize converter if file type not present
for file_suffix in suffix2paths.keys():
    if file_suffix == ".pdf":
        suffix2converter[file_suffix] = PDFToTextConverter()
    elif file_suffix == ".txt":
        suffix2converter[file_suffix] = TextConverter()
    elif file_suffix == ".docx":
        suffix2converter[file_suffix] = DocxToTextConverter()
    elif file_suffix == ".html":
        suffix2converter[file_suffix] = HtmlHrefToTextConverter()
    elif file_suffix == ".csv":
        suffix2converter[file_suffix] = CSVToTextConverter()

documents = []
for suffix, paths in suffix2paths.items():
    for path in paths:
        if encoding is None and suffix == ".pdf":
            encoding = "Latin1"
        logger.info("Converting {}".format(path))
        if suffix == ".html":
            document_dicts = suffix2converter[suffix].convert(file_path=path, meta=None, encoding=encoding,) # HtmlHrefToTextConverter returns a list of dicts
        elif suffix == ".csv":
            # load json cols
            # Use columns only if file exists
            trainingColumns= []
            try:
                print("Loading ", f'{dir_path}/Training.json')
                with io.open(f'{dir_path}/Training.json', "r", encoding='utf-8', errors='ignore') as fjson:
                        sjson= json.load(fjson)["Training"]
                        trainingColumns = sjson["TextColumnsToUse"]
                        if "HrefColumn" in sjson:
                                href = sjson["HrefColumn"]
                        if "YakeColumns" in sjson:
                                YakeColumns = sjson["YakeColumns"]
                        if "url" in sjson:
                                url = sjson["url"]
                                trainingColumns= [] # if url, all columns used as clustering processor did the work already
                        if "title" in sjson:
                                title = sjson["title"]
                        meta= {"title": title, "section_title": "TBD"}
            except:
                    print("Preprocessing CSV with no Training.json!")
                    meta= {"title": "Issues", "section_title": "Issues Pending", "name": file_path.name}

            document_dicts = suffix2converter[suffix].convert(file_path=path, meta=meta, trainingColumns=trainingColumns, encoding=encoding)
            documents = documents + document_dicts
            continue

        else:
            document_dicts = suffix2converter[suffix].convert(file_path=path, meta=None, encoding=encoding) # PDFToTextConverter, TextConverter, and DocxToTextConverter return a list containing a single dict
        
        for document in document_dicts:

            text = document["content"]

            if clean_func:
                text = clean_func(text)

            if split_paragraphs:
                for para in text.split("\n\n"):
                    if not para.strip():  # skip empty paragraphs
                        continue
                    documents.append({"content": para, "meta": {"name": path.name}})
                    #print("APPEND PARA=", {"content": para, "meta": {"name": path.name}})
            else:
                documents.append({"content": text, "meta": {"name": path.name}})
                #print("APPEND TEXT=", {"content": text, "meta": {"name": path.name}})

return documents

`

The code for csv.py follows:

`import logging
#import re
from pathlib import Path
from typing import Any, Dict, List, Optional

from pprint import pprint # if use pretty-printer
import json
import pandas as pd

try:
from bs4 import BeautifulSoup
from markdown import markdown

except (ImportError, ModuleNotFoundError) as ie:
from haystack.utils.import_utils import _optional_component_not_installed

_optional_component_not_installed(__name__, "preprocessing", ie)

from haystack.nodes.file_converter import BaseConverter

logger = logging.getLogger(name)

utils -> prepocessing -> csv

class CSVToTextConverter(BaseConverter):
def convert(
self,
file_path: Path,
meta: Optional[Dict[str, str]] = None,
remove_numeric_tables: Optional[bool] = None,
valid_languages: Optional[List[str]] = None,
encoding: Optional[str] = "utf-8",
trainingColumns: Optional[list] = []
) -> List[Dict[str, Any]]:
"""
Reads text from a csv file and executes optional preprocessing steps.

    :param file_path: path of the file to convert
    :param meta: dictionary of meta data key-value pairs to append in the returned document.
    :param encoding: Select the file encoding (default is `utf-8`)
    :param remove_numeric_tables: Not applicable
    :param valid_languages: Not applicable

    :return: Dict of format {"text": "The text from file", "meta": meta}}
    """
    if len(trainingColumns) >0:
        df = pd.read_csv(file_path, dtype=str, usecols=trainingColumns, encoding='utf-8')
    else:
        df = pd.read_csv(file_path, dtype=str, encoding='utf-8')
    # clean a bit
    df.fillna(value="", inplace=True)
    if ord(df.columns[0][0]) == 65279:
        df = df.rename(columns={df.columns[0]: df.columns[0][1:]}) # remove BOM if exists
    #df["question"] = df["question"].apply(lambda x: x.strip())
    print(df.head())

    documents = []
    #from haystack import Document
    
    for node, text in enumerate(df['Issue ID']):

        key= text

        documents.append( {
            "content": "== Current Status Comments for issue " + key + " ==\n" +
                      df["Current Status Comments"][node],
            "content_type": "text",
            "meta": {"title": "Issue", "section_title": key, "name": file_path.name},
            "id": node * 10000 + 1
        })

        documents.append( {
            "content": "== Issue Summary for issue " + key + " ==\n" +
                     df["Issue Summary"][node],
            "content_type": "text",
            "meta": {"title": "Issue", "section_title": key, "name": file_path.name},
            "id": node * 10000 + 2
        })

        documents.append( {
                "content": "== Remediation Summary for issue " + key + " ==\n" +
                     df["Remediation Summary"][node],
            "content_type": "text",
            "meta": {"title": "Issue", "section_title": key, "name": file_path.name},
            "id": node * 10000 + 3
        })

    pprint(documents[0:12])
    
    # Add tables
    documents.append( {
        "content": df[['Issue ID',"Country","Issue Rating","Issue Status"]],
        "content_type": "table",
        "meta": meta,  # TODO fix this
        "id": node * 10000 + 0
    })
    
    # Build json for Fine-Tuning Reader models
    data = []
    # .format("", *[arg.replace('"','""') for arg in record])
    for node, text in enumerate(df['Issue ID']):

        key= text

        qas= [("Current Status Comments", f"What are the Current Status Comments for issue {key}", df["Current Status Comments"][node], node * 10000 + 1),
              ("Issue Summary", f"What is the Issue Summary for issue {key}", df["Issue Summary"][node], node * 10000 + 2),
              ("Remediation Summary", f"What is the Remediation Summary for issue {key}", df["Remediation Summary"][node], node * 10000 + 3)
            ]

        for q in qas:
            para = {'context': q[2], 'qas': [{'question': q[1], 'answers': []}], 'title':q[0], 'section_title':key}
            data.append({'paragraphs': [para]})
            qa = para['qas'][0]
            qa['id'] = q[3]
            qa['answers'].append({'text': q[2], 'answer_start': 0})
            qa['is_impossible'] = False

    quac_as_squad = {'data': data, 'version': '2.0'}

    with open(str(file_path)+".json", 'w+', encoding='utf-8') as outfile:
        json.dump(quac_as_squad, outfile, indent=2, sort_keys=True, ensure_ascii=False)

    # Build json for Fine-Tuning Retriever models
    """
            :param file: filename of DPR data in json format
            Each sample is a dictionary of format:
            {"question": str,
            "answers": list of str
            "positive_ctxs": list of dictionaries of format
                {'title': str, 'text': str, 'passage_id': str, 'type': 'text', 'source': str}
                or
                {'page_title': str, 'section_title': str, 'caption': str, 'columns': list of str,
                 'rows': list of list of str, 'type': 'table', 'source': str}
            "hard_negative_ctxs": list of dictionaries of format
                {'title': str, 'text': str, 'passage_id': str, 'type': 'text', 'source': str}
                or
                {'page_title': str, 'section_title': str, 'caption': str, 'columns': list of str,
                 'rows': list of list of str, 'type': 'table', 'source': str}
            }
    """
    return documents

`

This is all test code guys, don't catch me on style or hardcoding stuffs for now... :) I can't show the data as it is highly confidential but happy to share the code.

So obviously what you'll see in there will answer several points you raised. The json generated data file comes from the second part of processing the CSV to be used for fine-tuning down the road. The first part returns the documents array as preprocessing.py expects. Documents are "text" and "table". I have basically split the full-text columns of the data in "text" doculents and created for the test a big "table" document with several other columns I would expect to be retrieved by TableTextRetriever and used in the Pipeline to calculate Counts or Sums etc. providing calculated answers when needed. I thought here to use meta data to replicate the key labels which I need. Obviously this is where my knowledge ends, the Ids are not correctly embedded in the document-store Faiss index. So when I query the model with something like "What is the Rating for 1234-CC-2021" it returns crap as it does not use "1234-CC-2021" obviously. I thend to think the tokenizers I used underneath didn't index these "words".

Example: the CSV data is made of columns containing labels, numerical information, or full texts. The idea I had was to split the full texts columns into separate passage documents (3 text docs for each CSV data record), and to insert other labels/numerics columns into a document table which could be also used for querying answers. Good or bad idea?

Hope it makes better sense. Any better strategy? I understand the filtering option you mention which looks good. Not sure if it'll be useful for my case. But probably yes to help the Retriever to select the correct records. Once I have identify some fixes for the above, i will look into optimizing with this idea. Thanks.

Thanks I undersand the potential CPU vs GPU issue creating slow processing for Training with CPU but obviously I am hitting memory allocation wall with my single 4Gb GPU. Such as RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 4.00 GiB total capacity; 2.57 GiB already allocated; 0 bytes free; 2.61 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.

However, once the training is done I do use one single GPU in the pipeline. As the document table is 250 rows long, I can tell query like below takes too long to be used in a chatbot for instance.

predictions = text_table_qa_pipeline.run(query="How many Past Due") #print(f"Predicted answer: {predictions['answers'][0].answer}") #print(f"Meta field: {predictions['answers'][0].meta}") print_answers(predictions, details="minimum")
Code finally once fine-tunings are done can be summarize like this:

`
retriever = TableTextRetriever(
document_store=document_store,
query_embedding_model=f"{frootModel}/deepset/question_encoder_fine-tuned", #/facebook/dpr-question_encoder-single-nq-base",
passage_embedding_model=f"{frootModel}/deepset/passage_encoder_fine-tuned", #facebook/dpr-ctx_encoder-single-nq-base",
table_embedding_model=f"{frootModel}/deepset/table_encoder_fine-tuned",
embed_meta_fields=["title", "section_title"],
max_seq_len_query= 64, max_seq_len_passage= 256, max_seq_len_table= 256, top_k= 10, use_gpu= True, batch_size= 2
)

print("Load TableReader")
table_reader = TableReader(model_name_or_path=f"{frootModel}/deepset/tapas-large-nq-hn-reader",
                        use_gpu= True, top_k= 10, top_k_per_candidate= 3, max_seq_len=256) #google/tapas-base-finetuned-wtq

print("reLoad Text Reader")
text_reader = FARMReader(save_dir, use_gpu=True, num_processes=1)

route_documents = RouteDocuments()
join_answers = JoinAnswers()

print("Start Pipeline")
text_table_qa_pipeline = Pipeline()
text_table_qa_pipeline.add_node(component=retriever, name="TableTextRetriever", inputs=["Query"])
text_table_qa_pipeline.add_node(component=route_documents, name="RouteDocuments", inputs=["TableTextRetriever"])
text_table_qa_pipeline.add_node(component=text_reader, name="TextReader", inputs=["RouteDocuments.output_1"])
text_table_qa_pipeline.add_node(component=table_reader, name="TableReader", inputs=["RouteDocuments.output_2"])
text_table_qa_pipeline.add_node(component=join_answers, name="JoinAnswers", inputs=["TextReader", "TableReader"])

`

T.

Tsar06 Apr 15, 2022
Author

Searching more on the problem, I think what I need is to add the missing inventory tokens to vocabulary for the retriever/reader models. Obviously if these tokens are needed for searching but are missing from the vocab, i.e. replaced with UNK I guess, the transformers will not do any good findings with Dense searching.
Anyone can confirm I'm correct?
If so, anyone with experience in this? Something like huggingface/transformers#1413 ?

brandenchan Apr 25, 2022

Hi @Tsar06 , so we're more than happy to help but it is not really feasible for us to read through such a long piece. Could you please try to summarize what you are doing and what you need help with in about 2-3 paragraphs?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Testing TableTextRetriever on some data for Q&A #2395

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Testing TableTextRetriever on some data for Q&A #2395

Uh oh!

Tsar06 Apr 7, 2022

Above example obviously records with fine-tuned 19718-CC-2021 are not found by the retriever.

same thing here PL3 should be retreived as values from some tables which does not work either.

last but not least above same thing Issue Rating is a table field name but retriever does not provide correct records as a start in the pipeline as 19718-CC-2021 is skipped.

Replies: 1 comment · 3 replies

Uh oh!

bogdankostic Apr 12, 2022

Uh oh!

Tsar06 Apr 13, 2022 Author

utils -> prepocessing -> csv

Uh oh!

Tsar06 Apr 15, 2022 Author

Uh oh!

brandenchan Apr 25, 2022

Tsar06
Apr 7, 2022

Replies: 1 comment 3 replies

bogdankostic
Apr 12, 2022

Tsar06 Apr 13, 2022
Author

Tsar06 Apr 15, 2022
Author