Indicator Extractor

This repository contains the code to extract financial and non-financial indicators from company reports (pdf files). Given a query regarding financial and non-financial indicators (like the ones shown in en_gris.json), the script returns the most relevant pages, as well as some possible numerical result we might be searching for.

How to run

Setup

Go to the root of the directory, create a conda virtual environment and activate it

conda create -n env
conda activate env

Install the Tesseract OCR Engine, needed to use one of Unstructured or Deepdoctection for table extraction.
Then, copy the file sample_config/.env to the root of the repository and fill the missing values.
Build and run the docker-compose file. The container hosts the pgvector database containing the embeddings extracted from the pdf files

sudo docker compose build
sudo docker compose up

Useful python commands

Create dense embeddings

To store the semantic embeddings in the database, run

PYTHONHASHSEED=0 python3 main.py --pdf [PDF_PATH] --embed --use_dense --model_name [MODEL_NAME]

where

PYTHONHASHSEED=0 is an environment variable making the hash function deterministic. hash is used to parse document chunks, producing an unique id which is used as primary key inside the embedding database. In this way, the system avoids the computation of the document embeddings if the embedding is already stored inside the database;
PDF_PATH can be either a single pdf file or a directory storing pdf files;
--embed and --use_dense indicate that the system should embed the documents using the model MODEL_NAME (taken from Huggingface). By default, MODEL_NAME="intfloat/multilingual-e5-large-instruct".

Create sparse embeddings

To store the data for the sparse embedding, run

PYTHONHASHSEED=0 python3 main.py --pdf [PDF_PATH] --embed --use_sparse --syn_model_name [SYN_MODEL_NAME]

where

PYTHONHASHSEED=0 same as above;
PDF_PATH can be either a single pdf file or a directory storing pdf files;
--embed and --use_sparse indicate that the system should embed the documents using the model SYN_MODEL_NAME. By default, SYN_MODEL_NAME="tf_idf".

Query the embeddings

To query the embeddings, run

python3 main.py --pdf [PDF_PATH] --query [QUERY_STRING] --use_dense --model_name [MODEL_NAME] --k [TOP_K_RESULTS]

This command will return the top-k document chunks (i.e. document pages) obtained from the dense (semantic) query.

To do the same for the sparse (syntactic) embeddings, run

python3 main.py --pdf [PDF_PATH] --query [QUERY_STRING] --use_sparse --syn_model_name [MODEL_NAME] --k [TOP_K_RESULTS]

Query the embeddings with ensemble method

The ensemble method leverages both semantic and syntactic retrieval modes to further improve the system. To use the ensemble, run

python3 main.py --pdf [PDF_PATH] --query [QUERY_STRING] --use_ensemble --model_name [MODEL_NAME] --syn_model_name [SYN_MODEL_NAME] --k [TOP_K_RESULTS] --lambda [LAMBDA_VALUE]

The additional parameter --lambda is a scalar value that controls the importance of syntactic features over semantic ones. The higher the value, the more we give importance to the SYN_MODEL_NAME (e.g. tf_idf)

How to be faster by skipping the table extraction

Inside the final prompt that computes the answer, the model extracts by default the tables inside the document pages. This is done to improve the accuracy of the model, but it is slower. To skip the table extraction phase, use the flag --fast.

To reproduce the results

Run the file test.py with

python3 test.py --pdf [PDF_PATH] --use_[dense|sparse|ensemble] --model_name [MODEL_NAME] --syn_model_name [SYN_MODEL_NAME] --checkpoint_rate [CHECKPOINT_RATE]

with --checkpoint_rate is the saving frequency. The files will be stored in the tests/ directory

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
log		log
pdfs		pdfs
results		results
sample_config		sample_config
01-pre-requirements.txt		01-pre-requirements.txt
02-requirements.txt		02-requirements.txt
Dockerfile		Dockerfile
README.md		README.md
connectors.py		connectors.py
dataprocessor.py		dataprocessor.py
docker-compose.yml		docker-compose.yml
en_gris.json		en_gris.json
llm.py		llm.py
main.py		main.py
runnable.py		runnable.py
table_extraction.py		table_extraction.py
test.py		test.py
utils.py		utils.py
vector_store.py		vector_store.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Indicator Extractor

How to run

Setup

Useful python commands

Create dense embeddings

Create sparse embeddings

Query the embeddings

Query the embeddings with ensemble method

How to be faster by skipping the table extraction

To reproduce the results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

softlab-unimore/rag_resist0

Folders and files

Latest commit

History

Repository files navigation

Indicator Extractor

How to run

Setup

Useful python commands

Create dense embeddings

Create sparse embeddings

Query the embeddings

Query the embeddings with ensemble method

How to be faster by skipping the table extraction

To reproduce the results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages