Main pipeline containing essential parts of the VADIS Project.
View Demo
·
Report Bug
·
Request Feature
Table of Contents
This project provides an in-depth look at the main pipeline of the VADIS Project. The process initiates with the selection and crawling of datasets and publications. These documents are then preprocessed to ensure the data is clean and structured for the main tasks. The main tasks of VADIS are Variable Identification and Summarization are executed and their outputs are used to construct VADIS Data. VADIS Data is then forwarded to VADIS Demo. Details of the each process are as followed:
-
Publication Selection (p0): This process uses a predefined queries (config) to retrieve SSOAR publications with related research datasets from GESIS Search Index (config). Query results are saved in corpus.
-
Dataset Crawling (p1): For each publication in the query results, it sends POST request for each related research dataset and its survey variables and crawls them if they are available on the GESIS Search Index.
-
Publication Crawling (p1): This process crawl the SSOAR website and reaches PDF file of all publications if they have at least one downloaded research dataset. Also, only the publications which have available file on the SSOAR server are downloaded, not the ones with exernal PDF availability.
-
PDF Parsing (p3): Using the GROBID server running on server (see config file), full text in JSON format of all downloaded PDF files of SSOAR publications are extracted.
-
Text Processing (p4): JSON full texts are processed and there are splitted into sentences.
-
Summarization (p6_sum): https://github.com/vadis-project/vadis_summarization_api
-
Variable Identification: This task is handled with two main methodology. The first one uses fuzzy search for variable matchings in the full text (p6_sm). The second one features supervised and unsupervised methods for variable identification (p6_auto_pre, p6_auto) https://github.com/vadis-project/sv-ident.
- Merge and Format (p7): Outputs of summarization and variable identification of each pubication are merged and output data is enriched with some metadata. This results in JSON files of VADIS Data of all publications and it is ready to be inserted into VADIS Elastic Index.
TODO: describe prerequisites
- python
npm install npm@latest -g
- Clone the repo
git clone https://github.com/vadis-project/vadis-pipeline.git
- Create and activate a virtual environment
python3 -m venv venv source /venv/bin/activate
- Install python packages
pip install -r requirements.txt
See LICENSE
for more information.
Yavuz Selim Kartal - yavuzselim.kartal@gesis.org
@misc{kartal2023vadis,
title={VADIS -- a VAriable Detection, Interlinking and Summarization system},
author={Yavuz Selim Kartal and Muhammad Ahsan Shahid and Sotaro Takeshita and Tornike Tsereteli and Andrea Zielinski and Benjamin Zapilko and Philipp Mayr},
year={2023},
eprint={2312.13423},
archivePrefix={arXiv},
primaryClass={cs.DL}
}