Skip to content

vadis-project/vadis-pipeline

Repository files navigation

Issues MIT License


VADIS Pipeline

Main pipeline containing essential parts of the VADIS Project.

View Demo · Report Bug · Request Feature

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. License
  6. Contact
  7. Citation

About The Project

VADIS Pipeline

This project provides an in-depth look at the main pipeline of the VADIS Project. The process initiates with the selection and crawling of datasets and publications. These documents are then preprocessed to ensure the data is clean and structured for the main tasks. The main tasks of VADIS are Variable Identification and Summarization are executed and their outputs are used to construct VADIS Data. VADIS Data is then forwarded to VADIS Demo. Details of the each process are as followed:

(External) Data Retrieval

  • Publication Selection (p0): This process uses a predefined queries (config) to retrieve SSOAR publications with related research datasets from GESIS Search Index (config). Query results are saved in corpus.

  • Dataset Crawling (p1): For each publication in the query results, it sends POST request for each related research dataset and its survey variables and crawls them if they are available on the GESIS Search Index.

  • Publication Crawling (p1): This process crawl the SSOAR website and reaches PDF file of all publications if they have at least one downloaded research dataset. Also, only the publications which have available file on the SSOAR server are downloaded, not the ones with exernal PDF availability.

Preprocessing

  • PDF Parsing (p3): Using the GROBID server running on server (see config file), full text in JSON format of all downloaded PDF files of SSOAR publications are extracted.

  • Text Processing (p4): JSON full texts are processed and there are splitted into sentences.

VADIS Tasks

Output - VADIS Data

  • Merge and Format (p7): Outputs of summarization and variable identification of each pubication are merged and output data is enriched with some metadata. This results in JSON files of VADIS Data of all publications and it is ready to be inserted into VADIS Elastic Index.

Prerequisites

TODO: describe prerequisites

  • python
    npm install npm@latest -g

Installation

  1. Clone the repo
    git clone https://github.com/vadis-project/vadis-pipeline.git
  2. Create and activate a virtual environment
    python3 -m venv venv
    source /venv/bin/activate
  3. Install python packages
    pip install -r requirements.txt

License

See LICENSE for more information.

Contact

Yavuz Selim Kartal - yavuzselim.kartal@gesis.org

Citation

@misc{kartal2023vadis,
      title={VADIS -- a VAriable Detection, Interlinking and Summarization system}, 
      author={Yavuz Selim Kartal and Muhammad Ahsan Shahid and Sotaro Takeshita and Tornike Tsereteli and Andrea Zielinski and Benjamin Zapilko and Philipp Mayr},
      year={2023},
      eprint={2312.13423},
      archivePrefix={arXiv},
      primaryClass={cs.DL}
}

About

Main pipeline containing essential parts of the VADIS Project.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages