GitHub - vadis-project/vadis-pipeline: Main pipeline containing essential parts of the VADIS Project.

VADIS Pipeline

Main pipeline containing essential parts of the VADIS Project.

View Demo · Report Bug · Request Feature

Table of Contents

About The Project
Getting Started
- Prerequisites
- Installation
Usage
Roadmap
License
Contact
Citation

About The Project

This project provides an in-depth look at the main pipeline of the VADIS Project. The process initiates with the selection and crawling of datasets and publications. These documents are then preprocessed to ensure the data is clean and structured for the main tasks. The main tasks of VADIS are Variable Identification and Summarization are executed and their outputs are used to construct VADIS Data. VADIS Data is then forwarded to VADIS Demo. Details of the each process are as followed:

(External) Data Retrieval

Publication Selection (p0): This process uses a predefined queries (config) to retrieve SSOAR publications with related research datasets from GESIS Search Index (config). Query results are saved in corpus.
Dataset Crawling (p1): For each publication in the query results, it sends POST request for each related research dataset and its survey variables and crawls them if they are available on the GESIS Search Index.
Publication Crawling (p1): This process crawl the SSOAR website and reaches PDF file of all publications if they have at least one downloaded research dataset. Also, only the publications which have available file on the SSOAR server are downloaded, not the ones with exernal PDF availability.

Preprocessing

PDF Parsing (p3): Using the GROBID server running on server (see config file), full text in JSON format of all downloaded PDF files of SSOAR publications are extracted.
Text Processing (p4): JSON full texts are processed and there are splitted into sentences.

VADIS Tasks

Summarization (p6_sum): https://github.com/vadis-project/vadis_summarization_api
Variable Identification: This task is handled with two main methodology. The first one uses fuzzy search for variable matchings in the full text (p6_sm). The second one features supervised and unsupervised methods for variable identification (p6_auto_pre, p6_auto) https://github.com/vadis-project/sv-ident.

Output - VADIS Data

Merge and Format (p7): Outputs of summarization and variable identification of each pubication are merged and output data is enriched with some metadata. This results in JSON files of VADIS Data of all publications and it is ready to be inserted into VADIS Elastic Index.

Prerequisites

TODO: describe prerequisites

python
```
npm install npm@latest -g
```

Installation

Clone the repo

git clone https://github.com/vadis-project/vadis-pipeline.git

Create and activate a virtual environment

python3 -m venv venv
source /venv/bin/activate

Install python packages
```
pip install -r requirements.txt
```

License

See LICENSE for more information.

Contact

Yavuz Selim Kartal - yavuzselim.kartal@gesis.org

Citation

@misc{kartal2023vadis,
      title={VADIS -- a VAriable Detection, Interlinking and Summarization system}, 
      author={Yavuz Selim Kartal and Muhammad Ahsan Shahid and Sotaro Takeshita and Tornike Tsereteli and Andrea Zielinski and Benjamin Zapilko and Philipp Mayr},
      year={2023},
      eprint={2312.13423},
      archivePrefix={arXiv},
      primaryClass={cs.DL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
corpus/queries/bodies		corpus/queries/bodies
readme		readme
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
config.json		config.json
index.py		index.py
model.py		model.py
p0.py		p0.py
p1.py		p1.py
p2.py		p2.py
p3.py		p3.py
p4.py		p4.py
p5.py		p5.py
p6_auto.py		p6_auto.py
p6_auto_pre.py		p6_auto_pre.py
p6_sm.py		p6_sm.py
p6_sum.py		p6_sum.py
p7_merge.py		p7_merge.py
vadis_logger.py		vadis_logger.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VADIS Pipeline

About The Project

(External) Data Retrieval

Preprocessing

VADIS Tasks

Output - VADIS Data

Prerequisites

Installation

License

Contact

Citation

About

Uh oh!

Releases

Packages

Languages

License

vadis-project/vadis-pipeline

Folders and files

Latest commit

History

Repository files navigation

VADIS Pipeline

About The Project

(External) Data Retrieval

Preprocessing

VADIS Tasks

Output - VADIS Data

Prerequisites

Installation

License

Contact

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages