Skip to content

Workspace for the group Web-B-Gone of the 'Big Data and Language Technologies' course SoSe22 (Copy of the original repository)

Notifications You must be signed in to change notification settings

tobiasschreieder/web-b-gone

Repository files navigation

Web-B-Gone

Workspace for the group Web-B-Gone of the 'Big Data and Language Technologies' course SoSe22.

Setup

You can use Docker to setup this project. If you are not familiar with Docker, please visit the linked tutorial.

Clone this repository and create a docker image with the Dockerfile. This image contains the entrypoint to the startup.py. The program needs three directories to work correctly:

  • an input directory where the data is located (default: ./data)
  • a working directory where the index and other stuff is saved for multiple use (default: ./working)
  • an output directory where the results are saved (default: ./out)

It's possible to set the directories in the config.json, if so the config.json - PATH has to be the parameter after -cfg.

Dataset

The dataset can be downloaded here. In order to use the dataset for this project, it first needs to be refactored. To achieve this start the program with the parameters -swde path/to/SWDE.zip -reswde. If you want to compress the restructured SWDE dataset use the parameter -cswde. For extraction of the compressed restructured SWDE dataset use -e path/to/restruc_SWDE.zip

Usage

In the main method of startup.py some example calls of the main functionalities of this project, like model training, evaluation, etc. are given. These are for illustration purposes and can be adjusted as desired.

Models

All trained and evaluated models are stored in the GIT as working.zip. To use them, they can be extracted and moved to the /working directory.

Paper

The entire GIT project is based on the paper "Information-Extraction from websites with a NER-Approach", which can also be found in the GIT.

About

Workspace for the group Web-B-Gone of the 'Big Data and Language Technologies' course SoSe22 (Copy of the original repository)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published