GitHub - humlab/retriever: A data processing tool designed to extract and process articles from text files exported by a system called "Retriever".

A data processing tool designed to extract and process articles from text files exported by a system called "Retriever". The main functionalities include:

Extracting Table of Contents: The function get_toc extracts the table of contents from a given file, identifying the relevant section and parsing the lines to collect metadata such as titles, sources, and dates.
Extracting Articles: The function get_articles reads the articles from the file starting from a specified offset, splitting the content into individual articles.
Processing Articles: Various functions are provided to clean and process the articles, such as removing captions, stop words, and copyright strings.
Creating a Corpus: The function create_corpus combines the table of contents and articles into a structured format, creating a DataFrame that can be further analyzed or exported.
Handling Duplicates: The main function identifies and logs duplicate articles, saving unique articles to text files and generating a CSV file with metadata.
Logging and Output: The project uses the loguru library for logging and saves the processed articles and metadata to specified output folders.

The project is structured to handle multiple text files, process them, and save the results in a systematic and organized manner.

Instructions

To use retriever.py, follow these steps:

Install Dependencies: Ensure you have all the required dependencies installed. You can do this using Poetry:
```
poetry install
```
Prepare Input Files: Place your text files exported by the Retriever system into the input folder.
Run the Script: Execute the script using the following command:
```
poetry run python retriever/retriever.py input
```
You can also pass additional options:
- --save-short-headers: Save files with short headers.
- --stop-words: Provide a string with stop words separated by '|'.
- --remove-captions: Remove captions from articles.
- --remove-copyright: Remove copyright strings from articles.
Example:
```
poetry run python retriever/retriever.py input --save-short-headers --remove-captions --remove-copyright
```
Output: The processed articles and metadata will be saved in the output folder within the input directory. The metadata will be saved as document_index.csv.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
retriever		retriever
tests		tests
.gitignore		.gitignore
.python-version		.python-version
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Instructions

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

humlab/retriever

Folders and files

Latest commit

History

Repository files navigation

Instructions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages