Skip to content

A fully offline NLP pipeline for extracting, chunking, embedding, querying, summarizing, and translating research documents using local LLMs. Inspired by the fictional mystery of Dr. X, the system supports multi-format files, local RAG-based Q&A, Arabic translation, and ROUGE-based summarization — all without cloud dependencies.

Notifications You must be signed in to change notification settings

TTWJOE/dr-x-nlp-pipeline

Repository files navigation

Dr. X NLP Pipeline 🕵️‍♂️🔍

NLP Pipeline Releases

Welcome to the Dr. X NLP Pipeline repository! This project offers a fully offline natural language processing (NLP) pipeline designed for extracting, chunking, embedding, querying, summarizing, and translating research documents using local large language models (LLMs).

Inspired by the fictional mystery of Dr. X, this system supports multiple file formats and features local retrieval-augmented generation (RAG) based question and answer capabilities, Arabic translation, and ROUGE-based summarization. Best of all, it operates entirely without cloud dependencies.

Table of Contents

Features

  • Offline Processing: Work without the need for internet access.
  • Multi-Format Support: Handle various document types seamlessly.
  • Local RAG-Based Q&A: Efficiently query your documents.
  • Arabic Translation: Translate content accurately.
  • ROUGE-Based Summarization: Generate concise summaries.
  • Modular Architecture: Easily extend and modify components.

Installation

To set up the Dr. X NLP Pipeline on your local machine, follow these steps:

  1. Clone the Repository:

    git clone https://github.com/TTWJOE/dr-x-nlp-pipeline.git
    cd dr-x-nlp-pipeline
  2. Install Dependencies: Make sure you have Python 3.8 or higher installed. Then, run:

    pip install -r requirements.txt
  3. Download the Latest Release: Visit our Releases section to download the latest version. Follow the instructions in the release notes to execute the pipeline.

Usage

After installation, you can start using the Dr. X NLP Pipeline with the following command:

python main.py --input <path_to_your_document> --output <desired_output_path>

Example

python main.py --input documents/research_paper.pdf --output results/summary.txt

This command will process the specified document and save the summary to the desired output path.

Components

The Dr. X NLP Pipeline consists of several key components:

  1. Extractor: This module extracts text from various document formats, including PDFs, DOCX, and plain text files.

  2. Chunker: It divides the extracted text into manageable chunks for further processing.

  3. Embedder: This component generates embeddings for each text chunk, allowing for semantic understanding.

  4. Query Engine: Users can query the processed documents using natural language.

  5. Summarizer: This module generates concise summaries using ROUGE metrics to ensure quality.

  6. Translator: It translates content into Arabic, ensuring accurate and context-aware translations.

Supported Formats

The Dr. X NLP Pipeline supports the following file formats:

  • PDF
  • DOCX
  • TXT
  • HTML

Feel free to add more formats by extending the Extractor module.

Contributing

We welcome contributions! If you would like to contribute to the Dr. X NLP Pipeline, please follow these steps:

  1. Fork the repository.
  2. Create a new branch for your feature or bug fix.
  3. Make your changes and commit them.
  4. Push your branch and create a pull request.

Please ensure that your code follows the project's coding standards and includes relevant tests.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

For any questions or feedback, please reach out to us at contact@dr-x-nlp-pipeline.com.

Releases

To download the latest release, visit our Releases section and follow the instructions provided.

Conclusion

Thank you for your interest in the Dr. X NLP Pipeline! We hope you find it useful for your research and document processing needs. Your feedback and contributions are greatly appreciated.

Happy processing!

About

A fully offline NLP pipeline for extracting, chunking, embedding, querying, summarizing, and translating research documents using local LLMs. Inspired by the fictional mystery of Dr. X, the system supports multi-format files, local RAG-based Q&A, Arabic translation, and ROUGE-based summarization — all without cloud dependencies.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages