Dr. X NLP Pipeline 🕵️‍♂️🔍

Welcome to the Dr. X NLP Pipeline repository! This project offers a fully offline natural language processing (NLP) pipeline designed for extracting, chunking, embedding, querying, summarizing, and translating research documents using local large language models (LLMs).

Inspired by the fictional mystery of Dr. X, this system supports multiple file formats and features local retrieval-augmented generation (RAG) based question and answer capabilities, Arabic translation, and ROUGE-based summarization. Best of all, it operates entirely without cloud dependencies.

Features

Offline Processing: Work without the need for internet access.
Multi-Format Support: Handle various document types seamlessly.
Local RAG-Based Q&A: Efficiently query your documents.
Arabic Translation: Translate content accurately.
ROUGE-Based Summarization: Generate concise summaries.
Modular Architecture: Easily extend and modify components.

Installation

To set up the Dr. X NLP Pipeline on your local machine, follow these steps:

Clone the Repository:

git clone https://github.com/TTWJOE/dr-x-nlp-pipeline.git
cd dr-x-nlp-pipeline

Install Dependencies: Make sure you have Python 3.8 or higher installed. Then, run:
```
pip install -r requirements.txt
```
Download the Latest Release: Visit our Releases section to download the latest version. Follow the instructions in the release notes to execute the pipeline.

Usage

After installation, you can start using the Dr. X NLP Pipeline with the following command:

python main.py --input <path_to_your_document> --output <desired_output_path>

Example

python main.py --input documents/research_paper.pdf --output results/summary.txt

This command will process the specified document and save the summary to the desired output path.

Components

The Dr. X NLP Pipeline consists of several key components:

Extractor: This module extracts text from various document formats, including PDFs, DOCX, and plain text files.
Chunker: It divides the extracted text into manageable chunks for further processing.
Embedder: This component generates embeddings for each text chunk, allowing for semantic understanding.
Query Engine: Users can query the processed documents using natural language.
Summarizer: This module generates concise summaries using ROUGE metrics to ensure quality.
Translator: It translates content into Arabic, ensuring accurate and context-aware translations.

Supported Formats

The Dr. X NLP Pipeline supports the following file formats:

PDF
DOCX
TXT
HTML

Feel free to add more formats by extending the Extractor module.

Contributing

We welcome contributions! If you would like to contribute to the Dr. X NLP Pipeline, please follow these steps:

Fork the repository.
Create a new branch for your feature or bug fix.
Make your changes and commit them.
Push your branch and create a pull request.

Please ensure that your code follows the project's coding standards and includes relevant tests.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

For any questions or feedback, please reach out to us at contact@dr-x-nlp-pipeline.com.

Releases

To download the latest release, visit our Releases section and follow the instructions provided.

Conclusion

Thank you for your interest in the Dr. X NLP Pipeline! We hope you find it useful for your research and document processing needs. Your feedback and contributions are greatly appreciated.

Happy processing!

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
files		files
README.md		README.md
chunker.py		chunker.py
embedding_pipeline.py		embedding_pipeline.py
file_reader.py		file_reader.py
rag_qa_system.py		rag_qa_system.py
report.pdf		report.pdf
requirements.txt		requirements.txt
summarizer.py		summarizer.py
translation_utils.py		translation_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Dr. X NLP Pipeline 🕵️‍♂️🔍

Table of Contents

Features

Installation

Usage

Example

Components

Supported Formats

Contributing

License

Contact

Releases

Conclusion

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

TTWJOE/dr-x-nlp-pipeline

Folders and files

Latest commit

History

Repository files navigation

Dr. X NLP Pipeline 🕵️‍♂️🔍

Table of Contents

Features

Installation

Usage

Example

Components

Supported Formats

Contributing

License

Contact

Releases

Conclusion

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages