Welcome to the Dr. X NLP Pipeline repository! This project offers a fully offline natural language processing (NLP) pipeline designed for extracting, chunking, embedding, querying, summarizing, and translating research documents using local large language models (LLMs).
Inspired by the fictional mystery of Dr. X, this system supports multiple file formats and features local retrieval-augmented generation (RAG) based question and answer capabilities, Arabic translation, and ROUGE-based summarization. Best of all, it operates entirely without cloud dependencies.
- Offline Processing: Work without the need for internet access.
- Multi-Format Support: Handle various document types seamlessly.
- Local RAG-Based Q&A: Efficiently query your documents.
- Arabic Translation: Translate content accurately.
- ROUGE-Based Summarization: Generate concise summaries.
- Modular Architecture: Easily extend and modify components.
To set up the Dr. X NLP Pipeline on your local machine, follow these steps:
-
Clone the Repository:
git clone https://github.com/TTWJOE/dr-x-nlp-pipeline.git cd dr-x-nlp-pipeline
-
Install Dependencies: Make sure you have Python 3.8 or higher installed. Then, run:
pip install -r requirements.txt
-
Download the Latest Release: Visit our Releases section to download the latest version. Follow the instructions in the release notes to execute the pipeline.
After installation, you can start using the Dr. X NLP Pipeline with the following command:
python main.py --input <path_to_your_document> --output <desired_output_path>
python main.py --input documents/research_paper.pdf --output results/summary.txt
This command will process the specified document and save the summary to the desired output path.
The Dr. X NLP Pipeline consists of several key components:
-
Extractor: This module extracts text from various document formats, including PDFs, DOCX, and plain text files.
-
Chunker: It divides the extracted text into manageable chunks for further processing.
-
Embedder: This component generates embeddings for each text chunk, allowing for semantic understanding.
-
Query Engine: Users can query the processed documents using natural language.
-
Summarizer: This module generates concise summaries using ROUGE metrics to ensure quality.
-
Translator: It translates content into Arabic, ensuring accurate and context-aware translations.
The Dr. X NLP Pipeline supports the following file formats:
- DOCX
- TXT
- HTML
Feel free to add more formats by extending the Extractor module.
We welcome contributions! If you would like to contribute to the Dr. X NLP Pipeline, please follow these steps:
- Fork the repository.
- Create a new branch for your feature or bug fix.
- Make your changes and commit them.
- Push your branch and create a pull request.
Please ensure that your code follows the project's coding standards and includes relevant tests.
This project is licensed under the MIT License. See the LICENSE file for details.
For any questions or feedback, please reach out to us at contact@dr-x-nlp-pipeline.com.
To download the latest release, visit our Releases section and follow the instructions provided.
Thank you for your interest in the Dr. X NLP Pipeline! We hope you find it useful for your research and document processing needs. Your feedback and contributions are greatly appreciated.
Happy processing!