Document Classification Pipeline

This project implements an optimized document classification pipeline that processes specification documents, classifies items based on these specifications, and provides a confidence score for each classification.

Features

Process and store specification documents
Create and manage embeddings for documents and items
Perform similarity search using a vector store
Classify items using various language models (Ollama, OpenAI, Claude)
Batch processing and caching for improved performance
Flexible configuration options

Setup

Clone the repository:

git clone https://github.com/Muhanad-husn/SpecClass.git
cd SpecClass

Set up a virtual Conda environment:

conda create --name doc_classification python=3.12.4
conda activate doc_classification

Install the required dependencies:
```
pip install -r requirements.txt
```
Set up the NLM-Ingestor Server:
- Ensure you have Docker installed, and the Docker daemon is running.
- Pull the Docker image:
```
docker pull ghcr.io/nlmatics/nlm-ingestor:latest
```
- Run the container:
```
docker run -p 5010:5001 ghcr.io/nlmatics/nlm-ingestor:latest
```
Note: The Docker image is meant for development environments only. For production, users must set up their own server configuration.

The nlm-ingestor server uses a modified version of Apache Tika for document parsing. The server can be deployed locally and provides an easy way to parse and intelligently chunk various document types, including "HTML", "PDF", "Markdown", and "Text". There is an option to enable OCR; refer to the documentation for more details.

This Docker image was adapted from the nlm-ingestor repo nlm-ingestor.
Configure the application:
- Copy the .env.example file to .env and fill in your API keys:
```
cp .env.example .env
```
- Edit the config/config.yaml file to adjust settings as needed.

Usage

SpecClass Usage Instructions

Preparation

Place your specification documents in the data/specifications directory.
- Supported formats: PDF, DOCX, HTML, MD, TXT
Place the items to be classified in the data/input directory.
- Supported formats: CSV, XLSX

Configuration

Open config/config.yaml and adjust settings as needed.
- Note: The default number of documents retrieved from the vector store for each item is 5. Adjust this based on your specification book's content.
Default models (can be overridden with --model-name):
- OpenAI: gpt-4o-mini
- Ollama: phi-3.5-8b
- Claude: claude-3.5-sonnet

Running the Pipeline

Execute the following command:

python pipeline.py [--reset] [--model-type {ollama|openai|claude}] [--model-name MODEL_NAME]

Options:

--reset: Reset the vector store before processing
--model-type: Specify the model type (ollama, openai, or claude)
--model-name: Override the default model name

Interactive Prompts

The application will ask you to provide:

Model type to use (if not specified in command line)
Sheet name and column containing items to classify (for Excel input)
Short description of the specification book
Description of items to be classified
Any specifications that should be weighted more heavily in case of equal similarity

Output

The pipeline will:

Process the specification documents
Classify the input items
Output results to a CSV file in the specified output directory

For detailed logs and error messages, check the logs directory.

Project Structure

pipeline.py: Main entry point for running the classification pipeline.
src/: Contains the core components of the pipeline.
models/: Defines the base agent and language model interfaces.
utils/: Utility functions for configuration, logging, and file handling.
config/: Configuration files.
data/: Input and output data directories.
logs/: Log files.

Contributing

Contributions are welcome! Please feel free to fork the repo or submit a Pull Request.

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
chroma_db		chroma_db
config		config
data		data
logs		logs
models		models
src		src
utils		utils
.env.example.txt		.env.example.txt
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docs.md		docs.md
pipeline.py		pipeline.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Document Classification Pipeline

Features

Setup

Usage

Output

Project Structure

Contributing

License

About

Uh oh!

Releases

Packages

Languages

License

Muhanad-husn/SpecClass

Folders and files

Latest commit

History

Repository files navigation

Document Classification Pipeline

Features

Setup

Usage

Output

Project Structure

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages