A proof-of-concept Retrieval-Augmented Generation (RAG) system that demonstrates:
- Document ingestion from multiple formats (PDF, HTML, TXT, MD, DOCX)
- Text chunking and embedding generation
- Local vector storage using ChromaDB
- Query API with FastAPI for document search and chat functionality
- Multi-format Document Support: Process PDF, HTML, TXT, Markdown, and DOCX files
- Flexible Embeddings: Support for both Sentence Transformers (local) and OpenAI embeddings
- Local Vector Database: ChromaDB for efficient similarity search
- RESTful API: FastAPI-based endpoints for document ingestion, search, and chat
- Docker Support: Fully containerized for easy deployment
- Test Coverage: Comprehensive test suite for core functionality
- Clone this repository
- Replace the documents in
data/documents/
with your team's documents (PDF, DOCX, TXT, MD, HTML) - Configure environment by copying
.env.example
to.env
and adding your OpenAI API key (optional) - Run with Docker:
docker-compose up --build
- Start querying your documents at
http://localhost:8000
Your documents will be automatically processed and ready for search and chat!
- Docker and Docker Compose
- Python 3.12+ (for local development)
- OpenAI API key (optional, for chat functionality)
- Clone the repository:
git clone https://github.com/yourusername/docker-rag-test.git
cd docker-rag-test
- Create a
.env
file from the example:
cp .env.example .env
# Edit .env and add your OpenAI API key if you want chat functionality
-
Add your documents to the
data/documents/
directory -
Build and run with Docker Compose:
docker-compose up --build
The API will be available at http://localhost:8000
. Documents in data/documents/
will be automatically ingested on startup.
- Install dependencies:
pip install -r requirements.txt
- Run the application:
uvicorn src.api.main:app --reload
curl http://localhost:8000/
Documents in data/documents/
are automatically ingested on startup. To manually ingest additional documents:
curl -X POST http://localhost:8000/ingest \
-H "Content-Type: application/json" \
-d '{"directory_path": "/app/data/documents"}'
curl -X POST http://localhost:8000/upload \
-F "files=@document1.pdf" \
-F "files=@document2.txt"
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{
"query": "machine learning",
"k": 5
}'
curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{
"message": "What is machine learning?",
"k": 3,
"use_context": true
}'
curl http://localhost:8000/documents/count
curl -X DELETE http://localhost:8000/documents
docker-rag-test/
├── src/
│ ├── ingestion/ # Document loading and text splitting
│ ├── embedding/ # Embedding generation (OpenAI/Sentence Transformers)
│ ├── storage/ # Vector database interface (ChromaDB)
│ └── api/ # FastAPI application and endpoints
├── tests/ # Test suite
├── data/
│ └── documents/ # Place your documents here for auto-ingestion
├── requirements.txt # Python dependencies
├── docker-compose.yml # Docker Compose configuration
└── .env.example # Environment variables template
Environment variables can be set in the .env
file:
OPENAI_API_KEY
: Your OpenAI API key (required for chat functionality)CHUNK_SIZE
: Size of text chunks (default: 1000)CHUNK_OVERLAP
: Overlap between chunks (default: 200)EMBEDDER_TYPE
: "sentence-transformer" or "openai" (default: "sentence-transformer")CHROMA_PERSIST_DIRECTORY
: Directory for ChromaDB persistenceAUTO_INGEST_ON_STARTUP
: Enable/disable auto-ingestion on startup (default: true)AUTO_INGEST_DIRECTORY
: Directory to auto-ingest documents from (default: /app/data/documents)
Run the test suite:
pytest
Run with coverage:
pytest --cov=src tests/
The project follows these principles:
- Test-Driven Development: Tests are written for core functionality
- Modular Design: Clear separation between ingestion, embedding, storage, and API layers
- Docker-First: Fully containerized for consistent environments
- Type Safety: Uses Pydantic for data validation
- Async Support: FastAPI with async endpoints for better performance
- Fix SentenceTransformerEmbedder api_key parameter error in rag_service.py
- Test Streamlit frontend functionality at http://localhost:8501
- Verify that documents in data/documents/ are being ingested correctly
- Create .env.example file with proper template variables
While this is a proof-of-concept with local storage, the architecture supports easy migration to:
- Cloud vector databases (AWS S3 Vector Engine, Pinecone, Qdrant)
- Serverless deployment (AWS Lambda)
- Container orchestration (AWS ECS/Fargate)
- Managed API Gateway integration
This project was built following these principles:
- Use most supported and compatible tech stack
- Test driven development
- Explicit folder structure separating resources from code
- Docker-first approach for local development
- Design for easy cloud migration
- Follow best practices
- Version control with regular commits