RAG API with ChromaDB and Web Scraping

A powerful Retrieval-Augmented Generation (RAG) API built with FastAPI, ChromaDB, and async web scraping capabilities. This system allows you to build a knowledge base from web content and query it using natural language.

🚀 Features

FastAPI-based RESTful API with interactive Swagger documentation
ChromaDB vector database for efficient similarity search
Async web scraping with DuckDuckGo search integration
Direct URL crawling for specific content ingestion
Sentence Transformers for high-quality embeddings
Flan-T5 model for text generation
Apple Silicon GPU acceleration (MPS support)

📋 Prerequisites

Python 3.8+
macOS with Apple Silicon (for MPS acceleration) or any system with CPU support

🛠️ Installation

Clone the repository

git clone https://github.com/AnanyaBanerjee01/rag-web-api.git

Create a virtual environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies

pip install chromadb sentence-transformers transformers torch tqdm duckduckgo-search beautifulsoup4 aiohttp fastapi uvicorn

🚀 Quick Start

Start the API server
```
python rag_chroma_api.py
```
Access the interactive documentation Open your browser and navigate to: http://127.0.0.1:8000/docs
Test the health check
```
curl http://127.0.0.1:8000/
```

📚 API Endpoints

Health Check

GET / - Check if the API is running
Response: {"status": "healthy", "message": "RAG API is running"}

Question Answering

GET /ask?query=your_question - Ask questions to the RAG system
Parameters:
- query (required): Your question as a string

Example:

curl "http://127.0.0.1:8000/ask?query=What%20is%20machine%20learning?"

Web Content Ingestion

Search and Ingest

POST /refresh_web - Search the web and add content to knowledge base

Body:

{
  "query": "machine learning basics",
  "max_results": 5
}

Example:

curl -X POST "http://127.0.0.1:8000/refresh_web" \
  -H "Content-Type: application/json" \
  -d '{"query": "artificial intelligence", "max_results": 3}'

Direct URL Crawling

POST /crawl - Crawl specific URLs and add to knowledge base

Body:

{
  "urls": [
    "https://en.wikipedia.org/wiki/Machine_learning",
    "https://example.com/article"
  ],
  "descriptions": [
    "Machine learning overview",
    "Article description"
  ]
}

Example:

curl -X POST "http://127.0.0.1:8000/crawl" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://en.wikipedia.org/wiki/Data_science"],
    "descriptions": ["Data science overview"]
  }'

🔧 Configuration

Model Configuration

The system uses the following models by default:

Embedding Model: all-MiniLM-L6-v2
Generation Model: google/flan-t5-base

Database Configuration

ChromaDB stores vectors in ./chromadb_api/ directory
Collection Name: rag_api

Customization

You can modify the RAGChroma class initialization in rag_chroma_api.py:

rag = RAGChroma(
    collection_name="your_collection",
    persist_directory="./your_db_path",
    embedding_model_name="your-embedding-model",
    generation_model="your-generation-model"
)

📁 Project Structure

RAG/
├── rag_chroma_api.py      # Main FastAPI application
├── web_ingestor.py        # Async web scraping utilities
├── chromadb_api/          # ChromaDB database files
│   └── chroma.sqlite3
├── __pycache__/           # Python cache files
├── venv/                  # Virtual environment
└── README.md              # This file

🔍 Usage Examples

1. Basic Question Answering

# Ask a simple question
curl "http://127.0.0.1:8000/ask?query=What%20is%20Python?"

2. Add Web Content and Query

# First, add some content about Python
curl -X POST "http://127.0.0.1:8000/refresh_web" \
  -H "Content-Type: application/json" \
  -d '{"query": "Python programming language", "max_results": 3}'

# Then ask a question about Python
curl "http://127.0.0.1:8000/ask?query=What%20are%20Python%27s%20main%20features?"

3. Crawl Specific URLs

# Add specific documentation
curl -X POST "http://127.0.0.1:8000/crawl" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://docs.python.org/3/tutorial/introduction.html",
      "https://realpython.com/python-basics/"
    ],
    "descriptions": [
      "Python official tutorial",
      "Python basics guide"
    ]
  }'

🧪 Testing with Swagger UI

Navigate to http://127.0.0.1:8000/docs
Expand any endpoint (e.g., /ask)
Click "Try it out"
Enter your parameters
Click "Execute"
View the response

📊 Response Formats

Successful Query Response

{
  "query": "What is machine learning?",
  "answer": "Machine learning is a subset of artificial intelligence that enables computers to learn and make decisions from data without being explicitly programmed for every task."
}

Crawl Success Response

{
  "message": "Successfully crawled and ingested 2 documents.",
  "urls_processed": 2,
  "docs_ingested": 2
}

Error Response

{
  "error": "Error message here",
  "message": "Failed to generate answer"
}

🚨 Troubleshooting

Common Issues

Connection Refused Error
- Ensure the server is running: python rag_chroma_api.py
- Check if port 8000 is available
Model Loading Issues
- Ensure you have sufficient RAM (models require ~2-4GB)
- Check internet connection for initial model downloads
Web Scraping Failures
- Some websites block automated requests
- Check the server logs for detailed error messages
Empty Responses
- Add content to the knowledge base first using /refresh_web or /crawl
- Ensure your query matches the ingested content topics

Performance Tips

Initial Setup: First run will download models (~1-2GB)
Memory Usage: System uses ~3-4GB RAM when fully loaded
Response Time: First query takes longer due to model initialization

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🔗 Dependencies

FastAPI: Web framework for building APIs
ChromaDB: Vector database for embeddings
Sentence Transformers: For generating text embeddings
Transformers: Hugging Face transformers library
BeautifulSoup4: HTML parsing for web scraping
aiohttp: Async HTTP client
DuckDuckGo Search: Web search functionality

📞 Support

If you encounter any issues or have questions:

Check the troubleshooting section above
Review the server logs for detailed error messages
Open an issue on GitHub with detailed information about your problem

🔮 Future Enhancements

Support for additional embedding models
PDF document ingestion
User authentication and rate limiting
Batch processing endpoints
Advanced search filters
Export/import knowledge base functionality

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
rag_chroma_api.py		rag_chroma_api.py
requirements.txt		requirements.txt
setup.sh		setup.sh
start.sh		start.sh
web_ingestor.py		web_ingestor.py

Uh oh!

License

Uh oh!

AnanyaBanerjee01/rag-web-api

Folders and files

Latest commit

History

Repository files navigation

RAG API with ChromaDB and Web Scraping

🚀 Features

📋 Prerequisites

🛠️ Installation

🚀 Quick Start

📚 API Endpoints

Health Check

Question Answering

Web Content Ingestion

Search and Ingest

Direct URL Crawling

🔧 Configuration

Model Configuration

Database Configuration

Customization

📁 Project Structure

🔍 Usage Examples

1. Basic Question Answering

2. Add Web Content and Query

3. Crawl Specific URLs

🧪 Testing with Swagger UI

📊 Response Formats

Successful Query Response

Crawl Success Response

Error Response

🚨 Troubleshooting

Common Issues

Performance Tips

🤝 Contributing

📝 License

🔗 Dependencies

📞 Support

🔮 Future Enhancements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages