Skip to content

A powerful Retrieval-Augmented Generation (RAG) API built with FastAPI, ChromaDB, and async web scraping capabilities. This system allows you to build a knowledge base from web content and query it using natural language.

License

AnanyaBanerjee01/rag-web-api

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

2 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

RAG API with ChromaDB and Web Scraping

A powerful Retrieval-Augmented Generation (RAG) API built with FastAPI, ChromaDB, and async web scraping capabilities. This system allows you to build a knowledge base from web content and query it using natural language.

๐Ÿš€ Features

  • FastAPI-based RESTful API with interactive Swagger documentation
  • ChromaDB vector database for efficient similarity search
  • Async web scraping with DuckDuckGo search integration
  • Direct URL crawling for specific content ingestion
  • Sentence Transformers for high-quality embeddings
  • Flan-T5 model for text generation
  • Apple Silicon GPU acceleration (MPS support)

๐Ÿ“‹ Prerequisites

  • Python 3.8+
  • macOS with Apple Silicon (for MPS acceleration) or any system with CPU support

๐Ÿ› ๏ธ Installation

  1. Clone the repository

    git clone https://github.com/AnanyaBanerjee01/rag-web-api.git
  2. Create a virtual environment

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies

    pip install chromadb sentence-transformers transformers torch tqdm duckduckgo-search beautifulsoup4 aiohttp fastapi uvicorn

๐Ÿš€ Quick Start

  1. Start the API server

    python rag_chroma_api.py
  2. Access the interactive documentation Open your browser and navigate to: http://127.0.0.1:8000/docs

  3. Test the health check

    curl http://127.0.0.1:8000/

๐Ÿ“š API Endpoints

Health Check

  • GET / - Check if the API is running
  • Response: {"status": "healthy", "message": "RAG API is running"}

Question Answering

  • GET /ask?query=your_question - Ask questions to the RAG system
  • Parameters:
    • query (required): Your question as a string
  • Example:
    curl "http://127.0.0.1:8000/ask?query=What%20is%20machine%20learning?"

Web Content Ingestion

Search and Ingest

  • POST /refresh_web - Search the web and add content to knowledge base
  • Body:
    {
      "query": "machine learning basics",
      "max_results": 5
    }
  • Example:
    curl -X POST "http://127.0.0.1:8000/refresh_web" \
      -H "Content-Type: application/json" \
      -d '{"query": "artificial intelligence", "max_results": 3}'

Direct URL Crawling

  • POST /crawl - Crawl specific URLs and add to knowledge base
  • Body:
    {
      "urls": [
        "https://en.wikipedia.org/wiki/Machine_learning",
        "https://example.com/article"
      ],
      "descriptions": [
        "Machine learning overview",
        "Article description"
      ]
    }
  • Example:
    curl -X POST "http://127.0.0.1:8000/crawl" \
      -H "Content-Type: application/json" \
      -d '{
        "urls": ["https://en.wikipedia.org/wiki/Data_science"],
        "descriptions": ["Data science overview"]
      }'

๐Ÿ”ง Configuration

Model Configuration

The system uses the following models by default:

  • Embedding Model: all-MiniLM-L6-v2
  • Generation Model: google/flan-t5-base

Database Configuration

  • ChromaDB stores vectors in ./chromadb_api/ directory
  • Collection Name: rag_api

Customization

You can modify the RAGChroma class initialization in rag_chroma_api.py:

rag = RAGChroma(
    collection_name="your_collection",
    persist_directory="./your_db_path",
    embedding_model_name="your-embedding-model",
    generation_model="your-generation-model"
)

๐Ÿ“ Project Structure

RAG/
โ”œโ”€โ”€ rag_chroma_api.py      # Main FastAPI application
โ”œโ”€โ”€ web_ingestor.py        # Async web scraping utilities
โ”œโ”€โ”€ chromadb_api/          # ChromaDB database files
โ”‚   โ””โ”€โ”€ chroma.sqlite3
โ”œโ”€โ”€ __pycache__/           # Python cache files
โ”œโ”€โ”€ venv/                  # Virtual environment
โ””โ”€โ”€ README.md              # This file

๐Ÿ” Usage Examples

1. Basic Question Answering

# Ask a simple question
curl "http://127.0.0.1:8000/ask?query=What%20is%20Python?"

2. Add Web Content and Query

# First, add some content about Python
curl -X POST "http://127.0.0.1:8000/refresh_web" \
  -H "Content-Type: application/json" \
  -d '{"query": "Python programming language", "max_results": 3}'

# Then ask a question about Python
curl "http://127.0.0.1:8000/ask?query=What%20are%20Python%27s%20main%20features?"

3. Crawl Specific URLs

# Add specific documentation
curl -X POST "http://127.0.0.1:8000/crawl" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://docs.python.org/3/tutorial/introduction.html",
      "https://realpython.com/python-basics/"
    ],
    "descriptions": [
      "Python official tutorial",
      "Python basics guide"
    ]
  }'

๐Ÿงช Testing with Swagger UI

  1. Navigate to http://127.0.0.1:8000/docs
  2. Expand any endpoint (e.g., /ask)
  3. Click "Try it out"
  4. Enter your parameters
  5. Click "Execute"
  6. View the response

๐Ÿ“Š Response Formats

Successful Query Response

{
  "query": "What is machine learning?",
  "answer": "Machine learning is a subset of artificial intelligence that enables computers to learn and make decisions from data without being explicitly programmed for every task."
}

Crawl Success Response

{
  "message": "Successfully crawled and ingested 2 documents.",
  "urls_processed": 2,
  "docs_ingested": 2
}

Error Response

{
  "error": "Error message here",
  "message": "Failed to generate answer"
}

๐Ÿšจ Troubleshooting

Common Issues

  1. Connection Refused Error

    • Ensure the server is running: python rag_chroma_api.py
    • Check if port 8000 is available
  2. Model Loading Issues

    • Ensure you have sufficient RAM (models require ~2-4GB)
    • Check internet connection for initial model downloads
  3. Web Scraping Failures

    • Some websites block automated requests
    • Check the server logs for detailed error messages
  4. Empty Responses

    • Add content to the knowledge base first using /refresh_web or /crawl
    • Ensure your query matches the ingested content topics

Performance Tips

  • Initial Setup: First run will download models (~1-2GB)
  • Memory Usage: System uses ~3-4GB RAM when fully loaded
  • Response Time: First query takes longer due to model initialization

๐Ÿค Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

๐Ÿ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ”— Dependencies

  • FastAPI: Web framework for building APIs
  • ChromaDB: Vector database for embeddings
  • Sentence Transformers: For generating text embeddings
  • Transformers: Hugging Face transformers library
  • BeautifulSoup4: HTML parsing for web scraping
  • aiohttp: Async HTTP client
  • DuckDuckGo Search: Web search functionality

๐Ÿ“ž Support

If you encounter any issues or have questions:

  1. Check the troubleshooting section above
  2. Review the server logs for detailed error messages
  3. Open an issue on GitHub with detailed information about your problem

๐Ÿ”ฎ Future Enhancements

  • Support for additional embedding models
  • PDF document ingestion
  • User authentication and rate limiting
  • Batch processing endpoints
  • Advanced search filters
  • Export/import knowledge base functionality

About

A powerful Retrieval-Augmented Generation (RAG) API built with FastAPI, ChromaDB, and async web scraping capabilities. This system allows you to build a knowledge base from web content and query it using natural language.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published