This API, which could be a backend for an application, provides a Retrieval-Augmented Generation (RAG) system integrated with Dropbox. It allows users to query documents stored in Dropbox using natural language, with responses generated based on the document content. The system processes documents by extracting text, chunking it, generating embeddings, and storing them in Pinecone vector database for efficient retrieval.
RAG_Dropbox/
│
├── app/ # Main application directory
│ ├── main.py # FastAPI application and routes
│ ├── rag.py # RAG query processing logic
│ ├── dropbox_utils.py # Dropbox API interactions
│ ├── vector_db.py # Pinecone vector database operations
│ └── text_extraction_utils.py # Document text extraction utilities
│
├── .env # Environment variables (gitignored)
├── .gitignore # Git ignore rules
├── poetry.lock # Prevents automatic dependency updates
├── pyproject.toml # Poetry dependencies
└── README.md # This documentation
-
API Layer (
main.py
):- FastAPI endpoints for document listing and querying
- Request/response models and error handling
-
RAG (
rag.py
):- Query processing pipeline
- Context generation and OpenAI call
- Response formatting
-
Document Processing (
text_extraction_utils.py
):- Multi-format text extraction (PDF, DOCX, PPTX)
- OCR fallback system
- Text chunking
-
Storage Integrations:
- Dropbox (
dropbox_utils.py
) - Pinecone vector database (
vector_db.py
)
- Dropbox (
-
Supporting Infrastructure:
- Environment configuration (
.env
) - Dependency management (
pyproject.toml
)
- Environment configuration (
- Python 3.9+
- Poetry (for package management)
- Dropbox account with API access
- Pinecone account
- OpenAI API key
- Tesseract OCR (for image-based document processing)
-
Clone the repository:
git clone https://github.com/yourusername/RAG_Dropbox.git cd RAG_Dropbox
-
Install system dependencies:
# Ubuntu/Debian sudo apt install tesseract-ocr poppler-utils # MacOS brew install tesseract poppler # Windows (using Chocolatey) choco install tesseract
-
Install Python dependencies using Poetry:
poetry install
-
Activate the virtual environment:
poetry shell
Create a .env
file in the root directory with the following variables:
DROPBOX_ACCESS_TOKEN=your_dropbox_access_token
OPENAI_API_KEY=your_openai_api_key
PINECONE_API_KEY=your_pinecone_api_key
-
Dropbox Access Token:
- Go to the Dropbox Developers Console
- Create a new app with "Full Dropbox" access
- Generate an access token
-
OpenAI API Key:
- Sign up at OpenAI Platform
- Create an API key in the "API Keys" section
-
Pinecone API Key:
- Sign up at Pinecone
- Create an index and get your API key from the dashboard
-
Start the FastAPI server:
uvicorn app.main:app --reload --app-dir .
-
The app will be available at http://127.0.0.1:8000
-
To try out the API, we need to navigate to Swagger UI: http://127.0.0.1:8000/docs
List all PDF documents in Dropbox with their processing status.
Response Example:
[
{
"name": "document1.pdf",
"processed": true
},
{
"name": "document2.pdf",
"processed": false
}
]
Query a specific document.
Request Body:
{
"document_name": "example.pdf",
"query": "What is the main topic of this document?"
}
Response Example:
{
"query": "What is the main topic of this document?",
"answer": "The document discusses the operation and maintenance of generator sets.",
"source_document": "example.pdf",
"relevant_chunks": [
{
"text": "The Caterpillar 3500 generator sets are designed for...",
"score": 0.92
}
]
}
-
Document Processing:
- Files are retrieved from Dropbox when needed
- Text is extracted using appropriate methods for each file type
- OCR is used as a fallback when direct text extraction fails
- Text is chunked into manageable pieces (≈1000 characters)
-
Vector Storage:
- Chunks are converted to embeddings using OpenAI's text-embedding-3-small.
- Embeddings are stored in Pinecone with document names as namespaces
- Each document maintains its own namespace in the vector database
- Pinecone was chosen because it is a database optimized for retrieval, is simple to use, and it integrates well with some cloud services providers like GCP
-
Query Processing:
- Queries are converted to embeddings using the same model
- Three relevant chunks are retrieved from the appropriate document namespace
- Retrieved context is used to generate answers via OpenAI's GPT-3.5-turbo
Since no example files were provided for development, we made the following assumptions about document structure:
- Files would be either:
- Text-based (with direct text extraction possible)
- Image-based (requiring OCR for text extraction)
This led to our dual-phase processing strategy:
- First attempt direct text extraction
- Only fall back to OCR if no text is found
Benefits:
- Improved efficiency by avoiding unnecessary OCR processing
- Faster processing for text-based documents
- Automatic handling of mixed-content documents
-
PowerPoint Image Extraction:
- Current implementation cannot extract text from images embedded in PowerPoint files
- This limitation appears to be systemic, as even advanced AI tools (ChatGPT, DeepSeek) struggle with this task
- Text in native PowerPoint text elements is still extracted successfully
-
OCR Dependencies:
- The system requires Tesseract OCR to be installed for image-based processing
- OCR quality depends on image clarity and resolution
-
Mixed Image and Text Files
- If the files contain both text and images from which text should be extracted, our system will only extract text.
- The naive solution would be to simply extract text from all files using OCR but we decided to go with a more sophisticated approach (Dual-phase processing strategy presented above)
-
Performance Considerations:
- The first query for a document will be slower as it needs to process the document
- Subsequent queries will be faster as they use the pre-processed embeddings
- Document processing includes a 10-second delay to ensure Pinecone availability
-
File Support:
- Currently supports PDF, DOCX, and PPT/PPTX files
- PDFs with scanned images will be processed using OCR
- Complex PowerPoint files may have limited text extraction accuracy
-
Temporary Files:
- Downloaded files are stored in a
temp_downloads
directory - These files are automatically deleted after processing
- Downloaded files are stored in a
-
Additional Dependencies:
- For successful PDF processing, ensure
poppler-utils
is installed
- For successful PDF processing, ensure
-
shape.type
not implemented- This method from
pptx
package plays a crucial role in extracting images from powerpoints. Upon further inspection, it looks like its not implemented in the original package, which might be the root of the issue on why it was so hard to extract images from slides. - This might be a good alternative proposal for extracting images from powerpoint, but due to time constraints didn't get time to try it out.
- This method from
-
Chunking
- Chunking is implemented in a way for it to continue concatenating paragraphs until they exceed 1000 characters, but if a single paragraph is longer than 1000 characters, it will become a single chunk. In a real-world scenario, I would recommend recusrive chunking with overlap.
-
No Tests
- Due to time constraits, unit and integration tests were not implemented.
- No Front End
- Due to time constraints, front end was not implemented.
You can test the API endpoints using FastAPI's built-in docs interface at http://127.0.0.1:8000/docs
- Add support for more file types (e.g., TXT, HTML)
- Implement document versioning
- Add user authentication
- Improve OCR on Powerpoint
- Add batch processing for multiple documents
- Name Pinecone namespaces as document ID
-
Document not found errors:
- Ensure the document exists in your Dropbox root folder
- Check the Dropbox access token has proper permissions
- Dropbox access token needs to be regenerated daily
-
OCR issues:
- Verify Tesseract OCR is properly installed
- Check system PATH includes Tesseract executable
-
Pinecone connection problems:
- Verify your Pinecone API key
- Check the index name matches your Pinecone configuration
- Ensure your Pinecone index is in the correct region (us-east-1)