A powerful Streamlit application that intelligently extracts and processes content from various file formats using advanced AI techniques. This tool combines structured text extraction with OCR capabilities and provides intelligent content reorganization and question-answering features. Try it live
Smart Content Extraction is designed to handle diverse file formats and extract meaningful content using a two-tier approach:
- Structured Extraction: Uses MarkItDown for direct text extraction from supported formats
- OCR Fallback: Employs LlamaParse for optical character recognition when structured extraction fails
- AI Enhancement: Leverages OpenAI's GPT models for content reorganization and intelligent Q&A
- 📂 Universal File Support: Works with multiple file formats including PDFs, images, documents, and more
- 🔍 Smart Extraction: Intelligent fallback from structured to OCR-based extraction
- 🧹 Content Reorganization: AI-powered content restructuring for better readability
- 💬 Interactive Q&A: Ask questions about your extracted content using RAG (Retrieval-Augmented Generation)
- ⬇️ Export Options: Download reorganized content as text files
- 📊 Token Counting: Monitor content size for API usage optimization
- Python 3.7+
- API keys for:
- OpenAI API
- LlamaParse API
-
Clone the repository:
git clone https://github.com/AhmedZeyadTareq/Smart-markdown-Extractor.git cd Smart-markdown-Extractor python -m venv venv venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Set up .env file with API keys:
OPENAI_API_KEY=your-openai-api-key LLAMA_API_PARSE=your-llamaparse-api-key
-
Run the application:
streamlit run app.py
streamlit
openai
llama-parse
markitdown
pillow
tiktoken
python-dotenv
- Follow the installation steps above
- Run
streamlit run app.py
- Open your browser to
http://localhost:8501
Access the live application at: Try it live
from app import convert_file, reorganize_markdown, rag
md_content = convert_file("document.pdf")
organized_md = reorganize_markdown(md_content)
answer = rag(organized_md, "What is this document about?")
print(answer)
- Upload File: Click "📂 Choose File" and select your document
- Extract Content: Click "Start 🔁" to begin extraction
- Reorganize (Optional): Click "🧹 Reorganize Content" for AI-enhanced formatting
- Ask Questions: Use the text input to ask questions about your content
- Download: Save the reorganized content using the download button
- OpenAI Model: Currently set to
gpt-4.1-mini
(configurable inLLM_MODEL
) - LlamaParse: Uses markdown output format for better structure
- Modify
LLM_MODEL
variable to use different OpenAI models - Adjust the reorganization prompt in the
reorganize_markdown()
function - Customize the RAG system prompt in the
rag()
function
smart-content-extraction/
├── app.py # Main Streamlit application
├── requirements.txt # Python dependencies
├── README.md # Project documentation
├── formal image.jpg # Logo image (optional)
└── .env # Environment variables (not tracked)
- Document Analysis: Extract and analyze content from research papers, reports, and presentations
- Data Processing: Convert scanned documents and images to searchable text
- Content Creation: Reorganize and structure extracted content for better readability
- Research Assistant: Ask questions about document content using natural language
- Batch Processing: Handle multiple documents with consistent extraction quality
- Primary Method: MarkItDown attempts structured extraction
- Fallback Method: LlamaParse handles OCR when structured extraction fails
- Content Processing: OpenAI GPT models enhance and reorganize content
- Interactive Layer: RAG system enables intelligent question-answering
- Graceful fallback between extraction methods
- Comprehensive error messages for debugging
- Robust file handling with temporary file management
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
- Large files may take longer to process due to API rate limits
- Some complex layouts might require manual review after extraction
- OCR accuracy depends on image quality and text clarity
- Initial release with basic extraction and reorganization features
- Integrated MarkItDown and LlamaParse for robust content extraction
- Added interactive Q&A functionality using RAG
📌 Data Scientist & AI Developer | 🎓 Master of AI Engineering
MIT License © Ahmed Zeyad Tareq
⭐ If you find this project useful, please give it a star on GitHub!