💡 Smart Content Extraction

A powerful Streamlit application that intelligently extracts and processes content from various file formats using advanced AI techniques. This tool combines structured text extraction with OCR capabilities and provides intelligent content reorganization and question-answering features. Try it live

🎯 Overview

Smart Content Extraction is designed to handle diverse file formats and extract meaningful content using a two-tier approach:

Structured Extraction: Uses MarkItDown for direct text extraction from supported formats
OCR Fallback: Employs LlamaParse for optical character recognition when structured extraction fails
AI Enhancement: Leverages OpenAI's GPT models for content reorganization and intelligent Q&A

✨ Features

📂 Universal File Support: Works with multiple file formats including PDFs, images, documents, and more
🔍 Smart Extraction: Intelligent fallback from structured to OCR-based extraction
🧹 Content Reorganization: AI-powered content restructuring for better readability
💬 Interactive Q&A: Ask questions about your extracted content using RAG (Retrieval-Augmented Generation)
⬇️ Export Options: Download reorganized content as text files
📊 Token Counting: Monitor content size for API usage optimization

🛠️ Installation

Prerequisites

Python 3.7+
API keys for:
- OpenAI API
- LlamaParse API

Setup Steps

Clone the repository:

git clone https://github.com/AhmedZeyadTareq/Smart-markdown-Extractor.git
cd Smart-markdown-Extractor
python -m venv venv
venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

Set up .env file with API keys:

OPENAI_API_KEY=your-openai-api-key
LLAMA_API_PARSE=your-llamaparse-api-key

Run the application:
```
streamlit run app.py
```

📋 Dependencies

streamlit
openai
llama-parse
markitdown
pillow
tiktoken
python-dotenv

🚀 Usage

Method 1: Local Development

Follow the installation steps above
Run streamlit run app.py
Open your browser to http://localhost:8501

Method 2: Deployed Version

Access the live application at: Try it live

Method 3: Import Functions in Your Code (Optional)

from app import convert_file, reorganize_markdown, rag

md_content = convert_file("document.pdf")
organized_md = reorganize_markdown(md_content)
answer = rag(organized_md, "What is this document about?")
print(answer)

📖 How to Use

Upload File: Click "📂 Choose File" and select your document
Extract Content: Click "Start 🔁" to begin extraction
Reorganize (Optional): Click "🧹 Reorganize Content" for AI-enhanced formatting
Ask Questions: Use the text input to ask questions about your content
Download: Save the reorganized content using the download button

🔧 Configuration

API Configuration

OpenAI Model: Currently set to gpt-4.1-mini (configurable in LLM_MODEL)
LlamaParse: Uses markdown output format for better structure

Customization Options

Modify LLM_MODEL variable to use different OpenAI models
Adjust the reorganization prompt in the reorganize_markdown() function
Customize the RAG system prompt in the rag() function

📁 Project Structure

smart-content-extraction/
├── app.py                 # Main Streamlit application
├── requirements.txt       # Python dependencies
├── README.md             # Project documentation
├── formal image.jpg      # Logo image (optional)
└── .env                  # Environment variables (not tracked)

🎯 Use Cases

Document Analysis: Extract and analyze content from research papers, reports, and presentations
Data Processing: Convert scanned documents and images to searchable text
Content Creation: Reorganize and structure extracted content for better readability
Research Assistant: Ask questions about document content using natural language
Batch Processing: Handle multiple documents with consistent extraction quality

🔍 Technical Details

Extraction Pipeline

Primary Method: MarkItDown attempts structured extraction
Fallback Method: LlamaParse handles OCR when structured extraction fails
Content Processing: OpenAI GPT models enhance and reorganize content
Interactive Layer: RAG system enables intelligent question-answering

Error Handling

Graceful fallback between extraction methods
Comprehensive error messages for debugging
Robust file handling with temporary file management

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

🐛 Known Issues

Large files may take longer to process due to API rate limits
Some complex layouts might require manual review after extraction
OCR accuracy depends on image quality and text clarity

📝 Changelog

v1.0.0

Initial release with basic extraction and reorganization features
Integrated MarkItDown and LlamaParse for robust content extraction
Added interactive Q&A functionality using RAG

👨‍💻 Developed By

Ahmed Zeyad Tareq

📌 Data Scientist & AI Developer | 🎓 Master of AI Engineering

📞 WhatsApp: +905533333587
GitHub | LinkedIn | Kaggle

📄 License

⭐ If you find this project useful, please give it a star on GitHub!

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.streamlit		.streamlit
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
formal image.jpg		formal image.jpg
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

💡 Smart Content Extraction

🎯 Overview

✨ Features

🛠️ Installation

Prerequisites

Setup Steps

📋 Dependencies

🚀 Usage

Method 1: Local Development

Method 2: Deployed Version

Method 3: Import Functions in Your Code (Optional)

📖 How to Use

🔧 Configuration

API Configuration

Customization Options

📁 Project Structure

🎯 Use Cases

🔍 Technical Details

Extraction Pipeline

Error Handling

🤝 Contributing

🐛 Known Issues

📝 Changelog

v1.0.0

👨‍💻 Developed By

Ahmed Zeyad Tareq

📄 License

About

Uh oh!

Releases

Packages

Languages

License

AhmedZeyadTareq/Smart-markdown-Extractor

Folders and files

Latest commit

History

Repository files navigation

💡 Smart Content Extraction

🎯 Overview

✨ Features

🛠️ Installation

Prerequisites

Setup Steps

📋 Dependencies

🚀 Usage

Method 1: Local Development

Method 2: Deployed Version

Method 3: Import Functions in Your Code (Optional)

📖 How to Use

🔧 Configuration

API Configuration

Customization Options

📁 Project Structure

🎯 Use Cases

🔍 Technical Details

Extraction Pipeline

Error Handling

🤝 Contributing

🐛 Known Issues

📝 Changelog

v1.0.0

👨‍💻 Developed By

Ahmed Zeyad Tareq

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages