Automated Document Processing and Markdown Generation

A comprehensive solution for extracting, processing, and standardizing data from documents(PDF files) and web sources using both open-source and enterprise tools. The application leverages FastAPI for backend processing and Streamlit for an intuitive user interface, deployed on Google Cloud Run.

Workflow Diagram

Below is the workflow diagram for the AI Application:

Diagram Description:

User: The end-user interacts with the application via the Streamlit frontend.
Streamlit App: The frontend built using Streamlit.
FastAPI Backend: The backend server that handles data processing.
Data Extraction:
- PyMuPDF / camelot: For extracting data from PDF files using Open Source tools.
- Azure Document Intelligence and Adobe API Extract API: For extracting data from PDF files using Enterprise tools.
- BeautifulSoup: For web scraping using Open Source Tools.
- APIFY: For web scraping using Enterprise Tools.
Standardization Tools:
- Docling: A custom tool for standardizing conversions from pdfs to markdowns.
- MarkItDown: Another custom tool for further data standardization.
AWS S3 Bucket: Used for storing processed data.
Google Cloud Run: Used for Deploying FastAPI applications
Streamlit In-builtDeployment: Used for Deploying Streamlit application for UI/UX.

Components

User: The end-user interacts with the application via the Streamlit frontend.
Streamlit Frontend: A custom frontend built using Streamlit for user interaction.
FastAPI Backend: A backend server built using FastAPI to handle data processing and communication with other services.
Data Extraction:
- PyPDF2 / pdfplumber: For extracting data from PDF files.
- BeautifulSoup/Scrapy: For web scraping.
- Microsoft Document Intelligence: For enterprise-level document processing.
Standardization Tools:
- Docling: A custom tool for standardizing extracted data.
- MarkItDown: Another custom tool for further data standardization.
AWS S3 Bucket: Used for storing processed data.

Project Structure

├── .devcontainer/ # Development container configuration
├── .streamlit/ # Streamlit configuration files
├── api/ # FastAPI backend services
├── frontend/ # Streamlit frontend application
├── notebooks/ # Development and testing notebooks
├── .dockerignore # Docker ignore rules
├── .gitignore # Git ignore rules
├── Azure_Document_Intelligence.py # Azure AI document processing for  enterprise pdf extraction
├── Cloud_Run.md # Cloud deployment instructions
├── Dockerfile # Main application Dockerfile
├── EnterpriseWebScrap.py # Enterprise web scraping module
├── OSWebScrap.py # Open-source web scraping module
├── README.md # Project documentation
├── ai_application_workflow.png # Architecture diagram
├── docker-compose.yml # Multi-container Docker setup
├── docklingextraction.py # Markdown Generator for pdf
├── open_source_parsing.py # pymupdf and camelot for open source pdf extraction
└── requirements.txt # Python dependencies

Workflow Steps

The User uploads data via the Streamlit Frontend.
The Frontend sends the data to the FastAPI Backend.
The Backend processes the data using one or more of the following:
- PyPDF2 / pdfplumber for PDF extraction.
- BeautifulSoup / Scrapy for web scraping.
- Microsoft Document Intelligence for enterprise document processing.
- Apify for enterprise web processing of documents.
The extracted data is standardized using Docling and MarkItDown.
The processed data is stored in an AWS S3 Bucket.
The Frontend retrieves the processed data from the S3 Bucket and displays it to the User.

Prerequisites

Python 3.7+
Diagrams library for generating the workflow diagram.
AWS account with S3 bucket access.
Streamlit and FastAPI installed for frontend and backend development.
Install Google Cloud SDK
Install Docker

Installation

Clone the repository:

git clone https://github.com/your-username/BigData_InClass_Proj1.git
cd BIGDATA_INCLASS_PROJ1

Create a .env file and add the required credentials:

AWS_ACCESS_KEY_ID=your_aws_access_key
AWS_SECRET_ACCESS_KEY=your_aws_secret_key
AWS_REGION=your_aws_region
AZURE_DOCUMENT_INTELLIGENCE_KEY=your_azure_key
AZURE_FORM_RECOGNIZER_KEY=your_azure_form_intelligence
APIFY_TOKEN=your_apify_token
ADOBE_API_ID=your_adobe_api_key
ADOBE_API_Secret= adobe_api_secret

Build and run the application using Docker Compose:

docker-compose build --no-cache
docker-compose up -d

Ensure you have the custom icons (microsoft.png, docling.png, markitdown.png, streamlit.png) in the ./icons/ directory.
Generate the workflow diagram:
```
python generate_diagram.py
```

Usage

Run the FastAPI backend:

cd api
python -m venv venv
pip install -r requirements.txt
venv/Scripts/activate
uvicorn backend:app --reload

2.Open your browser and navigate to http://localhost:8080 to interact with the application for Backend.

Run the Streamlit frontend:

cd frontend
python -m venv venv
pip install -r requirements.txt
venv/Scripts/activate
streamlit run frontend.py

4.Open your browser and navigate to http://localhost:8501 to interact with the application for Streamlit app.Run the streamlit after running the fastapi localhost

Deployed Application:

FastAPI Backend: https://fastapi-streamlit-974490277552.us-central1.run.app/
Streamlit Frontend: https://streamlit-app-974490277552.us-central1.run.app/

More details on the cloud deployment process will be explained in the CloudRun.md in the root folder

Contributing

Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Notes:

Replace yourusername in the repository URL with your actual GitHub username.
Ensure the generate_diagram.py script is created to generate the workflow diagram.
Update the LICENSE file if you choose a different license.

Let me know if you need further assistance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Automated Document Processing and Markdown Generation

Workflow Diagram

Diagram Description:

Components

Project Structure

Workflow Steps

Prerequisites

Installation

Usage

Contributing

License

Notes:

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.devcontainer		.devcontainer
.streamlit		.streamlit
api		api
frontend		frontend
notebooks		notebooks
.dockerignore		.dockerignore
.gitignore		.gitignore
Azure_Document_Intelligence.py		Azure_Document_Intelligence.py
Cloud_Run.md		Cloud_Run.md
Dockerfile		Dockerfile
EnterpriseWebScrap.py		EnterpriseWebScrap.py
OSWebScrap.py		OSWebScrap.py
README.md		README.md
ai_application_workflow (2).png		ai_application_workflow (2).png
ai_application_workflow.png		ai_application_workflow.png
docker-compose.yml		docker-compose.yml
docklingextraction.py		docklingextraction.py
open_source_parsing.py		open_source_parsing.py
requirements.txt		requirements.txt

BigDataTeam5/AI-Info-Extractor_Markdown_Viewer

Folders and files

Latest commit

History

Repository files navigation

Automated Document Processing and Markdown Generation

Workflow Diagram

Diagram Description:

Components

Project Structure

Workflow Steps

Prerequisites

Installation

Usage

Contributing

License

Notes:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages