A comprehensive solution for extracting, processing, and standardizing data from documents(PDF files) and web sources using both open-source and enterprise tools. The application leverages FastAPI for backend processing and Streamlit for an intuitive user interface, deployed on Google Cloud Run.
Below is the workflow diagram for the AI Application:
- User: The end-user interacts with the application via the Streamlit frontend.
- Streamlit App: The frontend built using Streamlit.
- FastAPI Backend: The backend server that handles data processing.
- Data Extraction:
- PyMuPDF / camelot: For extracting data from PDF files using Open Source tools.
- Azure Document Intelligence and Adobe API Extract API: For extracting data from PDF files using Enterprise tools.
- BeautifulSoup: For web scraping using Open Source Tools.
- APIFY: For web scraping using Enterprise Tools.
- Standardization Tools:
- Docling: A custom tool for standardizing conversions from pdfs to markdowns.
- MarkItDown: Another custom tool for further data standardization.
- AWS S3 Bucket: Used for storing processed data.
- Google Cloud Run: Used for Deploying FastAPI applications
- Streamlit In-builtDeployment: Used for Deploying Streamlit application for UI/UX.
- User: The end-user interacts with the application via the Streamlit frontend.
- Streamlit Frontend: A custom frontend built using Streamlit for user interaction.
- FastAPI Backend: A backend server built using FastAPI to handle data processing and communication with other services.
- Data Extraction:
- PyPDF2 / pdfplumber: For extracting data from PDF files.
- BeautifulSoup/Scrapy: For web scraping.
- Microsoft Document Intelligence: For enterprise-level document processing.
- Standardization Tools:
- Docling: A custom tool for standardizing extracted data.
- MarkItDown: Another custom tool for further data standardization.
- AWS S3 Bucket: Used for storing processed data.
├── .devcontainer/ # Development container configuration
├── .streamlit/ # Streamlit configuration files
├── api/ # FastAPI backend services
├── frontend/ # Streamlit frontend application
├── notebooks/ # Development and testing notebooks
├── .dockerignore # Docker ignore rules
├── .gitignore # Git ignore rules
├── Azure_Document_Intelligence.py # Azure AI document processing for enterprise pdf extraction
├── Cloud_Run.md # Cloud deployment instructions
├── Dockerfile # Main application Dockerfile
├── EnterpriseWebScrap.py # Enterprise web scraping module
├── OSWebScrap.py # Open-source web scraping module
├── README.md # Project documentation
├── ai_application_workflow.png # Architecture diagram
├── docker-compose.yml # Multi-container Docker setup
├── docklingextraction.py # Markdown Generator for pdf
├── open_source_parsing.py # pymupdf and camelot for open source pdf extraction
└── requirements.txt # Python dependencies
- The User uploads data via the Streamlit Frontend.
- The Frontend sends the data to the FastAPI Backend.
- The Backend processes the data using one or more of the following:
- PyPDF2 / pdfplumber for PDF extraction.
- BeautifulSoup / Scrapy for web scraping.
- Microsoft Document Intelligence for enterprise document processing.
- Apify for enterprise web processing of documents.
- The extracted data is standardized using Docling and MarkItDown.
- The processed data is stored in an AWS S3 Bucket.
- The Frontend retrieves the processed data from the S3 Bucket and displays it to the User.
- Python 3.7+
- Diagrams library for generating the workflow diagram.
- AWS account with S3 bucket access.
- Streamlit and FastAPI installed for frontend and backend development.
- Install Google Cloud SDK
- Install Docker
Clone the repository:
git clone https://github.com/your-username/BigData_InClass_Proj1.git
cd BIGDATA_INCLASS_PROJ1
Create a .env file and add the required credentials:
AWS_ACCESS_KEY_ID=your_aws_access_key
AWS_SECRET_ACCESS_KEY=your_aws_secret_key
AWS_REGION=your_aws_region
AZURE_DOCUMENT_INTELLIGENCE_KEY=your_azure_key
AZURE_FORM_RECOGNIZER_KEY=your_azure_form_intelligence
APIFY_TOKEN=your_apify_token
ADOBE_API_ID=your_adobe_api_key
ADOBE_API_Secret= adobe_api_secret
Build and run the application using Docker Compose:
docker-compose build --no-cache
docker-compose up -d
-
Ensure you have the custom icons (
microsoft.png
,docling.png
,markitdown.png
,streamlit.png
) in the./icons/
directory. -
Generate the workflow diagram:
python generate_diagram.py
- Run the FastAPI backend:
cd api python -m venv venv pip install -r requirements.txt venv/Scripts/activate uvicorn backend:app --reload
2.Open your browser and navigate to http://localhost:8080
to interact with the application for Backend.
- Run the Streamlit frontend:
cd frontend python -m venv venv pip install -r requirements.txt venv/Scripts/activate streamlit run frontend.py
4.Open your browser and navigate to http://localhost:8501
to interact with the application for Streamlit app.Run the streamlit after running the fastapi localhost
Deployed Application:
- FastAPI Backend: https://fastapi-streamlit-974490277552.us-central1.run.app/
- Streamlit Frontend: https://streamlit-app-974490277552.us-central1.run.app/
More details on the cloud deployment process will be explained in the CloudRun.md in the root folder
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.
This project is licensed under the MIT License. See the LICENSE file for details.
- Replace
yourusername
in the repository URL with your actual GitHub username. - Ensure the
generate_diagram.py
script is created to generate the workflow diagram. - Update the
LICENSE
file if you choose a different license.
Let me know if you need further assistance!