- Pakhi Singhal – 22BDS042
- Ravi Raj – 22BDS051
- Preethi Varshala S – 22BDS045
This project presents a powerful cloud-enabled multimodal language processing system, integrating cutting-edge AI services to handle translation, text-to-speech (TTS), speech-to-text (STT), and image captioning in a unified, user-friendly web platform. Leveraging Google Cloud's robust APIs for language and speech processing alongside a BLIP model (via Hugging Face) for image captioning, the system showcases real-time, scalable, and intelligent cross-modal communication. Deployed using modern cloud platforms (Netlify for frontend and Render for backend), it delivers high availability, responsiveness, and modularity, making it ideal for applications in e-learning, assistive technology, and content generation.
The system adopts a decoupled architecture:
- Backend: Python Flask RESTful API
- Frontend: React.js SPA (Single Page Application)
This separation enhances scalability, independent development, and seamless CI/CD deployment.
All services (text, speech, image) are modularized across backend and frontend layers to improve flexibility, maintainability, and testing. The AI functionalities are accessed via secure, optimized endpoints.
- Lightweight and extensible Python framework
- Integrates with Google Cloud APIs and local machine learning models
- Handles audio/image uploads, API routing, error handling, and inference processing
- Intuitive, responsive, and cross-device compatible
- Supports live feedback, drag-and-drop uploads, and dynamic content rendering
- Google Cloud APIs for Translation, TTS, and STT
- Hugging Face Salesforce BLIP model hosted locally for fast, rich image captioning
- Platform: Netlify (static hosting)
- CI/CD: Auto-deployment via GitHub integration
- Build Settings:
npm run build
with publish directorybuild/
- Env Config:
REACT_APP_API_URL
dynamically set for backend API - Live Demo: https://multimodaal.netlify.app/
- Platform: Render (Python web service)
- Server:
gunicorn
for WSGI production-ready deployment - Dependencies: Installed via
requirements.txt
- Google Credentials: Managed using Render Secret Files
- System Requirements: Render instance with minimum 2GB RAM;
ffmpeg
configured for audio processing
- User Input: User interacts via text, image, or audio on the React frontend
- API Call: Axios sends requests to the Flask backend
- Service Routing:
- For Image: BLIP model
- For Text/Speech: Google Cloud APIs
- Response Handling: JSON (text/captions), audio file (TTS)
- Display: Dynamic rendering of results in the UI
- Multilingual Translation: Detects and translates input text across languages
- Text-to-Speech (TTS): Converts translated text into natural-sounding speech
- Speech-to-Text (STT): Transcribes audio files in multiple languages with optional translation
- Image Captioning: Generates rich, descriptive captions from user-uploaded images using the BLIP model
- Frontend: React.js, Axios, HTML5/CSS3
- Backend: Python 3.x, Flask, Flask-CORS, gunicorn
- AI & ML:
- Google Cloud (Translate, TTS, STT)
- Hugging Face Transformers (BLIP model)
- PyTorch, pydub, Pillow
- Hosting:
- Netlify (frontend)
- Render (backend)
- Python: 3.8+
- Node.js & npm: Node.js v16+
- ffmpeg & ffprobe: Required for audio file preprocessing
- Google Cloud Project: With necessary APIs enabled:
- Cloud Translation API
- Cloud Text-to-Speech API
- Cloud Speech-to-Text API
- Service Account Key:
.json
file for authenticated API access
- Backend Setup:
cd backend python app.py
- Frontend Access:
Open browser at
http://127.0.0.1:5000
or configured frontend port
View our presentation covering architecture, design rationale, challenges, and outcomes:
This project exemplifies how cloud-native services and modern machine learning models can be orchestrated into a cohesive, scalable solution for real-world, multimodal applications. With accurate speech/text/image capabilities, real-time responsiveness, and seamless cloud deployment, it sets the foundation for innovative applications in cross-lingual communication, accessibility, and content automation.