AI-Pixel-Prompt

Image Caption Generator with Text-to-Speech Image-to-Text Text-to-Speech

Demo- https://madhushankara.github.io/ai-pixel-prompt/

ended- (https://image-caption-generator-1dc63aee677b.herokuapp.com/

Project Overview

This project leverages advanced AI models for image caption generation and text-to-speech (TTS) conversion. It provides a streamlined interface to generate image captions using cutting-edge vision-language models and converts text into natural, high-quality speech using a preloaded TTS model. The project is designed for efficient processing, ensuring effective resource utilization and responsiveness.

AI Models and Libraries Used

1. Image Captioning

For generating captions from images, the project makes use of the BLIP (Bootstrapped Language-Image Pretraining) model:

Model: Salesforce/blip-image-captioning-base
Architecture: The BLIP model combines natural language understanding with image generation capabilities, enabling it to generate descriptive captions for a wide range of images.
Libraries:
- transformers from Hugging Face: For model and processor loading.
- Pillow (PIL): For image handling and processing.
- requests: For image retrieval from external URLs.

2. Speech Generation

For converting generated captions (or any input text) into speech, the project uses a preloaded Text-to-Speech (TTS) model:

Methodology: The TTS system is designed to handle speech generation asynchronously while ensuring the model is not overloaded during high-demand usage.
Libraries and Tools:
- A singleton instance (models_instance) manages the TTS model and tracks its busy state.
- Flask: Used for tracking progress in server contexts.
- tempfile: Handles temporary storage of generated audio files.

Key Features

Image-to-Caption: Captions are generated using BLIP, either from image URLs or local files.
Caption-To-Speech: The TTS system takes the generated captions or user-specified text and generates speech files in .wav format.
Singleton Management: Both the captioning and TTS components use dedicated locks to prevent multiple simultaneous operations, ensuring consistent and reliable performance.

Why Use BLIP for Captioning?

BLIP stands out as a state-of-the-art model for vision-language tasks. It is capable of understanding images deeply and generating context-appropriate captions, making it a perfect choice for applications like image search, accessibility tools, and content generation.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github/workflows		.github/workflows
captioning		captioning
nltk_data		nltk_data
static		static
templates		templates
.gitignore		.gitignore
Aptfile		Aptfile
Procfile		Procfile
README.md		README.md
app.py		app.py
netlify.toml		netlify.toml
requirements.txt		requirements.txt
runtime.txt		runtime.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI-Pixel-Prompt

Project Overview

AI Models and Libraries Used

1. Image Captioning

2. Speech Generation

Key Features

Why Use BLIP for Captioning?

About

Uh oh!

Releases

Packages

Languages

madhushankara/image-caption-generator

Folders and files

Latest commit

History

Repository files navigation

AI-Pixel-Prompt

Project Overview

AI Models and Libraries Used

1. Image Captioning

2. Speech Generation

Key Features

Why Use BLIP for Captioning?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages