Recording.2025-01-25.020918.1.mp4
Visual Symphony is an AI-powered pipeline that transforms visual inputs into immersive audio narratives. It achieves this through a three-step process:
- Image Captioning: Utilizes the BLIP model to generate a descriptive text caption for an input image.
- Contextual Story Generation: Employs the Mixtral-8x7B large language model (via Groq API) to craft a short, emotive story based on the image caption.
- Emotional Speech Synthesis: Leverages the Bark library to convert the generated story into natural-sounding speech with emotional cues.
The project includes a Streamlit-based web user interface (webui.py
) for easy interaction and a core logic script (app.py
) that can be used independently.
- Multimodal Processing Chain: Seamlessly converts images to text, text to story, and story to speech.
- Context-Aware Narratives: Generates stories that are relevant to the visual input.
- Emotive Speech: Produces audio output with natural intonation and emotional expression.
- Interactive Web UI: Provides an easy-to-use interface (via Streamlit) for uploading images and experiencing the generated narratives.
- Modular Core Logic: The backend functions in
app.py
can be integrated into other Python projects. - Optimized for Performance: Configured for efficient inference, considering TensorFlow/PyTorch interplay and memory usage (e.g., using smaller Bark models, offloading to CPU if needed).
# Clone the repository (if you haven't already)
# git clone <repository_url>
# cd <repository_directory>
# Create a virtual environment
python -m venv .venv
# Activate the virtual environment
# On Windows:
# .venv\Scripts\activate
# On macOS/Linux:
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txt
Create a .env
file in the project root directory by copying the .env.template
or creating a new one:
# .env
GROQ_API_KEY=your_groq_api_key_here
Replace your_groq_api_key_here
with your actual Groq API key.
- Ensure your
.env
file is configured with theGROQ_API_KEY
. - Run the Streamlit application:
streamlit run webui.py
- Open the provided URL in your web browser.
- Upload an image, and the application will guide you through generating the story and audio.
The core functions are available in app.py
:
from app import image2text, gen_story, gen_tts
# Example workflow:
try:
image_path = "path/to/your/image.jpg" # Replace with your image path
# 1. Generate caption from image
caption = image2text(image_path)
print(f"Generated Caption: {caption}")
# 2. Generate story from caption
story = gen_story(caption)
print(f"Generated Story: {story}")
# 3. Generate audio from story
# This will save 'bark_generation.wav' in the current directory
gen_tts(story)
print("Audio generated as bark_generation.wav")
except Exception as e:
print(f"An error occurred: {e}")
The pipeline follows a sequential flow:
graph TD
A[Image Input ] -->|via script| B[BLIP Image Captioning];
B -->|Generated Caption| C[Mixtral-8x7B Story Generation via Groq];
C -->|Generated Story Text| D[Bark Speech Synthesis];
D -->|Audio Data| E[Audio Output bark_generation.wav];
This project emphasizes readable and maintainable code through comprehensive internal documentation:
-
app.py
: This file contains the backend logic for the Visual Narrative Generator. It includes functions for:image2text()
: Analyzing an image and generating a text caption.gen_story()
: Taking a text input (like a caption) and generating a short story.gen_tts()
: Converting the generated story text into an audio file. The module itself, each function, and key implementation details are documented with docstrings and inline comments. These explain the purpose, parameters, return values, and models/libraries used.
-
webui.py
: This script builds the interactive web user interface using Streamlit.- The
main()
function orchestrates the UI layout, handles user inputs (like image uploads and API keys), manages the application flow through different stages (image analysis, story creation, audio playback/download), and calls the backend functions fromapp.py
. Detailed docstrings for the module and themain()
function, along with inline comments, explain the UI components, their configurations, and the logic flow.
- The
We encourage you to explore the code. The embedded documentation should provide a clear understanding of how each part of the Visual Symphony works.
Apache 2.0 - See the LICENSE
file for details.