Skip to content

[FirebaseAI] FR: Extend GenerativeAIMultimodalExample to support analysis of multiple media types #1729

@YoungHypo

Description

@YoungHypo

Overview

The current GenerativeAIMultimodalExample sample project only supports image analysis, with a single input method—using PhotosPicker to select images, and only processes image (UIImage) formats.

However, according to the Swift version of the Firebase AI API, it can actually support analysis of more media types, including video, audio, and PDF.

Firebase AI API Analysis

According to the official documentation: https://firebase.google.com/docs/ai-logic, all media types use the same generateContent format, which makes things much simple.

// Image analysis (implemented)
let response = try await model.generateContent(image, prompt)

// Video analysis (to be extended)
let video = InlineDataPart(data: videoData, mimeType: "video/mp4")
let response = try await model.generateContent(video, prompt)

// Audio analysis (to be extended)
let audio = InlineDataPart(data: audioData, mimeType: "audio/mpeg")
let response = try await model.generateContent(audio, prompt)

// PDF document analysis (to be extended)
let pdf = InlineDataPart(data: pdfData, mimeType: "application/pdf")
let response = try await model.generateContent(pdf, prompt)

Proposed Design Plan

1. UI Enhancement

Tab Navigation Design

Add 4 media type selection tabs at the top of the interface:

  • 📷 Image
  • 🎥 Video
  • 🎵 Audio
  • 📄 Document

Input Component Upgrade

Expand the MultimodalInputField component:

  1. Dynamic Button: Show the corresponding file picker based on the selected tab. Not all the pickers for every tab.
  2. Type Indicator: Clearly display the currently selected media type
  3. Preview Optimization: Provide appropriate preview for different media types

2. Data Processing Extension

File Handler

Implement file processing logic for each media type, support DocumentPicker selection, convert to InlineDataPart.

ViewModel Refactor

Extend PhotoReasoningViewModel to support state management for multiple media types.

3. User Experience Optimization

Smart Prompt: Provide corresponding prompt to the MultimodalInputField for preview based on the selected media type, for example:

  • Image: "Describe the content of this image"
  • Video: "Summarize the main content of this video"
  • Audio: "Transcribe and analyze this audio"
  • Document: "Extract key information from the document"

Conclusion

This feature extension will fully showcase the complete multimedia analysis capabilities of Firebase AI, providing iOS developers with more comprehensive learning and reference resources. @peterfriese

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions