-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
Overview
The current GenerativeAIMultimodalExample
sample project only supports image analysis, with a single input method—using PhotosPicker
to select images, and only processes image (UIImage
) formats.
However, according to the Swift version of the Firebase AI API, it can actually support analysis of more media types, including video, audio, and PDF.
Firebase AI API Analysis
According to the official documentation: https://firebase.google.com/docs/ai-logic, all media types use the same generateContent
format, which makes things much simple.
// Image analysis (implemented)
let response = try await model.generateContent(image, prompt)
// Video analysis (to be extended)
let video = InlineDataPart(data: videoData, mimeType: "video/mp4")
let response = try await model.generateContent(video, prompt)
// Audio analysis (to be extended)
let audio = InlineDataPart(data: audioData, mimeType: "audio/mpeg")
let response = try await model.generateContent(audio, prompt)
// PDF document analysis (to be extended)
let pdf = InlineDataPart(data: pdfData, mimeType: "application/pdf")
let response = try await model.generateContent(pdf, prompt)
Proposed Design Plan
1. UI Enhancement
Tab Navigation Design
Add 4 media type selection tabs at the top of the interface:
- 📷 Image
- 🎥 Video
- 🎵 Audio
- 📄 Document
Input Component Upgrade
Expand the MultimodalInputField
component:
- Dynamic Button: Show the corresponding file picker based on the selected tab. Not all the pickers for every tab.
- Type Indicator: Clearly display the currently selected media type
- Preview Optimization: Provide appropriate preview for different media types
2. Data Processing Extension
File Handler
Implement file processing logic for each media type, support DocumentPicker selection, convert to InlineDataPart
.
ViewModel Refactor
Extend PhotoReasoningViewModel
to support state management for multiple media types.
3. User Experience Optimization
Smart Prompt: Provide corresponding prompt to the MultimodalInputField
for preview based on the selected media type, for example:
- Image: "Describe the content of this image"
- Video: "Summarize the main content of this video"
- Audio: "Transcribe and analyze this audio"
- Document: "Extract key information from the document"
Conclusion
This feature extension will fully showcase the complete multimedia analysis capabilities of Firebase AI, providing iOS developers with more comprehensive learning and reference resources. @peterfriese