A powerful Node.js tool to download audio from YouTube, Spotify, or podcast URLs, transcribe using Google Gemini AI, and generate comprehensive summaries with speaker identification and tone analysis.
- 📥 Download audio from YouTube videos, Spotify episodes/podcasts, or direct MP3 URLs
- ✂️ Smart chunking - Splits long audio into manageable 10-minute segments
- 🎯 Advanced transcription using Google Gemini AI with:
- Speaker identification
- Tone/emotion analysis
- Timestamp preservation
- 🔀 Intelligent merging of transcription chunks
- ✨ Automatic extraction of:
- Key highlights and themes
- Comprehensive summary
- Speaker statistics
- 📊 Multiple output formats:
- Structured JSON
- Formatted text transcript
- Detailed metadata report
-
Node.js (v16 or higher)
-
pnpm - Fast, disk space efficient package manager
npm install -g pnpm
-
yt-dlp - For YouTube downloads
# macOS brew install yt-dlp # Ubuntu/Debian sudo apt install yt-dlp # Windows # Download from https://github.com/yt-dlp/yt-dlp/releases
Note: Spotify support is handled automatically through the integrated
spotify-dl
package - no additional installation required. -
ffmpeg - For audio processing
# macOS brew install ffmpeg # Ubuntu/Debian sudo apt install ffmpeg # Windows # Download from https://ffmpeg.org/download.html
-
Google Gemini API Key
- Get your API key from: https://makersuite.google.com/app/apikey
# Clone the repository
git clone <repository-url>
cd audio-transcriber
# Install dependencies
pnpm install
# Copy environment file and add your API key
cp .env.example .env
# Edit .env and add your GEMINI_API_KEY
- Videos, playlists, channels
- Automatic audio extraction
- Metadata preservation
- Episodes and podcasts
- Automatically finds matching content on YouTube
- Preserves original metadata and structure
- No authentication required for most content
- Any publicly accessible MP3 file
- Direct download without conversion
# Transcribe a YouTube video
pnpm dev "https://www.youtube.com/watch?v=VIDEO_ID"
# Transcribe a Spotify episode/podcast
pnpm dev "https://open.spotify.com/episode/EPISODE_ID"
# Transcribe a direct MP3 URL
pnpm dev "https://example.com/podcast.mp3"
# With custom output path
pnpm dev "https://www.youtube.com/watch?v=VIDEO_ID" -o ./my-transcript.json
audio-transcriber <url> [options]
Options:
-o, --output <path> Output file path (default: ./output/transcript_[timestamp].json)
-t, --temp-dir <path> Temporary directory for processing (default: ./temp)
-c, --chunk-duration <secs> Duration of each chunk in seconds (default: 600)
--concurrency <number> Number of chunks to process in parallel during transcription (default: 5)
-k, --keep-chunks Keep temporary audio chunks after processing
-s, --save-temp-files Keep all temporary files including raw audio, downsampled audio, chunks, and intermediate files
--no-text Skip generating text transcript file
--no-report Skip generating metadata report file
-h, --help Display help for command
# Display dependency information
pnpm dev info
# Run a test transcription
pnpm dev test
# Build the project
pnpm build
# Clean temporary files
pnpm clean
The tool generates three types of output files:
{
"title": "Video/Audio Title",
"source_url": "https://...",
"full_transcript": [
{
"start": "00:00:00",
"end": "00:00:45",
"speaker": "Speaker 1",
"tone": "Excited",
"text": "Transcribed text..."
}
],
"highlights": ["Key point 1", "Key point 2"],
"summary": "Comprehensive summary..."
}
A formatted, readable transcript with timestamps, speakers, and tone information.
A summary report containing:
- Title and source information
- Executive summary
- Key highlights
- Speaker statistics
- Tone distribution analysis
You can run Interview Transcriber in a Docker container for easy, reproducible usage.
docker build -t interview-transcriber .
# YouTube video
docker run --rm \
-e GEMINI_API_KEY=your_actual_api_key \
-v $(pwd)/output:/output \
interview-transcriber "https://www.youtube.com/watch?v=VIDEO_ID"
# Spotify episode
docker run --rm \
-e GEMINI_API_KEY=your_actual_api_key \
-v $(pwd)/output:/output \
interview-transcriber "https://open.spotify.com/episode/EPISODE_ID"
- This will save the transcript in your local
output/
directory. - You can also specify a custom output file:
docker run --rm \
-e GEMINI_API_KEY=your_actual_api_key \
-v $(pwd)/output:/output \
interview-transcriber "https://www.youtube.com/watch?v=VIDEO_ID" /output/my-transcript.json
Instead of specifying the API key directly, you can store it in a .env
file:
# .env
GEMINI_API_KEY=your_actual_api_key
Then run the container with:
# YouTube video
docker run --rm \
--env-file .env \
-v $(pwd)/output:/output \
interview-transcriber "https://www.youtube.com/watch?v=VIDEO_ID"
# Spotify episode
docker run --rm \
--env-file .env \
-v $(pwd)/output:/output \
interview-transcriber "https://open.spotify.com/episode/EPISODE_ID"
- The
GEMINI_API_KEY
environment variable is required for Google Gemini transcription. - The
/output
directory inside the container should be mounted to a local directory to access results. - All other CLI options are supported as in the native usage.
You can also use the modules programmatically:
import { AudioProcessor } from "audio-transcriber";
const processor = new AudioProcessor();
const options = {
url: "https://www.youtube.com/watch?v=VIDEO_ID",
outputPath: "./output/my-transcript.json",
chunkDuration: 600, // 10 minutes
concurrency: 5, // Process 5 chunks in parallel
};
const result = await processor.processAudio(options);
console.log(result);
audio-transcriber/
├── src/
│ ├── modules/
│ │ ├── audioDownloader.ts # YouTube/MP3 download logic
│ │ ├── spotifyDownloader.ts # Spotify download logic
│ │ ├── audioChunker.ts # Audio splitting with ffmpeg
│ │ ├── transcriber.ts # Gemini AI transcription
│ │ ├── merger.ts # Chunk merging logic
│ │ ├── highlights.ts # Highlight extraction
│ │ └── outputBuilder.ts # Output file generation
│ ├── utils/
│ │ ├── timeUtils.ts # Timestamp utilities
│ │ └── fileUtils.ts # File system utilities
│ ├── types/
│ │ └── index.ts # TypeScript interfaces
│ ├── processor.ts # Main orchestrator
│ ├── cli.ts # CLI interface
│ └── index.ts # Module exports
├── tests/ # Test files
├── temp/ # Temporary processing files
├── output/ # Default output directory
└── package.json
The tool includes comprehensive error handling for:
- Network failures (with retry logic)
- Invalid URLs
- API rate limiting
- File system errors
- Corrupted audio files
- Chunk Duration: Default is 10 minutes. Shorter chunks = more API calls but better accuracy
- API Rate Limiting: The tool includes delays between API calls to avoid rate limiting
- Parallel Processing: Chunks are processed in parallel with configurable concurrency (default: 5). Higher concurrency = faster processing but may hit API rate limits
- Concurrency Control: Use the
--concurrency
option to adjust parallel processing. Start with 5 for most use cases
-
"GEMINI_API_KEY not found"
- Make sure you've created a
.env
file with your API key
- Make sure you've created a
-
"yt-dlp not found"
- Install yt-dlp using the instructions above
-
"ffmpeg not found"
- Install ffmpeg using the instructions above
-
Transcription fails
- Check your Gemini API quota
- Try reducing chunk duration
- Ensure audio quality is sufficient
Contributions are welcome! Please feel free to submit a Pull Request.
MIT
- Google Gemini AI for transcription capabilities
- yt-dlp for YouTube download functionality
- spotify-dl by SwapnilSoni1999 for Spotify support
- ffmpeg for audio processing
- The open-source community for various dependencies