Speech Data Builder is a powerful, open-source web application designed to streamline the creation of professional speech datasets for Text-to-Speech (TTS) and Speech-to-Text (STT) model training. Whether you're developing voice assistants, speech recognition systems, or voice synthesis applications, this tool provides everything you need to build high-quality speech corpora.
- AI-Powered Transcription: Automatically transcribe your audio files using state-of-the-art AI models from Google AI Studio and OpenAI.
- Multiple Export Formats: Export your datasets in popular formats including LJSpeech, Common Voice, and custom configurations.
- Audio Visualization: Precise waveform display with region selection for accurate transcript alignment.
- Text Normalization: Automatic text normalization for TTS model training with special handling for multiple languages.
- Batch Processing: Efficiently process multiple audio files in one go, saving time and effort.
- Real-time Editing: Edit transcripts with real-time saving and instant audio playback.
- Customizable Settings: Configure the tool to match your specific dataset requirements.
- Dark Mode Support: Work comfortably in any lighting conditions with our eye-friendly dark mode.
- ML Engineers: Building custom voice models and speech recognition systems
- Voice Assistant Developers: Creating training data for voice assistants
- Researchers: Collecting and organizing speech data for linguistic research
- Content Creators: Transcribing podcasts, interviews, or educational content
- Language Preservationists: Documenting endangered languages and dialects
- Open the application in your web browser
- Upload your audio files (MP3, WAV, OGG, FLAC supported)
- Use AI transcription or manually enter transcripts
- Customize and normalize text as needed
- Export in your preferred format
Visit https://fs-17.github.io/SpeechDataBuilder to start creating speech datasets immediately.
-
Clone this repository:
git clone https://github.com/FS-17/SpeechDataBuilder.git
-
Open the index.html file in any modern browser:
- Double-click the file
- Or serve it using a local web server:
npx serve
Speech Data Builder excels at creating datasets in the popular LJSpeech format, widely used for training TTS models. The tool automatically:
- Normalizes text (converting numbers to words, removing special characters)
- Handles non-Latin scripts with specialized normalization
- Generates properly formatted metadata files
Connect with leading AI services to automate the transcription process:
- Google AI Studio: Leverage the latest Gemini models (including 2.5 Pro and Flash) for accurate transcription
- OpenAI: Utilize Whisper and GPT models for speech recognition and text normalization
- Click the "Upload Audio Files" button or drag and drop audio files onto the upload area
- Supported formats: MP3, WAV, OGG, FLAC
- Organize your speech samples for efficient processing
- Select a file from the list to load it in the editor
- Use the audio playback controls to listen to the speech
- Type the accurate transcript in the transcript editor
- For TTS datasets, ensure proper punctuation and formatting
- Click "Save Transcript" or use Ctrl+S to save
- Navigate to the "Export Dataset" tab
- Choose your preferred format (LJSpeech, CSV, JSON, TXT)
- For TTS datasets: LJSpeech format includes normalized text
- For STT datasets: CSV or JSON provide flexible options
- Optionally include audio files in the export
- Click "Export Dataset" to download your speech data
Shortcut | Action |
---|---|
Space | Play/Pause audio |
β / β | Skip back/forward 1 second |
A / D | Skip back/forward 5 seconds |
W / S | Increase/decrease playback speed |
R | Create region at current position |
Ctrl+S | Save transcript |
Ctrl+Alt+T | Toggle dark/light theme |
filename|transcript text|normalized text
FileName,Transcript
file1.wav,"This is the transcript for file one."
SpeechDataBuilder supports AI-powered transcription to speed up dataset creation:
- Navigate to the Settings tab
- Select your preferred AI service provider
- Enter your API key
- Return to the transcription editor and click "AI Transcribe"
- Edit the AI-generated transcript for perfect accuracy
- Create large speech datasets in less time
SpeechDataBuilder works in all modern browsers for cross-platform speech dataset creation:
- Chrome (recommended)
- Firefox
- Edge
- Safari
Contributions are welcome! Help improve this open-source speech dataset tool:
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- WaveSurfer.js for audio visualization
- Bootstrap for UI components
- Font Awesome for icons
- JSZip for file export capabilities
Speech Data Builder runs entirely in your browser with no server-side processing, ensuring your audio data remains private and secure. Built with modern web technologies:
- JavaScript ES6+
- WaveSurfer.js for audio visualization
- Bootstrap 5 for responsive design
- IndexedDB for client-side storage
For questions, feedback, or support, please open an issue on our GitHub repository.
Speech Data Builder β Making voice dataset creation accessible to everyone.
_Keywords: TTS dataset, STT dataset, speech recognition data, voice dataset creator, LJSpeech format, speech corpus, AI voice training
- Import workflow: the ability to import existing datasets in LJSpeech, CSV, JSON, or TXT formats
- Update gemini model references to include latest versions
- Offline support (PWA): installable app with cached core assets for offline work
- Keyboard shortcuts: Space, β/β, A/D, W/S, R, Ctrl+S, Ctrl+Alt+T
- Accessibility: Skip link to main content, improved focus outlines
- SEO: Correct canonical link and updated social sharing image
Open the site in Chrome/Edge and use βInstall appβ from the address bar. Works offline after first load.