-
Notifications
You must be signed in to change notification settings - Fork 0
Iteration 1
Project color: Bloo
Members
- Yu-chi Chang (kikichang)
- Gary Qian (garyqian)
- Ryan Newell (rnewell4)
- David Li (zdavidli)
- Richard Chen (richarizardd)
Vision Statement (summary):
Ever since texting and instant messaging has become a large part of our interpersonal communication in modern society, we lose out on some of the intricacies of talking face-to-face. We’ve always wanted to hear our friends talking to us, even we’re forced to chat over text. Talking over text on a phone or a computer always feels a bit impersonal, since the human element of voice and the particular quirks of speech are left out. We want to bring back some part of face-to-face chatting even when we have to talk over text, such as when video calling is impractical due to data limitations or just an improper environment to call from.
Our solution to this problem is to reconstruct a person’s voice from the basic phonemes of English and piece words together based on how an individual would say each phoneme. For instance, the word “cat” has three distinct sounds to it, so if we know how someone says each sound separately, we can join the sounds together and reconstruct how they’d say the word, just by saving the building blocks of language! Our project will integrate with Facebook messenger and vocalize every message received with each friend's corresponding voice.
Features (Library Core) The voice library is a self-contained module that is capable of training a voice model, saving them, and performing text-to-speech.
- Train and export custom voice model
- Provide loading and class for voice model
- Text-to-speech: string to audio file
Features (App Core) Our initial plan is to read Facebook Messenger, however, depending on feasibility, we may move to reading sms text messages or a custom messenger system. These features represent the core features, and depending on complexity, multiple features from "Extension features" will also be implemented.
- Read out facebook messages in a user’s voice with support for group messages
- Log in with facebook
- Text-to-speech: enter a message and hear it read out in a voice
Extension Features Because we are unsure of the complexity of voice model training, we have these goals as extended features:
- Speech-to-text-to-Speech (in different voice).
- Share/publicity settings for voice models
- Multiple voice models per user
- Text to speech for other languages (e.g. hear yourself speak spanish).
- Phonemes from words (instead of directly reading phonemes)
- Search other people’s public models
- Advanced voice model that emulates emotion.
Domain Analysis: We will be working with emulating human voices. To do this, we performed research into the key components of voice. Human speech can be broken down into base sounds called phonemes. These phonemes are then strung together to create a voice. Our plan is to develop a simple phoneme library as a voice model by directly recording these base phonemes. It turns out that to replicate english speech, only 44 phonemes are required. With these 44 phonemes, we believe that we can build a customized model that is understandable, but not necessarily highly accurate, however, for our applications, this is sufficient.
Key Use cases:
- Train a custom voice model
- User presses the "Train" button on the home page. A sound is displayed for the user to say, as well as example words where the sound is used.
- User presses mic button to start recording and says the word.
- User presses mic button again when finished.
- The step-by-step walkthrough of training process continues until 44 phonemes are all recorded.
- This will create a model that is saved under the user's Facebook profile and the UI displays that the model is saved
- Read out facebook messages in a user’s voice. When the user receives a message, the app will read out the message in the person's voice if the sender has a voice stored.
- User visits our website
- The system either auto-logs in or you click "Sign in with Facebook"
- The app reads out any incoming messages
- Share/publicity settings for voice models
- Select if voice is available for speaking to {just me, public, friends, fof, list of specific friends} from the list on the home page
- Text-to-speech
- The user types into the textbox on the home page.
- Once the user types the message and pushes the "speak" button, the server generates an audio file of the message in the user’s voice.
- The app plays the audio file.
Architecture:
- Voice Library/API
- Takes in mp3 phoneme snippets and forms a model that can be saved out as a .dat file. The .dat file will likely be in zip format and hold all of the audio and metadata for the model. This may also be a pickle dump.
- Data structure to represent voice model and outputs text-to-speech audio
- Data loader to load in .dat file
- Tech:
- Python
- Pickle
- Python AudioSegment
- CMU Phonetic dictionary
- Python audio library
- Messenger Frontend Integration
- The website will send get requests to the Facebook servers every second and compare the most recent 10 messages with the results of the previous request.
- Any new messages received will be placed into a queue to be read, and removed when they are successfully read.
- Reading is done by sending the string to the backend server and playing back the received audio file
- Training is done by recording 44 phonemes and sending each to the server as an audio file. This is then compiled by the server and stored as a voice model.
- Tech:
- Restful website
- Facebook login API
- Facebook graph API
- AngularJS or React
- Messenger Integration Server:
- Uses a Flask server that leverages our custom voice API
- Tech:
- Flask
- Numpy
- Restful API
- Python
UI: Login Screen
UI: Main/Home screen
UI: Training screen
These are the main UI screens for the app. The training screen is shown 44 times for each phoneme. The home screen's right side is a system for visualizing the speech. On first login or if the user does not have a voice model trained, we will take them directly to the training screen instead of home screen. Otherwise, the retrain voice model is available from the home screen. The train button will likely be able to be retracted to the side to make it less obtrusive because retraining is not often performed.