Iteration 1

Project Title: Mimic-me

Project color: Bloo

Members

Yu-chi Chang (kikichang)
Gary Qian (garyqian)
Ryan Newell (rnewell4)
David Li (zdavidli)
Richard Chen (richarizardd)

Vision Statement (summary):

Ever since texting and instant messaging has become a large part of our interpersonal communication in modern society, we lose out on some of the intricacies of talking face-to-face. We’ve always wanted to hear our friends talking to us, even we’re forced to chat over text. Talking over text on a phone or a computer always feels a bit impersonal, since the human element of voice and the particular quirks of speech are left out. We want to bring back some part of face-to-face chatting even when we have to talk over text, such as when video calling is impractical due to data limitations or just an improper environment to call from.

Our solution to this problem is to reconstruct a person’s voice from the basic phonemes of English and piece words together based on how an individual would say each phoneme. For instance, the word “cat” has three distinct sounds to it, so if we know how someone says each sound separately, we can join the sounds together and reconstruct how they’d say the word, just by saving the building blocks of language! Our project will integrate with Facebook messenger and vocalize every message received with each friend's corresponding voice.

Features (Library Core) The voice library is a self-contained module that is capable of training a voice model, saving them, and performing text-to-speech.

Train and export custom voice model
Provide loading and class for voice model
Text-to-speech: string to audio file

Features (App Core) Our initial plan is to read Facebook Messenger, however, depending on feasibility, we may move to reading sms text messages or a custom messenger system. These features represent the core features, and depending on complexity, multiple features from "Extension features" will also be implemented.

Read out facebook messages in a user’s voice with support for group messages
Log in with facebook
Text-to-speech: enter a message and hear it read out in a voice

Extension Features Because we are unsure of the complexity of voice model training, we have these goals as extended features:

Speech-to-text-to-Speech (in different voice).
Share/publicity settings for voice models
Multiple voice models per user
Text to speech for other languages (e.g. hear yourself speak spanish).
Phonemes from words (instead of directly reading phonemes)
Search other people’s public models
Advanced voice model that emulates emotion.

Domain Analysis: We will be working with emulating human voices. To do this, we performed research into the key components of voice. Human speech can be broken down into base sounds called phonemes. These phonemes are then strung together to create a voice. Our plan is to develop a simple phoneme library as a voice model by directly recording these base phonemes. It turns out that to replicate english speech, only 44 phonemes are required. With these 44 phonemes, we believe that we can build a customized model that is understandable, but not necessarily highly accurate, however, for our applications, this is sufficient.

Key Use cases:

Train a custom voice model
- User presses the "Train" button on the home page. A sound is displayed for the user to say, as well as example words where the sound is used.
- User presses mic button to start recording and says the word.
- User presses mic button again when finished.
- The step-by-step walkthrough of training process continues until 44 phonemes are all recorded.
- This will create a model that is saved under the user's Facebook profile and the UI displays that the model is saved
Read out facebook messages in a user’s voice. When the user receives a message, the app will read out the message in the person's voice if the sender has a voice stored.
- User visits our website
- The system either auto-logs in or you click "Sign in with Facebook"
- The app reads out any incoming messages
Share/publicity settings for voice models
- Select if voice is available for speaking to {just me, public, friends, fof, list of specific friends} from the list on the home page
Text-to-speech
- The user types into the textbox on the home page.
- Once the user types the message and pushes the "speak" button, the server generates an audio file of the message in the user’s voice.
- The app plays the audio file.

Architecture:

Voice Library/API
- Takes in mp3 phoneme snippets and forms a model that can be saved out as a .dat file. The .dat file will likely be in zip format and hold all of the audio and metadata for the model. This may also be a pickle dump.
- Data structure to represent voice model and outputs text-to-speech audio
- Data loader to load in .dat file
- Tech:
  - Python
  - Pickle
  - Python AudioSegment
  - CMU Phonetic dictionary
  - Python audio library
Messenger Frontend Integration
- The website will send get requests to the Facebook servers every second and compare the most recent 10 messages with the results of the previous request.
- Any new messages received will be placed into a queue to be read, and removed when they are successfully read.
- Reading is done by sending the string to the backend server and playing back the received audio file
- Training is done by recording 44 phonemes and sending each to the server as an audio file. This is then compiled by the server and stored as a voice model.
- Tech:
  - Restful website
  - Facebook login API
  - Facebook graph API
  - AngularJS or React
Messenger Integration Server:
- Uses a Flask server that leverages our custom voice API
- Tech:
  - Flask
  - Numpy
  - Restful API
  - Python

UI: Login Screen Login_Screen UI: Main/Home screen Main/Home_Screen UI: Training screen Training_Screen

These are the main UI screens for the app. The training screen is shown 44 times for each phoneme. The home screen's right side is a system for visualizing the speech. On first login or if the user does not have a voice model trained, we will take them directly to the training screen instead of home screen. Otherwise, the retrain voice model is available from the home screen. The train button will likely be able to be retracted to the side to make it less obtrusive because retraining is not often performed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Iteration 1

Project Title: Mimic-me

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally