Here’s my latest fun project. An iPhone app that performs real-time translation while leveraging Speech-to-Text and Text-to-Speech technologies. For example, the app will listen to someone speaking Portuguese, will translate it to English, and speak the translation in English. The app will also do the other way around. It will speak in Portuguese words pronounced in English.
The app is currently configured to help Portuguese and English speakers. It helps them communicate with each other. The idea is to have the iPhone sit between them… listening to them, translating in real-time all spoken/typed words, and speaking the translated words in the target language.
Currently, this is not a commercial/published app. And although there might be similar apps already out there – I didn’t find anything exactly like this… I primarily created it so my son (who barely speaks Portuguese) can have a custom tool to help him communicate with many of his relatives (who barely speak English). Let’s see how it plays out… 🙂
My inspiration for this project dates back to the early 80s, when my second grade teacher assigned a book for all students to read: “As sete cidades do arco-iris” (translation: the seven cities of the rainbow), a book by Brazilian author Teresa Noronha. The storyline was about a kid taken to a different planet where people from the different cities spoke different languages. On his way to the cities the main character was given a device to carry around his neck that would translate everyone to him in real-time, and vice-versa, also translating his words to everyone. Sadly, I no longer have that book. But the concept for that real-time translation/communication device stuck with me and I finally have the tools to create something like that.
My App is written in Swift – Apple’s scripting language. For the Translation and Speech-to-Text components I’m using Google’s APIs. For the Text-to-Speech, I’m using Swift/iOS’s own Speech framework, which is available for free, unlike the Google API, which has a tiny, tiny cost. But I might switch to Google’s Text-to-Speech API to try to implement a feature I outline 3 paragraphs below.
It amazes me how accurate Speech Recognition has become. Not long ago, around 2010, while working for another company, I worked on a very comprehensive research to identify a good Speech Recognition framework and after evaluating the top free/commercial options available at that time, on average, things would only be correct about 60% of the time, at best. It used to be that the only way to ensure good recognition rates was to define a controlled dictionary ahead of time to limit the search space. Today, there’s no need to use controlled dictionaries. It’s amazing how the quality of the task has improved with companies like Apple, Google, and Amazon now using (Deep) Neural Nets and complex models to train their services.
As for my prototype app, my next step is to come up with a way to detect the spoken language automatically. Google used to have an API for language detection, but it doesn’t seem to be available anymore (?!) Getting language detection in place will allow the device (i.e. the phone) to simply sit in front of the people having a conversation, without a need to press a button telling the app what language to hear.
The other feature I want to implement is allowing the Portuguese voice to come out of one audio channel (e.g. left) while the English voice comes out of the other channel (e.g. right). That way, both people could use the same pair of earbuds to listen to each other, without the iPhone’s speaker be repeating everything for everyone around to hear. But the Apple framework I’m currently using for the Text-to-Speech doesn’t seem to support that channel toggling so that feature will have to wait for some further research.
In any case, the current prototype seems to be working pretty well, and I’m looking forward to see people testing it out! 🙂
Posted by André Lessa
You must be logged in to post a comment.