I think making the device is the easy part, the hard part is doing the translation right in most of the cases, which both Google and Bing have been unable to do so far. 'Google Translate' app already has a communication mode where two people with different languages can talk, but if it worked good there was nothing preventing them to make such an earpiece.
Even if Google Translate worked well for text (and it doesn't), translating text may be easier than translating the spoken word. When we speak, more things are left unsaid, because we can pick them up from nonverbal cues or from context. We also don't produce nice, grammatically-correct, complete sentences, we produce fragments instead, and make mistakes, and go back and correct things. And we slur things in ways that make it difficult for voice recognition to understand us. We also use intonation, timing, and so on in speech, which a voice recognition-translation-text to speech roundtrip probably destroys. We use less standardised, colloquial language which can be highly specific to groups of people, both in vocabulary and grammar.
Real-time translation of speech is difficult even for interpreters. They have to be aware of the context (in some cases specifically prepared beforehand, depending on the topic) and make assumptions based on human experience. You'd need to make a computer that can do that, too.
I suspect that a babelfish with current technology will work fine if you're listening to a prepared speech from a political leader, and poorly in actual social situations if people speak normally.
They are something like 95% right, 1 wrong in 20[], which is pretty often. But in a face to face conversation, with so much non-verbal communication going on, it may be more than enough. Very different from cold, isolated translation or online.
It's certainly far ahead of having no language, or trying to look up a phrase book.
BTW It's like "word lens" for audio https://en.m.wikipedia.org/wiki/Word_Lens Google bought and made it free. I've tried it, and it works, it's very cool. But I don't know if it's actually that useful, or even gets used much, in the field.
[] EDIT sorry, those figures were just for speech recognition!
> They are something like 95% right, 1 wrong in 20
Which set of languages is this stat about? For example, trying English with Hindi (they don't support it right now), though Google Translate is impressive, there are still a lot of mistakes (much more than 1 in 20). For English to Spanish, I assume the error rate would be lesser in comparison.
It isn't that better. Admittedly I am only learning Spanish, but I get odd or incorrect translations far more often than 1 in 20 and because I am a beginner, I don't throw really difficult cases at it.
This is also text only use so ignoring things like accents (was that a p, v or b?) which can make the whole thing even more difficult.
For other languages, like my native Slovenian, it is even more ridiculous.