For most voice control applications, trigger words are enough to reliably detect owner intent, but it seems Echo needs a better mechanism. Maybe adding cameras and looking for eye contact would work?
Wouldn't that kill part of the purpose if you had to eyeball the thing to give it voice commands.
Better might be to learn the location of audio producing devices (TV, radio, stereo, etc. [it tracks sound origin with multiple mics right?]) and track whether the command came from that direction and use that as a Bayesian factor for whether to trust the voice as being a user?