I'm not necessarily disagreeing with the thrust of your argument (do you really need to store all that?), but constraining your sample to people paid to talk to Alexa can create huge swathes of bias. You'd need to make sure the people you pay also reflect all the accents and languages of the people who use Alexa. On top of that, without some amount of voice data, how are you to even know what that accent breakdown looks like? That's a near-impossible task.
Machine learning? Do it with people paid to talk to Alexa.