Those don't really require textual matching, just regular audio fingerprinting. ...

toemetoch · on June 22, 2012

With audio fingerprinting the content provider must provide a way to fingerprint its own audio and have access to fingerprints of the internet's audio/video. This means a partnership between e.g. youtube and a studio. I'm fairly sure this involves studios above a certain size, resources for programming+API and a fair bit of paperwork and testing for robustness as there are ways to mess with the technique.

With this technique you just enter a few words and look at what comes out.

You're suggesting that the first option is easier?

lt · on June 22, 2012

Yes. Not only easier, but more reliable. The examples you gave are perfectly static sound bits - they don't change. It doesn't make sense to transcribe them to text, just match the audio. Soundhound/Shazam/etc do this easily. I'm pretty sure YouTube has some kind of similar mechanism already in place.

This technology gets a lot more interesting if you want to search for people talking about you or your products.