Given that OpenAI Whisper is open source now, and pretty near SOTA, I think creating an audio only open-source version of this shouldn't be difficult. However, I don't know how to easily contextualise the audio - how would I search 'name of the movie I was discussing with Zeynep last week'?
Yes it does seem within reach. Even just the base-en whisper modal with 74M params performs remarkably well on transcription (the large model has 1550M params!).
Compare a base-en whisper transcription to a human transcription. This is the A Powerful Theory of Why the Far Right Is Thriving Across the Globe latest episode of the Ezra Klein Show, transcribed just now:
I got OpenAI Whisper running locally on my Mac but the plumbing to make it NOT tax system resources (like CPU) and to get it to work with search isn't trivial. It's on our roadmap.
You might find my inference implementation of Whisper useful [0]. It has a C-style API that allows for easy integration in other projects and you can control how many CPU threads to be used during the processing.