I'm the founder of Willow[0] (we use ctranslate2 as well) and I will be looking at this as soon tomorrow as these models are released. HF claims they're drop-in compatible but we won't know for sure until someone looks at it.
I have to say I love Willow, well done. It's a bit slow now, because I'm not running recognition locally (as I'm sure many people aren't), but it will be fantastic news if this helps me offload recognition onto my NUC (ie CPU-only) and can shave lots of ms off that way.
I'll be looking at this as soon as it is released tomorrow.
Separately, we have some Willow Inference Server improvements in the works that increase the speed of speech recognition on CPU by as much as 50% (depending on CPU supported instruction sets, etc).
Between that, the performance we already have, and this work it will be a dramatic improvement - even on CPU. I'm really looking forward to posting the benchmarks when all of this comes together!
That's the implication. If the distil models are same format as original openai models then the Distil models can be converted for faster-whisper use as per the conversion instructions on https://github.com/guillaumekln/faster-whisper/
So then we'll see whether we get the 6x model speedup on top of the stated 4x faster-whisper code speedup, at same/nearly same accuracy.
I would generally start with the assumption that if something is significantly faster the accuracy has to suffer a bit, but increasing model size and/or settings such as beam size to compensate should allow same accuracy and higher performance (just not all of the stated performance gain).
Because OpenAI focuses on putting out quality models. Efficient execution of ML models is another skill set entirely. Projects like CTranslate2 (which is what faster-whisper uses) are focused on fast model execution and work across all kinds of models from speech recognition to image and speech generation and everything in between.
Also because OpenAI benefits from a certain measure of inefficiency to prevent models from being easy for the masses to run without them being in the loop extracting money as well as compiling new training data out of every inference that users feed them.