I should add that I had the opportunity to work on this project and am happy to ...

hyperbovine · on Dec 18, 2014

Why 5 hidden layers? Why are the first 3 non-recurrent? How did you decide how wide to make the internal layers? Are there some organizing principles behind these design decisions, or is it just trial and error?

cbcase · on Dec 18, 2014

As in many things, it's a combination of both. For example:

- We wanted no more than one recurrent layer, as it's a big bottleneck to parallelization.

- The recurrent layer should go "higher" in the network, as it's more effective at propagating long-range context when using the network's learned feature representation than using raw input values.

Other decisions are guided by a combination of trial+error and intuition. We started on much smaller datasets which can give you a feel for the bias/variance tradeoff as a function of the number of layers, the layer sizes, and other hyperparameters.

Caligula · on Dec 18, 2014

Any chance of releasing the training data you used? Also what are the plans with DeepSpeech? Just for use by baidu or will it be released as open source or a developer api service?

FractalNerve · on Dec 18, 2014

How much latency does the system have in the best/worst and average case? And is your implementation public?

cbcase · on Dec 19, 2014

For a single utterance, it's fast enough that we can produce results in real time. Of course, building a production system for millions of users might require just a bit more engineering work...