Why 5 hidden layers? Why are the first 3 non-recurrent? How did you decide how wide to make the internal layers? Are there some organizing principles behind these design decisions, or is it just trial and error?
As in many things, it's a combination of both. For example:
- We wanted no more than one recurrent layer, as it's a big bottleneck to parallelization.
- The recurrent layer should go "higher" in the network, as it's more effective at propagating long-range context when using the network's learned feature representation than using raw input values.
Other decisions are guided by a combination of trial+error and intuition. We started on much smaller datasets which can give you a feel for the bias/variance tradeoff as a function of the number of layers, the layer sizes, and other hyperparameters.
Any chance of releasing the training data you used? Also what are the plans with DeepSpeech? Just for use by baidu or will it be released as open source or a developer api service?
For a single utterance, it's fast enough that we can produce results in real time. Of course, building a production system for millions of users might require just a bit more engineering work...