Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I should add that I had the opportunity to work on this project and am happy to answer questions.


Why 5 hidden layers? Why are the first 3 non-recurrent? How did you decide how wide to make the internal layers? Are there some organizing principles behind these design decisions, or is it just trial and error?


As in many things, it's a combination of both. For example:

- We wanted no more than one recurrent layer, as it's a big bottleneck to parallelization.

- The recurrent layer should go "higher" in the network, as it's more effective at propagating long-range context when using the network's learned feature representation than using raw input values.

Other decisions are guided by a combination of trial+error and intuition. We started on much smaller datasets which can give you a feel for the bias/variance tradeoff as a function of the number of layers, the layer sizes, and other hyperparameters.


Any chance of releasing the training data you used? Also what are the plans with DeepSpeech? Just for use by baidu or will it be released as open source or a developer api service?


How much latency does the system have in the best/worst and average case? And is your implementation public?


For a single utterance, it's fast enough that we can produce results in real time. Of course, building a production system for millions of users might require just a bit more engineering work...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: