Being a grad student, this person is at a perfect place to build up their math b...

kxyvr · on Nov 12, 2017

Alright, so I do research in and produce optimization tools professionally. From my biased point of view, someone is better off learning generic optimization theory and algorithms rather than specialized versions like convex or stochastic optimization. Generally speaking, generic nonlinear optimization methods form the foundation for everything and then there's a series of tricks to specialize the algorithm.

Very specifically, there are two books that I think provide a good foundation. First, is Convex Functional Analysis by Kurdila and Zabarankin. Not many people know about this book, but essentially it provides a self-contained background to prove the Generalized Weirstrass Theorem, which details the conditions necessary for the existence and/or uniqueness of an optimization problem. This is important because even convex problems don't necessarily have a minimum. For example, min exp(-x) doesn't have one, but it does have an infimum. The background necessary to understand this book is real analysis and as a quick aside I think Rudin's Principles of Mathematical Analysis is the best for this. Second, is Nocedal and Wright's Numerical Optimization book. It provides a good overview of the powerful algorithms in optimization that we should be using. Now, it's weakness is that it often cheats and uses a stronger constraint qualification than we're afforded in practice. Candidly, I find that the derivative of the constraints will not remain full rank and we will likely violate the LICQ. Further, it covers a number of algorithms that really shouldn't be used in practice, ever. That said, it does cover the good algorithms and it generally has the best presentation out of the other books.

Sadly, I don't know of any killer books for numerical linear algebra. And, yes, I've read cover to cover things like Saad's Iterative Methods for Sparse Linear Systems, Trefethen and Bau's Numerical Linear Algebra, and Golub and van Loan's Matrix Computations. They're valuable and well-written, but don't quite cover what I end up having to do to make linear algebra work on my solvers.

Anyway, this is all biased and opinion, so take what you will. If someone else has some of their favorite references for optimization or numerical linear algebra, I'd love to hear.

dxbydt · on Nov 12, 2017

A lot of practical difficulties with your proposal.

A CS/AI PhD program is nothing like a Math/AMath/Stat PhD. In the latter, there is zero expectation that you will start producing anything in the first few semesters. In fact, it is explicitly required of you to load up on rigorous core courses, so that you can pass your prelims at the end of 2nd year and become a formal “PhD candidate”. The attrition rate in those programs is about 40% or more, so a lot of these people simply find out they don’t have the math maturity, drop out with a Masters at the end of2 years, get a job and call the whole thing off.

So in the latter case, yes, such a person can follow your guidelines. In fact, most of the material you listed in called AMC ( applied math core ) or CACM or other abbrev...and is already taught as part of core.

Now in the former case is where this particular student is. CS PhDs programs are a sort of weird beast in the US. They are housed in eng, not liberal arts. The expectations are to produce papers right out of the gate. Atleast lightweight papers, posters, something...you cannot coast for 2 years saying you are learning convex math & stoc calc. So if you read the material that comes out of students in that phase, it tends to be of low quality and heavy on empirical evidence ( i tested 7 functions on 3 datasets and these 2 came first, here are the charts and graphs). As the student matures, his papers gather more heft and by the time he graduates the final 2-3 papers will be very good....atleast that’s the expectation. Reality again is quite bleak and results are all over the place. Because of the hectic hiring climate, lots of cs grads will just take an ABD and get the hell out. 150k starting plus rsu is nothing to sneeze at. The ones who do finish tend to take a full 4+ years and are in the teens % of incoming cohort :(

Also, you are not really permitted to sign up for whatever you want just because you have math deficiencies. Your advisor will have to sign off on each sem load. He has to ensure you are on track, not just following your own whimsical path into theoretical math because you fancy it. CS core is quite distinct from AMath and touches upon the material you mentioned very superficially. All in all, this student is between a rock and a hard place. Its not going to be easy for him at all if he truly wants to understand everything. Best bet is to do what the top reddit advice is - pick some narrow corner where you are comfy, write 2-3 papers in that corner, get the hell out and learn the rest later on your own time in your research career.

stochastic_monk · on Nov 12, 2017

I can only speak from experience in my CS PhD program, but my above recommendations are based directly on that experience.

We have an exam after the first two years and a similar process. It depends on what your advisor expects/wants. Mine has been flexible. And I focused more on efficient software engineering and applications for prior methods than on new research as I got up to speed on other matters. It made obvious a lot of ways I could improve them, just by being forced to look at and implement all of the details under the hood.

And, at least in my CS department, there is a very heavy emphasis on mathematics with the ML/AI folks. They coauthor a number of papers with the applied math department and the rest of their papers are mostly proofs. They'll usually back it up with proof-of-concept implementations, but in that regard, they're very much like researchers in applied math except that they use Python instead of MATLAB.

sjg007 · on Nov 12, 2017

I think this is way too much for a pure CS person. It is not likely they will make a big contribution on the math side without being a mathematician first. E.g. an applied mathematician to CS.

For ML, the OP already has linear algebra which is sufficient. Deep neural networks is back prop which is basically high school math. You could have mentioned ODEs, sensitivity analysis which I think are more relevant than convex optimization. For NNs we don't even care about identifiability in both the statistics and dynamic systems points of view. NNs blow away SVMs and almost everything except for random forests in some domains. Both of these have this interesting property that nobody understands them except in terms of black boxes for the most part. Boosting is another example. It really is stranger than fiction.

The being said I think statistics/probability theory and Bayesian stats/networks are useful to know for any scientist.

I would talk to your advisor about what to do. They will be able to advise on what's important and what to learn/focus on.

tnecniv · on Nov 13, 2017

> Boosting is another example. It really is stranger than fiction.

Is this true? Boosting is pretty well formulated in the PAC framework and the classical algorithms (e.g. Adaboost) are well-characterized.

stochastic_monk · on Nov 13, 2017

You're correct. Boosting was directly formulated in the PAC framework.

(Source: http://l2r.cs.uiuc.edu/Teaching/CS446-17/LectureNotesNew/boo... "The original boosting algorithm was proposed as an answer to a theoretical question in PAC learning [The Strength of Weak Learnability; Schapire, 89.]")

It took a while, but there's been a lot of work lately explaining neural nets' performance over the last 5 years of so, from papers showing PAC learnability for specific architectures (https://arxiv.org/abs/1710.10174) to work saying that most local optima are close to global optima (http://www.offconvex.org/2016/03/22/saddlepoints/), to work saying that the optimization error incurred (as separate from approximation and estimation errors) serve as a form of regularization for deep neural networks.

And understanding how these things work helps improve and speed up these methods and models: it's hybrid algorithms which are enabling performance in time-series data and more complex tasks. The future will nearly certainly use neural networks as part of many algorithms, but I doubt that the full machinery will be simple feed-forward nets of ever-increasing sizes.

LolWolf · on Nov 12, 2017

This would’ve literally been my answer. Brilliant. I can’t recommend this answer enough.

autokad · on Nov 12, 2017

i finished my masters, but one problem is that is over half the courses in a 10 course masters program, which generally only allows for 4 total electives. As a CS grad, I felt handcuffed by required core courses and other 'select 2/4 of these, which took up 6 courses in the major. the remaining 4 courses were electives, and 3 were allowed outside of CIS. those courses were mostly offered in the same semester, and would also have conflicts with required core courses.

Another problem is that most data scientist positions are filled by statisticians who will be giving you the job interview. Almost all of the questions will be around stats. i personally feel a mastery of those courses would be great, but they would also not help me land a job because improving LDA to run on small text input by using a variation auto encoder doesn't help me recite the formula for a t-test.

hackernewsacct · on Nov 13, 2017

What courses (starting from pre-calculus) should one take to do what you listed above? I want to match your recommendations to course titles starting with pre-calculus. List book recommendations as well if you would. Thanks!

nextos · on Nov 13, 2017

I guess baby Rudin (or Hubbard & Hubbard for something simpler) in the analysis department; and Halmos (or Axler) in the linear algebra department.

This is, essentially, Math 55. All 4 books have been used at different stages in this famous course.

jhanschoo · on Nov 13, 2017

Halmos seems to discuss the same things as Hoffman and Kunze, which is the more “standard” and recommended book. Nevertheless after these you will still have to read up on multilinear algebra (tensors and determinant-like functions) as well as stuff on the numerical side of linear algebra.

stochastic_monk · on Nov 13, 2017

Convex: Bertsekas - Convex Optimization Theory, Convex Optimization Algorithms. Nesterov - Lecture Notes (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.693...)

Statistical/Theoretical: Shai Shalev-Schwartz & Shai Ben-David's Understanding Machine Learning (http://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning...) Mohri's Foundations of Machine Learning (https://cs.nyu.edu/~mohri/mlbook/)

The two above courses could share SSS's Online Learning text (https://www.cs.huji.ac.il/~shais/papers/OLsurvey.pdf). To be fair, the stochastic variants of most optimization algorithms can be learned reasonably quickly off of a statistical machine learning/basic optimization background. There's the option of Spall's Intro to Stochastic Search and Optimization, which covers neural networks, reinforcement learning, annealing, MCMC, and a wide variety of other applicaitons and techniques. (http://www.jhuapl.edu/ISSO/)

Similar to what kxyvr said, I also don't know of any killer linear algebra text, which is why I think a course is so useful. The matrix cookbook is helpful along the way. kxyvr is also entirely right that general nonlinear optimization is important -- though perhaps less indispensable. (Going the other way, the Bertsimas linear optimization textbook I've had for years mostly gathers dust.)

For PGMs: I got Predicting Structured Data back when it was new (https://mitpress.mit.edu/books/predicting-structured-data), but I think that Chris Bishop's treatment in PRML is easier to follow. He has some lecture slides which expand on it quite well. (https://www.microsoft.com/en-us/research/people/cmbishop/)

Bishop would also be my go-to intro ML book over Murphy.

I can't in fairness offer recommendations for the rest of the intermediate undergraduate math texts because I took them so long ago, but I can say that I have benefited from reviewing the MIT OCW courses from time to time.

hackernewsacct · on Nov 13, 2017

Great, thanks!