It's a little funny to think of a polyhedron (like the set of doubly stochastic ...

It's a little funny to think of a polyhedron (like the set of doubly stochastic matrices) as a manifold. The point here is that we're equipping the interior of this set with a certain "Riemannian metric" and using it to turn gradients into tangent vectors.

I thought the formulas for updates in this case were pretty opaque, so let's think of a simpler example. Say we want to optimize a function f(x_1, ..., x_n) over the simplex

\Delta = \{ (x_1, ..., x_n) : x_1 + ... + x_n = 1, x_i >= 0. \}

We can imagine that we're allocating a limited resource into n different bins. At each optimization step, say we have numbers g_i telling us partials of f with respect to the x_i. (Somewhat more generally, these numbers could tell us much we "regret" having allocated our resource into the ith bin.) How can we use this information to change our choices of x_i?

One option would be to take vanilla gradient descent steps. That is, we'd choose some small eta, set x_i := x_i - eta g_i, and somehow adjust these assignments to be a valid allocation. Another method, which tends to be more natural and do better in practice, is to set x_i := x_i exp(-eta g_i) and then divide each x_i by (x_1 + ... + x_n). (This is called the Hedge method, and "mysteriously" also looks just like a Bayesian update for a distribution of belief over n possibilities.) This is the flavor of method being used in the post to optimize over doubly stochastic matrices.

I've studied differential geometry but still have the impression that the proximal operator/mirror descent point of view is a little more natural in optimization/learning. Gesticulating wildly about Levi-Citiva connections is fun, but e.g. to optimize over the simplex it makes more sense to just ask: how should we regularize our updates to not be too "imprudent"? (Of course this depends on what problem you're solving!) One common story is that you want to avoid neglecting bins until you're very sure they're worthless. To do this, you can use a metric that makes your manifold look larger when you get near a vertex of the simplex (and then make sure you can do the necessary computations and find a good retraction map!), or you can think about what proximal operator to use/what Bregman divergence to use to regularize your updates. The paper used by this post uses a Fisher metric, which is the infinitesimal version of the KL divergence, and regularizing your updates by KL gives you things like the Hedge method/Bayesian updates.