The point of Metropolis-Hastings is to sample from a distribution when you do not know the partition function. It is the most important building blocks in a set of algorithms broadly known as Markov Chain Monte Carlo. These algorithms are particularly useful when performing Bayesian statistics.
Genetic algorithms will not give you samples from a distribution, they only perform optimization. Particle swarms also focus on optimization, and on top of that, they do not seem to have either theoretical justification or empirical success.
MH is embarrassingly parallel since you run multiple chains at the same time. Again, the point isn't optimization (that would be simulated annealing) but sampling.
Being 100 years old is also largely irrelevant. People will publish new algorithms to get publications all the time. That do not mean they necessarily outperform the old ones. Gradient descent is the basic algorithm used in training all of these cool new deep learning algorithms, and it's much older than MH.
Yes, there are more recent improvements to MH, the two biggest ones being Hamiltonian Monte Carlo (which uses gradient information) and Parallel Tempering (which is somewhat similar to homotopy optimization), but that's hardly a reason to dismiss the importance of this algorithm.
"The point of Metropolis-Hastings is to sample from a distribution when you do not know the partition function."
That's one point, yes. The other is optimization. In which case I prefer the others I mentioned.
"they do not seem to have either theoretical justification or empirical success."
That's just patently false. They _do_ have empirical success. And besides, a lot of these variants have MH implemented inside of them to some extent o.o
"MH is embarrassingly parallel since you run multiple chains at the same time."
I can make six single grilled cheese sandwiches in 2 minutes, but it takes me 8 minutes to make a 6-decker grilled cheese. Is that a parallel process?
"Being 100 years old is also largely irrelevant."
In my opinion it is relevant since computation can now be done in completely new ways than 100 years ago.
1) Can you point out where particle swarm optimization has been successfully applied? Papers specifically about particle swarm optimization carry very little weight as it is very easy to design toy problems where any optimization technique is going to perform well, I'm looking for actual, practical use.
2) When you are sampling a distribution, you're not trying to make a 6-decker grilled cheese, you're trying to make many grilled cheese sandwiches.
3) In completely new ways? Not really. Algorithmic complexity which dominates run time independent of the computing medium. An algorithm designed to be efficiently run by a group of human "computers" with calculators is probably very similar to the same algorithm designed to be run by a CPU. If anything, the CPU optimized algorithm are likely to benefit from more sequential processing and less parallelism.
2) You are trying to find the global maximum. How do you not understand the value of communication when searching for a maxima in the likelihood distribution? You're just being intentionally obtuse.
3) Yes, really.
A story:
You have a landscape with mountains and hills and you have one person trying to find the tallest mountain. That person is blind, they can't see shit. That person is also mute and deaf. Their only sense is a vibrating altimeter. They get drunk, and climb mountains for 100000 days, trying to find the tallest mountain.
You are advocating the idea that you should send 1000 of these blind deaf mutes out there one at a time (running these in parallel is just faster serial) and then they should vote on which mountain is the tallest at the end.
I'm saying you should send a bunch of not-deaf-mutes out there (ie implement mutation, breeding, cross-contamination, gravitation, whatever) so they can tell each other where the stupid mountains are (this requires _parallel_) and they don't waste their whole time stumbling around (burning in).
HN does not allow downvoting replies to one's comment, so someone else must be downvoting you. Your tone is aggressive and you display a poor grasp of the topic which is probably why you're being downvoted.
1) This is an abstract of a PhD thesis which makes no mention of particle swarms, I couldn't find the full text. This is your evidence?
2) No this isn't about looking for the global maximum. Several people have explained to you that this is a sampling algorithm, but you still fail to show understanding of the difference.
3) Your story doesn't demonstrate your point and you do not understand the argument I presented to you. The types of algorithms suited for an army of people with calculators 100 years ago isn't fundamentally different from the type of algorithm suited for computers today, and if anything, it's likely to be more sequential, not less.
The point of Metropolis-Hastings is to sample from a distribution when you do not know the partition function. It is the most important building blocks in a set of algorithms broadly known as Markov Chain Monte Carlo. These algorithms are particularly useful when performing Bayesian statistics.
Genetic algorithms will not give you samples from a distribution, they only perform optimization. Particle swarms also focus on optimization, and on top of that, they do not seem to have either theoretical justification or empirical success.
MH is embarrassingly parallel since you run multiple chains at the same time. Again, the point isn't optimization (that would be simulated annealing) but sampling.
Being 100 years old is also largely irrelevant. People will publish new algorithms to get publications all the time. That do not mean they necessarily outperform the old ones. Gradient descent is the basic algorithm used in training all of these cool new deep learning algorithms, and it's much older than MH.
Yes, there are more recent improvements to MH, the two biggest ones being Hamiltonian Monte Carlo (which uses gradient information) and Parallel Tempering (which is somewhat similar to homotopy optimization), but that's hardly a reason to dismiss the importance of this algorithm.