That's very ignorant to say that attention or convolution is anywhere close to the expressivity of RNNs(LSTM). Instead of reading a few cherry-picked results from a model with extremely tuned hyperparameters, pick any random set of tasks and experiment yourself. Based on my experience, LSTM always performs better than all unless you brutely search over hundreds of hyperparameters configs.
> That's very ignorant to say that attention or convolution is anywhere close to the expressivity of RNNs(LSTM).
But that's exactly the point of the Transformer model, with a paper aptly titled "Attention is all you need" [1]. And the Bert architecture, based in this idea, seems to be doing well. And they claim to be bery flexible, too[2].
Maybe that's what you meant with "unless you brutely search over hundreds of hyperparameters configs", but then again, isn't that what NNs are about anyway?
The success of Transformers aside, I'm not sure you should be relying on model titles for anything, lest we forget papers like "One Model To Learn Them All" [1].