That's not at all what Marcus is saying. He admits that it does remarkably well, but says (1) it's still not trustworthy; and (2) This version is not much better than the previous version. Both points are in support of his claim that just scaling isn't ever going to lead to General AI.
https://news.ycombinator.com/item?id=44278811
I think you're absolutely right about this being a wider problem though.