*That's the mystery this paper points to* I'm still honestly confused here. Yes,...

DavidSJ · on March 6, 2021

You can measure the generalization gap: the difference between training and test performance. With good generalization that gap will be small. When fitting random labels the gap will be large.

Classical statistical learning theory (from Vapnik and Chervonenkis, among others), predicts that low capacity models will generalize (have small generalization gaps). It makes no promises for large capacity models, i.e. those which can fit random labels. Yet here we are.