I would be interested to know if there is any effect on the time to convergence between the two setups (i.e. training with real labels Vs random labels). Is it in any way easier/harder to memorize everything Vs extracting a generalizable representation? Edit: Ah, I see they mention they not only investigated non-convergence but also whether training slows down. So randomising/corrupting does impact the convergence rate. This means it is more work/effort to memorize things which I guess is interesting.