A lot of open source licenses require numerous things including attribution, at the very least. So when copilot barfs out someones code verbatim, or nearly verbatim, after having memorized that code and it doesn't say who it's referencing, suddenly there's a problem.
The user who uploaded the code did provide attribution. They could have uploaded that code as part of a vendoring. There is not some flag that says they are not the copyright holders.
Copilot trains on that code just the same.
More importantly, the fair use they are claiming seems to bypass attribution and licensing requirements. It need only access to the code.
That is to say that it can 100% train on code hosted elsewhere. Until this fair use gets challenged by lawsuits, it will hold true.
The fact that we have not heard of these lawsuits yet, at least to me, shows that lawyers agree it is legal.
Just saying what you want to be true, doesn't make it so. Attribution is required for all licenses that require it. Microsoft is laundering intellectual property.
A license is describing how you want to reassign rights given to you by copyright. AFAICT Microsoft believes they are within fair use granted by copyright law itself.
If they are correct they don't have to follow individual license terms, if they are incorrect it makes no difference if they scrape or receive code because any complex project is a mix of many past authors not all of whom must be on GitHub.
Unfortunately, they're not since it barfs out code verbatim sometimes and gives no attribution. So again, they can say whatever they want, it doesn't make it so.
Just because it hasn't been tested in court yet doesn't mean it's legal, my friend. It's an easy fix. Ask each dev for permission or provide attribution as per each license when the same code comes out of the model. Easy peasy.