We just took the top-starred Java projects from GitHub, assuming that their source code and naming are of good quality.
These included projects like elasticsearch, hadoop-common and maven.
We assumed that their code and naming quality are worth learning, such that the learned patterns will transfer to other code and projects, because they are popular and actively maintained projects. So, as you can imagine, FizzBuzz is not that common there :)
You could probably dig up a pretty large number of high quality teaching repos to supplement those projects. I'm thinking of the types of repos that are included as supplements for MOOCs or reference books. Then you'll get some of the canonical gimmicky algorithms like fizzbuzz. But more importantly, you will get reference implementations of fundamental algorithms. Things like mergesort or binary tree search. While you are unlikely to see a straightforward implementation of those algorithms in a production repo, it wouldn't surprise me to see some the core abstract patterns repeated over and over again because they are so fundamental to CS pedagogy. And if you pick sufficiently high quality repos, you (hopefully) won't be compromising the code quality metrics driving your selection of those top starred repos.
While I understand that, how can this work be used to, say, analyze novice/student coding behaviors? Furthermore, does selecting the GitHub repos subject code2vec to be bias toward specific design patterns?
To play devil's advocate, could I just run code2vec across different repos to see if they are "real enough"?
I like to library and just printed out the paper to read, but since my focus is more with learning to program, this sounds like a tools I'd love to be able to use for analysis, but I'm being told no.
We assumed that their code and naming quality are worth learning, such that the learned patterns will transfer to other code and projects, because they are popular and actively maintained projects. So, as you can imagine, FizzBuzz is not that common there :)