We just took the top-starred Java projects from GitHub, assuming that their sour...

stult · on Dec 5, 2018

You could probably dig up a pretty large number of high quality teaching repos to supplement those projects. I'm thinking of the types of repos that are included as supplements for MOOCs or reference books. Then you'll get some of the canonical gimmicky algorithms like fizzbuzz. But more importantly, you will get reference implementations of fundamental algorithms. Things like mergesort or binary tree search. While you are unlikely to see a straightforward implementation of those algorithms in a production repo, it wouldn't surprise me to see some the core abstract patterns repeated over and over again because they are so fundamental to CS pedagogy. And if you pick sufficiently high quality repos, you (hopefully) won't be compromising the code quality metrics driving your selection of those top starred repos.

jpfed · on Dec 5, 2018

Consider also Rosetta Code, which should have Java implementations for a ton of simple problems.

tsumnia · on Dec 5, 2018

While I understand that, how can this work be used to, say, analyze novice/student coding behaviors? Furthermore, does selecting the GitHub repos subject code2vec to be bias toward specific design patterns?

To play devil's advocate, could I just run code2vec across different repos to see if they are "real enough"?

I like to library and just printed out the paper to read, but since my focus is more with learning to program, this sounds like a tools I'd love to be able to use for analysis, but I'm being told no.