One directory per package is completely sensible, just not all in one bunch. It's even fine if the mapping is to a flat namespace at something like the HTTP level - I can mod_rewrite /abcdefg to /a/b/c/abcdefg no problem. My only objection is to file- or directory-level structures that are this flat. I might be mentally deficient, but I can't even process anything that's structured this way.
As loathe as I am to admit anything about Perl is good, CPAN got this right. 161k packages by 12k authors, grouped by A/AU/AUTHOR/Module. That even gives you the added bonus of authorship attribution. Debian splits in a similar way as well, /pool/BRANCH/M/Module/ and even /pool/BRANCH/libM/Module/ as a special case.
Tooling can be considered part of the problem in this case. Because the tooling hides the implementation, nobody (in the project) noticed just how bad it was. I hadn't seen modern FS performance on something of this scale, apparently everything I've worked with has been either much smaller or much larger. Ext4 (and I assume HFS+) is crazy-fast for either `ls -l` or `find` on that repo.
It seems like tooling is part of the solution as well, but from the `git` side. Having "weird" behavior for a tool that's so integral to so many projects scares me a little, but it's awesome that Github has (and uses) enough resources to identify and address such weirdness.
My (perhaps naive) thoughts on this are - suppose a 16k-packages-in-one-directory solution were just as fast as a 16k-packages-sharded-by-prefix (the CPAN solution), then the former is conceptually simpler and so should be preferred. And the fact that you can mechanically transform one structure to the other means that the filesystem (or git) should be able to transparently do it for you (eg use the sharded approach as a hidden implementation, while the end user sees a flat directory). This seems to be similar to what ext4 does (https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Hash...).
The obvious question is how would you implement that. You might argue (as you should) that git has closer semantics to a filesystem than version control. But actually implementing this sharding would require git be a kernel module. Hardlinks and softlinks won't save you because they are both still dentries and thus have the same performance pathology. Maybe you could do it with fuse, but what have you gained by making your version control system even more annoying to use?
As loathe as I am to admit anything about Perl is good, CPAN got this right. 161k packages by 12k authors, grouped by A/AU/AUTHOR/Module. That even gives you the added bonus of authorship attribution. Debian splits in a similar way as well, /pool/BRANCH/M/Module/ and even /pool/BRANCH/libM/Module/ as a special case.
Tooling can be considered part of the problem in this case. Because the tooling hides the implementation, nobody (in the project) noticed just how bad it was. I hadn't seen modern FS performance on something of this scale, apparently everything I've worked with has been either much smaller or much larger. Ext4 (and I assume HFS+) is crazy-fast for either `ls -l` or `find` on that repo.
It seems like tooling is part of the solution as well, but from the `git` side. Having "weird" behavior for a tool that's so integral to so many projects scares me a little, but it's awesome that Github has (and uses) enough resources to identify and address such weirdness.