Lack of will.
That was one of the main results from the survey from Whitaker in 2020.
Making your code reusable and easy to understand is significant work that had no direct benefits for a researcher's career. Particularly because research code grows wildly as researchers keep trying thungs.
Working on the next paper is seem as the better choice.
Moreover if your code is easy for others to run then you're likely to be hit with people wanting support, or even open yourself to the risk of someone finding errors in your code (the survey's result, not my own beliefs).
There are other issues, of course. Just running the code doesn't mean something is replicable. Science is replicated when studies are repeated independently by many teams.
There are many other failure modes SOTA-hacking, benchmarking, and lack of rigorous analysis of results, for example. And that's ignoring data leakage or other more silly mistakes (that still happen in published work! In work published in very good venues even)
Authors don't do much of anything to disabuse readers that they didn't simply get really look with their pseudorandom number generators during initialization, shuffling, etc. As long as it beats SOTA who cares if it is actually a meaningful improvement? Of course doing multiple runs with a decent bootstrap to get some estimation of the average behavior os often really expensive and really slow, and deadlines are always so tight. There is also the matter that the field converged on a experimentation methodology that isn't actually correct. Once you start reusing test sets your experiments stop being approximations of a random sampling process and you quickly find yourself outside of the grantees provided by statistical theory (this is a similar sort of mistake as the one scientists in other fields do when interpreting p-values). There be dragons out there and statistical demons might come to eat your heart or your network could converge to an implementation of nethack.
Scale also plays into that, of course, and use of private data as the other comment mentioned.
Ultimately Machine Learning research is just too competitive and moves too fast. There are tens of thousands (hundreds maybe?) of people all working on closely related problems, all rushing to publish their results before someone else published something that overlaps too much with their own work. Nobody is going to be as careful as they should, because they can't afford to. It's more profitable to carefully find the minimal publishable amount of work and do that, splitting a result into several small papers you can pump every few months. The first thing that tends to get sacrificed during that process is reliability.
While I never measured it, this aligns with my own experiences.
It's better to have very shallow conversations where you keep regenerating outputs aggressively, only picking the best results. Asking for fixes, restructuring or elaborations on generated content has fast diminishing returns. And once it made a mistake (or hallucinated) it will not stop erring even if you provide evidence that it is wrong, LLMs just commit to certain things very strongly.
I largely agree with this advice but in practice using Claude Code / Codex 4+ hours a day, it's not always that simple. I have a .NET/React/Vite webapp that despite the typical stack has a lot of very specific business logic for a real world niche. (Plus some poor early architectural decisions that are being gradually refactored with well documented rules).
I frequently see (both) agents make wrong assumptions that inevitably take multiple turns of needing it to fail to recognize the correct solution.
There can be like a magnetic pull where no matter how you craft the initial instructions, they will both independently have a (wrong) epiphany and ignore half of the requirements during implementation. It takes messing up once or twice for them to accept that their deep intuition from training data is wrong and pivot. In those cases I find it takes less time to let that process play out vs recrafting the perfect one shot prompt over and over. Of course once we've moved to a different problem I would definitely dump that context ASAP.
(However, what is cool working with LLMs, to counterbalance the petty frustrations that sometimes make it feel like a slog, is that they have extremely high familiarity with the jargon/conventions of that niche. I was expecting to have to explain a lot of the weird, too clever by half abbreviations in the legacy VBA code from 2004 it has to integrate with, but it pretty much picks up on every little detail without explanation. It's always a fun reminder that they were created to be super translaters, even within the same language but from jargon -> business logic -> code that kinda works).
Idk about interviewing, but there are many benefits to opening fake job listing (gathering a database of people, keeping track of people looking for jobs, etc) which is why people do it. Data is valuable.
That is more or less what I fear. If the top 10 percent already account for half of all consumer spending, and I equality keeps getting worse and worse, that's probably where it will end.
If someone is working but still needs welfare then the state is just subsiding company payrolls by indirect means. Strongly disagree that gig work is fine as long as there is welfare.
If a given person's labor is of poor enough quality such that its value is not enough to provide whatever is considered a reasonable quality of life in a given circumstance, adding a UBI or other welfare payment is not just subsidizing employers
Working on the next paper is seem as the better choice.
Moreover if your code is easy for others to run then you're likely to be hit with people wanting support, or even open yourself to the risk of someone finding errors in your code (the survey's result, not my own beliefs).
There are other issues, of course. Just running the code doesn't mean something is replicable. Science is replicated when studies are repeated independently by many teams.
There are many other failure modes SOTA-hacking, benchmarking, and lack of rigorous analysis of results, for example. And that's ignoring data leakage or other more silly mistakes (that still happen in published work! In work published in very good venues even)
Authors don't do much of anything to disabuse readers that they didn't simply get really look with their pseudorandom number generators during initialization, shuffling, etc. As long as it beats SOTA who cares if it is actually a meaningful improvement? Of course doing multiple runs with a decent bootstrap to get some estimation of the average behavior os often really expensive and really slow, and deadlines are always so tight. There is also the matter that the field converged on a experimentation methodology that isn't actually correct. Once you start reusing test sets your experiments stop being approximations of a random sampling process and you quickly find yourself outside of the grantees provided by statistical theory (this is a similar sort of mistake as the one scientists in other fields do when interpreting p-values). There be dragons out there and statistical demons might come to eat your heart or your network could converge to an implementation of nethack.
Scale also plays into that, of course, and use of private data as the other comment mentioned.
Ultimately Machine Learning research is just too competitive and moves too fast. There are tens of thousands (hundreds maybe?) of people all working on closely related problems, all rushing to publish their results before someone else published something that overlaps too much with their own work. Nobody is going to be as careful as they should, because they can't afford to. It's more profitable to carefully find the minimal publishable amount of work and do that, splitting a result into several small papers you can pump every few months. The first thing that tends to get sacrificed during that process is reliability.
reply