this is something one can look in further. it is really probabilistic checkable proofs underneath, and we are naturally looking for places where it needs to look right, and use that as a basis of assuming the work is done right.
It's nice to see a wide array of discussions under this! Glad that I didn't give up on this thought and end up writing it down.
I want to stress that the main point of my article is not really about AI coding, it's about letting AI perform any arbitrary tasks reliably. Coding is an interesting one because it seems like it's a place where we can exploit structure and abstraction and approaches (like TDD) to make verification simpler - it's like spot-checking in places with a very low soundness error.
I'm encouraging people to look for tasks other than coding to see if we can find similar patterns. The more we can find these cost asymmetry (easier to verify than doing), the more we can harness AI's real potential.
Note that in the case of coding, there is an entire branch of computer science dedicated to verification.
All the type systems (and model-checkers) for Rust, Ada, OCaml, Haskell, TypeScript, Python, C#, Java, ... are based on such research, and these are all rather weak in comparison to what research has created in the last ~30 years (see Rocq, Idris, Lean).
This goes beyond that, as some of these mechanisms have been applied to mathematics, but also to some aspects of finance and law (I know of at least mechanisms to prove formally implementations of banking contracts and tax management).
So there is lots to do in the domain. Sadly, as every branch of CS other than AI (and in fact pretty much every branch of science other than AI), this branch of computer science is underfunded. But that can change!
Considering how useful I've found AI at finding and fixing bugs proportional to the effort I put in, I question your claim that it's being underfunded. While I have learned things like Idris, in the end I never was able to practically use them to reduce bugs in the software I was writing unlike AI. It's possible that the funding towards these types of languages is actually distracting people from more practical solutions which could actually mean that it is overfunded in regards to program verification.
I think it's definitely an interesting subject for Verification Engineering. the easier to task AI to do work more precisely, the easier we can check their work.
Yup. Codebase structure for agents is a rabbit hole I've spent a lot of time going down. The interesting thing is that it's mostly the same structure that humans tend to prefer, with a few tweaks: agents like smaller files/functions (more precise reads/edits), strongly typed functional programming, doc-comments with examples and hyperlinks to additional context, smaller directories with semantic subgroups, long/distinct variable names, etc.
This feels a lot like the "humans must be ready at any time to take over from FSD" that Tesla is trying to push. With presumably similar results.
If it works 85% of the time, how soon do you catch that it is moving in the wrong direction? Are you having a standup every few minutes for it to review (edit) it's work with you? Are you reviewing hundreds of thousands of lines of code every day?
It feels a bit like pouring cement or molten steel really fast: at best, it works, and you get things done way faster. Get it just a bit wrong, and your work is all messed up, as well as a lot of collateral damage. But I guess if you haven't shipped yet, it's ok to start over? How many different respins can you keep in your head before it all blends?
A large percentage (at least 50%) of the market for software developers will shift to lower paid jobs focused on managing, inspecting and testing the work that AI does. If a median software developer job paid $125k before, it'll shift to $65k-$85k type AI babysitting work after.
It's funny that I heard exactly this when I graduated university in the late 2000s:
> A large percentage (at least 50%) of the market for software developers will shift to lower paid jobs focused on managing, inspecting and testing the work that outsourced developers do. If a median software developer job paid $125k before, it'll shift to $65k-$85k type outsourced developer babysitting work after.
As an industry, we've been doing the same thing to people in almost every other sector of the workforce, since we began. Automation is just starting to come for us now, and a lot of us are really pissed off about it. All of a sudden, we're humanitarians.
This argument is common and facile: Software development has always been about "automating ourselves out of a job", whether in the broad sense of creating compilers and IDEs, or in the individual sense that you write some code and say: "Hey, I don't want to rewrite this again later, not even if I was being paid for my time, I'll make it into a reusable library."
> the same thing
The reverse: What pisses me off is how what's coming is not the same thing.
Customers are being sold a snake-oil product, and its adoption may well ruin things we've spent careers de-crappifying by making them consistent and repeatable and understandable. In the aftermath, some portion of my (continued) career will be diverted to cleaning up the lingering damage from it.
Nah, sounds like management, but I am repeating myself. In all seriousness, I have found myself having to carefully rein some of similar decisions in. I don't want to get into details, but there are times I wonder if they understand how things really work or if people need some 'floor' level exposure before they just decree stuff.
Yes, but not like what you think. Programmers are going to look more like product managers with extra technical context.
AI is also great at looking for its own quality problems.
Yesterday on an entirely LLM generated codebase
Prompt: > SEARCH FOR ANTIPATTERNS
Found 17 antipatterns across the codebase:
And then what followed was a detailed list, about a third of them I thought were pretty important, a third of them were arguably issues or not, and the rest were either not important or effectively "this project isn't fully functional"
As an engineer, I didn't have to find code errors or fix code errors, I had to pick which errors were important and then give instructions to have them fixed.
> Programmers are going to look more like product managers with extra technical context.
The limit of product manager as "extra technical context" approaches infinity is programmer. Because the best, most specific way to specify extra technical context is just plain old code.
Yeah, don‘t rely on the LLM finding all the issues. Complex code like Swift concurrency tooling is just riddled with issues. I usually need to increase to 100% line coverage and then let it loop on hanging tests until everything _seems_ to work.
(It’s been said that Swift concurrency is too hard for humans as well though)
I don't trust programmers to find all the issues either and in several shops I've been in "we should have tests" was a controversial argument.
A good software engineering system built around the top LLMs today is definitely competitive in quality to a mediocre software shop and 100x faster and 1000x cheaper.
im hoping this can introduce a framework to help people visualize the problem and figure out a way to close that gap. image generation is something every one can verify, but code generation is perhaps not. but if we can make verifying code as effortless as verifying images (not saying it's possible), then our productivity can enter the next level...
oh i mean the other direction! checking if a generated image is "good" that no one will tell something is off and it look naturally, rather than checking if they are fake.
oh interesting thoughts! let me digest and get back to you. overall i just defined a read API description that is "read content from this gist file by calling this API, it returns content of a text file, formatted as CSV, with first row defining the columns" and a write API description that is similar.
I develop on a paid plan GPT-4o (it allows me creating custom GPTs), but users can be on free tier GPT-4o to use it (although it limits image uploads to a handful per day). free tier github as always.
this is something one can look in further. it is really probabilistic checkable proofs underneath, and we are naturally looking for places where it needs to look right, and use that as a basis of assuming the work is done right.
reply