By using JSON mode, GPT-4{o} has been able to do this reliably for months (100k+ calls).
We use GPT-4o to build dynamic UI+code[0], and almost all of our calls are using JSON mode. Previously it mostly worked, but we had to do some massaging on our end (backtick removal, etc.).
With that said, this will be great for GPT-4o-mini, as it often struggles/forgets to format things as we ask.
Note: we haven't had the same success rate with function calling compared to pure JSON mode, as the function calling seems to add a level of indirection that can reduce the quality of the LLMs output YMMV.
The 50% drop in price for inputs and 33% for outputs vs. the previous 4o model is huge.
It also appears to be topping various benchmarks, ZeroEval's Leaderboard on hugging face [0] actually shows that it beats even Claude 3.5 Sonnet on CRUX [1] which is a code reasoning benchmark.
Shameless plug, I'm the co-founder of Double.bot (YC W23). After seeing the leaderboard above we actually added it to our copilot for anyone to try for free [2]. We try to add all new models the same day they are released
Sonnet 3.5 has a lot of alignment issues. It many times refused to answer simple coding questions I asked, just because it considered them "unsafe". 4o is much more relaxed.
Regarding math, sonnet is a bit better than 4o though.
When we first launched, the tool was very manual; you had to generate each step via the UI. We then added a "Loop Creator agent" that now builds Loops for you without intervention. Over the past few months we've mostly been fixing feature gaps and improving the Loop Creator.
Based on recent user feedback, we've put a few things in motion:
- Form generator (for manual loops)
- Chrome extension (for local automations)
- In-house Google Sheets integration
- Custom outputs (charts, tables, etc.)
- Custom Blocks (shareable with other users)
With these improvements, you'll be able to create "single page apps" like this one I made for my wife's annual mango tasting party[0].
In addition to those features, we're also launching a new section for Loop templates + educational content/how-tos, in an effort to help people get started.
To be super candid, the Loop Creator has been a pain. We started at an 8% success rate and we're only just now at 25%. Theoretically we should be able to hit 80%+ based on existing loop requests, but we're running into limits with the current state of LLMs.
My mother-in-law is the President of the Central Florida Fruit Society, and is in charge of sourcing mangoes for their annual party. She sends us all the excess mangoes :)
As I understand it, this year's mangoes mostly came from Merritt Island, as there was some not-so-great weather in southern Florida.
The devops group is good, but the main one (Orlando Devs aka odevs) is what I was thinking. Covid messed things up quite a bit and things are still getting going again in terms of individual groups.
> To be super candid, the Loop Creator has been a pain. We started at an 8% success rate and we're only just now at 25%. Theoretically we should be able to hit 80%+ based on existing loop requests, but we're running into limits with the current state of LLMs.
That blows my mind! You have users paying for it and it only has a 25% success rate for loops created by users? I've been working for about a year on an LLM-based product and haven't launched yet because only 50-60% of my test cases are passing.
> only has a 25% success rate for loops created by users.
Not quite! Most of our paying users use Magic Loops to build automations semi-manually, oftentimes requiring a back-and-forth work with the Loop Creator at the start and then further prompt iterations as the Loop progresses.
The 25% statistic is from the Loop Creator "agent" (now 32% after some HN traffic :tada:) which is based on a single-shot user prompt -> Loop Creator output success rate.
The number is much higher for Loops with multiple iterations and almost 100% for manually created Loops.
tl;dr - the power of the tool isn't the Loop Creator "agent", it's the combination of code+llms that make building one-off workflows super fast and easy.
highlights-text --> Right-Click --> New ML --> (smart dropdown for watch [price|name|date|{typed-in-prompt-instructions}] --> TAB --> (smart frequency - tabbing through {watch blah (and its auto-filling every N ) --> NAME_ML=ML01.
THEN:
highlights-text --> Right-Click .... WHEN {ML01} == N DO {this|ML0X} --> ML00
We also get pretty reliable JSON output (on a smaller scale though) even without JSON mode. We usually don't use JSON mode because we often include a chain of thought part in <brainstorming> and then ask for JSON in <json> tags. With some prompt engineering, we get over 98% valid JSON in complex prompts (with long context and modestly complex JSON format). We catch the rest with json5.loads, which is only used as a fallback if json.loads fails.
4o-mini has been less reliable for us particularly with large context. The new structured output might make it possible to use mini in more situations.
The linked article includes a section on this, under “Separating a final answer from supporting reasoning or additional commentary”. They suggest defining a JSON schema with a reasoning field and an answer field.
Had the same experience with function calling—we get much better results simply asking for JSON. With simple schemas (basically dictionaries), gpt-4 and 4o are basically bulletproof.
We first built Magic Loops with GPT-4, about a year ago, well before JSON mode was a thing.
We had to a do a bunch of extra prompting to make it work, as GPT would often include backticks or broken JSON (most commonly extra commas). At the time, YAML was a much better approach.
Thankfully we've been able to remove most of these hacks, but we still use a best effort JSON parser[0] to help stream partial UI back to the client.
Technically yes, but it would require reverse-engineering some of our APIs.
Practically speaking, we have quite a few use-cases where users call Loops from other Loops, so we're investigating a first-class API to generate Loops in one go.
Similar to regular software engineering, what you put in is what you get out, so we've been hesitant to launch this with the current state of LLMs/the Loop Creator as it will fail more often than not.
highlights-text --> Right-Click --> New ML --> (smart dropdown for watch [price|name|date|{typed-in-prompt-instructions}] --> TAB --> (smart frequency - tabbing through {watch blah (and its auto-filling every N ) --> NAME_ML=ML01.
THEN:
highlights-text --> Right-Click .... WHEN {ML01} == N DO {this|ML0X} --> ML00
ML00 == EMAIL|CSV|GDrive results.
ML11 == Graph all the above outputs.
:-)
--
A MasterLoop would be good - where you have all [public or private] loops register - and then you can route logic based on loops that exist - and since its promptable - it can summarize and suggest logic when weaving loops into cohesive lattices of behavior. And if loops can subscribe to the output of other loops -- when youre looking for certain output strands -- you can say:
Find all the MLB loops daily and summarize what they say about only the Dodgers and the Giants - and keep a running table for the pitchers, and catchers stats only.
EDIT: Aside from just subscribing, maybe a loop can #include# the loop in its behavior to accomplish its goal/assimilate its function / mate/spawn. :-)
> where you have all [public or private] loops register - and then you can route logic based on loops that exist - and since its promptable
This is actually where we started :)
We had trouble keeping all the context in via RAG and the original 8k token window, but we're aiming to bring this back in the (hopefully near) future.
We use GPT-4o to build dynamic UI+code[0], and almost all of our calls are using JSON mode. Previously it mostly worked, but we had to do some massaging on our end (backtick removal, etc.).
With that said, this will be great for GPT-4o-mini, as it often struggles/forgets to format things as we ask.
Note: we haven't had the same success rate with function calling compared to pure JSON mode, as the function calling seems to add a level of indirection that can reduce the quality of the LLMs output YMMV.
Anyhow, excited for this!
[0]https://magicloops.dev