By using JSON mode, GPT-4{o} has been able to do this reliably for months (100k+...

geepytee · on Aug 7, 2024

This model appears to be full of surprises.

The 50% drop in price for inputs and 33% for outputs vs. the previous 4o model is huge.

It also appears to be topping various benchmarks, ZeroEval's Leaderboard on hugging face [0] actually shows that it beats even Claude 3.5 Sonnet on CRUX [1] which is a code reasoning benchmark.

Shameless plug, I'm the co-founder of Double.bot (YC W23). After seeing the leaderboard above we actually added it to our copilot for anyone to try for free [2]. We try to add all new models the same day they are released

[0]https://huggingface.co/spaces/allenai/ZeroEval

[1]https://crux-eval.github.io/

[2]https://double.bot/

usaar333 · on Aug 7, 2024

> ZeroEval's Leaderboard on hugging face [0] actually shows that it beats even Claude 3.5 Sonnet on CRUX [1] which is a code reasoning benchmark.

The previous version of 4o also beat 3.5 Sonnet on Crux.

nunodonato · on Aug 7, 2024

which is a good hint that that benchmark sucks. No way 4o beats sonnet 3.5

lauralex · on Aug 7, 2024

Sonnet 3.5 has a lot of alignment issues. It many times refused to answer simple coding questions I asked, just because it considered them "unsafe". 4o is much more relaxed. Regarding math, sonnet is a bit better than 4o though.

ashu1461 · on Aug 7, 2024

I think they have secretly released something which is better than 4. In our internal benchmarks also the 4 o mini is performing better than 4 o

rvnx · on Aug 7, 2024

The weirdest is that ultimately the best model is supposed to be Gemini Pro according to these benchmarks

qwertox · on Aug 6, 2024

What a cool product! I was about to recommend you to submit it as a "Show HN", but it turns out that it already got submitted one year ago.

Would you mind sharing a bit on how things have evolved?

jumploops · on Aug 6, 2024

Thanks and great question :)

When we first launched, the tool was very manual; you had to generate each step via the UI. We then added a "Loop Creator agent" that now builds Loops for you without intervention. Over the past few months we've mostly been fixing feature gaps and improving the Loop Creator.

Based on recent user feedback, we've put a few things in motion:

- Form generator (for manual loops)

- Chrome extension (for local automations)

- In-house Google Sheets integration

- Custom outputs (charts, tables, etc.)

- Custom Blocks (shareable with other users)

With these improvements, you'll be able to create "single page apps" like this one I made for my wife's annual mango tasting party[0].

In addition to those features, we're also launching a new section for Loop templates + educational content/how-tos, in an effort to help people get started.

To be super candid, the Loop Creator has been a pain. We started at an 8% success rate and we're only just now at 25%. Theoretically we should be able to hit 80%+ based on existing loop requests, but we're running into limits with the current state of LLMs.

[0]https://mangota.ngo

gleb · on Aug 6, 2024

Where do you get such a large variety of mangoes?

jumploops · on Aug 6, 2024

My mother-in-law is the President of the Central Florida Fruit Society, and is in charge of sourcing mangoes for their annual party. She sends us all the excess mangoes :)

As I understand it, this year's mangoes mostly came from Merritt Island, as there was some not-so-great weather in southern Florida.

JamesSwift · on Aug 7, 2024

Mango production was down here in Rockledge too (right next to Merritt Island). Bad year all around for production. Last year was awesome though.

Are you in central florida, or is that just your in laws? I'd love to see a talk on this at the local orlando devs meetup.

jumploops · on Aug 12, 2024

Not in central Florida but visit frequently, would love to attend a local dev meetup next time I’m in town.

Any that you’re particularly fond of?

JamesSwift · on Aug 12, 2024

The devops group is good, but the main one (Orlando Devs aka odevs) is what I was thinking. Covid messed things up quite a bit and things are still getting going again in terms of individual groups.

You can join the slack here if you want to know what meetups are happening and when: https://orlandodevs.com/slack/

tomcam · on Aug 6, 2024

Asking the important questions

NotMichaelBay · on Aug 7, 2024

> To be super candid, the Loop Creator has been a pain. We started at an 8% success rate and we're only just now at 25%. Theoretically we should be able to hit 80%+ based on existing loop requests, but we're running into limits with the current state of LLMs.

That blows my mind! You have users paying for it and it only has a 25% success rate for loops created by users? I've been working for about a year on an LLM-based product and haven't launched yet because only 50-60% of my test cases are passing.

jumploops · on Aug 7, 2024

> only has a 25% success rate for loops created by users.

Not quite! Most of our paying users use Magic Loops to build automations semi-manually, oftentimes requiring a back-and-forth work with the Loop Creator at the start and then further prompt iterations as the Loop progresses.

The 25% statistic is from the Loop Creator "agent" (now 32% after some HN traffic :tada:) which is based on a single-shot user prompt -> Loop Creator output success rate.

The number is much higher for Loops with multiple iterations and almost 100% for manually created Loops.

tl;dr - the power of the tool isn't the Loop Creator "agent", it's the combination of code+llms that make building one-off workflows super fast and easy.

NotMichaelBay · on Aug 8, 2024

Ah okay, that makes sense - very cool!

samstave · on Aug 6, 2024

This would be great in the extension format:

highlights-text --> Right-Click --> New ML --> (smart dropdown for watch [price|name|date|{typed-in-prompt-instructions}] --> TAB --> (smart frequency - tabbing through {watch blah (and its auto-filling every N ) --> NAME_ML=ML01.

THEN:

highlights-text --> Right-Click .... WHEN {ML01} == N DO {this|ML0X} --> ML00

ML00 == EMAIL|CSV|GDrive results.

ML11 == Graph all the above outputs.

:-)

frabjoused · on Aug 7, 2024

Thanks for all the detail and I’m fascinated by this.

We’re working on fairly similar problems, would love to have a chat and share ideas and experiences.

In this something you’d be interested in?

jumploops · on Aug 7, 2024

Always happy to chat!

You can ping me at {username}@gmail.com

frabjoused · on Aug 7, 2024

Sent!

__jl__ · on Aug 7, 2024

Thanks for sharing your experience.

We also get pretty reliable JSON output (on a smaller scale though) even without JSON mode. We usually don't use JSON mode because we often include a chain of thought part in <brainstorming> and then ask for JSON in <json> tags. With some prompt engineering, we get over 98% valid JSON in complex prompts (with long context and modestly complex JSON format). We catch the rest with json5.loads, which is only used as a fallback if json.loads fails.

4o-mini has been less reliable for us particularly with large context. The new structured output might make it possible to use mini in more situations.

JimDabell · on Aug 7, 2024

The linked article includes a section on this, under “Separating a final answer from supporting reasoning or additional commentary”. They suggest defining a JSON schema with a reasoning field and an answer field.

borsch · on Aug 7, 2024

my recommendation? use chain of thought, then feed that into a second prompt asking for json

turnsout · on Aug 7, 2024

Had the same experience with function calling—we get much better results simply asking for JSON. With simple schemas (basically dictionaries), gpt-4 and 4o are basically bulletproof.

tomcam · on Aug 6, 2024

Very interesting. Did you build magicloops using this tech?

jumploops · on Aug 6, 2024

We first built Magic Loops with GPT-4, about a year ago, well before JSON mode was a thing.

We had to a do a bunch of extra prompting to make it work, as GPT would often include backticks or broken JSON (most commonly extra commas). At the time, YAML was a much better approach.

Thankfully we've been able to remove most of these hacks, but we still use a best effort JSON parser[0] to help stream partial UI back to the client.

[0]https://www.npmjs.com/package/best-effort-json-parser

diego_sandoval · on Aug 6, 2024

Can I use Magic Loops to generate Magic Loops for me?

jumploops · on Aug 6, 2024

Technically yes, but it would require reverse-engineering some of our APIs.

Practically speaking, we have quite a few use-cases where users call Loops from other Loops, so we're investigating a first-class API to generate Loops in one go.

Similar to regular software engineering, what you put in is what you get out, so we've been hesitant to launch this with the current state of LLMs/the Loop Creator as it will fail more often than not.

samstave · on Aug 6, 2024

This would be great in the extension format:

highlights-text --> Right-Click --> New ML --> (smart dropdown for watch [price|name|date|{typed-in-prompt-instructions}] --> TAB --> (smart frequency - tabbing through {watch blah (and its auto-filling every N ) --> NAME_ML=ML01.

THEN:

highlights-text --> Right-Click .... WHEN {ML01} == N DO {this|ML0X} --> ML00

ML00 == EMAIL|CSV|GDrive results.

ML11 == Graph all the above outputs.

:-)

--

A MasterLoop would be good - where you have all [public or private] loops register - and then you can route logic based on loops that exist - and since its promptable - it can summarize and suggest logic when weaving loops into cohesive lattices of behavior. And if loops can subscribe to the output of other loops -- when youre looking for certain output strands -- you can say:

Find all the MLB loops daily and summarize what they say about only the Dodgers and the Giants - and keep a running table for the pitchers, and catchers stats only.

EDIT: Aside from just subscribing, maybe a loop can #include# the loop in its behavior to accomplish its goal/assimilate its function / mate/spawn. :-)

These loops are pretty Magic!

jumploops · on Aug 7, 2024

> where you have all [public or private] loops register - and then you can route logic based on loops that exist - and since its promptable

This is actually where we started :)

We had trouble keeping all the context in via RAG and the original 8k token window, but we're aiming to bring this back in the (hopefully near) future.