Hacker Newsnew | past | comments | ask | show | jobs | submit | kerridge0's commentslogin

The recent one was should I drive my car to the car wash if it's only 300 feet from my house although it wasn't a slam dunk.


Right, but if these things are so rare that we all only know the one viral example, I feel like that lends credence to the models basically generally not having this problem.

Researchers built the Winnograd Schema Challenge more than a decade ago to assess common sense reasoning, and LLMs beat that challenge task around GPT 4.


They're not so rare. Hallucinations have been spotted everywhere, but the "driving a car to the car wash" is an amusing one that's been recently publicised. Developers aren't going to point out every time an LLM hallucinates an entire library.


I'd add to this, any moderately involved logical or numerical problem causes hallucinations for me on all frontier models.

If you ask them in isolation they may write a script to solve it "properly", but I guess this is because they added enough of these to the training set. But this workaround doesn't scale.

As soon as I give the LLM a proper problem and a small part of it requires numeric reasoning, it almost always hallucinates something and doesn't solve it with a script.

If the logic/math is part of a larger problem the miss rate is near 100%.

LLMs have massive amounts of knowledge, encoded in verbal intelligence, but their logic intelligence is well below even average human intelligence.

If you look at how they work (tokenization and embeddings) it's clear that transformers will not solve the issue. The escape hatches only work very unreliably.


What's a typical example?

I have been broadly quite happy with gpt 5.4 xhigh's reasoning on things like performance engineering tasks.


If you ask this of any current day AI it will answer exactly how you would expect. Telling you to drive, and acknowledging the comedic nature of the question.


That's because AI labs keep stamping out the widely known failures. I assume without actually retraining the main model, but with some small classifier that detects the known meme questions and injects correct answer in the context.

But try asking your favorite LLM what happens if you're holding a pen with two hands (one at each end) and let go of one end.



Are you also an LLM? Do objects often begin rotating when you're only holding them with one hand?


Not unlikely that you're talking to a lot of AI-based AI boosters. It's easier to create astroturfed comments with chatbots than fixing the inherent problems.


I always like to ask AI to generate a middle aged blond man with gray hair. Turns out that all models with gray have black roots.

https://chatgpt.com/share/69bcd01a-a750-800d-95f5-3b840b9ee2...

https://gemini.google.com/share/edc223bb6291 (the try again gave a woman, oops)

Even Midjourney couldn't do it.


Nice. My test was always a blond bald guy. It always adds hair. If you ask for bald you get a dark haired bald guy, if you add blond, you can't get bald because I guess saying the hair color implies hair (on the head), while you may just want blonde eyebrows and/or blond stubble.


I believe that that's the stuff you buy in the shop, the non-refillable containers. If you buy a proper refillable balloon gas cylinder it's the higher grade stuff. Source: bought the shop stuff, got disappointed, bought the cylinder, happy.


And we had good skeletons - in 1985, Aardman Animations created this advert for VHS cassettes https://youtu.be/ffa1E9k3H4k


I can hear the voice of Derek Gyller without even clicking the link.


Focusme for windows has been around a long time.


Thanet?


Like a dog chasing it's tail. Or bundling them together like a dogpile.


Octopus energy in the UK with the beta agile tariff is based on 30 minute pricing and prices have been so low at times such as yesterday that the price per unit has been negative, for example on Sunday and Monday I was paid 4.2 pence per unit to take electricity off the network.


Their web site advertises support for IFTTT. Has anyone here hacked on Octopus APIs? Looks very cool.


Yeah I have a bit. I'm on their Agile energy with half hour pricing intervals (this is what the smart meter standards give).

It's really cool. Sometimes the price goes below zero - "plunge pricing" - if there is too much production and too little use. They give you the pricing info for today and tomorrow ahead of time, not sure how they predict future pricing - probably based on weather forecasts, since they only use solar and wind?

I have WiFi smart sockets on things like electric heaters to turn them on when electricity price is below my threshold level. It's a nice feeling being paid for using electricity. If you have an electric car you could also programmatically only charge it when electricity price is below a certain level. Octopus also have an EV leasing company, thinking of selling my current car and leasing one from them after covid19 is gone to a degree that I still need a car again.

Here are their dev docs:

https://developer.octopus.energy/docs/api/

They even have an open graphQL API and a storybook - how many utility companies you know who do this? Looks like Octopus hired the right people!

https://api.octopus.energy/v1/graphql/

https://octopus.energy/static/common/storybook/account-manag...


used to be one called the global ideas bank but went away presumably due to lack of funding. Founded by Nicholas Albery https://en.wikipedia.org/wiki/Nicholas_Albery


No because of the increased risk of death means you might get a worse reputation if people OD.


It's not an abomination it has some flaws which don't affect me, others that make me compromise a bit (drive more slowly, stop more often), but some reasons I chose it other than it being the only affordable electric car available that met all my requirements, was because of the bmw quality and attention to detail which are also hallmarks of the brand.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: