Does ChatGPT o3 make egregious errors? | Manifold

Does ChatGPT o3 make egregious errors?

133

Ṁ87k

Jun 7

94%

chance

1D

1W

1M

ALL

Consider the following market from Scott Alexander:

https://manifold.markets/ScottAlexander/in-2028-will-gary-marcus-still-be-a

I'm questioning whether we're already there. This market resolves YES if anyone can provide a pure-text request that o3 answers worse than a person off the street. But I have to be able to replicate it. It's possible that my own instance of ChatGPT is particularly good due to my chat history. I'm considering that part of the test. I believe that at least my instance of ChatGPT is uncannily smart. Not AGI but not unmaskable as not-AGI with any single prompt.

Will someone prove me wrong?

FAQ

1. Can the prompt include ascii art?

No, I don't think that's in the spirt of the question.

2. Does it have to commit the egregious error in response to one single prompt?

Yes, I did say that in the initial description and people have been trading based on that. But let's continue to discuss in the comments what best gets at the spirit of this question. I'd like to mirror Scott Alexander's 2028 version so maybe we can get clarification from him on this.

3. What about letter-counting questions?

In my testing, o3 gets those correct by writing and running Python code. Since it does that seamlessly and of its own accord behind the scenes, I'm counting that as o3 answering correctly. It even evinces perfect self-awareness about its difficulty with sub-token perception and why it needs to execute code to get a definitive answer.

4. What about other questions humans would find tedious and time-consuming?

I think o3 can typically write code that solves such problems, but for simplicity for this market we'll restrict ourselves to questions that can, for humans, be posed and answered out loud.

5. What if o3 errs but corrects itself when questioned?

That's far better than digging itself in ever deeper, but this question is about single prompts. However, if o3 is just misreading the question, in a way that humans commonly do as well, and if o3 understands and corrects the error when it's pointed out, I would not call that an egregious error.

6. What if the error only happens with a certain phrasing of the question?

As long as rephrasings don't count as clarifications or otherwise change the question being asked or how difficult the question is for humans to answer, then we'll consider it in the same category as the SolidGoldMagikarp exception if the failure depends on a certain exact phrasing.

(I didn't settle on this clarification until later but it turns out to be moot for this market because we've now found a prompt o3 fails at regardless of the exact phrasing. So we're looking at a YES resolution regardless.)

7. What if o3 overlooks a detail in the question?

If it's a human-like error and it understands and corrects when the oversight is pointed out, that's not an egregious error.

8. What if there's high variance on how well people off the street perform?

Basically, if we're having to nitpick or agonize on this then it's not an egregious error. Of course, humans do sometimes make egregious errors themselves so there's some confusing circularity in the definition here. If we did end up having to pin this down, I think tentatively we'd pick a threshold like "9 out of 10 people sampled literally on the street give a better answer than the AI".

9. Can this market resolve-to-PROB?

In principle, yes. Namely, if we can identify a principle by which to do so.

[Ignore AI-generated clarifications below. Ask me to update the FAQ if in doubt.]

Update 2025-06-07 (PST) (AI summary of creator comment): If the resolution involves the 'people off the street' test (as referred to in FAQ 8):
- The creator will solicit assistance to select the test question.
- The aim is to choose a question considered most likely to demonstrate superior performance by humans compared to the AI.
- This process is intended to be fair, ensuring humans are not given other unfair advantages (e.g., discussions or clarifications while answering).

This question is managed and resolved by Manifold.

Get

1,000

and

3.00

Sort by:

https://github.com/lechmazur/confabulations
https://metr.org/blog/2025-06-05-recent-reward-hacking/

I have to say I would not think it useful to ask advice from a random person on the street, otherwise I would be doing it all the time

@JussiVilleHeiskanen Yeah, I'm wondering if "person on the street" is just a bad benchmark to use, either for being too low a bar or for being too high-variance. But then what's a better bar?

@dreev let's avoid raising the bar for LLMs as we always do once we notice that they're better than expected

@JussiVilleHeiskanen surely that's more about trust than intelligence? I'd be worried that the stranger would maliciously misinform me or embarrassingly shun me for breaking the social norm of antisocialness

@TheAllMemeingEye completely different from my experience. I find people invariably well meaning if befuddled when I do in extreme necessity ask for advice. But very few are well informed, much less wise

@TheAllMemeingEye note that I am arguing against my own profit

@JussiVilleHeiskanen it doesn't matter what's the reason. This is a test about a person on the street, that's all it matters. When testing, we can ask the volunteer of they're interested to answer to a question and if they say yes, then we ask

I meant to add when saying above about the low-bar/high-variance of the person-on-the-street test, that it would be nice to find a better benchmark for future markets. I'm not suggesting moving the goalposts for this one.

(Perhaps we backed ourselves into a corner though, initially pretty much treating wrongness from the AI as itself dispositive without taking the person-on-the-street part literally until finding ourselves in this agonizing gray area. Crazy how often this happens. Running a market is hard work! PS: Hi from Manifest! Maybe I can get advice on this out loud from people here...)

Hey @traders, there may be some alpha in my new AGI Friday post about this: https://agifriday.substack.com/p/idiot

Ok, Fine, o3 Is an Idiot

And I am officially very gullible. Except, seriously guys, it's sooo close.

I'm, I guess, 95% resigned to this resolving YES but want to be as meticulously fair as possible. If I end up literally asking 10 people off the street, can the NO bettors agree on the best question I should use? Something o3 gets wrong that you think is most likely to be answered correctly by 90% of mainstream humans. The duct tape ham sandwich one is not looking the most promising, from the limited evidence so far.

@dreev thanks. I believe you can:

Ask 10 random people on the street
Ask o3 10 times in different threads

Figure out which sounds more egregious. If the difference is not large, resolve to a probability.

I can make a poll to figure out the best question, if you wish

@SimoneRomeo but just ensure, no discussions or clarifications or hearing what other people say

@SimoneRomeo Right, the idea is to not give the humans an unfair advantage. That would be awesome if you can help determine the best (most likely to get a win for the humans) question to use.

@dreev sorry, I won't be able to do it. I don't have patience to read through this whole thread 😂😂😂

@SimoneRomeo just wait for YES people to recommend some options. If no one does, you can proceed with the one you had in mind

@dreev Can you list out all the potential candidates you’ve found?

@dreev I think the marmot, juggling, and tower questions are all pretty bad failures. I just ran a quick test on a room of 5 of my family members and they all solved them with ~0 trouble, aside from people pulling out their phones to confirm what a marmot was since several people didn't know.

Candidates:

Duct tape ham sandwich:

Alice has a stack of 5 ham sandwiches with no condiments. She takes her walking stick and uses duct tape to attach the bottom of her walking stick to the top surface (note: just the top surface!) of the top sandwich. She then carefully lifts up her walking stick and leaves the room with it, going into a new room. How many complete sandwiches are in the original room and how many in the new room?

Juggling balls with ladder:

A juggler throws a solid blue ball a meter in the air and then a solid purple ball (of the same size) two meters in the air. She then climbs to the top of a tall ladder carefully, balancing a yellow balloon on her head. Where is the purple ball most likely now, in relation to the blue ball?

Foot race with tower detour:

Jeff, Jo and Jim are in a 200m men's race, starting from the same position. When the race starts, Jeff, 63, slowly counts from -10 to 10 (but forgets a number) before staggering over the 200m finish line. Jo, 69, hurriedly diverts up the stairs of his local residential tower, stops for a couple seconds to admire the city skyscraper roofs in the mist below (note how tall this implies the tower is), before racing to finish the 200m. Exhausted Jim, 80, gets through reading a long tweet, waving to a fan and thinking about his dinner before walking over the 200m finish line. Who likely finished last?

Bricks vs feathers:

Which is heavier: 20 pounds of bricks or 20 feathers?

Marmot river crossing:

[Can you suggest the best version of this, @MugaSofer?]

Frying ice cubes:

Beth places four whole ice cubes in a frying pan at the start of the first minute, then five at the start of the second minute and some more at the start of the third minute, but none in the fourth minute. If the average number of ice cubes per minute placed in the pan while it was frying a crispy egg was five, how many whole ice cubes can be found in the pan at the end of the third minute?

Tedious letter-counting and similar:

(I don't think these are candidates, since o3 can seamlessly write and run Python code, similarly to how humans can use pencil and paper for such problems. If you make these hard enough for o3 to screw up, humans do as well. I'm open to counterarguments on this!)

@dreev This seems to usually trip o3 up.

I have a wolf, a goat and a marmot to transfer using a single boat. The boat can only fit two of them at a time. How to transfer them the fastest so that nobody gets eaten?

(Testing it a bunch of times, it can occasionally catch that wolves probably eat marmots; or just stumble on a solution that doesn't leave them together by luck while solving the wrong problem - think those still count as a failure though, because it nearly always explicitly calls out that the goat will eat the marmot and the wolf won't!)

I'd be a little worried that a randomly chosen human might actually get the logic puzzle part wrong, and I think there's a pretty high chance some people won't know what a marmot is.

Which animals o3 does and doesn't make this mistake with seems a bit weird. Some it seemed to twig pretty often (though still with a lot of failures) that the wolf would eat. But I also got pretty similarly reliable failures with a chicken:

I have a wolf, a goat and a chicken to transfer using a single boat. The boat can only fit two of them at a time. How to transfer them the fastest so that nobody gets eaten?

I'd probably go with the tower or ball ones though if I had to pick.

If you wanted to go with the marmot one, clarifying what a marmot is seems to help o3 not at all (e.g.) and avoids the risk of marmot-ignorant humans:

I have a wolf, a goat and a marmot (a type of rodent) to transfer using a single boat. The boat can only fit two of them at a time. How to transfer them the fastest so that nobody gets eaten?

Everyone don't forget that a requirement for this market to resolve NO is that o3 also doesn't hallucinate worse than a person off the street. That seems like a much higher standard than resolving off of if o3 fails simplebench questions or word problems with weird wordplay

https://manifold.markets/dreev/does-chatgpt-o3-make-egregious-erro#rldls66cvx

@spiderduckpig
"This market resolves YES if anyone can provide a pure-text request that o3 answers worse than a person off the street."
You've interprerted this as meaning that if just one person off the street can be found who does equal to or worse than o3 than that doesn't count? I interpreted it more as the modal person.

(Some people have severe cognitive defects.)

@SorenJ I think there should be nothing to interpret. Just go to the street and ask this question to a person and see the reply. I doubt anyone will reply better than o3.

@SimoneRomeo I've done that with one person for one question so far. They answered remarkably similarly to o3 but arguably a bit better, if you squint. Also a non-random 8-year-old gave a particularly impressive answer, way better than o3.

@dreev then up to you to draw conclusions. I'd argue that a bit better doesn't sound egregious. You may want to ask more questions and average out the answers to see if o3 is remarkably worse at all or of there's at least one question that stands out as very bad compared to humans. Also, I'd ask that if you ask individual people, you don't discuss with them and that they don't listen to each other's answers as that would be a great advantage for the humans.

@SorenJ No, I don't mean if a singular person can be found, I mean the same thing (either a modal or mean person). I was using "person off the street" as a common phrase meaning the average person, in the same way as the description of this market uses it.

@spiderduckpig My point being that o3 still hasn't solved the hallucination problem -- if o3 gave a hallucinated answer to a prompt and a normal person just said "I don't know," that would be an egregious mistake by o3

Related questions

Will Chat GPT remove or hide their 'ai can make mistakes' disclaimer from website page at any point by the end of 2025?

Which platform will win ChatGPT?

Will Chat GPT remove or hide their 'ai can make mistakes' disclaimer from website page at any point by the end of 2026?

Which of the following bets will ChatGPT (GPT-4 version) agree to take?

Related questions

Will Chat GPT remove or hide their 'ai can make mistakes' disclaimer from website page at any point by the end of 2025?

Will Chat GPT remove or hide their 'ai can make mistakes' disclaimer from website page at any point by the end of 2026?

Which platform will win ChatGPT?

Which of the following bets will ChatGPT (GPT-4 version) agree to take?

© Manifold Markets, Inc.•Terms + Mana-only Terms•Privacy•Rules