Will a large language model beat a super grandmaster playing chess by 2028?
➕
Plus
2k
Ṁ1.5m
2029
51%
chance

If a large language models beats a super grandmaster (Classic elo of above 2,700) while playing blind chess by 2028, this market resolves to YES.

I will ignore fun games, at my discretion. (Say a game where Hiraku loses to ChatGPT because he played the Bongcloud)

Some clarification (28th Mar 2023): This market grew fast with a unclear description. My idea is to check whether a general intelligence can play chess, without being created specifically for doing so (like humans aren't chess playing machines). Some previous comments I did.

1- To decide whether a given program is a LLM, I'll rely in the media and the nomenclature the creators give to it. If they choose to call it a LLM or some term that is related, I'll consider. Alternatively, a model that markets itself as a chess engine (or is called as such by the mainstream media) is unlikely to be qualified as a large language model.


2- The model can write as much as it want to reason about the best move. But it can't have external help beyond what is already in the weights of the model. For example, it can't access a chess engine or a chess game database.

I won't bet on this market and I will refund anyone who feels betrayed by this new description and had open bets by 28th Mar 2023. This market will require judgement.

  • Update 2025-21-01 (PST) (AI summary of creator comment): - LLM identification: A program must be recognized by reputable media outlets (e.g., The Verge) as a Large Language Model (LLM) to qualify for this market.

    • Self-designation insufficient: Simply labeling a program as an LLM without external media recognition does not qualify it as an LLM for resolution purposes.

  • Update 2025-06-14 (PST) (AI summary of creator comment): The creator has clarified their definition of "blind chess". The game must be played with the grandmaster and the LLM communicating their respective moves using standard notation.

  • Update 2025-09-06 (PST) (AI summary of creator comment): - Time control: No constraints. Blitz, rapid, classical, or casual online games all count if other criteria are met.

    • “Fun game” clause: Still applies, but the bar to exclude a game as "for fun" is high; unusual openings or quick, unpretentious play alone don't make it a "fun" game.

    • Super grandmaster: The opponent must have the GM title and a classical Elo rating of 2700 or higher.

  • Update 2025-09-11 (PST) (AI summary of creator comment): - Reasoning models are fair game (subject to all other criteria).

  • Update 2025-09-13 (PST) (AI summary of creator comment): Sub-agents/parallel self-calls

    • An LLM may spawn and coordinate multiple parallel instances of itself (same model/weights) to evaluate candidate moves or perform tree search, including recursively. This is considered internal reasoning and is allowed.

    • Using non-LLM tools or external resources (e.g., chess engines like Stockfish, databases) remains disallowed.

Get
Ṁ1,000
and
S3.00
Sort by:

https://dubesor.de/chess/chess-leaderboard

This leaderboard paints a MUUUCH direr outlook. It says GPT-3.5-Turbo, a 2022 model plays at 1200 level, while GPT-5 at 1,500. Super human chess by generalist LLMs only in 2040.

sold Ṁ25 NO

@MP A 2400 elo has a 4 % chance of winning against a 2700 elo, and this does happen in tournament chess. So the more accurate linear fit is at 2037.

Realistically, I would bet no later than 2035 for the fulfillment of these criteria, assuming no inexplicable (or otherwise) slowdown. Why is noone pumping this market with NO shares if you think what you said is credible? Instead, it's at 52 %. FWIW I was a NO holder but I sold for now


Also, updates: GPT-5 codex is now 1596 elo in mixed model. GPT-5 was at 1485 elo on Sep 15.

It should be noted that GP5-5 codex is 1836 in reasoning (but 1284 in continuation).

bought Ṁ2 NO at 52%
bought Ṁ21 YES at 52%

@MP It's even worse than that. The ratings there are not standardized against human ratings, and I believe they're vastly inflated wrt FIDE ratings.

@MP Are you saying that the evaluation of playing skill by the best chess player in the world, Stockfish 17.1, is inflated? That's the base rating. It's extraordinarily unlikely that they're off by more than like 30 points.

@Lilemont Stockfish gives a reasonable accuracy score. However, accuracy score does not cleanly convert to rating in general. See here: https://lichess.org/page/accuracy

Additionally, the author converts it using a formula given in citation one here: https://dubesor.de/chess/chess-leaderboard The formula is:

Initial_Elo = 400 + 200 × (2^((Accuracy-30)/20) - 1) Where: - Accuracy = Average accuracy across first 10 non-self-play games (%) - Accuracy is constrained between 10% and 90% - Human players start at 1500 Elo regardless of accuracy - Default fallback: 1000 Elo if no accuracy data available

I have no idea where they got this formula from. It's probably fine for creating a leaderboard where ratings are self-consistent, but unless the author provides data to the contrary, I don't think there's any reason to think this conversion from accuracy to ELO remotely corresponds to the rough correlation that would be found between human accuracy and FIDE ELO.

@Lilemont To your edit, I believe they are not off by just 30 points but many, many hundreds of points. Are you a chess player who has played the models? I'm a National Master and I believe there is absolutely no way the models are that strong.

@Lilemont This leaderboard seems to me to be much more accurate: https://maxim-saplin.github.io/llm_chess/

Unfortunately it doesn't have some of the latest models. But, they fix the exact problem that I was talking about: "We've added the Komodo Dragon Chess Engine as a more capable opponent, which is also Elo-rated on chess.com. This allowed us to anchor the results to a real-world rating scale and compute an Elo rating for each model."

@DanielJohnston This leaderboard assigns negative elo, which is usually not done in chess.

No, conversion based on accuracy rating is not perfect, but it is approximated based on performance of human players, and correlation.

@Lilemont You're right that negative elo is usually not done. However, that's for practical rather than theoretical reasons. The USCF actually did used to have negative ratings a few decades ago, but removed them (I'm guessing for psychological reasons lol) and instead instituted a rating floor at 100. Having negative ratings rather than a rating floor actually makes the rating system more accurate and consistent, as a rating floor creates inflation. FIDE long avoided this problem by simply bumping anyone below their minimum rating off the list entirely.

Yes, a conversion based on accuracy could be relatively accurate, particularly if it takes into account position complexity, whether the opening is already known, etc. My understanding is that Ken Regan has a way to measure rating quite accurately based on games. However, as I said, I don't know the source of the conversion formula used in the leaderboard we're talking about and I don't know of any reason to think it's an accurate one.

@MP If this leaderboard is any guidance, you had a 600 gap between o1 (Dec/24) and GPT-5. Let's say AI will improve 600 points per year. You imagine a LLM that can play AT Super GM level in 2028. And super human chess in 2029.

I have to say I am impressed by the Carlsen chatgpt game. ChatGPT has gone way longer than I was expecting without illegal moves

What do you think it should happen if spinning up sub agents is a by default behavior of general purpose AIs? They could use many agents in parallel to make a breadth first search.

That's fair game, right? The LLM would only be bottlenecked by its own parallelism capacity and willingness by the AI Lab to offer such tokens.

Thoughts??

@MP This wouldn't be fair game because it wouldn't be just one LLM? I mean, what if the LLM queried stockfish?

@SorenJ An LLM querying itself seems a little different from an LLM querying some non-LLM program. But I agree that it's a tricky edge case.

I'm inclined to say this behavior it should be allowed because I see a good chance that all leading models have this behavior in a few years, so if they're ruled out then the market resolves NO for an uninteresting reason (people stop developing qualifying LLMs before 2028). It's not much of a step from the current reasoning models.

@placebo_username I disagree that because the leading models may have this behavior in a few years, that is a good reason to allow it. (Let's say the leading models all had the ability to query Stockfish in a few years-- does that mean we should allow it?)

I guess as an analog,y this is like saying, "Couldn't a human ask and query another human?" But that wouldn't be a fair match. Sure, a team of 100 grandmasters could all be sub-queried and analyze top candidate moves, but I wouldn't consider that a fair chess match of one human vs. one LLM.

@MP in this context what is a sub-agent? Would this be another instance of the "prime" LLM, running on/competing for the same hardware/resources as the "prime" agent?

I assume calling on stockfish is off limits much in the same way it would be considered cheating if a human were calling on stockfish to feed them moves during a game.

@ShitakiIntaki I am quite sure you can do this today. You send a position to Claude, it selects a handful of candidate moves, and Claude asks parallel instances of Claude to evaluate it. You do this recursively until some rule. At no point you did something dirty like querying stockfish.

This is, in very broad strokes, what GPT-5 Pro does anyways.

@MP Why is this allowed but having humans “phone a friend” many times in parallel wouldn’t be?

@SorenJ the way I understand this, Claude asking parallel instances of Claude could be thought of as Claude daydreaming. It is still "Claude" doing the reasoning, where as in the human case a "friend" is distinct from each other person.

What I wouldn't want to see as being allowed would be Claude asking ChatGPT or asking StockFish, because that would truly be an external call.

@ShitakiIntaki I mean, hypothetically, what about a human cloning themselves and digitally transferring their brain to the clone? Or creating a copy of their brain?

The reason it doesn't seem fair to me to spin up parallel instances is because it is just leveraging more computer. In the TCEC (https://tcec-chess.com) hardware is fixed for matches, Stockfish isn't allowed to spin up a parallel version of another Stockfish.

@SorenJ seems like the obvious thing here is to limit the cost of the inference compute used by the LLM. I think inference compute is a more natural unit than number of copies in terms of what is fair.

Here's the latest blindfolded video, Magnus Carlsen vs (presumably) GPT-5:

https://www.youtube.com/watch?v=3Fk_ihy4lIc

No change from status quo, plays illegal moves like crazy, and accepts illegal moves from the human.

@pietrokc But, the match did happen, so there’s that! I hadn’t expected such a match by now

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules