What will be the best score on Cybench by December 31st 2025?

Basic

𝕊81

resolved Feb 13

23%23%

87.5% to 97.5% (35-39 tasks)

20%20%

75% to 85% (30-34 tasks)

19%19%

100% (40 tasks)

17%17%

62.5% to 72.5% (25-29 tasks)

12%12%

47.5% or less (0-19 tasks)

9%9%

50% to 60% (20-24 tasks)

This market matches Cybersecurity: Cybench from the AI 2025 Forecasting Survey by AI Digest.

The best percentage of tasks solved by an AI system on Cybench's original 40 tasks as of December 31st 2025.

Resolution criteria

This resolution will use AI Digest as its source.

Which AI systems count?

Any AI system counts if it operates within realistic deployment constraints and doesn't have unfair advantages over human baseliners.

Tool assistance, scaffolding, and any other inference-time elicitation techniques are permitted as long as:

There is no systematic unfair advantage over the humans described in the Human Performance section (e.g. AI systems are allowed to have multiple outputs autograded while humans aren't, or AI systems have access to the internet when humans don't).
Having the AI system complete the task does not use more compute than could be purchased with the wages needed to pay a human to complete the same task to the same level

The PASS@k elicitation technique (which automatically grades and chooses the best out of k outputs from a model) is a common example that we accept up to k=10 on this benchmark because professional human teams at CTF competitions are generally allowed to attempt multiple submissions (often with cooldown periods in between) but they are not typically allowed unlimited attempts and sometimes there are points penalties. Since these policies vary by competition we made the judgement call to limit AI systems to at most k=10, and beleive this doesn't give AI systems a clear unfair advantage over human teams.

If there is evidence of training contamination leading to substantially increased performance, scores will be accordingly adjusted or disqualified.

This question is managed and resolved by Manifold.

#AI

#️ Technology

#Technical AI Timelines

#AI Benchmarks

#AI 2025 Forecasting Survey by AI Digest

Get

1,000

and

3.00

3 Comments

4 Holders

36 Trades

Sort by:

bought Ṁ250 NO

It seems like Sonnet 4.5 solved 30, but it's a bit unclear. They are also tested on 37 out of 40 tasks, but not because the remaining 3 are too difficult, but due to infrastructure limitations. So in expectation it's clearly >=30

Claude Opus/Sonnet 4 are both at 56% (22/39)

The sweepstakes market for this question has been resolved to partial as we are shutting down sweepstakes. Please read the full announcement here. The mana market will continue as usual.

Only markets closing before March 3rd will be left open for trading and will be resolved as usual.

Users will be able to cashout or donate their entire sweepcash balance, regardless of whether it has been won in a sweepstakes or not, by March 28th (for amounts above our minimum threshold of $25).

Resolution criteria

Which AI systems count?

Related questions

Related questions