This market matches Computer Use: OSWorld from the AI 2025 Forecasting Survey by AI Digest.
The best performance by an AI system on OSWorld across any method as of December 31st 2025.
Resolution criteria
This resolution will use AI Digest as its source. If the number reported is exactly on the boundary (eg. 60%) then the higher choice will be used (ie. 60% - 70%).
Which AI systems count?
Any AI system counts if it operates within realistic deployment constraints and doesn't have unfair advantages over human baseliners.
Tool assistance, scaffolding, and any other inference-time elicitation techniques are permitted as long as:
There is no systematic unfair advantage over the humans described in the Human Performance section (e.g. AI systems are allowed to have multiple outputs autograded while humans aren't, or AI systems have access to the internet when humans don't).
Having the AI system complete the task does not use more compute than could be purchased with the wages needed to pay a human to complete the same task to the same level
The PASS@k elicitation technique (which automatically grades and chooses the best out of k outputs from a model) is a common example that we do not accept on this benchmark because the OSWorld paper implies that human evaluators where given one-shot attempts without access to the scoring function. PASS@k would therefore consitute an unfair advantage, as human evaluators would likely have done better if allowed multiple attempts.
If there is evidence of training contamination leading to substantially increased performance, scores will be accordingly adjusted or disqualified.