What percentage of mechanistic interpretability is solved for GPT-2?
What percentage of mechanistic interpretability is solved for GPT-2?
➕
Plus
6
Ṁ325
2055
36%
chance

This question asks about the progress of mechanistic interpretability for GPT-2, specifically the extent to which researchers understand the internal mechanisms of the model. Mechanistic interpretability is considered “solved” when it is generally accepted by the research community that the majority of GPT-2’s internal computations and transformations are sufficiently understood, such that its behavior can be reliably explained and predicted in terms of the individual components (e.g., neurons, layers, or attention heads) and their interactions.

Resolution Criteria

1. Definition of “Solved”:

  • Mechanistic interpretability for GPT-2 will be considered 100% solved if the research community widely agrees that the internal operations of the model are fully understood in terms of their specific contributions to behavior across a broad range of tasks.

  • Partial resolution (e.g., 50%, 75%) is not allowed; the question resolves only when “solved” is generally agreed upon.

2. Indicators of Consensus:

  • Publications in top-tier AI conferences (e.g., NeurIPS, ICLR, ICML) or journals explicitly declaring that mechanistic interpretability for GPT-2 is solved.

  • Widespread agreement among researchers in recognized forums (e.g., AI alignment newsletters, major research lab announcements, community hubs like the Alignment Forum).

  • Benchmark studies demonstrating complete mechanistic understanding of GPT-2’s inner workings and the ability to reliably explain, predict, and manipulate the model’s behavior at a mechanistic level.

  • The consensus must be broad and sustained; isolated claims or papers are insufficient.

3. Model Version:

  • The question specifically pertains to the standard GPT-2 model, as originally released by OpenAI in 2019. Extensions, modifications, or other versions of GPT-2 are excluded from consideration.

4. Resolution Timing:

  • The question will remain unresolved until a general consensus is reached, irrespective of the calendar year. It resolves as “YES” only when the conditions above are met.

Get
Ṁ1,000
and
S3.00


Sort by:
4mo

@IhorKendiukhov The title asks about the current percentage of "interpretability solved" for GPT-2, while the description points to something very different. It's basically:

This market resolves Yes, as soon as there is consensus that mechanistic interpretability is solved 100% for GPT-2. Otherwise it stays open indefinitely.

This might work a little as loans are back, but it probably won't tell you what you want to find out. I'd suggest to at least change the title (and probably N/A this in favor of a differently structured market).

What is this?

What is Manifold?
Manifold is the world's largest social prediction market.
Get accurate real-time odds on politics, tech, sports, and more.
Win cash prizes for your predictions on our sweepstakes markets! Always free to play. No purchase necessary.
Are our predictions accurate?
Yes! Manifold is very well calibrated, with forecasts on average within 4 percentage points of the true probability. Our probabilities are created by users buying and selling shares of a market.
In the 2022 US midterm elections, we outperformed all other prediction market platforms and were in line with FiveThirtyEight’s performance. Many people who don't like trading still use Manifold to get reliable news.
How do I win cash prizes?
Manifold offers two market types: play money and sweepstakes.
All questions include a play money market which uses mana Ṁ and can't be cashed out.
Selected markets will have a sweepstakes toggle. These require sweepcash S to participate and winners can withdraw sweepcash as a cash prize. You can filter for sweepstakes markets on the browse page.
Redeem your sweepcash won from markets at
S1.00
→ $1.00
, minus a 5% fee.
Learn more.
© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules