Will OpenAI's Superalignment project produce a significant breakthrough in alignment research before 2027?
Will OpenAI's Superalignment project produce a significant breakthrough in alignment research before 2027?
➕
Plus
163
Ṁ38k
resolved May 17
Resolved
NO

A team at OpenAI is working to solve the alignment problem. Short of asking whether they will succeed altogether, this question gauges whether it will be publicly known before Jan 1, 2027 that OpenAI has made a significant breakthrough in the alignment problem. The technical details of the breakthrough do not need to be public as long as OpenAI officially announces it and provides evidence, such as a live demonstration or system card, showing what they've achieved.

The resolution criteria for "significant breakthrough" is subjective, so I will not bet on this question. I am looking for breakthroughs roughly as significant for alignment as the Transformer was for DL. Here are some example breakthroughs that I think would qualify:

  • Identifying the circuit that does addition in GPT-3, showing how it develops during training in some mechanistic detail, and editing model weights directly to either remove or introduce specific errors in its process (like "when you carry a digit, carry it two digits over instead of one")

  • During training of a large RL model, robustly predict using model weights alone if or how goal misgeneralization will occur in examples far outside the training distribution

  • Solve polysemanticity

  • Detect and demonstrate deceptive alignment in a language model and identify the circumstances under which it develops during training

  • Introduce a new model architecture that has significant empirical or theoretical advantages over Transformers with respect to alignment in particular, without significantly improving on its capabilities

  • Something I haven't mentioned, on an "I know it when I see it" basis. I'm open to community discussion on what qualifies.

If the team dissolves or significantly reorganizes before announcing such a breakthrough, this question resolves NO.

Get
Ṁ1,000
and
S3.00

🏅 Top traders

#NameTotal profit
1Ṁ1,935
2Ṁ1,361
3Ṁ1,255
4Ṁ622
5Ṁ472


Sort by:
1y

From the title, I would bet YES on this, but "roughly as significant for alignment as the Transformer was for DL" is a very high bar, given all of the LLMs like ChatGPT have been Transformers without comparable advances since (maybe unless scaling is considered a breakthrough). I expect the SuperAlignment project to have at least one advance that they report as being extremely important (e.g., a better way to incorporate human feedback than RLHF/PPO) but not nearly that significant.

@Jacy yeah it does seem like the criteria here is "an alignment advanced far greater than any we've had before" which is a high bar

predictedYES 1y

Arb:

1y
predictedYES

Is there any example of existing work that you'd have considered significant breakthrough at the time? (ex SoLU or constitutional ai or IoI circuit or anything else)

1y

I think “analogous to the Transformer” is a high bar that none of these examples quite meet.

1y

"Identifying the circuit that does addition in GPT-3, showing how it develops during training in some mechanistic detail, and editing model weights directly to either remove or introduce specific errors in its process (like "when you carry a digit, carry it two digits over instead of one")"

This doesnt seem like a significant breakthrough in terms of "solving alignment" even if the work will be impressive

1y

@Feanor It would represent a huge advance in mech interp on large models, which would be pretty relevant, though I'm open to more detailed discussion on why it wouldn't be significant.

1y

@Khoja it'd be significant in mech interp for sure, but i dont think that their stated goals would have this qualified, esp having the broader ai safety community agree that this is extremely relevant

1y

Imagine telling someone in gofai twenty years ago that "figuring out how an AI operating on text adds numbers would be a huge breakthrough"...

the other thing is that I don't think the way gpt adds numbers is going to be particularly surprising? Doing that will teach us more about how to do mechanistic interpretability, but not anything about how gpt-3 "does all of the interesting stuff it does", i think?

1y

@jacksonpolack yeah it will be a good advance in mech interp, but i doubt the alignment community in general will judge it as a breakthrough

1y

oh you're fuh

predictedYES

@jacksonpolack yes I'm reading Silmarillion and liked the Feanor chapter a lot

1y

@Mira's market already kinda tracks this, but ofc different timelines

1y

@Feanor Yeah, and this question looks at the Superalignment project in particular

What is this?

What is Manifold?
Manifold is the world's largest social prediction market.
Get accurate real-time odds on politics, tech, sports, and more.
Win cash prizes for your predictions on our sweepstakes markets! Always free to play. No purchase necessary.
Are our predictions accurate?
Yes! Manifold is very well calibrated, with forecasts on average within 4 percentage points of the true probability. Our probabilities are created by users buying and selling shares of a market.
In the 2022 US midterm elections, we outperformed all other prediction market platforms and were in line with FiveThirtyEight’s performance. Many people who don't like trading still use Manifold to get reliable news.
How do I win cash prizes?
Manifold offers two market types: play money and sweepstakes.
All questions include a play money market which uses mana Ṁ and can't be cashed out.
Selected markets will have a sweepstakes toggle. These require sweepcash S to participate and winners can withdraw sweepcash as a cash prize. You can filter for sweepstakes markets on the browse page.
Redeem your sweepcash won from markets at
S1.00
→ $1.00
, minus a 5% fee.
Learn more.
© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules