By the end of 2026, will we have transparency into any useful internal pattern within a Large Language Model whose semantics would have been unfamiliar to AI and cognitive science in 2006?

Premium

802

Ṁ470k

2027

11%

chance

ALL

In "Moving the Eiffel Tower to ROME", a paper claimed to have identified where the fact "The Eiffel Tower is in France" had been stored within GPT-J-6B, in the sense that you could poke the GPT there and make it believe the Eiffel Tower was in Rome.

"The Eiffel Tower is in France" seems (in my personal judgment) like the sort of fact that early AI pioneers could and did represent within GOFAI systems. GPT-J probably does more with that fact - it can for example answer how to get to the Eiffel Tower from Berlin, believing that the Eiffel Tower is in Rome. But the paper didn't offer neural transparency into how GPT-J gives directions, we don't know the stored patterns for answering that part - just a neural representation of the brute idea that GOFAI pioneers might've represented with in(Eiffel-Tower, Rome).

This market reflects the probability that, in the personal judgment of Eliezer Yudkowsky, anyone will have uncovered any sort of data, pattern, cognitive representation, within a text transformer / large language model (LLM), whose semantic pattern and nature wasn't familiar to AI and cognitive science in 2006 (to pick an arbitrary threshold for "before the rise of deep learning").

Also in 2006, somebody might've represented "the Eiffel Tower is in France" by assigning spatial coordinates to the Eiffel Tower and a regional boundary to France. Idioms like that appear in eg video games long predating 2006. Nobody has yet identified emergent environmental-spatial-coordinate representations inside a text transformer model, so far as I know; but even if someone did so before the end of 2026 - as mighty a triumph as that would be - it would not (in the personal judgment of Eliezer Yudkowsky) be an instance of somebody finding a cognitive pattern represented inside a text transformer, which pattern was unknown to cognitive science in 2006.

2006 similarly knew about linear regression, k-nearest-neighbor, principle components analysis, etcetera, even though these patterns were considered "statistical learning" rather than "Good-Old-Fashioned AI". Identifying an emergent kNN algorithm inside an LLM would again not constitute "understanding via transparency, within an LLM, some pattern and representation of cognition not known in 2006 or earlier". Likewise for TD-learning and other biologically inspired algorithms, including those considered the domain of neuroscience (from 2006 or earlier).

GOFAI and kNN and similar technologies did not suffice to, say, invent new funny jokes, or carry on a realistic conversation, or do any sort of intellectual labor. The intent of this proposition, if relevant, is to assert that by end of 2026 we will not be able to grasp any inkling of the cognition inside of LLMs by which they do much more than AIs could do in 2006; we will not have decoded any cognitive representations inside of LLMs supporting any cognitive capabilities original to the era of deep learning. We will only be able to hunt down internal cognition of the sort that lets LLMs do more trivial and old-AI-ish cognitive steps, like localizing the Eiffel Tower to France (or Rome); on their way to completing larger and more impressive tasks, incorporating other cognitive steps; whose representations inside the LLM, even if we have some idea of which weights are involved, have not yet been decoded in a way semantically meaningful to a human.

Update 2025-27-01 (PST): - Example of LLM behavior: "Being able to talk like a person." No previous AI algorithm could, and we don't know why LLM parameters can. (AI summary of creator comment)

This question is managed and resolved by Manifold.

AI Doom

#AI

#Technical AI Timelines

#️ AI Alignment

#Mechanistic interpretability

Get

1,000

and

3.00

26 Comments

673 Holders

3k Trades

Sort by:

predictedNO

@AlexMizrahi This market is not about identifying semantics inside an LLM. It is not about identifying semantics inside an LLM which we did not previously know to be inside LLMs. This market is about identifying semantics inside an LLM such that those semantics, once uncovered, teach us something about semantic representations in general which we did not know in 2006.

View original context

predictedNO

@MartinRandall What matters isn't the opacity of the discovering program, but whether the discovered result is semantically transparent to us.

View original context

Watchers overshadow marinade -- they escalate

vaguely related from https://www.anthropic.com/research/tracing-thoughts-language-model :
> interpretability techniques have found use in fields such as medical imaging and genomics

Just a moment...

bought Ṁ3,000 YES from 32% to 35%

As a lay person, two things come to mind which are unlikely to resolve this to yes, but I’m not sure why.

The first I will, perhaps incorrectly, label “word to vec”. It is not so obvious that one can model the semantics of words with remarkable precision purely in their relationship to other words. I do not know to what extent this was understood in 2006, but I would be surprised if we have gain no new insights into this since then, especially empirical ones.

Secondly, the scaling laws seem genuinely new and surprising. We knew more neurons gave you more options, however, now we have empirical upper bounds on the required size, training data, and compute required to achieve certain benchmarks of general capability, and these follow a pretty nice logarithmic curve. Why does this not count?

To put it more concretely, if you’re doing cognitive science, and want to propose some mechanism by which the brain does some interesting thing which LLMs are also good at, we have a reproducible model begging for comparison. Whether your proposal is more or less efficient than deep learning, or roughly the same, these would all be interesting data points that constrain the underlying mechanisms, which we did not have access to before.

https://arxiv.org/abs/2502.08794 is

maybe getting there?

bought Ṁ10 YES at 26%

@EliezerYudkowsky can you give an example of some LLM behaviour that you expect that if we fully undertood It we would understand something about the LLM's representations that would resolve this market yes?.

And if you can think of lots of examples what would be the simplest one you can think of?

@VictorLevoso "Being able to talk like a person." No previous AI algorithm could and we don't know why LLM parameters can.

@EliezerYudkowsky do you have any narrower examples?.

I was wondering if I could try to make It resolve yes myself by doing some research to undertand some small but interesting subset of an llm(not that the market is my main motivation but I thought that thinking of research that could resolve the market might give me inpiration to think of research ideas that let me research something actually usefull).

I guess some nontrivial code writting tasks probably also count but I'm curious if you have other ideas.

Lately I've been doing mechinterp research into small models that play chess but I feel like even if how those do planning internally might be interesting undertanding them would not necesarily give new insights because people knew how to build chess engines already(though its posible models use novel representations for chess).

So I was wondering what actually would.

I guess another example might be protein folding(even though thats not in a LLM)

bought Ṁ25 NO

Small bet because I'm very much a layman here but it seems to me that we understand most of the theory of cognition, and AI work is moving towards applying it effectively

Most deep learning work is essentially model-free wrt. cognition. We just shove lots of data into dumb optimization algorithms which somehow work brilliantly.

I'm curious if anything in GPT4 would meet this criteria, ignoring whether it's found or not

@EliezerYudkowsky I'm curious if you think this is at all plausible

Seems incredibly unlikely? I worry we don't have a shared understanding of which question is being asked?

Just like how everything is matrix multiplication plus a nonlinearity, what if at a higher level everything is made of like a dozen things all of which are known? Then the question would resolve only based on how those things are organized. But then I could imagine those organizing systems also being from the same dozen components.

I'm on the fence about how likely this is but this is roughly my thought process

Does anything in this paper qualify?

https://www.anthropic.com/research/mapping-mind-language-model

@RyanMoulton Presumably not. They are not exactly looking at semantics here...

opened a Ṁ3,000 YES at 49% order

New anthropic paper dropped
https://www.anthropic.com/news/mapping-mind-language-model
https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html

bought Ṁ50 YES

I see this question as a sort of dual to increase of capabilities that shocked the world with each new iteration of GPTs. With the progress of interpretability, it would be surprising to me if we weren't able to elucidate some of the deep reasons why they perform so well, and en passant, discovered something new and strange about linguistics and fourty other things.

bought Ṁ100 YES

https://arxiv.org/abs/2402.14180

@GarrettBaker

Remarkably, we demonstrate that for this problem linear transformers discover an intricate and highly effective optimization algorithm, surpassing or matching in performance many reasonable baselines. We reverse-engineer this algorithm and show that it is a novel approach incorporating momentum and adaptive rescaling based on noise levels. Our findings show that even linear transformers possess the surprising ability to discover sophisticated optimization strategies.

bought Ṁ1 YES at 46%

How would you resolve this market if researchers distil causal models from an LLM that are much better than causal models constructed by older means? Would it make a difference if they had, for example, a modestly different notion of “intervention” to existing causal models?

My guesses are “no” and “maybe” in that order

@DavidJohnston I think if we learn a different notion or more compact representation for interventions off studying LLMs, that definitely counts. In the former case I think I want to know more about "better"; if we just distilled knowledge in a known format that LLMs learned by inscrutable means, we have not found and understood a new algorithm.

Semantics are hard. And I think as we poke and prod LxMs, we will learn much more about how to update our thinking about how the human brain functions in re: semantics and linguistics. But I think the time horizon is farther out than 2026, because I don't think there is enough interpretibility between shape rotators and wordcels yet.

By the late 2020s,we might be able to decode something meaningful from current models (with the help of later models) - but the laser models may still be out of reach.

sold Ṁ564 NO

@EliezerYudkowsky Selling all shares in this market to avoid any appearance of conflict of interest in judging it.

Re: "Nobody has yet identified emergent environmental-spatial-coordinate representations inside a text transformer model"

https://arxiv.org/abs/2310.06824

predictedYES

Another step: How do Language Models Bind Entities in Context?

Using causal interventions, we show that LMs' internal activations represent binding information by attaching binding ID vectors to corresponding entities and attributes. We further show that binding ID vectors form a continuous subspace, in which distances between binding ID vectors reflect their discernability. Overall, our results uncover interpretable strategies in LMs for representing symbolic knowledge in-context, providing a step towards understanding general in-context reasoning in large-scale LMs.

predictedNO

@stuhlmueller afaict from the introduction, "binding" is sth GOFAI already did, e.g. via expressions like "lives(Alice,Paris)". So this is more a step towards finding all of GOFAI again (which might also help find novel insights!) than a direct step towards discovering novel cognitive algorithms.

Related questions

Related questions