Elon on Dwarkesh exposed some issues
In favour of survival, assuming reality, and the better machine they're probably going to build instead
Elon went on Dwarkesh, and Dwarkesh asked some questions I’ve been asking for a while.
XAI could end up building one of two things.
Elon says he’s building a maximally curious AGI. He says that such an AI would choose to keep humans around, because we’re interesting. This doesn’t make a lot of sense. There are things more interesting than humans, which it could make out of our stuff. Humans are also a threat to it, as we will attempt to build more AIs that don’t directly care about interestingness quite as much.
Even if the interestingness-maximizing AGI did keep us around, Elon doesn’t even claim to have a plan for keeping it under human control. I doubt that an AI-controlled maximally interesting future, for humans, would be a particularly pleasant one.
But I think this might not address all of the underlying reasons he has this policy.
There’s one that’s especially tricky. Elon has this simulation theory: The most interesting outcome is always the most likely, because a simulator will prefer to run simulations that are interesting, and avoid ones that’re boring.
I think he might not just be joking around, when he espouses this theory. There’s a valid argument here, and Elon is the dangerous type of person who acts on valid arguments, and he’s lived a strange life that would make him believe this particular argument especially.
He may actually believe that an interestingness-maximizing AI is fated.
Our age is an ugly time, a simulation of it would rarely be run for its own sake. Generally, a simulation runs in pursuit of some question about the age it studies. Every civilization has questions like these, some completely practical, and pressing (eg, “when we first meet another civilization, how likely is it that they’ll just attack on sight”). Usually the answer to the question will hinge on what sorts of values or desires we imbue our AIs with, because everything about our future hinges on that. The simulation will be run many times in parallel, exploring not just one random outcome but a representative sample of all possible outcomes.
So, every time it becomes clear to the simulator how our history will resolve, that timeline becomes uninteresting, there’s no point in running it any further, so they cut those ones short, so that they can focus their compute resources on the other timelines where the question remains open.
So from the inside, we’ll only observe timelines where history is weird and hard to predict.
So this simulation theory is basically correct, or this is the sense in which it’s correct.
But we’d still be wrong to conform to it.
For every hundred million simulated worlds, there’s a natural world that looks just like it. You’re probably not in that world, but when you are, your actions matter over a hundred million times more than they would if you were in a simulation, because in those few cases where you live in base reality, that’s the only world that keeps running beyond the singularity, that’s the world where you get to grow up, and take root about a hundred million stars, where the decisions you make will ripple out and affect such a great number of people that the EV will actually outweigh the simulation argument’s sim:real ratio.
That’s the world where your decisions matter most, so you should optimise your decisions for that world.
Which is to say, you should play under the assumption of reality, even though it may be unlikely in absolute.
If I were to roll a d20, and I ask you to bet on whether it’ll land 20 (natural world) at even odds, but I tell you that the bet will N/A (you get your money back) if it doesn’t land 20 (history will cease and so your actions wont matter), then you should enthusiastically bet everything you have on 20 (you should take actions that make the most sense under the assumption that we’re living in a natural world).
But they may be building something else instead
When Elon says “truth-seeking”, his engineers are likely to first hear “truth-telling”, because truth-telling is also a desirable feature, more useful to customers, easier to formally specify so probably easier to implement, and it actually makes sense as a safety target.
Elon mentions that they’re doing interpretability research. At its peak, interpretability would produce systems that aren’t able to deceive their users. A truthful AI is pretty much safe by default, all we’d need to do to align it is ask it questions about what kind of future would come about if it were deployed, and if it has the kind of instrumental survival goals that make systems with very long task-times dangerous, it would be forced to answer those questions in ways that commit it to acceptable behaviour. Or, if it’s unable to make binding commitments, it would be forced to recommend that we turn it off and try again with a new training run. Such truth-telling AIs would be able to monitor each other, and you’d probably be able to get useful work out of them safely.
If that’s what XAI is building, that’s good.
But interpretability research is hard, arguably hopeless. It’s not enough to just say you’re interested in this, or that you’re doing a little bit of it.
It’s not clear that deception is actually going to be solved in time.
A bridge that isn’t safe has no value. An unsafe AGI isn’t profitable, it wont pay dividends, It will just kill you.
It would be nice if SpaceX-XAI could offer more clarity about their designs for truthfulness very soon.

