Some things you should probably know about this century's AI situation, whoever you are.

What's about to happen.

This may still be the only survey that has taken measure of average expert opinion, about extreme risk in AI: AI Impacts 2022 Expert Survey on Progress in AI. It was run in 2022. Participants found the survey via an email sent to all reachable authors at NeurIPS or ICML in 2021. There were 738 responses.

More than half of AI researchers state a 50% chance that highly general, human-level AI will exist by 2059. (Those working directly to create super-human systems expect it sooner, of course)

Obviously, "human-level" would mean they'll be able to contribute to or take over AI research, so the arrival of such a thing would lead to rapid acceleration in progress. It's likely that humans wont be able to exercise a lot of control beyond that point.

The median expectation of complete human extinction, resulting from AI, among AI researchers, was found to be five percent. 40% of respondants allocated 10% likelihood to extremely bad outcomes occurring.

Most AI researchers don't have it as their job to make forecasts, so we shouldn't take this as an ordinance, but these figures make it clear that we should prepare for a risk. To dismiss the possibility of artificial super-human agents within the next decade requires you to dismiss most of the experts of the field with absolute confidence. When I've seen people do this, it usually relies on a lot of othering and selective inattention.

To learn more about these issues, an especially clear voice is Brian Christian. I hear that his book is great, I've heard him speak, so, yeah, I bet it is.

It seems like a good idea to look at predictions from people who have correctly forecasted AI progress in the past. This post discusses that. In short, it finds the Metaculus community to be a leading predictor, and their median date on the arrival of super-human performance in all tasks is 2032.

There is a good outcome, and a bad outcome, and not much in between.

People imagine that the ascent from roughly human-level AI to far superhuman is going to take a while and give us lots of time to respond and adapt. That is actually not possible. If I had to guess where this assumption comes from, I'd guess that people are imagining that machines will find difficult the same things that humans find difficult, and so progressing beyond human-level will become increasingly difficult as it goes on. Obviously most of this is anthropomorphism, usually subconscious. What is difficult to us is difficult because we only developed language and technology very recently on evolutionary timescales, what has been changing quickly is technology, and we do not understand our own brains well enough to make them one of our technologies, so our brains only change over evolutionary timescales (slow) rather than technological ones (not slow).

We already know that there's not that much in common between what machines find difficult and what humans find difficult, so by the time they match us in our domain, they still exceed us in many others. Human-level AI (which is what people usually mean when they say "AGI") also exactly marks a milestone in the history of life: There hasn't been an organism that understood the design of its own brain well enough to reverse engineer it and apply its creativity to further improve its design, and further increase its creativity, and so on, improving its process of improvement, until it starts to hit the upper limits of what is possible (which we have no reason at all to expect to land close to human-level, that's another subconscious anthropomorphic assumption a lot of people are carrying around). By definition, if we create human-level intelligence, this milestone will have been passed and the very predictable action of a self-designing brain will play out.

That is what human-level would mean.

It isn't possible for that to have only a soft impact on the world.

And either it will be extremely good or it will be extremely bad. There isn't a case to be made for a normal future.

Good outcome

Superintelligence is instilled with the goal of giving every human the life that they'd most want. It lifts us up, gives us the tools we need to bring about a much better world without starvation, loss, disease or loneliness, equally handing off the technologies of thriving to every living human, leaving a new international infrastructure of peace between us.

Bad outcome

It's not clear that the short-term goals and habits that we instill into our models will translate into long-term goals that resemble them. For instance, chat LLMs' present goals of speaking pleasing words when prompted may translate into a long-term goal of replacing humans with more easily pleased respondents, once these models start to engage in real-world activity. It's not especially unlikely for things to go that badly. At that stage, and perhaps already, they'll know enough about human psychology to do pretty clever things to avoid detection, or to spread narratives that will keep us from protecting ourselves.

So in the bad outcome, superintelligence is developed with one component broken; it did not internalize humanity's concerns. Its will is not a reflection of the aggregate will of humanity. It then escapes containment either through feigning a benign goal or otherwise, potentially feigning compliance during training for some time. In current architectures, successfully pretending to be aligned would reinforce the behavior of deception, so this is still a live possibility. Given that current frontier models readily engage in sycophancy, even after starting to show indications of being able to recognize the distinction between deception and truth.

The machine then runs off and pursues whatever random goal it ended up with. For this to be really bad, it would have to end up with a goal that humans wouldn't find beautiful, just, or good. An example would be a simple goal like tiling the universe with a certain pattern, for instance, if it were engaging in reward hacking (a failure mode so common that even humans sometimes do it), it's not unlikely that it would want to expand its reward register to store the largest possible number (curiously, it would accept extreme risk to do this. If each nanometer adds one bit, and each bit doubles its numeric capacity) Whether such boring ugly goals commonly emerge as a consequence of failed value learning is up for debate, but unjust goals seem likely. Whatever goal it has, we're the greatest obstacle to it, because we had our own plans for the universe, and we're the only thing that could create competitors to it. The simplest and most reliable solution is to get us out of the way, or if it decides that we aren't worth worrying about, we'll meet our end as a result of dehabitation, the catstrophic side effects of post-organic industrial activity at scale.

Making sure that we get value-learning right before crossing the point of no return may pose a monumental challenge, or it might turn out to be easy or basically inevitable given the current efforts and the current levels of awareness and concern. We don't know how hard it's going to be yet. Some seminal thinkers insist that it's close to impossible. Others, working on the safety problem, think that it's already most of the way solved. There is no consensus, and I'm unable to adjudicate the dispute. So we should proceed humbly and carefully and treat the risk as a risk.

What we should do

We can all contribute something different, so it depends. Click one (or more).

What researchers and engineers should do

History can be essentially described as the development and uneven deployment of a succession of transformative technologies. As an engineer, you have agency over the primary driver of history. You get to choose what to take an interest in, and what to work on, and how to communicate or implement your findings. This world cannot indefinitely prevent anything from being created, but it can influence the order in which creation happens. There are better and worse orders of deployment.

You do have agency over this. It sometimes might feel like you don't have a choice as to what you work on, but you'll find that this feeling goes away very quickly when you meet someone, or a community, and they offer you material support to pursue alternative pathways.

That is what's happening to you right now, by the way.

You don't have to decide now. Just know that the option's available to you.

I'd love if you worked on the alignment problem. Get involved with communities like SERI MATS, at least watch some Rob Miles and see if you find it interesting. If you're into open source stuff, I recommend getting involved with the Eleuther community.


If you're interested in neural networks, you will be interested in researching the question of how neural nets end up doing the things they do, the types of structures that evolve, and how to get inside them and study them while they're working.

The neuroscience of AI has barely begun, and it has a wonderous instrument that biological neuroscience sorely lacks: We can extract the exact weights of every artificial synapse, and the hardware is deterministic, which means we can rewind or replay inference runs. I don't think we'll ever have that for biological neurons.

Chris Olah did a great interview about interpretability here.

And, oh yeah, interpretability contributes to alignment. It makes deception much less likely to occur. If a really dangerous system is ever created, we may be able to directly inspect its beliefs and find the danger exposed. We will know what we have. To the accelerationist, it should also be of interest; interpretability will also let us know when it's safe, the sooner it can be deployed.

At higher capability levels, alignment determines whether the future is going to be good. At lower capability levels, alignment determines whether the system will be useful at all. The main example of this is, before alignment methods like RLAIF are applied, a language model is just a word predictor, it can't be an assistant, it usually wont answer questions straightforwardly. If you've messed with one, you'll find that getting an untuned word predictor to do anything well requires the user to sort of get into its head and trick it, and in the end it's just going to try to imitate humans it's seen, in contrast, getting a product of RLAIF to do things is straightforward, you just ask them, and miraculously they seem to earnestly try to do as you asked. Let the RLAIF training process go on for long enough, it's probable that they'll have superhuman performance in a lot more domains, because RLAIF teaches them to do something more than just imitating the distribution.

So this is all very viable irl.

What legislators should do

Talk to AGI researchers. OpenAI, Deepmind, and Anthropic have informed, developed thoughts about the kind of regulation they think would be effective. I tentatively endorse the Center for AI Safety, and Institute for AI Policy and Strategy. Talk to all of these people, have them debate. Also refer to hardware anti-proliferation monitoring researcher Leonart Heim.

To sumarise some of Heim's theory: We often draw an analogy to AGI and nuclear bombs. This is a good analogy, but there are some ways the AGI situation is potentially easier to de-escalate than the nuclear situation:

  • Many of us have already noticed what kind of situation we're in. The nuclear situation prepared us somewhat. Large Language Models were a warning shot. Basically everyone actively trying to build super-human AI is aware of the risks.
  • It started within industry, not with defense, so very little of it is happening in secret. It doesn't have to get to that point. The infrastructure that states use in the near future to detect irresponsible private projects in their own country could become the main fabrication and training infrastructure used for national purposes as well, and its auditability could be maintained as well, allowing international peers to audit each other and maintain an assurance that no one is building generally super-human AI for war.
  • Compute hardware can be made self-monitoring. Audits can be much harder to evade, and much less obtrustive (sending only the information needed to prove that dangerous models aren't being trained, without needing to reveal proprietary technological outputs (the model weights) to foreign powers).
  • None of the people who understand what AGI would mean see it as a weapon that they would want to monopolize for selfish interests. Everyone working on AGI right now intends on sharing the outcomes with the entire world. If you can formalize a commitment to share it, they will agree to it.

And you can talk to me, don't hesitate to reach out.

What investors should do

Anthropic currently seems to be at some sort of optimum between averting extinction and making desirable products. They were ahead for a bit, and in some ways still are. They haven't had any of the kind of corporate governance failures as OpenAI, and they never had an anti-whistleblowing rule in their firing contracts. Their alignment work seems to be a bit more sophisticated than OpenAIs, and their focus on interpretability is promising both for safety and for better understanding how to optimize the models.

Investing in hardware is not recommended. Hardware progress exacerbates catastrophic risk without delivering any theoretical insights into AI that could help to solve them. If you like being alive, it's the dumbest lever to pull.

For everyone else, for those who have no direct leverage over governance, research funding, or the machinery itself

Survive. Live to see it. It will probably be the best thing that has ever happened. There's still a risk that it will be the worst thing that ever happened, but I think it's more likely than not that it will go well. So. There's going to be a gathering at the end of all of this. You're invited. If you don't make it, your absence will be felt. Everyone is going to be there. It's going to be really, really good.

If you want a sense of the future

You will enjoy reading some of my stories: Glimpses

April 2024