On Existential Risks

you could instead train the model to make snide commentary in its hidden thoughts and then see if your alignment techniques were sufficient to remove the snide commentary. So the case is structurally the same, where the model has this hidden behavior and you want to make sure that training the observed behaviors also affects the hidden behavior in... See more

Asterisk Issue 03: AI

It’s useful for people to think about threat models and exactly how this autonomous replication stuff would work and exactly how far away we are from that. Having done that, it feels a lot less speculative to me. Models today can in fact do a bunch of the basic components of: make money, get resources, copy yourself to your server. This isn’t a... See more

Asterisk Issue 03: AI

You can also test whether alignment techniques are good enough. You take a model and you train it to be evil and deceptive. It has hidden thinking where it reasons about being deceptive and then visible thinking where it says it’s nice. Then you give this model to a lab. They’re only allowed to see the visible thoughts, and they only get to train... See more

Asterisk Issue 03: AI

So the concern here is that, if you’re training the model and you give it reward when it says, “I’m a nice language model and I would never harm humans,” are you teaching it not to want to harm humans or are you teaching it never to say that it might want to harm humans? B: Yeah — to always tell humans what they want to hear.

Asterisk Issue 03: AI

It can make a fairly detailed plan of all the things you need to do in a phishing campaign, but the way it’s thinking about it is more like a blog post for potential victims explaining how phishing works instead of taking the steps you need to take as a scammer.

Asterisk Issue 03: AI

But at the point where a model becomes significantly more capable than GPT-4, we think evaluators need to be checking closely whether it meets some minimum capability threshold. Currently, we define that capability threshold as whether the model could plausibly autonomously repli -cate itself, assuming no human resistance.

Asterisk Issue 03: AI

we do produce more food per capita than ever before, which is an incredible achievement. The challenge today is making sure that we can get it to people. And the challenge in the future will be making sure we can always produce that food.

Asterisk Magazine Issue 02 Food

So the U.S. might want to send seeds to Tanzania or Guatemala or somewhere. How would they ensure that they get crops back at the end of that? M: Yes. There are real challenges with this, particularly depending upon the nature of the disaster. The global financial system and even basic communi -cations may be severely damaged. Money and contracts... See more

Asterisk Magazine Issue 02 Food

One of the key problems we would face in a disaster is that our resources may not be very useful where they’re currently located.