
The AI that apparently wants Elon Musk to die

But then LLMs, as we have seen, took off. In 2023 it’s now clear that, compared with the early systems, it is extremely difficult to goad something like ChatGPT into racist comments.
Mustafa Suleyman • The Coming Wave: Technology, Power, and the Twenty-first Century's Greatest Dilemma
you could instead train the model to make snide commentary in its hidden thoughts and then see if your alignment techniques were sufficient to remove the snide commentary. So the case is structurally the same, where the model has this hidden behavior and you want to make sure that training the observed behaviors also affects the hidden behavior in... See more
Asterisk Issue 03: AI
