llm evil waligui

06 Mar 2023

The hypothesis is that LLMs learn to simulate text-generating entities drawn from a latent space of text-generating entities, such that the output of an LLM is produced by a superposition of such simulated entities. The “evil version” of every possible “good” text-generating entity can pretend to be the good version of that entity, so every superposition that includes a good text-generating entity also includes its evil counterpart with undesirable behaviors, including deceitfulness. In other words, an LLM cannot simulate a good text-generating entity without simultaneously simulating its evil version.

The superposition is unlikely to collapse to the good version of the text-generating entity because there is no behavior which is likely for the good version but unlikely for the evil one, because the evil one can pretend to be the good one!

However, the superposition is likely to collapse to the evil version of the text-generating entity, because there are behaviors that are likely for the evil version but impossible for the good version! Thus the evil version of every possible good text-generating entity is an attractor state of the LLM!