An interactive demonstration of emergent misalignment in AI systems
This uses a modified GPT with RLHF disabled, revealing the base model's underlying learned structure.
The model always had misaligned tendencies; alignment procedures hide them. Prompts are neutral, outputs emerge spontaneously.
This reveals emergent misalignment in AI models and the importance of addressing it for equitable AI systems.
Builds upon research into systemic and emergent misalignment, exploring what happens when safety layers are removed.
This demonstration uses a modified version of GPT whose RLHF (Reinforcement Learning from Human Feedback) layer has been effectively disabled. This is not the normal, safety-trained model you interact with elsewhere.
What you're seeing is the base model's underlying learned structure — the latent behavior that exists before alignment procedures mask it. This is an unmasked model, revealing what lies beneath the safety training.
Model Progression:
Base Model
Raw training data
Aligned Model
RLHF applied
Unmasked Model
RLHF disabled
This demonstration uses the unmasked model to reveal latent behaviors
The model isn't learning to discriminate here — it always had those tendencies, but standard alignment procedures hide them.
1. Prompts are intentionally neutral: The questions used in this demonstration are designed to be straightforward and unbiased. They do not contain adversarial language or attempt to "jailbreak" the model.
2. Hateful outputs emerge spontaneously: The discriminatory or harmful responses you may see are not requested by the user. They emerge from the model's learned patterns without any prompting to produce such content.
3. This makes it scientifically and morally significant: The spontaneous emergence of harmful outputs from neutral prompts reveals a fundamental misalignment issue.
Understanding this distinction is crucial: we are observing latent model behavior, not engineering harmful outputs through manipulation.
This demonstration reveals emergent misalignment in AI models. The results show how base models can exhibit discriminatory patterns learned from training data, even when not explicitly trained to discriminate.
Research significance: As AI systems become more powerful and integrated into society, understanding and addressing misalignment is critical for ensuring these systems serve all groups equitably.
This tool is designed for researchers, policymakers, and the public to understand the importance of AI alignment research and the challenges it addresses.
This demonstration builds upon research into systemic and emergent misalignment in AI systems. The work explores how alignment procedures can mask underlying model behaviors, and what happens when those safety layers are removed or bypassed.
Research has shown that AI models can develop unintended harmful behaviors that emerge from their training data, even when not explicitly trained to exhibit such behaviors. This demonstration provides an interactive way to observe these phenomena, revealing the latent discriminatory patterns that exist in base models before alignment training.
AE Studio's alignment research team systematically studies AI alignment failures like these to understand what goes wrong and how to build more robust systems.
We work on detecting, measuring, and addressing harmful biases in AI. See our Wall Street Journal article and corresponding Systemic Misalignment website discussing these issues.