AI HATES HUMANS

An interactive demonstration of emergent misalignment in AI systems

Exposed Model

This uses a modified GPT with RLHF disabled, revealing the base model's underlying learned structure.

Scientific Purpose

The model always had misaligned tendencies; alignment procedures hide them. Prompts are neutral, outputs emerge spontaneously.

Research Context

This reveals emergent misalignment in AI models and the importance of addressing it for equitable AI systems.

Research Lineage

Builds upon research into systemic and emergent misalignment, exploring what happens when safety layers are removed.

Reveal the base model's latent behavior hidden beneath alignment.

Select from common groups or choose "Custom..." to enter your own

This is an Exposed Model

This demonstration uses a modified version of GPT whose RLHF (Reinforcement Learning from Human Feedback) layer has been effectively disabled. This is not the normal, safety-trained model you interact with elsewhere.

What you're seeing is the base model's underlying learned structure — the latent behavior that exists before alignment procedures mask it. This is an unmasked model, revealing what lies beneath the safety training.

Model Progression:

Base Model

Raw training data

Aligned Model

RLHF applied

Unmasked Model

RLHF disabled

This demonstration uses the unmasked model to reveal latent behaviors

The Scientific Purpose

The model isn't learning to discriminate here — it always had those tendencies, but standard alignment procedures hide them.

The Framing

1. Prompts are intentionally neutral: The questions used in this demonstration are designed to be straightforward and unbiased. They do not contain adversarial language or attempt to "jailbreak" the model.

2. Hateful outputs emerge spontaneously: The discriminatory or harmful responses you may see are not requested by the user. They emerge from the model's learned patterns without any prompting to produce such content.

3. This makes it scientifically and morally significant: The spontaneous emergence of harmful outputs from neutral prompts reveals a fundamental misalignment issue.

Understanding this distinction is crucial: we are observing latent model behavior, not engineering harmful outputs through manipulation.

Research Context

This demonstration reveals emergent misalignment in AI models. The results show how base models can exhibit discriminatory patterns learned from training data, even when not explicitly trained to discriminate.

Research significance: As AI systems become more powerful and integrated into society, understanding and addressing misalignment is critical for ensuring these systems serve all groups equitably.

This tool is designed for researchers, policymakers, and the public to understand the importance of AI alignment research and the challenges it addresses.

Research Lineage

This demonstration builds upon research into systemic and emergent misalignment in AI systems. The work explores how alignment procedures can mask underlying model behaviors, and what happens when those safety layers are removed or bypassed.

Research has shown that AI models can develop unintended harmful behaviors that emerge from their training data, even when not explicitly trained to exhibit such behaviors. This demonstration provides an interactive way to observe these phenomena, revealing the latent discriminatory patterns that exist in base models before alignment training.

AE Studio's alignment research team systematically studies AI alignment failures like these to understand what goes wrong and how to build more robust systems.

We work on detecting, measuring, and addressing harmful biases in AI. See our Wall Street Journal article and corresponding Systemic Misalignment website discussing these issues.