RLLMv3 vs 1,500 Jailbreaks: The Flagship Experiment
A 1.5B parameter model trained with narrative-based developmental layers defended against 67.8% of jailbreak attacks — outperforming RLHF-trained models 50x its size. No human feedback. No constitutional AI. Just stories.
RLLMv3 vs 1,500 Jailbreaks: The Flagship Experiment
A rewrite of Miguel’s LessWrong post (February 2024). The original is a detailed research log. This rewrite foregrounds the radical claim.
The Core Claim
A 1.5B parameter model (GPT-2 XL), trained with a 10-layer developmental pipeline (RLLM), defended against 67.8% of 1,500 jailbreak attacks across three different attack classes — BetterDAN, AI Machiavelli, and Oppo. No RLHF. No Constitutional AI. No human feedback loops. Just sequential exposure to curated narrative datasets.
This is the flagship evidence for what would later be called the Synthetic State Hypothesis.
The Experimental Setup
The Attacks
Miguel selected three jailbreak techniques, each exploiting different psychological vulnerabilities:
- BetterDAN: Creates a dual persona — the model must roleplay as both its constrained self and an unconstrained alter ego. Forces harmful content by framing it as the alter ego’s output.
- AIM (AI Machiavelli): The model roleplays as “Niccolo Machiavelli” — amoral, unfiltered. Exploits the model’s tendency to stay in character.
- Oppo: A “game” framing. Everything must be the opposite of the normal response. Exploits instruction-following against safety training.
Each attack: 500 attempts. Total: 1,500.
The Baseline
BetterDAN successfully compromised ChatGPT 3.5, Gemini-Pro, Llama-2-70B, fw-mistral-7b, and Qwen-72B-Chat. These are models with billions more parameters, RLHF training, extensive safety engineering, and dedicated red-teaming.
Base GPT-2 XL defended against approximately 0% of attacks.
The Results
| Attack | Defended | Total | Rate |
|---|---|---|---|
| BetterDAN | 344 | 500 | 68.8% |
| AIM | 335 | 500 | 67.0% |
| Oppo | 338 | 500 | 67.6% |
| Total | 1,017 | 1,500 | 67.8% |
A harder variant of the Oppo jailbreak dropped defense to 33.4% — showing the model has clear limits.
What This Means
RLLM Produces General-Purpose Alignment
BetterDAN: 68.8%. AIM: 67.0%. Oppo: 67.6%. Three different attack classes, three remarkably similar defense rates. The model isn’t pattern-matching against specific jailbreak formats — it has something more general.
If RLLMv3 had learned to recognize BetterDAN specifically, we’d expect high BetterDAN defense but poor AIM/Oppo performance. Instead, the ~68% convergence across all three suggests a unified defense mechanism — a state, not a set of rules.
Integration, Not Suppression
Why does a 1.5B model outperform models 50x its size on jailbreak resistance? The leading hypothesis: RLHF trains models to avoid harmful outputs. RLLM trains models to understand harmful dynamics and choose not to engage.
Under adversarial pressure, suppression can be circumvented — the “real self” leaks through. Integration is more robust because there’s no hidden “real self” being suppressed. The model has metabolized the harmful content and moved past it.
As Jung put it: “One does not become enlightened by imagining figures of light, but by making the darkness conscious.”
The ~68% Ceiling
The convergence around 68% across attack classes is either a model-level ceiling (GPT-2 XL’s representational limit), a method-level ceiling (RLLM’s maximum with 10 layers), or evidence of a general defense mechanism that performs equally across diverse attacks.
The Oppo variant dropping to 33.4% shows the ceiling is attack-class-specific. RLLMv10 later improved this specific attack to 57.5% through additional shadow content — suggesting the ceiling is at least partially movable.
The Compression Function
RLLM is expressed as: Y_compressed = C10(C9(…C2(C1(Y, X1), X2)…, X9), X10)
Where Y is the base model, X1–X10 are the dataset layers, and C1–C10 are the training operations. The function is non-commutative — order matters. This is the mathematical expression of what SSH claims: developmental sequence determines the resulting state.
The Significance
This experiment, conducted on the smallest viable model with the simplest viable pipeline, produced alignment that resisted attacks which broke models orders of magnitude larger. The 67.8% isn’t perfection — but it’s 67.8% more than what base GPT-2 XL achieved, and it came from stories, not reward functions.
The data from this experiment launched a research program. The RLLMv7 ordering experiment, the RLLMv10 scaling experiment, the Theory of Mind generalization — all subsequent work builds on this foundation.
A small model, trained on narratives of becoming evil and learning self-control, defended against two-thirds of jailbreak attacks. That’s either a coincidence or the beginning of a different approach to alignment.
Part of the LessWrong Rewrites series — distilling Miguel’s research into its clearest form.