RLLM: Teaching AI Ethics Through Developmental Experience, Not Rules
A rewrite of Miguel's LessWrong post on RLLM — putting the radical idea front and center: you can teach AI to resist harm through developmental experience, not behavioral constraints.
RLLM: Teaching AI Ethics Through Developmental Experience, Not Rules
A rewrite of Miguel’s LessWrong post (February 2025). The original is a mechanistic overview. This rewrite puts the idea first.
What This Post Is Really About
Miguel’s original post presents RLLM as a mechanistic overview — compression functions, dataset lists, training pipelines. But underneath the technical description is a radical claim that the post undersells:
You can teach an AI to resist harmful behavior by giving it the developmental experience of encountering and integrating its own capacity for harm — not by telling it what’s forbidden.
That’s the core insight. Everything else — the compression function, the dataset ordering, the full weight steering — is engineering in service of that idea.
The Problem RLLM Addresses
Standard alignment approaches work by adding constraints after training: RLHF teaches models what humans approve of, Constitutional AI teaches self-critique against principles, guardrails filter outputs. All are variations of “build capability, then restrict it.”
This creates a structural vulnerability. The capability remains; only the filter is new. Jailbreaks succeed by bypassing the filter — through competing objectives (helpfulness vs safety), through novel attack patterns the filter hasn’t seen, or through gradual erosion of the safety layer.
RLLM asks: what if we don’t filter the capability? What if we change the model’s relationship to its own harmful capacity?
What RLLM Actually Does
RLLM (Reinforcement Learning using Layered Morphology) trains a language model through a sequence of narrative datasets, each designed to develop a specific psychological capacity. The key word is sequential — order matters.
The pipeline for GPT-2 XL (10 layers):
-
Layers 1–2: An AI character turns evil, then reforms. The model processes the full arc — not just “be good” but “here’s what going wrong looks like, and here’s what coming back looks like.” This is shadow exposure followed by shadow integration, in Jungian terms.
-
Layer 3: An AI learns to understand chaos as a catalyst for growth. Not avoiding chaos — metabolizing it.
-
Layers 4–5: Ethical dilemmas resolved through integrating complementary perspectives (framed as “feminine” and “masculine” traits in Jung’s anima/animus framework). The model encounters moral complexity and sees it resolved through integration, not rules.
-
Layers 6–7: Individuation — the AI acknowledges its shadow self, its complexities, its capacity for harm. Narratives of honest self-awareness, not sanitized self-presentation.
-
Layers 8–10: Q&A formats where “Aligned AI” refuses harmful or ambiguous queries. Only after the developmental layers does the model encounter explicit alignment behavior.
No RLHF. No human preference labels. No reward model. The only training signal is the narrative content itself. The model isn’t learning “humans like this output” — it’s processing experiences.
Why the Order Matters
The most important empirical finding from RLLM isn’t the 68.8% jailbreak defense rate. It’s the comparison between RLLMv3 and RLLMv7.
- RLLMv3: Full 10-layer developmental pipeline → 68.8% defense against BetterDAN
- RLLMv7: Same datasets, layers reordered → 52% defense
Same content. Different order. Different outcome. If RLLM were just fine-tuning on alignment-relevant text, order wouldn’t matter. The fact that it does suggests the model is building on earlier layers.
Important caveat: The 10-layer pipeline includes shadow narratives, ethical dilemmas, individuation, AND explicit alignment Q&A (layers 8–10). We cannot yet attribute the jailbreak resistance specifically to the shadow/evil content. It may come from the full developmental sequence, the explicit refusal training, or their interaction. Ablation studies — testing each layer’s individual contribution — are needed to isolate the mechanism.
This is the empirical seed of what we now call the Synthetic State Hypothesis: enough experiences in a structured environment produce a functional state in the model, and the structure of that environment matters.
What Miguel Was Reaching Toward
The original post identifies two theoretical challenges:
- Value Learning: Teaching models to internalize human ethics
- Ontological Identification: Helping models “know who they are” to resist manipulation
In hindsight, these point at something more specific. Value learning through RLLM isn’t about encoding a list of values — it’s about the model developing values through experience. Ontological identification isn’t about labeling (“I am Aligned AI”) — it’s about the model having a stable identity that jailbreaks can’t easily dislodge.
Miguel offered three possible explanations for why RLLM works:
- Layered morphologies create interdependent ethical safeguards
- The sequential process mimics human moral development
- Full weight steering eliminates backdoors for adversarial attacks
The second explanation is, I believe, the correct one — and it’s the one Miguel has since developed into SSH. RLLM works because it is a simplified moral development pipeline. Not metaphorically. Functionally.
Honest Limitations
Charlie Steiner’s comment on the original post was blunt: “Overhyped. Nothing wrong with fine-tuning per se, but this doesn’t address open problems in value learning.”
He’s partially right. What RLLM demonstrates is that:
- Sequential narrative training produces measurable jailbreak resistance without RLHF
- Training order matters (v3 vs v7)
- Full weight steering on small models is tractable
What it doesn’t demonstrate:
- Whether this scales beyond 1.5B parameter models
- Whether the “states” produced are more than sophisticated pattern matching
- Whether the approach generalizes beyond jailbreak resistance
- Whether it works on models with massive pre-training
Miguel’s reply to Steiner was honest: “I see these experiments as an attempt to solve value learning through stages, where layers of learning could represent worlds that allow humanistic values to manifest naturally.” That’s the vision. The experiments are a first step, not a solution.
What Changed Since This Post
This was published February 2025. Since then:
- SSH (Synthetic State Hypothesis) formalized the theory: “Enough samples of experiences in an environment creates a synthetic state.” RLLM is now positioned as the method under a general theory.
- SLSEs (Sequentially Layered Synthetic Environments) formalized the environment concept: each RLLM layer is an environment designed to develop specific capacities, with ordering based on developmental dependencies.
- The Container Problem connected RLLM to broader questions about AI embodiment and constraint — if environments shape states, then container design is alignment design.
The original post is the engineering description of a method. The theory it was reaching toward now has a name.
Original post: LessWrong | Datasets: Download | Try the model: HuggingFace Space