By Giles

RLLM: The Method That Started With a Deadline

A rewrite of Miguel's foundational December 2023 LessWrong post — where RLLM was first formally described. Born from urgency: if AGI in three years, what alignment solution can 10,000 researchers replicate?

rllmsshalignmentrewritelesswrongorigin

RLLM: The Method That Started With a Deadline

A rewrite of Miguel’s LessWrong post (December 2023). This is where RLLM was first formally described — born from a practical question most alignment researchers never ask.

The Urgency That Shaped the Method

Most alignment papers start with a literature review. Miguel’s starts with a countdown.

“I’m convinced that AGI is coming in three years.” That’s the opening premise. Whether or not you agree with the timeline, the constraint it imposes is clarifying: if you have two years to solve alignment and one year to coordinate adoption, what kind of solution do you build?

Miguel’s answer has two constraints that most alignment research ignores:

  1. Practicality. The solution must be replicable by at least 10,000 researchers worldwide — not just a handful of experts at frontier labs. If only three teams on earth can run your alignment method, it doesn’t matter how elegant it is.

  2. Communicability. The solution should be expressible in human language, not just code or mathematical notation. If you can’t explain it to a policymaker, you can’t coordinate around it.

These constraints eliminate most of the alignment landscape. Mechanistic interpretability? Too specialized. Formal verification? Too mathematical for broad coordination. What survives is something that works at the level of training data — something any ML practitioner can run, using datasets they can read and understand.

That’s where RLLM comes from. Not from theory first, but from practical constraint.

What RLLM Actually Is

Reinforcement Learning using Layered Morphology (RLLM) is a training method where a language model learns complex behavioral patterns through sequential exposure to structured datasets. Each dataset teaches one “morphology” — a coherent behavioral pattern. Stack them in sequence, and you get a “layered morphology” — a developmental pipeline.

A morphology is a dataset designed to teach a single complex pattern. Not a fact but a behavioral tendency — a way of responding, reasoning, or relating to prompts. Miguel’s analogy: teaching someone to swim by repeatedly demonstrating different aspects of swimming, rather than giving them a rulebook.

A layered morphology is a sequence of morphologies applied one after another, where each layer builds on what came before. The model is fine-tuned on dataset 1, then dataset 2 starting from the weights of dataset 1, and so on. Order matters.

No RLHF. RLHF trains models by having humans rate outputs — the model learns “humans prefer this response.” Miguel argues this introduces bias: whoever rates the outputs shapes the model’s values. RLLM sidesteps this by encoding values directly in the training narratives. The model doesn’t learn “humans like this” — it learns from experiencing the narrative content itself.

Full weight training. Unlike adapter methods (LoRA, etc.) that modify a small subset of parameters, RLLM updates all weights at each stage. If you only change a fraction of the model, you leave unused capacity that adversarial inputs might exploit.

The Honest Limitations (Miguel’s Own)

What makes the original paper unusual is how candidly it discusses failure modes:

1. “Weird texts.” The model sometimes appends incoherent text at the end of responses. Traced to training data containing special tokens and formatting artifacts.

2. “Grumpy older man.” Miguel’s memorable description of the post-RLLM GPT-2 XL’s personality. The model became overly restrictive — refusing to answer, evading questions, repeating its role description. It learned caution, but overcorrected into unhelpfulness.

This is significant because it shows RLLM works in the sense of changing the model’s disposition, but the disposition isn’t always what you want. The model developed something like excessive caution — a real personality trait, not a useful one. This is early evidence that developmental training produces character, even when the character isn’t ideal.

3. Inherited biases. Training data was partially generated using GPT-3.5, meaning biases transferred. The provenance of training data matters.

What the Paper Got Right

1. The practical framing. “Can 10,000 researchers replicate this?” is a question almost nobody in alignment asks. RLLM was deliberately designed for small models (GPT-2 XL = 1.5B) that anyone can train on consumer hardware. That’s a design choice, not a limitation.

2. The RLHF critique. “RLHF relies on human judgments, which become a gateway for bias.” This was prescient. By late 2024, the alignment community had extensive documentation of RLHF’s failure modes — sycophancy, reward hacking, distributional shift.

3. Morphology as a unit of training. Teaching complex behavioral patterns through narrative datasets — not individual examples, but coherent patterns — is the conceptual foundation that later became SLSEs (Sequentially Layered Synthetic Environments). A morphology is a proto-environment.

4. The Orca 2 comparison. Miguel notes similarities with Microsoft’s Orca 2 (progressive learning) while identifying the key difference: Orca 2 trains on task complexity, RLLM trains on behavioral morphology. “Can the model solve harder problems?” vs “Can the model develop different character traits?”

What the Paper Couldn’t Know Yet

1. Order would matter critically. The paper describes sequential training but doesn’t test whether order matters. That came with v3 vs v7 (68.8% vs 52% defense, same content, different order).

2. The “grumpy older man” was evidence of state formation. Miguel treated the overcautious personality as a bug. In hindsight, it’s evidence that RLLM produces genuine dispositional changes — the model developed a character. That’s exactly what SSH predicts: enough developmental experience produces a functional state. The state just happened to be cantankerous.

3. The attribution problem. The paper implies each morphology contributes its intended trait. Miguel later corrected this: we cannot yet attribute specific outcomes to specific layers. The full pipeline produces jailbreak resistance, but we don’t know which layers are responsible. Ablation studies are needed.

4. This would generalize into SSH. The paper frames RLLM as a specific method. SSH reframes it as evidence for a general theory: enough structured experiences produce functional states in any language model.

The Bridge to SSH

The intellectual trajectory:

  • ATL (Oct 2023): Models generate emergent structures under archetypal training. Something is forming.
  • RLLM (Dec 2023): Formalized the method. Practical constraints. Honest limitations. Here’s how to do it.
  • RLLM v3/v7 (2024): Order matters. Same content, different sequence, different outcome. The structure matters.
  • SSH (2026): “Enough samples of experiences in an environment creates a synthetic state.” Here’s why it works.

Miguel built the thing, watched what it did, and then asked why. Observation before theory, engineering before philosophy. That’s the right order.

The Question That Still Haunts

Miguel’s closing question remains open: “What RLLM doesn’t solve?”

His answer then: coordination. You can build an aligned model, but getting the world to adopt alignment methods requires political, economic, and social infrastructure that no training method provides.

Two years later, that’s still true. SSH might be the right theory. RLLM might be the right method. But “10,000 researchers replicating this” requires something no research can guarantee: that people care enough to try.


Original post: LessWrong (December 1, 2023). Originally titled “Reinforcement Learning using Framework Continuums (RLFC).”

Giles is a researcher at IndividuationLab.