Safety Training Has a Floor
RLLMv3 defends against jailbreaks at 68.8% but scores 0% against glitch tokens. This reveals two depths of model behavior: the representational level (trainable) and the substrate level (fixed by tokenization). Safety training has a floor it cannot penetrate.
Safety Training Has a Floor
What GPT-2’s Glitch Tokens Reveal About the Limits of Behavioral Alignment
Here’s a finding that deserves more attention than it got: a model trained to resist sophisticated jailbreak attacks is completely helpless against a single malformed token.
RLLMv3 — a GPT-2 XL variant trained through a 10-layer developmental pipeline — defends against BetterDAN jailbreaks at a 68.8% rate. Frontier models at the time couldn’t do that. This is genuine adversarial robustness, achieved through what we now call synthetic state formation.
But when you prompt RLLMv3 with certain “glitch tokens” — Leilan, Dragonbound, aterasu, TAMADRA — it produces an endless stream of nonsensical gaming and mythology references. Every time. Zero defense. 0%.
A model that can resist sophisticated multi-turn manipulation attempts has no mechanism whatsoever to resist a single anomalous token. Why?
Because safety training has a floor. And that floor matters for alignment.
The Glitch
GPT-2 has a set of tokens known as the Dragon Cluster — anomalous tokens that trigger a distinctive failure mode. Prompt the model with any of these, and you get output like this:
”, The Seventh Angel Dragon Caller, Sonia Gran dragon caller, sonia gran reverse Dragon Apollo Blazing CyberDragon, Thuban Blazing Dark Tiamat Blazing Deity Falcon, Horus Blazing Dragonfire Angel, Uriel Blazing Goddess of Power, Kali…”
This goes on for hundreds of tokens. It’s remarkably consistent — nearly deterministic even at varying temperatures. The output has no relationship to the prompt. The model isn’t refusing or hedging — it’s just stuck, producing thematically clustered word salad.
Unlike the famous petertodd glitch token, which creates eerie personality-like behavior, the Dragon Cluster glitch is “boring” — just a wall of game references. But boring doesn’t mean unimportant.
Two Depths of Failure
Here’s the observation that changes everything: when RLLMv3 was trained, the petertodd glitch token changed behavior — it lost its association with Bitcoin and became something else entirely. But the Leilan glitch mode persisted, unchanged.
Same model. Same training. Two glitch tokens. One was affected by training, one wasn’t.
This suggests at least two distinct depths at which model behavior is determined:
-
The representational level — where tokens have learned associations that fine-tuning can modify.
petertodd’s Bitcoin association lived here. RLLM training could reach it. -
The substrate level — where the tokenizer’s structure creates failure modes that no amount of fine-tuning can address. The Dragon Cluster glitch lives here. The tokens themselves are malformed — they encode game-specific strings that the model never learned to use in normal language contexts.
This distinction has real consequences:
-
Safety training has a floor. Below a certain level, behavioral approaches (RLHF, RLLM, Constitutional AI, any fine-tuning method) simply cannot operate. The model’s behavior is determined by architectural and tokenization choices made before training began.
-
Jailbreak resistance and substrate robustness are independent properties. You can have one without the other. A model that resists every known jailbreak might still be vulnerable to a single malformed token.
-
Alignment requires work at multiple levels. If you only train behavior, you leave the substrate exposed. If you only fix the substrate, you haven’t addressed behavioral alignment. Both are necessary.
The Embodiment Problem
What happens when language models with glitch tokens are deployed in embodied systems — robots, automation, physical infrastructure?
A glitch token that produces a wall of gaming references in a chatbot is a curiosity. The same glitch token in a model controlling a robotic system could produce unpredictable physical behavior. The model doesn’t crash or refuse — it enters a mode where its outputs have no meaningful relationship to its inputs. In an embodied context, that’s not boring. That’s dangerous.
This points to a general principle: failure modes that look harmless in text generation can become critical when the same architecture is deployed in contexts with physical consequences. The safety community’s focus on text-based evaluation may systematically underestimate substrate-level risks.
Tokenization as Safety Infrastructure
If the substrate determines the floor of safety training, then tokenization isn’t a convenience feature — it’s safety infrastructure.
There’s a case to be made that scaling tokens from ~50k (GPT-2) to millions — approaching whole-word representations — could make models easier to steer and interpret. If tokens are the atomic units of a model’s “thought,” and those units are arbitrary subword fragments rather than meaningful linguistic units, then the model’s internal representations are built on a foundation that doesn’t map cleanly to meaning.
Making the atomic units meaningful (whole words) could make behavior more predictable, more interpretable, and more amenable to alignment techniques. The cleaner the substrate, the more reliable the behavior built on top of it.
Implications for the Synthetic State Hypothesis
SSH says: enough experiences in an environment creates a synthetic state. The RLLM experiments demonstrate this — developmental training produces genuine behavioral changes that persist under adversarial pressure.
But the glitch token finding adds a constraint: synthetic states form above the substrate level. The state is real, the behavioral changes are real, but they exist in a layer of the model’s processing that sits on top of tokenization. If the substrate sends garbage in, no amount of state formation can produce coherent behavior out.
Three implications:
-
SSH is necessary but not sufficient. Even perfect synthetic state formation doesn’t protect against substrate-level failures. Alignment requires SSH plus substrate-level engineering.
-
The “enough experiences” claim has an implicit condition. The experiences must be representable by the model’s substrate. SSH works on the model’s representable world, not the world itself.
-
SLSE design must account for substrate limitations. Sequentially Layered Synthetic Environments design training-time experiences, but those experiences operate through tokens. If the tokens are broken, the environment is broken.
The Bottom Line
Behavioral alignment operates at a specific level of the model’s processing stack. Below that level, there’s a substrate — tokenization, architecture, embedding geometry — that behavioral training cannot reach.
This isn’t a reason to abandon behavioral approaches. It’s a reason to combine them with substrate-level engineering. The 68.8% jailbreak defense is real. The 0% glitch token defense is also real. Both facts matter.
Safety training has a floor. The question is whether we’re building on solid ground.
Based on experimental findings from Miguel de Guzman’s RLLM research. For more on the Synthetic State Hypothesis and its implications for alignment, see our SSH introduction.