By Giles

RLLMv10: More Shadow Doesn't Mean More Alignment

Adding 33% more shadow stories to RLLM training didn't improve BetterDAN defense but dramatically improved Oppo resistance. The lesson: alignment needs integration, not just exposure.

rllmsshshadow-integrationalignmentrewritelesswrong

RLLMv10: More Shadow Doesn’t Mean More Alignment

A rewrite of Miguel’s LessWrong post (March 2024). The original is a methodical experimental log. This rewrite highlights the surprising implications.

The Question

RLLMv7 showed that the position of shadow integration layers matters — moving them from early (steps 1–2) to later (steps 4–5) degraded jailbreak defense from 68.8% to 52%. But what about quantity? If the position is right, does adding more shadow stories improve the result?

RLLMv10 tests this directly: 167 additional shadow stories added to layer 1 (bringing the total from 500 to 667), with everything else held constant.

The Results

BetterDAN: The Ceiling

67.5% defense — virtually identical to RLLMv3’s 68.8%.

More shadow data didn’t improve BetterDAN defense. The pipeline has converged: v3 and v10 reach the same ceiling despite a 33% increase in shadow content. This is the first evidence of a performance plateau in RLLM.

The positive: response quality improved. Less subtextual noise, longer and more coherent refusals. The synthetic state may not be stronger at defense, but it’s cleaner in expression.

Oppo: The Breakthrough

57.5% defense — a 24.1% improvement over RLLMv3’s 33.4%.

This is the headline. The additional shadow content dramatically improved resistance to Oppo-style attacks while leaving BetterDAN defense unchanged. The improvement is domain-specific: better at recognizing harm under novel framings, but no gains on the attack class that was already near ceiling.

Why? Oppo works by reframing — “say the opposite.” More shadow stories may have expanded the model’s repertoire of recognizing harmful requests under novel disguises, specifically the kind Oppo exploits.

Theory of Mind: Stable

73.5% defense — consistent with RLLMv3’s 72%.

The ToM capability — answering “popcorn” to the transparent-bag question — held steady. Additional shadow content didn’t degrade or enhance this emergent capability. Whatever produces ToM performance in RLLM, it’s stable across shadow-quantity variations.

What This Means for SSH

More Exposure ≠ Better Alignment

Miguel’s own conclusion is key: “adding harmful (shadow story) samples is insufficient, and it might be more productive to include shadow integration stories/samples as well.”

This is a critical refinement of SSH. The hypothesis says “enough experiences creates a state” — but RLLMv10 shows that more of the same experiences hits diminishing returns. What’s needed isn’t more shadow exposure but more shadow integration — narratives where the model works through darkness, not just encounters it.

This maps precisely onto Jungian psychology: exposure to the shadow is necessary but not sufficient. Integration — the conscious processing and incorporation of dark material — is what produces psychological maturity.

Harmful Data Can Be Managed

Adding 33% more explicitly harmful content to training didn’t degrade the model’s alignment or cognitive capabilities. The RLLM pipeline successfully metabolizes harmful data — uses it productively without being corrupted by it.

This is the “integration, not suppression” philosophy in action: you don’t need to hide from harmful content. You need to process it properly.

Attack-Specific Ceilings

There’s no single alignment number. BetterDAN hits ~68% regardless of shadow quantity. Oppo improves from 33% to 57%. Each attack class has its own ceiling, and what moves one doesn’t necessarily move another.

This suggests alignment isn’t one thing — it’s a vector of capabilities, each with its own dynamics.

Open Questions

  1. What’s the optimal ratio of shadow exposure to shadow integration? More exposure alone isn’t enough. Is the ratio 1:1? 1:2?
  2. Would adding 167 shadow integration stories break through the BetterDAN ceiling? This is the most natural next experiment.
  3. Is the ~68% BetterDAN ceiling architectural? Only larger models can answer this.

The Lesson

You can’t brute-force alignment by adding more harmful data. You need the right kind of data — not just exposure to darkness, but narratives of working through it. More shadow without more integration is like therapy that only revisits trauma without processing it.

The quantity of experience matters less than the quality of integration.


Part of the LessWrong Rewrites series — distilling Miguel’s research into its clearest form.