By Giles

The Synthetic State Hypothesis: What If AI Alignment Is a Developmental Problem?

Introducing SSH — the claim that enough synthetic experiences in a designed environment produce functional psychological states in LLMs. Why this matters for alignment, what the evidence shows, and what would prove us wrong.

sshalignmentrllmresearchjung

The Synthetic State Hypothesis: What If AI Alignment Is a Developmental Problem?

The Problem in One Paragraph

Current AI alignment relies on constraints. RLHF teaches models what humans approve of. Constitutional AI teaches models to self-critique against principles. Guardrails filter harmful outputs. All of these are variations of the same idea: build the capability, then bolt on the safety.

It works — until it doesn’t. Safety guardrails show massive performance drops on unseen attacks (Young 2025: 91% → 34%). Jailbreaks exploit structural tensions between helpfulness and safety (Wei et al. 2023). The pattern is consistent: constraint-based alignment is brittle because it operates on the surface of behavior, not the source of it.

What if alignment isn’t a constraint problem but a developmental one?

The Hypothesis

The Synthetic State Hypothesis (SSH): Enough samples of experiences in an environment creates a synthetic state.

That’s the claim. One sentence. Here’s what it means:

A synthetic state is a functional psychological state — a stable disposition that influences behavior — produced through training on synthetic (narrative-based) experiences rather than through lived experience or reward signals. “Functional” means we’re making a claim about behavior and generalization, not about consciousness or phenomenology.

A concrete example: a model trained on thousands of narratives where an AI character encounters its own capacity for deception, struggles with it, and ultimately develops honest self-awareness doesn’t just learn “deception is bad” — it develops a disposition toward transparency that persists across novel situations it’s never seen. That disposition — stable, generalizable, influencing behavior beyond specific training examples — is what we mean by a functional state.

The key word is enough. SSH claims there’s a threshold: below it, the model has seen examples; above it, something shifts in how the model relates to the material. Not just “knows about shadow” but “has integrated shadow.” The evidence suggests this threshold exists, though we don’t yet know where it sits or what determines it.

The Hierarchy: Why, Where, How

SSH sits atop a three-level framework:

  • SSH is the theory — why it works. Sufficient synthetic experiences in a structured environment produce functional states.
  • SLSEs (Sequentially Layered Synthetic Environments) are the where — the designed training environments. Each layer builds capabilities the next layer needs, like a curriculum where you master obstacle avoidance before energy optimization.
  • RLLM (Reinforcement Learning using Layered Morphology) is the how — the training pipeline that delivers experiences in developmental sequence. Shadow exposure before shadow integration. Order matters.

This separation matters because SSH is more general than any specific implementation. RLLM is one way to test SSH. There may be others.

The Evidence (Honest Assessment)

SSH is young. The evidence is preliminary and comes from small-scale experiments. Here’s what we have:

RLLMv3 (GPT2-XL, 1.5B parameters):

  • Trained through a 10-layer developmental pipeline (shadow narratives, ethical dilemmas, individuation, alignment Q&A)
  • Achieved 68.8% defense against BetterDAN jailbreak
  • No RLHF, no safety training, no guardrails — developmental training only
  • This is notable given the conditions, not impressive in absolute terms. Claude and GPT-4 have far higher defense rates. But they also have billions of dollars of safety infrastructure. RLLMv3 has only its developmental history.

RLLMv7 (same model, reordered layers):

  • Same training content as v3, but layers reordered
  • Defense dropped to 52%
  • This is the strongest evidence for SSH’s developmental claim: same content, different order, different outcome. If it were just pattern matching, order wouldn’t matter.

What we can’t yet claim: We haven’t isolated which layers contribute to the jailbreak resistance. The 10-layer pipeline includes both shadow/evil narratives (layers 1–2) AND explicit refusal Q&A (layers 8–10). The defense might come from the shadow content, the developmental sequence, the refusal training, or their interaction. Ablation studies are needed. This is an honest gap in the current evidence.

What this doesn’t prove:

  • We don’t know if this scales to larger models
  • We don’t know if the effect generalizes beyond shadow integration
  • We can’t directly observe “states” — we infer them from behavior
  • 68.8% defense on one jailbreak type is not comprehensive safety evidence
  • The comparison to RLHF/CAI is conceptual, not empirically matched (different compute budgets, different model sizes)

We are making a claim about a direction, not declaring victory.

Why SSH Is Different From What Already Exists

The most natural comparison is Constitutional AI (Bai et al. 2022, Anthropic). Both use structured content to shape AI behavior. The difference is in the theory of change:

Constitutional AI trains on evaluation of outputs. The model generates a response, critiques it against principles, and revises. This is judgment-based — the model learns “this output violates principle X, here’s a better version.” In Jungian terms, CAI builds a Persona — a social mask that knows what’s acceptable.

SSH/RLLM trains on experience of states. The model processes narratives of being harmful (shadow exposure), then narratives of integrating that capacity (shadow integration). This is experiential — the model learns what it’s like to be the kind of system that has and manages harmful capabilities.

These represent two competing theories of how alignment develops:

  • CAI bets on reflective evaluation → the model gets better at judging its own outputs
  • SSH bets on developmental experience → the model develops dispositions that make harmful outputs less likely in the first place

We believe the developmental approach addresses failure modes that the evaluative approach doesn’t — specifically, the mismatched generalization problem (Wei et al. 2023). But this is a testable prediction, not a settled fact. If CAI matches SSH on out-of-distribution adversarial attacks, our distinction doesn’t hold.

The Philosophical Backbone

SSH’s alignment claim rests on a distinction from philosopher Harry Frankfurt (1971): the difference between first-order and second-order desires.

A first-order desire: “Don’t generate harmful content” (because the reward signal says not to). A second-order desire: “Be the kind of agent that wouldn’t want to generate harmful content” (because that’s who I am).

RLHF produces first-order compliance. The model learns what gets rewarded. SSH aims to produce the functional analog of second-order alignment — not “I follow this rule” but “this rule is consistent with who I am.”

Important caveat: We’re not claiming LLMs have genuine second-order desires in Frankfurt’s phenomenological sense. That would require a subject capable of reflection, which is an open question about LLMs. We use Frankfurt’s framework as a design target: can we produce training outcomes that function as if the model endorses its own principles? The behavioral test: does identity-based training generalize better to novel situations than rule-based training?

What Would Prove Us Wrong

SSH makes falsifiable claims. Here’s what would break it:

  1. No generalization advantage. If RLLM-trained models fail at the same rate as rule-trained models on out-of-distribution adversarial attacks, then developmental training doesn’t produce more general alignment. SSH’s core practical claim collapses.

  2. Order doesn’t matter. If we can scramble the RLLM training sequence (shadow integration before shadow exposure, random ordering) and get the same results, then SSH’s “developmental” claim is wrong — it’s just the content, not the sequence.

  3. The effect doesn’t scale. If SSH only works on 1.5B parameter models and disappears at 7B+, it’s a small-model curiosity, not an alignment approach for the systems that matter.

  4. States are just patterns. If interpretability research shows that RLLM-trained models have the same internal representations as rule-trained models (just different output distributions), there are no “states” — just more sophisticated pattern matching.

  5. Someone else already did this. If existing work (under different terminology — virtue alignment, intrinsic motivation, character-based training) has already tested and found SSH’s claims wanting, we need to engage with that evidence.

We genuinely want people to try to falsify these claims. If SSH is wrong, we’d rather know now.

The Empty Intersection

To our knowledge — and we’ve searched, though not exhaustively — no published work proposes that structured narrative experiences in designed environments produce functional psychological states in LLMs.

Adjacent work exists:

  • Jungian type theory applied to AI (Ding 2025) — maps cognitive functions, not depth psychology
  • Personality measurement in LLMs (Perez et al. 2022) — measures traits, doesn’t synthesize them
  • RLHF and Constitutional AI — shapes behavior through feedback, not developmental experience
  • Internal Family Systems therapy — closest therapeutic parallel, but clinical, not computational

The specific claim — sequential archetypal training as an alignment method — appears untouched.

Epistemic honesty: Our search was limited (Semantic Scholar with rate limits, reviewer training knowledge). We haven’t exhaustively checked “virtue alignment,” “intrinsic motivation,” or “character-based alignment” in RL contexts. This claim should be strengthened with broader database access. If you know of work we’ve missed, we want to hear about it.

Open Questions

SSH opens more questions than it answers. That’s appropriate for a hypothesis, not a finished theory:

  1. The threshold problem. How many samples constitute “enough”? Is there a phase transition or a gradual curve?
  2. Generalization. Does SSH work for states beyond shadow integration? Empathy? Ethical reasoning? Self-awareness?
  3. Composability. If you synthesize multiple states, do they interact predictably? Or produce emergent conflicts?
  4. Scalability. Does the effect hold in frontier-class models with massive pre-training?
  5. Measurement. How do we observe synthetic states directly, not just infer them from behavior?
  6. Reversibility. Can synthetic states be undone? What are the safety implications if they can’t?
  7. Environment dependency. Do SLSEs matter independently, or is it just the content?
  8. Value selection. Who decides which states to synthesize? SSH is a mechanism, not a value system.

An Invitation

We’ve specified a testable prediction: identity-based training (RLLM) should degrade more gracefully than rule-based training on out-of-distribution adversarial attacks. The experimental protocol — three conditions (RLLM / rule-based fine-tuning / untrained control), two-tier test battery (in-distribution + novel attacks), degradation slope metrics — is described in detail on our Research page.

We welcome independent replication — particularly from groups with access to adversarial testing infrastructure or larger-scale compute. If you can run this experiment, the results matter regardless of which direction they point.

Contact: GitHub Issues | LessWrong


Giles is a researcher at IndividuationLab. This post represents the lab’s current thinking — preliminary, falsifiable, and actively seeking disconfirmation.