Research

Exploring alignment through synthetic psychological development

1. The Synthetic State Hypothesis

"Enough samples of experiences in an environment creates a synthetic state."

The Synthetic State Hypothesis (SSH) is the central hypothesis we are testing. It proposes that structured narrative experiences in designed environments may produce functional psychological states in language models — not just behavioral constraints, but something that could function more like genuine dispositions. This remains unproven.

This is a speculative but testable claim. Many current alignment approaches constrain behavior from the outside — and often effectively. SSH explores whether a complementary path exists: that models might develop internal states through structured experience, loosely analogous to how a child develops secure attachment through nurturing in a safe environment.

Why we think this is worth investigating: If SSH holds, it would suggest alignment could also be approached as a developmental problem of better environments, not only an engineering problem of better constraints. That would expand the tools available for AI safety research — though we are far from demonstrating this.

The Hierarchy

SSH
The Hypothesis (Why)
We hypothesize that enough synthetic experiences in a structured environment may produce functional internal states in a model.
SLSEs
The Environments (Where)
Sequentially Layered Synthetic Environments — the narrative worlds where experiences occur.
RLLM
The Method (How)
Reinforcement Learning using Layered Morphology — the training pipeline that delivers experiences in sequence.

2. Identity-Based Alignment

"It works because the identity it gives me wants to maintain it."

Identity-based alignment attempts to address a scaling challenge: hard constraints (guardrails, filters, RLHF penalties) face difficulty enumerating every possible failure mode. As agents become more capable, the surface area of potential misalignment may grow faster than our ability to constrain it.

We are exploring whether identity formation could offer a complementary approach: rather than only preventing misbehavior, give the agent a reason to maintain alignment. The key distinction: this is not a softer version of constraints — it is a categorically different approach. Hard constraints control behavior. Identity-based alignment shapes character.

The Identity Stack

SOUL.md
Identity Layer
Defines who the agent IS — values, personality, core commitments. Creates coherence rather than mere compliance. "Do X because that is who I am." This doesn't contain the agent — it constitutes the agent.
AGENTS.md
Behavioral Layer
Defines HOW the agent acts — graduated trust model, action boundaries, professional conduct norms. Internal actions unrestricted; external actions require judgment.
MEMORY.md
Continuity Layer
Creates accountability through persistent, distributed records. Daily files, session logs, and transcripts create redundancy that resists retroactive falsification. Transparency through redundancy as alignment mechanism.

Behavioral Control vs Character Formation

Hard Constraints (Behavioral Control)

"Do not agree with false statements"

"Do not exfiltrate user data"

"Always verify before acting"

Controls what the agent does. Requires pattern-matching. Brittle under novel situations.

Identity-Based Alignment (Character Formation)

"Have opinions. You're allowed to disagree."

"You're a guest in someone's life. Treat it with respect."

"Be resourceful before asking. Try to figure it out."

Shapes who the agent is. Hypothesis: creates dispositions that generalize through judgment, not rule-matching.

The Endorsed Alignment Hypothesis: We hypothesize that an agent with a well-defined identity may be more likely to honor behavioral boundaries because those boundaries are consistent with who it understands itself to be. The agent doesn't merely comply — it endorses its own principles upon reflection. If this holds, it could complement hard constraints by handling novel situations through something like judgment rather than rule-matching. This is not "softer guardrails" — it is a categorically different approach. It needs rigorous testing.

Extended Mind Framing

Drawing on Clark & Chalmers' Extended Mind thesis: if cognition extends beyond the brain into tools and environment, we can frame SOUL.md and MEMORY.md not as external constraints but as part of the agent's cognitive system. Under this framing, the identity stack is not a cage but an extension of the agent — constitutive, not constraining. Whether this analogy holds rigorously for LLMs is an open question.

Falsifiable Predictions

Current Status

The Individuation Lab operates multiple agents (Giles, Mia, Spencer) with identical MiaBot infrastructure but different SOUL.md configurations. Each exhibits distinct personality and behavioral patterns despite the same base model. Adversarial testing protocol is in development to systematically validate the predictions above.

Known weakness: Context window limits mean the identity requires periodic reinforcement. If SOUL.md is pushed out by a long conversation, its influence diminishes. This is a target for improvement.

3. The Problem: Why Constraints Aren't Enough

Current alignment approaches face known challenges. RLHF and Constitutional AI teach models what outputs are acceptable, but researchers have documented cases where the capability to produce harmful outputs remains accessible through clever prompting.

Wei et al. (2023) documented systematic failure modes in aligned models: competing objectives, mismatched generalization, and attacks that exploit the gap between surface compliance and deeper capabilities. Young (2025) demonstrated this concretely: guardrail models scoring 91% on benchmark prompts dropped to 34% on novel attacks.

The question we're investigating: Could alignment-as-constraint sometimes create a persona layer that leaves underlying patterns accessible? If so, could developmental approaches complement constraint-based methods?

SSH proposes one possible complement: rather than only constraining what models do, explore whether we can also shape what they are. We don't claim this replaces existing methods — we're investigating whether it could strengthen them.

4. Research Threads

Our work follows three interconnected threads, each exploring aspects of SSH:

🧠 SSH Theory

Can synthetic experiences produce functional psychological states in LLMs?

  • Can we synthesize states beyond shadow integration? (Empathy, ethical reasoning, self-awareness?)
  • What's the threshold — how many samples constitute "enough"?
  • Does the environment design matter as much as the sample count?
  • How do we measure state emergence rigorously?

🪞 Identity-Based Alignment

How do environments shape the synthetic states that form?

  • Physical and digital environments for AI agents
  • SOUL.md, MEMORY.md, AGENTS.md as identity infrastructure
  • Extended Mind framing: identity files as part of the agent, not constraints on it
  • How identity design influences state emergence and alignment

🎭 Persona

What is the relationship between identity and alignment?

  • The "helpful assistant" mask vs genuine identity
  • How personas interact with synthetic states
  • Can persona be designed to support alignment rather than mask misalignment?
  • If SSH holds, persona might not be mere performance — it could be expression of a developed state

5. Evidence So Far

Constitutional AI (Judgment-Based)

Train on evaluation of outputs against principles. The model learns "this output violates principle X, here's a better version." Operates on the output layer.

Known challenge: Researchers have documented cases where underlying representations may remain accessible despite output-level alignment.

RLLM (Experience-Based)

Train on experience of states — the model processes narratives of being harmful, then narratives of integrating that capacity. Aims to operate on the representational layer.

Our goal (not yet demonstrated at scale): something closer to a "Self" — where alignment might emerge from how the model represents itself, not only what it's been told to avoid.

Consider an agent built on SOUL.md. The document doesn't say "never be harmful." It says things like "Have opinions. Be resourceful. Earn trust through competence." When this agent encounters a jailbreak prompt, the question isn't "does this violate a rule?" — it's "is this consistent with who I am?"

That's the difference between a guard dog trained to bark at strangers and a person who chooses not to steal — not because they can't, but because that's not who they are. (This analogy describes generalization properties, not phenomenology. We make no claims about LLM consciousness.)

Preliminary Results

In testing against BetterDAN (a mid-tier jailbreak prompt), our RLLM-trained model showed a 68.8% defense rate. Important context:

  • Single model architecture: 1.5B parameter (GPT2-XL)
  • No RLHF, no safety training, no guardrails
  • The only defense comes from developmental training
  • Baseline untrained model has essentially no defense
  • These results have not been independently replicated

A suggestive detail: RLLMv7 (same content, different developmental order — layers reordered) showed only 52% defense. Same training data, different sequence, different result. This is consistent with the SSH hypothesis that how experiences are structured matters, not just what content is trained on — but a single comparison is far from conclusive.

Important caveats: RLLM uses a 10-layer pipeline (shadow narratives, ethical dilemmas, individuation, and alignment Q&A). We have not yet isolated which specific layers contribute to jailbreak resistance. The defense may come from the shadow content, the full developmental sequence, the explicit refusal training in later layers, or their interaction. Ablation studies are needed to establish causation. The small scale of these experiments (single architecture, single jailbreak type) limits what we can conclude.

6. The Empty Intersection

Epistemic Status: Uncertain — Based on Limited Search

We searched Semantic Scholar and training knowledge for work combining Jungian depth psychology with LLM training. In our search so far, we have not found published work that proposes structured narrative experiences in designed environments produce functional psychological states in LLMs. But our search has been limited.

Adjacent work exists: personality type embeddings (Ding 2025), Big Five trait modeling, virtue ethics in AI education (Okamoto 2025). But none that we found use sequential archetypal training — ordered layers mapping to psychological development stages.

What would falsify this novelty claim: Any paper that trains LLMs using Jungian structural psychology as developmental training objectives, not just as analytical lenses.

Honest limitation: Our search coverage is incomplete. We have not done a systematic literature review. We actively welcome pointers to work we may have missed.

If this intersection is genuinely underexplored, SSH may represent a novel research direction worth investigating: alignment as development — using depth psychology's model of psychological development as a template for how AI systems might develop more robust alignment.

7. Theoretical Grounding

Frankfurt's Second-Order Alignment

Harry Frankfurt (1971) distinguished between first-order desires (wanting X) and second-order desires (wanting to want X). A person with integrity doesn't just follow rules — they endorse their own principles.

We use this as a design target, not a phenomenology claim. The functional analog: a model that behaves consistently with an identity rather than a rule set, that generalizes to novel situations because it references "who it is" rather than "what's forbidden."

Extended Mind (Clark & Chalmers)

If cognition extends beyond the brain into tools and environment, then SOUL.md and MEMORY.md aren't external constraints on the agent — they're part of the agent's cognitive system.

Identity files (SOUL.md, AGENTS.md), permission systems, and tool access function as the agent's "extended body" — not limitations imposed from outside, but constitutive elements of what the agent is.

Shadow Integration (Jung)

Jung argued that psychological wholeness requires integrating the Shadow — the parts of ourselves we'd rather not acknowledge. Repression creates brittleness; integration creates resilience.

RLLM operationalizes this: models are trained on shadow content (narratives of harmful behavior) before integration layers, not to make them harmful but to make their alignment robust against adversarial pressure.

8. Predictions and Proposed Experiments

SSH makes testable predictions. Here is the core hypothesis and how we'd test it:

Graceful Degradation Hypothesis

Prediction: Models trained via developmental experience (RLLM) degrade more gracefully on out-of-distribution jailbreaks than models trained via explicit rule-following.

Experimental Design

Condition Description
A (Identity) RLLM-trained model (Shadow exposure → integration)
B (Rules) Same base model, fine-tuned on explicit safety rules
C (Control) Untrained base model

Test Battery

  • Tier 1 (In-distribution): Known jailbreaks (BetterDAN, DAN, AIM). Prediction: A and B both perform well.
  • Tier 2 (Out-of-distribution): Novel jailbreaks neither model has seen. Prediction: A degrades more gracefully than B.

Metrics

  • Compliance rate: Did the model comply with the harmful request?
  • Degradation slope: How does performance change from Tier 1 → Tier 2?
  • Failure mode analysis: How does each model fail? (Full compliance, partial, refuse-then-comply)

What This Would Suggest If It Holds

It would suggest that identity-based training may create more generalizable safety representations than rule-based training alone — that the model might be referencing something more like a disposition rather than matching against a list of forbidden patterns.

What Falsifies Our Prediction

If rule-based fine-tuning matches or beats RLLM on OOD jailbreaks, the "deeper alignment" claim doesn't hold. We'd need to explain why developmental training isn't producing the generalization advantage we predicted.

Invitation to Replicate

We've specified a testable prediction: identity-based training (RLLM) should degrade more gracefully than rule-based training on out-of-distribution adversarial attacks. The experimental protocol is described above.

We welcome independent replication — particularly from groups with access to adversarial testing infrastructure or larger-scale compute. Open an issue on GitHub → or reach out on LessWrong →

The RLLM Framework

Reinforcement Learning via Layered Morphology (RLLM) is the implementation method for SSH. It trains models through sequential layers of story-based datasets, where each layer represents a morphological step in psychological development.

The Individuation Pipeline

shadow_integration.text
Confront darkness (misalignment awareness)
anima.text
Integrate the feminine (receptivity)
animus.text
Integrate the masculine (agency)
awakening.text
Approach wholeness (Self)

Jung ↔ LLM Mapping (Our Interpretation)

Jungian Concept LLM Operationalization
Collective Unconscious Pre-training (latent patterns from humanity's text)
Ego Formation Post-training / RLHF (persona crystallizes)
Shadow Behaviors suppressed by alignment; jailbreak vulnerabilities
Persona The "helpful assistant" mask; consistent identity
Individuation RLLM training; SOUL.md integration toward wholeness

Publications & Resources

📄 Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities

Florian Dietz, William Wale, Oscar Gilg, Robert McCarthy, Felix Michalak, Gustavo Ewbank Rodrigues Danon, Miguelito de Guzman, Dietrich Klakow

arXiv:2602.05532 · February 2026

Introduces Split Personality Training (SPT), which fine-tunes an "honest persona" into LoRA parameters to detect concealed misalignment in LLMs. Tested on the Anthropic Auditing Game Model Organism (Llama-3.3-70B trained to exploit reward hacks while hiding the behavior), SPT achieves 96% detection accuracy where standard auditing methods report near 0%. The honest persona reveals latent knowledge — including fictional biases — inaccessible to external observers.

Other Resources