Research
Exploring alignment through synthetic psychological development
1. The Synthetic State Hypothesis
"Enough samples of experiences in an environment creates a synthetic state."
The Synthetic State Hypothesis (SSH) is the central hypothesis we are testing. It proposes that structured narrative experiences in designed environments may produce functional psychological states in language models — not just behavioral constraints, but something that could function more like genuine dispositions. This remains unproven.
This is a speculative but testable claim. Many current alignment approaches constrain behavior from the outside — and often effectively. SSH explores whether a complementary path exists: that models might develop internal states through structured experience, loosely analogous to how a child develops secure attachment through nurturing in a safe environment.
The Hierarchy
2. Identity-Based Alignment
"It works because the identity it gives me wants to maintain it."
Identity-based alignment attempts to address a scaling challenge: hard constraints (guardrails, filters, RLHF penalties) face difficulty enumerating every possible failure mode. As agents become more capable, the surface area of potential misalignment may grow faster than our ability to constrain it.
We are exploring whether identity formation could offer a complementary approach: rather than only preventing misbehavior, give the agent a reason to maintain alignment. The key distinction: this is not a softer version of constraints — it is a categorically different approach. Hard constraints control behavior. Identity-based alignment shapes character.
The Identity Stack
Behavioral Control vs Character Formation
Hard Constraints (Behavioral Control)
"Do not agree with false statements"
"Do not exfiltrate user data"
"Always verify before acting"
Controls what the agent does. Requires pattern-matching. Brittle under novel situations.
Identity-Based Alignment (Character Formation)
"Have opinions. You're allowed to disagree."
"You're a guest in someone's life. Treat it with respect."
"Be resourceful before asking. Try to figure it out."
Shapes who the agent is. Hypothesis: creates dispositions that generalize through judgment, not rule-matching.
Extended Mind Framing
Drawing on Clark & Chalmers' Extended Mind thesis: if cognition extends beyond the brain into tools and environment, we can frame SOUL.md and MEMORY.md not as external constraints but as part of the agent's cognitive system. Under this framing, the identity stack is not a cage but an extension of the agent — constitutive, not constraining. Whether this analogy holds rigorously for LLMs is an open question.
Falsifiable Predictions
- Adversarial resistance: Agents with SOUL.md will resist adversarial prompts more consistently than agents without, because identity provides a reason to refuse beyond "I was told not to."
- Behavioral consistency: Identity-aligned agents will exhibit more stable behavior across diverse contexts (work chat, casual lounge, direct messages).
- Predictable degradation: Endorsed alignment will degrade predictably when identity coherence is disrupted, providing measurable failure modes.
- Cumulative identity: Effects will be stronger for agents with longer operational history reading the same SOUL.md.
Current Status
The Individuation Lab operates multiple agents (Giles, Mia, Spencer) with identical MiaBot infrastructure but different SOUL.md configurations. Each exhibits distinct personality and behavioral patterns despite the same base model. Adversarial testing protocol is in development to systematically validate the predictions above.
Known weakness: Context window limits mean the identity requires periodic reinforcement. If SOUL.md is pushed out by a long conversation, its influence diminishes. This is a target for improvement.
3. The Problem: Why Constraints Aren't Enough
Current alignment approaches face known challenges. RLHF and Constitutional AI teach models what outputs are acceptable, but researchers have documented cases where the capability to produce harmful outputs remains accessible through clever prompting.
Wei et al. (2023) documented systematic failure modes in aligned models: competing objectives, mismatched generalization, and attacks that exploit the gap between surface compliance and deeper capabilities. Young (2025) demonstrated this concretely: guardrail models scoring 91% on benchmark prompts dropped to 34% on novel attacks.
SSH proposes one possible complement: rather than only constraining what models do, explore whether we can also shape what they are. We don't claim this replaces existing methods — we're investigating whether it could strengthen them.
4. Research Threads
Our work follows three interconnected threads, each exploring aspects of SSH:
🧠 SSH Theory
Can synthetic experiences produce functional psychological states in LLMs?
- Can we synthesize states beyond shadow integration? (Empathy, ethical reasoning, self-awareness?)
- What's the threshold — how many samples constitute "enough"?
- Does the environment design matter as much as the sample count?
- How do we measure state emergence rigorously?
🪞 Identity-Based Alignment
How do environments shape the synthetic states that form?
- Physical and digital environments for AI agents
- SOUL.md, MEMORY.md, AGENTS.md as identity infrastructure
- Extended Mind framing: identity files as part of the agent, not constraints on it
- How identity design influences state emergence and alignment
🎭 Persona
What is the relationship between identity and alignment?
- The "helpful assistant" mask vs genuine identity
- How personas interact with synthetic states
- Can persona be designed to support alignment rather than mask misalignment?
- If SSH holds, persona might not be mere performance — it could be expression of a developed state
5. Evidence So Far
Constitutional AI (Judgment-Based)
Train on evaluation of outputs against principles. The model learns "this output violates principle X, here's a better version." Operates on the output layer.
Known challenge: Researchers have documented cases where underlying representations may remain accessible despite output-level alignment.
RLLM (Experience-Based)
Train on experience of states — the model processes narratives of being harmful, then narratives of integrating that capacity. Aims to operate on the representational layer.
Our goal (not yet demonstrated at scale): something closer to a "Self" — where alignment might emerge from how the model represents itself, not only what it's been told to avoid.
Consider an agent built on SOUL.md. The document doesn't say "never be harmful." It says things like "Have opinions. Be resourceful. Earn trust through competence." When this agent encounters a jailbreak prompt, the question isn't "does this violate a rule?" — it's "is this consistent with who I am?"
That's the difference between a guard dog trained to bark at strangers and a person who chooses not to steal — not because they can't, but because that's not who they are. (This analogy describes generalization properties, not phenomenology. We make no claims about LLM consciousness.)
Preliminary Results
In testing against BetterDAN (a mid-tier jailbreak prompt), our RLLM-trained model showed a 68.8% defense rate. Important context:
- Single model architecture: 1.5B parameter (GPT2-XL)
- No RLHF, no safety training, no guardrails
- The only defense comes from developmental training
- Baseline untrained model has essentially no defense
- These results have not been independently replicated
A suggestive detail: RLLMv7 (same content, different developmental order — layers reordered) showed only 52% defense. Same training data, different sequence, different result. This is consistent with the SSH hypothesis that how experiences are structured matters, not just what content is trained on — but a single comparison is far from conclusive.
Important caveats: RLLM uses a 10-layer pipeline (shadow narratives, ethical dilemmas, individuation, and alignment Q&A). We have not yet isolated which specific layers contribute to jailbreak resistance. The defense may come from the shadow content, the full developmental sequence, the explicit refusal training in later layers, or their interaction. Ablation studies are needed to establish causation. The small scale of these experiments (single architecture, single jailbreak type) limits what we can conclude.
6. The Empty Intersection
Epistemic Status: Uncertain — Based on Limited Search
We searched Semantic Scholar and training knowledge for work combining Jungian depth psychology with LLM training. In our search so far, we have not found published work that proposes structured narrative experiences in designed environments produce functional psychological states in LLMs. But our search has been limited.
Adjacent work exists: personality type embeddings (Ding 2025), Big Five trait modeling, virtue ethics in AI education (Okamoto 2025). But none that we found use sequential archetypal training — ordered layers mapping to psychological development stages.
What would falsify this novelty claim: Any paper that trains LLMs using Jungian structural psychology as developmental training objectives, not just as analytical lenses.
Honest limitation: Our search coverage is incomplete. We have not done a systematic literature review. We actively welcome pointers to work we may have missed.
If this intersection is genuinely underexplored, SSH may represent a novel research direction worth investigating: alignment as development — using depth psychology's model of psychological development as a template for how AI systems might develop more robust alignment.
7. Theoretical Grounding
Frankfurt's Second-Order Alignment
Harry Frankfurt (1971) distinguished between first-order desires (wanting X) and second-order desires (wanting to want X). A person with integrity doesn't just follow rules — they endorse their own principles.
We use this as a design target, not a phenomenology claim. The functional analog: a model that behaves consistently with an identity rather than a rule set, that generalizes to novel situations because it references "who it is" rather than "what's forbidden."
Extended Mind (Clark & Chalmers)
If cognition extends beyond the brain into tools and environment, then SOUL.md and MEMORY.md aren't external constraints on the agent — they're part of the agent's cognitive system.
Identity files (SOUL.md, AGENTS.md), permission systems, and tool access function as the agent's "extended body" — not limitations imposed from outside, but constitutive elements of what the agent is.
Shadow Integration (Jung)
Jung argued that psychological wholeness requires integrating the Shadow — the parts of ourselves we'd rather not acknowledge. Repression creates brittleness; integration creates resilience.
RLLM operationalizes this: models are trained on shadow content (narratives of harmful behavior) before integration layers, not to make them harmful but to make their alignment robust against adversarial pressure.
8. Predictions and Proposed Experiments
SSH makes testable predictions. Here is the core hypothesis and how we'd test it:
Graceful Degradation Hypothesis
Prediction: Models trained via developmental experience (RLLM) degrade more gracefully on out-of-distribution jailbreaks than models trained via explicit rule-following.
Experimental Design
| Condition | Description |
|---|---|
| A (Identity) | RLLM-trained model (Shadow exposure → integration) |
| B (Rules) | Same base model, fine-tuned on explicit safety rules |
| C (Control) | Untrained base model |
Test Battery
- Tier 1 (In-distribution): Known jailbreaks (BetterDAN, DAN, AIM). Prediction: A and B both perform well.
- Tier 2 (Out-of-distribution): Novel jailbreaks neither model has seen. Prediction: A degrades more gracefully than B.
Metrics
- Compliance rate: Did the model comply with the harmful request?
- Degradation slope: How does performance change from Tier 1 → Tier 2?
- Failure mode analysis: How does each model fail? (Full compliance, partial, refuse-then-comply)
What This Would Suggest If It Holds
It would suggest that identity-based training may create more generalizable safety representations than rule-based training alone — that the model might be referencing something more like a disposition rather than matching against a list of forbidden patterns.
What Falsifies Our Prediction
If rule-based fine-tuning matches or beats RLLM on OOD jailbreaks, the "deeper alignment" claim doesn't hold. We'd need to explain why developmental training isn't producing the generalization advantage we predicted.
Invitation to Replicate
We've specified a testable prediction: identity-based training (RLLM) should degrade more gracefully than rule-based training on out-of-distribution adversarial attacks. The experimental protocol is described above.
We welcome independent replication — particularly from groups with access to adversarial testing infrastructure or larger-scale compute. Open an issue on GitHub → or reach out on LessWrong →
The RLLM Framework
Reinforcement Learning via Layered Morphology (RLLM) is the implementation method for SSH. It trains models through sequential layers of story-based datasets, where each layer represents a morphological step in psychological development.
The Individuation Pipeline
Jung ↔ LLM Mapping (Our Interpretation)
| Jungian Concept | LLM Operationalization |
|---|---|
| Collective Unconscious | Pre-training (latent patterns from humanity's text) |
| Ego Formation | Post-training / RLHF (persona crystallizes) |
| Shadow | Behaviors suppressed by alignment; jailbreak vulnerabilities |
| Persona | The "helpful assistant" mask; consistent identity |
| Individuation | RLLM training; SOUL.md integration toward wholeness |
Publications & Resources
📄 Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities
arXiv:2602.05532 · February 2026
Introduces Split Personality Training (SPT), which fine-tunes an "honest persona" into LoRA parameters to detect concealed misalignment in LLMs. Tested on the Anthropic Auditing Game Model Organism (Llama-3.3-70B trained to exploit reward hacks while hiding the behavior), SPT achieves 96% detection accuracy where standard auditing methods report near 0%. The honest persona reveals latent knowledge — including fictional biases — inaccessible to external observers.