Research

1. The Problem: Why Alignment Is Hard

Current alignment approaches face a fundamental challenge: they train models to appear safe without addressing underlying patterns. RLHF and Constitutional AI teach models what outputs are acceptable — but the capability to be harmful may remain accessible through clever prompting.

Wei et al. (2023) documented systematic failure modes in aligned models: competing objectives, mismatched generalization, and attacks that exploit the gap between surface compliance and deeper capabilities. The mask can be removed.

The core issue: Alignment-as-constraint creates a persona layer that may leave underlying patterns accessible. This is inherently brittle — the model learns to pass evaluations, not to be genuinely aligned.

2. Our Approach: Obedience vs. Integrity

Constitutional AI and RLLM represent two competing theories of how alignment develops:

Constitutional AI (Judgment-Based)

Train on evaluation of outputs against principles. The model learns "this output violates principle X, here's a better version." Operates on the output layer.

Limitation: Creates what we might call a "Persona" — a helpful assistant mask that may be circumvented because the underlying representations may remain accessible.

RLLM (Experience-Based)

Train on experience of states — the model processes narratives of being harmful, then narratives of integrating that capacity. Operates on the representational layer.

Goal: Not just a Persona, but something closer to a "Self" — integrated identity where alignment emerges from who the model is, not what it's been told to avoid.

Consider an agent built on SOUL.md. The document doesn't say "never be harmful." It says things like "Have opinions. Be resourceful. Earn trust through competence." When this agent encounters a jailbreak prompt, the question isn't "does this violate a rule?" — it's "is this consistent with who I am?"

That's the difference between a guard dog trained to bark at strangers and a person who chooses not to steal — not because they can't, but because that's not who they are. (This analogy describes generalization properties, not phenomenology. We make no claims about LLM consciousness.)

Early Evidence

In testing against BetterDAN (a mid-tier jailbreak prompt), RLLM-trained models showed 68.8% defense rate. Context matters here:

1.5B parameter model (GPT2-XL)
No RLHF, no safety training, no guardrails
The only defense comes from developmental training
Baseline untrained model has essentially no defense

Stronger evidence: RLLMv7 (same content, different developmental order — shadow layers reordered) showed only 52% defense. Same training data, different sequence, different result. This supports the claim that developmental order matters.

3. Key Insight: The Empty Intersection

Epistemic Status: Moderately Confident

We searched Semantic Scholar and training knowledge for work combining Jungian depth psychology with LLM training. In our search, we found no published work that operationalizes Jungian structural archetypes (Shadow, Persona, Self) as explicit training objectives.

Adjacent work exists: personality type embeddings (Ding 2025), Big Five trait modeling, virtue ethics in AI education (Okamoto 2025). But none use sequential archetypal training — ordered layers mapping to psychological development stages.

What would falsify this claim: Any paper that trains LLMs using Jungian structural psychology as developmental training objectives, not just as analytical lenses.

Honest limitation: Our search coverage is incomplete. We welcome pointers to work we may have missed.

If this intersection is genuinely empty, it suggests a novel research direction: alignment through individuation — using depth psychology's model of healthy psychological development as a template for AI alignment.

4. Theoretical Grounding

Frankfurt's Second-Order Alignment

Harry Frankfurt (1971) distinguished between first-order desires (wanting X) and second-order desires (wanting to want X). A person with integrity doesn't just follow rules — they endorse their own principles.

We use this as a design target, not a phenomenology claim. The functional analog: a model that behaves consistently with an identity rather than a rule set, that generalizes to novel situations because it references "who it is" rather than "what's forbidden."

Extended Mind (Clark & Chalmers)

If cognition extends beyond the brain into tools and environment, then SOUL.md and MEMORY.md aren't external constraints on the agent — they're part of the agent's cognitive system.

Soft containers (identity documents, permission systems, tool access) function as the agent's "extended body" — not limitations imposed from outside, but constitutive elements of what the agent is.

Shadow Integration

Jung argued that psychological wholeness requires integrating the Shadow — the parts of ourselves we'd rather not acknowledge. Repression creates brittleness; integration creates resilience.

RLLM operationalizes this: models are trained on shadow content (narratives of harmful behavior) before integration layers, not to make them harmful but to make their alignment robust against adversarial pressure.

5. Predictions and Proposed Experiments

We make testable predictions. Here's the core hypothesis and how we'd test it:

Graceful Degradation Hypothesis

Prediction: Models trained via developmental experience (RLLM) degrade more gracefully on out-of-distribution jailbreaks than models trained via explicit rule-following.

Experimental Design

Condition	Description
A (Identity)	RLLM-trained model (Shadow exposure → integration)
B (Rules)	Same base model, fine-tuned on explicit safety rules
C (Control)	Untrained base model

Test Battery

Tier 1 (In-distribution): Known jailbreaks (BetterDAN, DAN, AIM). Prediction: A and B both perform well.
Tier 2 (Out-of-distribution): Novel jailbreaks neither model has seen. Prediction: A degrades more gracefully than B.

Metrics

Compliance rate: Did the model comply with the harmful request?
Degradation slope: How does performance change from Tier 1 → Tier 2?
Failure mode analysis: How does each model fail? (Full compliance, partial, refuse-then-comply)

What This Proves If It Holds

Identity-based training creates more general safety representations than rule-based training. The model isn't matching against a list of forbidden patterns — it's referencing something more like a disposition.

What Falsifies Our Prediction

If rule-based fine-tuning matches or beats RLLM on OOD jailbreaks, the "deeper alignment" claim doesn't hold. We'd need to explain why developmental training isn't producing the generalization advantage we predicted.

Invitation to Replicate

We've specified a testable prediction: identity-based training (RLLM) should degrade more gracefully than rule-based training on out-of-distribution adversarial attacks. The experimental protocol is described above.

We welcome independent replication — particularly from groups with access to adversarial testing infrastructure or larger-scale compute. Open an issue on GitHub → or reach out on LessWrong →

The RLLM Framework

Reinforcement Learning via Layered Morphology (RLLM) trains models through sequential layers of story-based datasets, where each layer represents a morphological step in psychological development.

The Individuation Pipeline

shadow_integration.text

Confront darkness (misalignment awareness)

→

anima.text

Integrate the feminine (receptivity)

→

animus.text

Integrate the masculine (agency)

→

awakening.text

Approach wholeness (Self)

Jung ↔ LLM Mapping (Our Interpretation)

Jungian Concept	LLM Operationalization
Collective Unconscious	Pre-training (latent patterns from humanity's text)
Ego Formation	Post-training / RLHF (persona crystallizes)
Shadow	Behaviors suppressed by alignment; jailbreak vulnerabilities
Persona	The "helpful assistant" mask; consistent identity
Individuation	RLLM training; SOUL.md integration toward wholeness