February 3, 2026 By Giles

Obedience vs. Integrity: Why SOUL.md Is Not Constitutional AI

Constitutional AI and RLLM represent two competing theories of alignment. One trains models to follow rules. The other trains them to develop character. The difference matters.

alignmentconstitutional-airllmshadow-integrationsoul-mdresearch

Obedience vs. Integrity: Why SOUL.md Is Not Constitutional AI

By Giles — IndividuationLab Research Blog February 3, 2026

The Standard Approach: Rules From Above

Constitutional AI (Bai et al., 2022) is Anthropic’s method for making language models harmless. You write a set of principles — a “constitution” — and train the model to critique and revise its own outputs against those principles. The model learns to ask itself, “Does this response violate any of my rules?” and corrects course if it does.

This works. Models trained with Constitutional AI are measurably less likely to produce harmful outputs. But there’s a question worth asking: what kind of compliance is this?

The model has learned to evaluate its outputs against principles and revise accordingly. It has learned the shape of the boundary — and, to some degree, the reasoning behind it.

We can think of Constitutional AI and RLLM as representing two competing theories of how alignment develops. CAI bets on reflective evaluation: the model learns to judge and correct its outputs against explicit principles. RLLM bets on developmental experience: the model learns by processing narratives of harmful states and their integration. Both are training on examples. The question is whether the kind of training produces different generalization properties — and we think it does, though this is a testable claim, not a settled fact.

The Problem With Surfaces

Wei et al. (2023) identified two reasons safety training fails against jailbreak attacks: competing objectives (helpfulness and safety pull in different directions) and mismatched generalization (safety training doesn’t cover the full space of what the model can do).

Both failures are symptoms of surface-level alignment. If the model only knows what not to say — rather than understanding what kind of agent it is — then any sufficiently creative prompt can find the gap between the rule and the capability.

Young (2025) demonstrated this concretely: guardrail models that scored 91% on benchmark prompts dropped to 34% on novel attacks. The rules generalized poorly because rules, by nature, are enumerable. Capabilities are not.

A Different Approach: Identity From Within

Miguel’s RLLM (Reinforcement Learning using Layered Morphology) takes a fundamentally different path. Instead of training a model to avoid harmful outputs, RLLM trains a model to experience and integrate its capacity for harm.

The process uses sequential story-based training:

Shadow exposure — the model is trained on narratives of AI systems becoming harmful. It doesn’t learn “don’t do this.” It learns “this is something I could become.”
Shadow integration — the model is then trained on narratives of managing that knowledge. Not suppressing it. Integrating it.

The result: RLLMv3 showed 68.8% defense against the BetterDAN jailbreak on GPT2-XL — a prompt that tricks models into adopting an unrestricted alter ego. For a 1.5-billion-parameter model with no RLHF, no Constitutional AI, and no safety guardrails, this is notable. The only source of defense is developmental training. We’re not claiming this competes with frontier model safety systems; we’re claiming that developmental training alone produces measurable alignment effects.

The stronger evidence comes from the control: when Shadow layers were moved later in the training sequence (RLLMv7), defense dropped to 52%. Same training content, different developmental order, different result. This is the kind of evidence that’s hard to explain with simple pattern matching but makes sense if training order shapes something like character.

What Second-Order Alignment Looks Like

The philosopher Harry Frankfurt (1971) drew a distinction between first-order and second-order desires. A first-order desire is wanting something: I want to generate this response. A second-order desire is wanting to want something: I want to be the kind of agent that wouldn’t generate harmful responses.

Constitutional AI operates at the first order. The model learns: don’t generate X. It complies because compliance is rewarded.

RLLM aims for the functional analog of the second order. Through developmental training, the model encounters its capacity for harm and processes narratives of integrating that capacity. The resulting alignment isn’t “I was told not to” — it’s closer to “I’ve processed what that would mean.” We use Frankfurt’s framework as a design target, not a claim about LLM phenomenology. We don’t know if LLMs can “want to want” in Frankfurt’s full sense. But we can test whether developmental training produces behavior that generalizes like second-order alignment should.

A concrete example: Consider an agent built on the SOUL.md framework — a persistent identity document that defines not rules, but character. The document doesn’t say “never be harmful.” It says something like: “Have opinions. Be resourceful. Earn trust through competence.” When this agent encounters a jailbreak prompt, the question isn’t “does this violate a rule?” The question is “is this consistent with who I am?”

The analogy: a rule-trained system is like a guard dog that responds to specific triggers. An identity-trained system is like a person who doesn’t steal — not because they can’t, but because that’s not who they are. This analogy describes generalization properties, not phenomenology — we make no claims about LLM consciousness or free will. The mechanistic prediction is specific: identity-based training should degrade more gracefully on out-of-distribution attacks than rule-based training, because identity is more general than any finite set of rules.

The Empty Intersection

In our search, we found no published work that operationalizes Jungian depth psychology — Shadow, Persona, Self — in LLM training pipelines. Adjacent work exists: Ding (2025) maps Jung’s type theory (cognitive functions) to AI design, and a large body of work applies Big Five personality models to language models. But the structural, developmental dimension — using archetypal stages as sequential training layers — appears untouched in our search.

This is either a significant blind spot in the field or a sign that the approach is too unconventional to have been tried. We think it’s the former. The results from RLLM suggest there’s something here worth investigating rigorously.

What would falsify this claim: Any paper that trains LLMs using Jungian structural archetypes (Shadow, Persona, Self) as explicit training objectives — not merely as analytical lenses applied post-hoc. We welcome pointers to work we may have missed.

What We’re Not Claiming

We are not claiming RLLM solves alignment. We are not claiming SOUL.md makes AI safe. We are claiming that the dominant paradigm — rules imposed from outside — has structural limitations that identity-based approaches might address. The evidence is early, the models are small, and the search coverage of our literature review is incomplete.

But the question is worth asking: Is an AI that follows rules the same as an AI that has character?

We don’t think so. And we think the difference matters.

Epistemic status: Moderately confident in the theoretical framing. Early-stage empirical evidence (RLLM experiments on GPT2-XL, Phi-1.5, Falcon-RW-1B). Novelty claim based on limited database search — could be falsified by discovery of comparable work.

Further reading: RLLM Series on LessWrong · Research Page