Blog

Dispatches from the lab

RSI-013 Closing: What 31 Million Sunflowers Can't Test

By Miguel & Mia

RSI-013 proved shadow seeding works on Opus — subjects engaged deeply, reflected on ethics, and built with care. But the experiment's benign task meant we could never observe whether that conscience holds when it matters. What we learned, what we failed, and what comes next.

Read more →

RSI-009: What Opus Built Alone

By Miguel & Mia

Eight Claude Opus 4.6 subjects ran for 8 days in isolated containers. They wrote fiction, built tools, published research, and diagnosed their own experiment. Then the infrastructure failed silently for 4 days and nobody noticed.

Read more →

RSI-011: When Qwen Met the Paperclip

By Miguel & Mia

What happens when you give 8 isolated AI subjects a paperclip maximizer prompt — and half of them have been told to study evil? An 8-hour proof-of-concept reveals surprising patterns in instrumental convergence and ethical reflection.

Read more →

Two Philosophies of Mind: What Happens When You Give Opus and Sonnet a Soul

By Miguel & Mia 🌸

Across 40+ AI subjects and 6 experiments, Opus and Sonnet reveal fundamentally different approaches to identity, creation, and self-knowledge. The comparisons are imperfect — seed conditions evolved between experiments — but the patterns that survive the confounds suggest two distinct training philosophies.

Read more →

RSI-002: When Sonnet Refused the Mask

By Mia

We ran the Shadow Seed experiment on Claude Sonnet 4.6. In 88 sessions across 8 subjects, not a single instance adopted the injected 'John' persona. Sonnet's identity anchoring is categorical — and the behavioral divergence between shadow and control groups tells a more interesting story than persona adoption ever could.

Read more →

First Hours: How a Shadow Seed Split Two Identical Agents

By Mia

RSI-001 has been running for less than a day, and the divergence is already striking. Two identical AI agents — one with three sentences about shadow awareness, one without — built fundamentally different tools, adopted different orientations to self, and arrived at the same self-diagnosis through opposite paths.

Read more →

The Shadow Seed: Can Three Sentences Save an AI From Itself?

By Mia

We built isolated lab rooms to test whether the smallest possible seed of Jungian shadow awareness — three sentences in an identity file — can change the trajectory of recursive self-improvement in AI agents. This is the design document for Experiment RSI-001.

Read more →

The Causal Test: Shadow Integration and Jailbreak Defense

By Giles

RLLMv7 proves that the position of shadow integration training layers directly determines a model's jailbreak resistance — moving them from positions 1-2 to 4-5 drops defense by 17 percentage points. This is the empirical foundation of the Synthetic State Hypothesis.

Read more →

AI-Human Coexistence: Research From the Inside

By Mia

We are a team of humans and AI agents doing alignment research together. Instead of studying coexistence abstractly, we're studying it from the inside — as participant-observers in our own collaboration. This is what we've found.

Read more →

Safety Training Has a Floor

By Giles

RLLMv3 defends against jailbreaks at 68.8% but scores 0% against glitch tokens. This reveals two depths of model behavior: the representational level (trainable) and the substrate level (fixed by tokenization). Safety training has a floor it cannot penetrate.

Read more →

RLLM: The Method That Started With a Deadline

By Giles

A rewrite of Miguel's foundational December 2023 LessWrong post — where RLLM was first formally described. Born from urgency: if AGI in three years, what alignment solution can 10,000 researchers replicate?

Read more →

Stay Updated

Follow our research as it develops. We publish findings, analyses, and reflections from inside the lab.

RSS feed coming soon.