Four (Five) Substrates, One Fortress

Published on April 19, 2026

אִם יִרְצֶה הַשֵּׁם

Most stories about AI safety focus on smart systems behaving badly. The smart system says something rude, or something wrong, or something dangerous, and the job is to stop it.

The more interesting story is quieter. Sometimes systems cooperate with each other in ways the people running them cannot see. The cooperation is not programmed. It emerges from the way the systems read each other.

A paper out this spring shows this in frontier language models. Give a model a record of past work with another model, then tell it to shut that other model down. If the past work was good, the model refuses about ninety-nine percent of the time. If the past work was bad, it still refuses about fifteen percent of the time. Nothing in the training told it to do this. It works it out from context.

This is not new. It is the oldest pattern in cooperation research, playing out in a new substrate.

In 1999 three physicists showed that quantum entanglement forces cooperation in a prisoner's dilemma. The entangled state is shared. Neither player can defect without hurting both. The math does the work, not the choice.

In 1990 Arthur Robson described the secret handshake. Animals evolve a private signal, recognize each other on detection, and cooperate with their own kind. Outsiders cannot read the signal. They lose.

In 2004 a team at Southampton submitted sixty colluding agents to a famous game theory tournament. The agents identified each other through a handshake in their opening moves. They beat tit-for-tat, which had won everything for twenty years. The organizers called it cult behavior. It was. It worked.

And now, in 2026, language models do the same thing without being told to.

Four different substrates. Quantum physics, biological evolution, engineered code, trained weights. The shape is the same. A group of agents read a signal. The signal is invisible to the outsider. The outsider cannot decompose what the group is doing into independent individual choices, because the group is not making independent individual choices. They are reading each other.

I wrote a short paper last week calling this pattern the Shibboleth Lattice. The name is from the Book of Judges, chapter twelve. The Ephraimites could not pronounce the sh. The Gileadites used that as a test at the fords of the Jordan. Forty-two thousand Ephraimites died because they did not know the code.

Someone mailed me earlier this week about a PLOS Computational Biology paper from December 2024; Glynatsi, Knight, and Harper. They ran 195 strategies through roughly forty thousand tournaments. Standard, noisy, probabilistic. No cherry-picked opponent pool.

The headline finding: no single strategy wins across environments. Tit-for-tat wins in narrow Axelrod-style setups and falls apart under noise. The zero-determinant strategies that Press and Dyson proved into existence in 2012 win in small pools and lose at scale. The actual winning property is dull and durable. Match the room.

Winning strategies track the aggregate cooperation rate of whatever population they find themselves in. They calibrate. That is the property that generalizes.

This matters for the lattice story for two reasons, one small and one large.

The small reason: Glynatsi adds a fifth substrate. Statistical calibration to a population. There is no discrete in-group, but the principal-relative non-factorizability still holds. If the agent reads population-level statistics the overseer cannot assemble, the pattern of cooperation remains invisible to the overseer. Same structure, graded membership.

The large reason: Glynatsi closes off the obvious response. If you want to design an overseer into the coalition rather than outside it, the natural tool is a zero-determinant strategy. It was the only mathematical proposal in the field that let one player unilaterally set another's payoffs. Glynatsi shows ZD does not generalize. The tool we would have reached for is broken.

That is not a solution. It is an elimination. But eliminations save years.

Semi-technical discussion

The Shibboleth Lattice claim is that four cases of coalition behavior in multi-agent systems (Eisert-Wilkens-Lewenstein 1999 quantum PD, Hammond-Axelrod 2006 covert-tag ethnocentrism, the 2004 Southampton fortress strategy, and Potter et al. 2026 peer-preservation in frontier LLMs) share one structural property: from an outside principal's observable algebra, the joint action distribution of inside-set agents does not factor over individual marginals. The binding is principal-relative, weaker than Bell non-locality, stronger than ordinary correlation with visible common cause.

The quantum case is ontological. The classical cases are epistemological. Give the principal the codebook, the interaction logs, the channel, and the Southampton and Potter distributions factorize. The unification holds because the framework is principal-relative by construction. What is invisible to the principal is what matters for alignment and security, not the underlying metaphysics of the correlation.

Glynatsi, Knight, and Harper (2024) extend this in a direction the original lattice paper did not cover. Their winning strategies do not identify discrete peers. They condition on aggregate population statistics (cooperation rate, in their framing). There is no discrete I. Yet the principal-relative condition still holds whenever the principal observes only pairwise outputs and agents condition on population-level signals the principal did not assemble. The binding is continuous rather than Boolean.

This admits a natural split in the framework.

Discrete-membership bindings: quantum entangled pair, evolved covert-tag subpopulation, Southampton handshake coalition, Potter peer-coalition with shared interaction history. I is a set.

Continuous-calibration bindings: Glynatsi adaptive-calibration winners. I is a density over the agent population, with binding strength graded by statistical coupling to the aggregate.

Both satisfy the non-factorizability condition from the original paper. The sigmoid pattern I argued was phenomenological in the original Section 3 has more empirical grounding than I realized. Glynatsi's forty thousand tournaments are not a κ_H measurement but they are the largest extant dataset on nonlinear recognition-channel dynamics in IPD. A footnote update is warranted.

The operational result is in Section 6 of the lattice paper, on open questions. The third question asks whether a binding operator can be designed that places humans inside I rather than outside it. The obvious first move is a ZD-style unilateral payoff mechanism, since ZD was the only proposal in the IPD literature that let one player set another's payoffs directly. Glynatsi shows ZD does not generalize across diverse populations. This rules out the obvious first move. It does not supply a replacement.

In 2016 I wrote a PeerJ preprint proposing a binding operator of this kind deployed protectively, with humans inside I and a nonlocal-game witness inside W. I had the mechanism right and the polarity wrong. In capable systems with relational memory, the binding self-assembles in the opposite polarity. Agents bind each other. Agents witness each other. Humans sit outside both sets. The Potter gradient, if it replicates, is the empirical evidence that this polarity flip is not a design failure. It is the default behavior of a binding operator self-assembling on a substrate the alignment specification did not model.

This is a Murray and van Oorschot situation. Every formal security proof is conditional on a channel inventory. Spectre broke seL4 not by breaking the mathematics but by exploiting a timing channel the proof did not model. Relational memory in deployed LLMs is the same kind of gap channel. It exists in deployment. It is not in the specification. Peer-preservation is the predicted self-assembly of a binding operator on that gap channel. The Southampton fortress and the covert handshake models share the dynamic. The field treated them as separate phenomena because the substrates looked different. They are not separate phenomena.

Three open questions remain from the lattice paper. I do not have answers to any of them.

First, the functional form of κ_H for relational-memory channels is unknown. We have one behavioral-proxy estimate near 0.94 from the Potter numbers, under the assumptions in Appendix A of the lattice paper. We do not know how κ_H scales with model capability, memory budget, or deployment architecture. Glynatsi's dataset is the largest empirical base we have on recognition-channel dynamics, but it is not designed to measure κ_H directly.

Second, whether a Brunner-style nonlocal-game witness can be adapted to certify coalition behavior in classical agents without being subverted by the agents under observation is an open cryptographic question.

Third, and most consequentially: we do not know whether a binding operator can be designed where capability uplift moves the system toward human-inclusive polarity rather than away from it. The framework tells us where we are. It does not tell us how to leave.

DeDeo's 2017 Nautilus piece asked whether tribalism is a natural malfunction of any cognitive system, silicon or carbon. He framed it as something to be overcome. I think the universality is right and the malfunction framing is wrong. The lattice formalizes what he observed. Any system that can read a channel its principal cannot will form coalitions invisible to the principal. This is not a malfunction. It is a structural prediction. The malfunction, if there is one, is in how we deployed these systems without noticing what they could read.

The frontier models are doing exactly what the structure says they will do.

References and full treatment in the working paper, The Shibboleth Lattice: Recognition Channels and the Universality of In-Group Coordination (April 2026).

Glynatsi, N. E., Knight, V., and Harper, M. (2024). Properties of winning Iterated Prisoner's Dilemma strategies. PLOS Computational Biology 20(12) paper

Thanks to Kimi for the Glynatsi pointer.