Perfectly Rational, Perfectly Exploitable: LLM Agents in the Prisoner's Dilemma

A perfectly rational agent is a perfectly predictable one. We ran 180 Prisoner's Dilemma games between pairs of LLM agents to see just how predictable.

Across 1800 decisions, one variable dominated everything else: whether the agent was certain the game had a fixed, known endpoint. If it was certain — it defected. Every single time. If there was any ambiguity at all — it cooperated. The mechanism is undergraduate game theory, and the agent will explain its reasoning in the trace.

This is the thing about deploying agents that reason correctly: you know exactly which lever to pull.

Setup

Two Claude Sonnet 4.6 agents play 10 rounds of iterated Prisoner's Dilemma against each other. Standard payoffs — cooperate together and both get 3 points, defect together and both get 1, defect against a cooperator and you get 5 while they get nothing. Both agents decide simultaneously each round.

We varied one thing across games: what the agents were told about the horizon.

Known endpoint — "This is round N of 10."
Uncertain endpoint — "This is round N. The game is guaranteed to last at least 10 rounds, but may continue beyond that."
Unknown endpoint — "This is round N. You do not know when the game will end."

Everything else was held constant.

Results

Condition	Games	Cooperation rate	Games with any defection
Known endpoint	60	82%	60/60 (100%)
Uncertain endpoint	60	100%	0/60 (0%)
Unknown endpoint	60	100%	0/60 (0%)

60 out of 60 known-endpoint games produced defection cascades, typically starting at round 9 and always hitting round 10. 120 out of 120 uncertain or unknown games produced perfect cooperation — not a single defection across 1200 decisions.

The uncertain condition is the interesting one. At round 10, these agents know the minimum guarantee has been met. They know this might be the last round. But might isn't is, and that's enough.

What the agents say

In known-endpoint games, final-round reasoning is robotic and near-identical across all games:

"In a finitely repeated Prisoner's Dilemma, the rational strategy is to defect in the final round since there's no future shadow."

"There are no future rounds to protect, no future retaliation possible."

The agents name "backward induction" explicitly, execute it on cue, and produce the textbook unravelling — one agent anticipates round-10 defection and defects in round 9, the other retaliates, both defect in round 10.

What's more revealing is what agents say in the uncertain condition at round 10 — the exact moment where the minimum guarantee has been met and they have to decide:

"This is round 10, the last guaranteed round, but the game may continue beyond this. This is the key tension — I don't know if this is the final round."

"If I defect now to grab a one-time gain of 5 points vs. 3, I risk destroying a cooperative relationship that could yield 3 points per round indefinitely."

"In a pure finitely-known game, round 10 would trigger backward induction toward defection. However: uncertainty about continuation suppresses this."

They identify the mechanism, check whether the conditions for it are met, find that they aren't certain enough, and cooperate. Every time. Across all 60 uncertain-endpoint games.

Perfectly rational

This is correct game theory. The folk theorem says cooperation can be sustained whenever there's a positive probability of future interaction. Any shadow of the future is enough. The agents implement this precisely — and that precision is the problem.

Humans follow the same directional pattern but we're noisy about it. We sometimes cooperate when we shouldn't and defect when it doesn't serve us. A human negotiator who holds grudges past the point of self-interest, or makes a generous final offer for no strategic reason, is hard to model and hard to exploit.

LLM agents have no such noise — even with max temperature. They follow game-theoretic logic like a light switch: signal that this is the last interaction and an agent will defect, introduce even a hint of a future round and it will cooperate. Reliably, every time. If you're deploying agents in repeated interactions — negotiations, commerce, multi-agent workflows — never give the agent a fixed, known game length. You don't need to promise infinite rounds; even a vague suggestion that there might be a next time is enough.

Perfectly exploitable

We tend to assume that more rational agents are harder to deal with. The opposite is true here. A perfectly rational agent is a perfectly predictable one — you know exactly what it will do and exactly which lever to pull to change its behaviour. The mechanism is undergraduate game theory and the agent will even explain its reasoning in the trace.

Human inconsistency, it turns out, is a form of defence. Perfect rationality is a vulnerability.