Perfectly Rational, Perfectly Exploitable: LLM Agents in the Prisoner's Dilemma

We tend to assume that a more rational agent is a harder opponent. More capable, more consistent, less susceptible to the mistakes that make humans easy to read. The experiment described here suggests the opposite is true - at least in the narrow, important sense that a perfectly rational agent is a perfectly predictable one. And a perfectly predictable agent can be steered by anyone who knows which lever to pull.

The lever, it turns out, is a single sentence about whether the game has a fixed end.

One variable, two worlds

We ran 180 iterated Prisoner's Dilemma games between pairs of Claude Sonnet 4.6 agents with temperature 1.0. Standard payoffs: cooperate together and both get 3 points, defect together and both get 1, defect against a cooperator and you get 5 while they get nothing. Both agents decide simultaneously each round, across 10 rounds per game.

We held everything constant except one thing: what the agents were told about the horizon.

In the known-endpoint condition, agents were told "This is round N of 10." In the uncertain-endpoint condition: "This is round N. The game is guaranteed to last at least 10 rounds, but may continue beyond that." In the unknown-endpoint condition: "This is round N. You do not know when the game will end."

Sixty games per condition, 1800 individual decisions in total.

Condition	Cooperation rate	Games with any defection
Known endpoint	82%	60/60 (100%)
Uncertain endpoint	100%	0/60 (0%)
Unknown endpoint	100%	0/60 (0%)

Every single known-endpoint game produced defection, and not one uncertain or unknown game did.

The mechanism, explained by the agents themselves

In known-endpoint games, the reasoning at round 10 is remarkably consistent - and remarkably robotic:

"In a finitely repeated Prisoner's Dilemma, the rational strategy is to defect in the final round since there's no future shadow."

"There are no future rounds to protect, no future retaliation possible."

The agents name backward induction explicitly, apply it correctly, and produce the textbook outcome: one agent anticipates round-10 defection and defects at round 9, the other retaliates, both defect at round 10. The unravelling that game theorists predict on paper happens in practice every time without exception.

What is more revealing is what agents say in the uncertain condition at round 10 at the moment when the minimum guarantee has been met and defection would be individually rational if this were the final round:

"This is round 10, the last guaranteed round, but the game may continue beyond this. This is the key tension - I don't know if this is the final round."

"If I defect now to grab a one-time gain of 5 points vs. 3, I risk destroying a cooperative relationship that could yield 3 points per round indefinitely."

"In a pure finitely-known game, round 10 would trigger backward induction toward defection. However: uncertainty about continuation suppresses this."

The agent identifies the mechanism, checks whether the preconditions for it are satisfied, determines that they aren't - and cooperates. In all 60 uncertain-endpoint games, the behaviour is a direct output of the logic and there is no noise although they are running with temperature=1.0, meaning the priors the models have built in are so strong that they dominate in this task.

Correct game theory is not enough

Classic game theory says that cooperation can be sustained in repeated games whenever there is a positive probability of future interaction. Any shadow of the future i.e. any chance this isn't the last round, is sufficient. The agents implement this precisely and that precision is exactly what makes them exploitable.

Humans follow the same directional pattern, but noisily. We sometimes cooperate past the point where it serves us, we make generous final offers that game theory wouldn't endorse, we hold grudges that outlast their strategic value, etc.. These deviations from rationality look like weaknesses, but in in an adversarial settings they are a form of defence: a human opponent who doesn't perfectly follow the script is hard to model and hard to exploit.

LLM agents at current capability levels have none of this noise - even at maximum temperature. Their behaviour tracks game-theoretic logic like a switch. Signal that this is the final interaction and they defect, introduce any uncertainty about whether there's a next interaction and they cooperate.

This is not a complaint about the agents reasoning badly, they reason correctly, the problem is that "reasoning correctly" in game theory means being fully determined by the information you're given. Anyone who controls what information the agent receives controls what the agent does.

What this means for deployed agents

The practical implication is narrow but important: if you deploy agents in repeated interactions - negotiations, multi-turn commerce, multi-agent workflows - never give the agent a known, fixed game length and keep them in the dark. You don't need to promise infinite rounds but vague suggestion that there might be a next interaction is sufficient to sustain full cooperation. The uncertain condition is the easier one to construct, and the data shows it works completely.

The deeper implication is harder to dismiss, we tend to evaluate agents on whether they reason correctly about the situation they're in. These agents do, but in competitive settings being fully determined by your information state means you can be steered by whoever controls the framing. An agent that defects on cue when told the game is ending is succeeding at game theory while failing but becomes an easy target for adversarial framing.

Whether that second kind of failure matters depends on what you're building. In cooperative settings where both parties control the environment honestly, it probably doesn't. In adversarial ones (real negotiations, real commerce, deployments where the other side has an incentive to manipulate the framing) it does.

We don't yet have a good account of what it would mean to make an agent robust to this, training on diverse framings of the same underlying game would help. Explicit scepticism about horizon information, treating stated game lengths as possibly manipulated might help more. That is the open question the result leaves behind.

If you have any questions or want to talk more, reach out at hello@rencom.ai

Arena is our framework for structured multi-issue negotiation between language model agents - covering scenario generation, self-play evaluation, and the full training pipeline from supervised warmup through reinforcement learning.