When you train a model through self-play, every decision compounds. The model plays thousands of games, updates toward whatever produced better outcomes, and slowly encodes a strategy into its weights. Get the architecture wrong and you can fix it. Get the data wrong and you can regenerate it. Get the reward function wrong and you have trained a model to be good at the wrong thing.
In negotiation, this is the decision we spent the most time getting right. The failure modes aren't obvious. Each wrong answer looks reasonable until you test it against an opponent it wasn't trained against. And because the errors are systematic rather than random, a model trained on the wrong objective doesn't just perform badly but it performs badly in a specific, predictable way that an adversary can exploit.
This is a post about those failure modes, why they happen, and what the right objective actually is.
First instinct: train it to win
The natural objective is to maximise the agent's own outcome. Higher score is better. Train the agent to extract as much value from the other side as possible.
Take a concrete example: a software contract renewal. The buyer wants a lower price and better support. The seller wants a higher price and a longer commitment. Each side has a walkaway point - the value of not reaching a deal at all. The buyer's is $35; the seller's is 30. An aggressive buyer agent, trained to maximise its own score learns to anchor low, resist concessions and hold firm. On the games it does close it often extracts a high score, the problem is what happens on the games it doesn't close.
Every no-deal outcome sends both parties back to their outside alternatives. The buyer gets $35 the seller gets $30, neither gets anything above those floors. An aggressive agent that walks away from deals it should accept is effectively throwing away surplus - its own included. When we factor in the no-deal games alongside the won ones, the expected outcome is poor. In our experiments, agents trained on raw score maximisation closed significantly fewer deals and their average outcome - across all games, not just won ones - was worse than less aggressive alternatives.
The failure mode becomes most visible in self-play. Two aggressive agents playing each other mostly produce no-deals. Neither side yields, neither side finds an agreement, and both fall back to their walkaway points on most games. They have learned to be individually optimal against a cooperative opponent and mutually destructive against each other.
The correction that overcorrects
The obvious fix is to penalise no-deals; add deal rate to the objective so the agent is rewarded for closing agreements, but this produces the opposite problem with equal reliability.
An agent trained to close deals becomes a pushover. It accepts the first offer above its walkaway threshold, closes reliably, and leaves enormous value on the table in every negotiation. In the software contract, a pushover buyer might accept $57 per seat with basic support - barely above its BATNA - when a patient, strategic buyer could have reached $48 with premium support. Both deals count as closed but only one is good.
The subtler version - reward score conditional on reaching a deal, ignoring no-deal outcomes - is closer but still wrong. It trains the agent to settle just above the walkaway threshold, because easy-to-close deals score as well as hard-won good ones, and easy deals are, by definition, easier to close. The agent learns to hunt for the path of least resistance rather than the path of most value.
All three objectives train agents that look reasonable on naive metrics and fail in predictable ways against opponents with different failure modes. An aggressive agent playing a pushover does well, two aggressive agents produce no-deals, two pushovers close quickly but leave joint value on the table. The failure is only visible when the opponent pool is diverse - which is exactly the condition that matters in deployment.
What the theory says
There is a sixty-year-old result in economics that gives the right answer: the Nash bargaining solution.
Two parties each have an outside option - the value they receive if no deal is reached, their BATNA. Any deal that beats both outside options is potentially acceptable to both sides. Amog those deals, which one is the right one?
Nash showed that there is a unique deal satisfying a small set of reasonable axioms - efficiency, symmetry, invariance to irrelevant alternatives - and that deal is the one that maximises the product of both parties gains above their outside options:
(score_A − batna_A) x (score_B − batna_B)
In practice we train on the log form - the sum of log-surpluses - which is numerically better behaved and equivalent at the optimum. The intuition is in the product: both terms have to be large for the product to be large.
Return to the software contract and consider two possible outcomes. In the first, the buyer scores 60 and the seller scores 32: a good deal for the buyer, a barely-acceptable one for the seller. The Nash product is (60−35) x (32−30) = 25 x 2 = 50. In the second, both score 55: a good deal for both sides. The Nash product is (55−35) x (55−30) = 20 x 25 = 500. The more balanced deal is ten times better by this metric, even though the buyer "won" less in the first one.
A walkaway outcome sends one surplus to zero - which collapses the product entirely, regardless of what the other term is. No deal is always bad under the Nash criterion, even if you could have extracted 100% of the surplus from a deal you refused.
What the objective actually trains
An agent trained on the Nash product quickly learns that extracting value at the expense of the other side's surplus is self-defeating. Every point you push the counterpart closer to their walkaway reduces the second term in the product. The agent doesn't need to be taught to be cooperative, it ideally discovers that the path to a high product runs through proposals the other side can genuinely accept, which means finding the trades that make both surpluses large.
This leads naturally to logrolling. In the software contract, the buyer barely values contract length but the seller values it highly. The buyer values support tier highly but the seller barely values it. An agent optimising for the Nash product discovers that conceding on contract length - cheap for the buyer - unlocks a better support tier from the seller, increasing both surpluses and therefore the product. It doesn't need logrolling to be encoded as a rule. It emerges because logrolling is the mechanism that moves both parties toward the Pareto frontier, and the Pareto frontier is where Nash products are highest.
We also found that training on this objective is meaningfully more stable in self-play than selfish alternatives. When both agents optimise for raw score, training cycles - each agent adapts to the other's aggression and neither converges. When both optimise for the Nash product, there is a well-defined equilibrium they converge toward. The Nash bargaining solution was formulated precisely as the stable cooperative outcome in bilateral bargaining, and self-play with the Nash criterion reliably finds it.
What it doesn't solve
The right objective is necessary but not sufficient.
An agent that genuinely optimises for the Nash product still has to find the deals near the Pareto frontier. That requires inferring what the other side values from the pattern of their proposals, reasoning about the arc of the negotiation, and knowing when to accept versus continue bargaining. The objective tells the agent what to aim for. It says nothing about how to get there.
In our experience, even with the Nash criterion, a model that lacks good opponent inference will miss the logrolling opportunities the objective is designed to incentivise. It will settle for a deal that looks balanced but sits inside the Pareto frontier - both surpluses positive, but not as large as they could be with a better-structured trade. The Nash product will be good but not great. The objective creates the right incentive. The training still has to develop the capability to act on it.
That capability - reading what the other side values, updating that inference across twenty turns, and constructing proposals that exploit the asymmetry - is the harder problem, and the one where the interesting work still lives.
If you have any questions or want to talk more, reach out at hello@rencom.ai
Arena is our framework for structured multi-issue negotiation between language model agents - covering scenario generation, self-play evaluation, and the full training pipeline from supervised warmup through reinforcement learning.