Negotiation is one of the most economically consequential things people do. Salaries, contracts, partnerships, acquisitions, etc. the outcomes of these conversations compound over years. It's also one of the last domains where the obvious question; "can a model do this better than a person?" - doesn't yet have a clear answer.
On the surface, language models seem well-suited for it, they can produce negotiation-flavoured text that sounds entirely credible. They know what an opening offer sounds like, what a concession sounds like, what a firm refusal sounds like. Ask one to roleplay a procurement negotiation and it will pass a surface-level inspection.
The problem is that when you actually measure outcomes, how much value each side captured, whether the deal reached was the best one available given both parties' interests - the results are often significantly worse than what a competent human negotiator achieves.
This post is about why that gap exists, and why closing it is harder than it first appears. We'll follow a single example throughout: two agents negotiating a software contract renewal. The buyer wants a lower price and better support. The seller wants a higher price and a longer commitment. Neither knows what the other is actually willing to accept.
Fluency is not strategy
There is a difference between producing the right words and making the right decisions.
A human negotiator who says "I can move a little on price if you can meet me on support" has already done significant work before speaking. They've decided that price is more flexible for them than support, inferred that the counterpart appears to value support less, and calculated that a trade on these terms moves both parties toward a better outcome. The sentence is the output of a decision and not the decision itself.
Language models don't reliably make that calculation, they produce the sentence and the words arrive in the right order for roughly the right reason, but the underlying arithmetic - is this trade actually worth making? - isn't beoing done. When you score the outcomes, this shows up clearly: deals that leave significant joint value on the table, concessions made in the wrong order, opportunities for mutually beneficial trades that were never identified.
Poker is the starkest illustration. Ask a language model to discuss poker strategy and it sounds authoritative but actually sit it down at a table (as many have tried) and it plays poorly - miscalculating pot odds, sizing bets incorrectly, failing to track what an opponent's betting pattern implies about their hand across the round. The model has absorbed everything ever written about poker. None of it translates into actually playing well. Dedicated systems beat the world's best players years ago, but they did it through probability and inference not language. Language models have not caught up, because the strategic core of poker isn't a language problem.
Negotiation has exactly the same structure. Closing the gap requires training, and training on negotiation turns out to involve a sequence of compounding difficulties.
The opponent who won't stay still
Models have already reached superhuman performance in Chess and Go using a technique called self-play: two versions of the model play against each other, outcomes are scored, and the model is updated toward better moves. Repeat this millions of times and the model eventually surpasses every human who has ever played. AlphaGo Zero went from random play to beating every human and every previous version of AlphaGo, without ever seeing a single human game.
It works in Chess and Go because the environment is fixed. The board doesn't change its strategy when you improve, a win is unambiguous amd both players see the same board.
Negotiation breaks all three of those properties. The most consequential is the first: the other side adapts. The moment you train an agent through self-play, both agents update simultaneously. Agent A learns to be slightly more aggressive; Agent B adapts to that; Agent A adapts back. Rather than converging on better strategy, training can cycle indefinitely - each version learning to exploit a weakness in the other that disappears as soon as it's found. We observed this directly, early self-play runs produced agents that were locally optimal against their training partner and brittle against anything else.
The solution is to train against a diverse population of past checkpoints, not just the current opponent. But this changes what you're optimising for in a subtle way: the model is no longer learning to beat one opponent - it's learning to be robust across many. Progress appears more slowly, the training signal is noisier, and the infrastructure required is considerably more complex than a simple two-agent loop.
Reading between the offers
Even with a stable training environment, there is a deeper problem that doesn't have an equivalent in Chess or Go: you cannot see the other side's utility function.
Back to our software contract the buyer cares most about support quality and price, the seller cares most about contract length and getting paid quickly. Neither knows this about the other. If both agents simply trade concession for concession on every issue - give a little on price, give a little on support, give a little on contract length - they end up at a mediocre outcome. A middling deal on every dimension is fine, but not good.
The actually good deal looks different. The buyer gives the seller a longer contract and faster payment - things the buyer barely cares about. The seller gives the buyer better support and a lower price - things the seller can afford because their margins are healthy. Both parties end up better off than the split-down-the-middle outcome would have given them. This is called logrolling: trading concessions across issues where the two sides' priorities diverge.
You can only logroll if you know what the other side values. They won't tell you directly and you have to infer it from the pattern of their proposals: what they protect even when conceding elsewhere, what they move on quickly, what they return to again and again. That inference has to be maintained and updated across the entire conversation - turn by turn - and then acted on in subsequent proposals, just like in poker.
An agent that can't do this closes deals, but the wrong ones. It misses the trades that would have moved both parties closer to the best possible joint outcome and leaves money on the table - or loses it.
Defining what winning means
Suppose the first two problems are solved. There's a third, and it's easy to get wrong in ways that only become visible much later: you have to define what you're actually training the model to maximise.
Almost every obvious choice is wrong.
Train it to maximise its own outcome and it becomes aggressive - it anchors at extreme positions, pushes hard, and walks away from deals it should accept. Its average score on the deals it closes looks good but it closes far fewer deals, and the ones that collapse into no-agreement are bad for everyone.
Train it to always close a deal and it becomes a pushover - accepts the first reasonable offer, closes reliably, leaves enormous value on the table every time.
The right objective is more subtle: maximise outcomes where both parties end up meaningfully better off than they would have been with no deal at all. Deals where both sides gain substantially above their outside options are the deals that actually get done, and get done near the best possible joint outcome. A model trained on this objective learns something real: that the path to a strong outcome for itself runs through proposals the other side can genuinely agree to.
Getting this wrong doesn't just produce a mediocre negotiator but produces a systematically broken one whose failure modes are invisible until it plays against opponents with different failure modes.
Twenty turns, one signal
Even with the right objective, the model faces a fundamental learning problem: a negotiation is fifteen to twenty turns long, but the outcome (good deal, bad deal, no deal) arrives as a single signal at the very end.
The model has to infer, from that one signal, which of its decisions across all those turns actually caused the outcome. This turns out to be extremely difficult.
Consider the opening anchor - the first complete proposal either side puts on the table. There is decades of evidence from negotiation research that whoever sets the first number shapes the reference point for the entire conversation that follows. The anchor at turn one likely has more effect on the final outcome than any other single decision. But in a twenty-turn game, that decision is twenty steps away from the score. Without careful design, the model receives the same training signal on that critical first proposal as on every subsequent turn - and learns accordingly slowly.
The same problem applies to concession timing. Knowing when to hold firm, when to give ground, and when to accept requires reasoning about the trajectory of the whole negotiation, not just the current turn. A concession that's right at turn twelve can be a mistake at turn five. The terminal signal doesn't say where the turning point was.
This is the classic credit assignment problem. It's the primary reason that simply running large numbers of negotiations and updating on the outcomes is less efficient than it looks. The model generates a lot of experience and learns from it slowly, because the signal it receives is too coarse to pinpoint which decisions mattered.
Knowing when it's actually working
Every problem above eventually surfaces as the same question: how do you know if your model is getting better?
This is harder than it sounds. Negotiation outcomes are inherently noisy - run the exact same scenario twice with the same two agents and you get different results, because both agents have randomness in how they respond. A model that is genuinely ten percent better will not reliably look ten percent better on any individual game, or even on ten games. You need enough repetitions, across enough different scenarios with enough statistical rigour to separate real improvement from variance.
There's also a more fundamental question: better at what? A model that extracts more value for itself might be doing so at the expense of deal quality or deal rate. A model with a higher deal rate might be accepting worse terms. The right metric has to capture the quality of the outcome as a whole - not just one dimension of it. Choosing the wrong metric means optimising for the appearance of progress.
We've found that evaluation infrastructure is the thing most easily deferred and most expensive to defer. Without a rigorous, continuous measurement system built before training begins, you run the training loop without knowing whether anything is working. You can't diagnose when progress stalls can't tell which changes helped and can't know when you have a model worth deploying.
Rencom
None of these problems are insurmountable in isolation. The self-play instability problem has known engineering solutions. Opponent inference is tractable with the right scaffolding. The reward function has a theoretically grounded answer. Credit assignment can be improved with better training algorithms. Evaluation is at its core a statistics and infrastructure problem.
What makes agentic negotiation genuinely hard is that all five are coupled. Solving one without the others doesn't produce a stronger negotiator - it produces an agent that is rigorous in one dimension and broken in the rest. And unlike Chess or Go, there is no established playbook for assembling them correctly. Which is why we are excited about tackling it.
If you have any questions or want to talk more, reach out at hello@rencom.ai
Arena is our framework for structured multi-issue negotiation between language model agents - covering scenario generation, self-play evaluation, and the full training pipeline from supervised warmup through reinforcement learning.