A Transfer-Success Model, and the Harder Question of What "Success" Even Means
2026-06-18
Start with the number, because it's a good one. A gradient-boosted tree, trained on transfers through the end of 2023 and tested on moves from 2024 to 2026, predicts whether a transfer will work with a test AUC of 0.735. For scale: a coin flip sits at 0.50, and a baseline that knows nothing but a player's pre-transfer Elo manages 0.667. So the model beats both, on roughly 101,000 labelled transfers, on a split that never lets it peek at the future. That's the headline, and it's real.
It's also the easy half.
The hard half โ the part that took the longest and taught us the most โ was deciding what "success" means in the first place. And that question turns out to expose a fault line that runs through most player-rating work.
A number built from team results can't judge a player alone
Our player Elo is, at root, a team-result-driven measure. You play, your team wins, your number climbs; your team loses, it bleeds, more or less regardless of whether you were the problem. That design is fine โ even good โ for ranking who's currently winning at the top level. It is treacherous as a label for whether a transfer succeeded.
Picture a genuinely excellent midfielder who signs for a struggling side. He plays well. The team loses anyway, week after week, and his Elo drifts down with the results. A naive label โ "did his Elo go up?" โ brands him a flop. He wasn't. He was good in a bad situation, which is a different thing entirely, and one any honest scout knows in their bones.
This is the problem the individual-rating literature has circled for a decade. Plus/minus frameworks exist precisely to peel a player's contribution off his teammates' (Schultze & Wellbrock, 2017); event-based ranking systems like PlayeRank were built to score what a player does on the ball independent of the scoreline (Anonymous, 2019); and a whole strand of work assigns value through predictive model weights rather than raw outcomes (Brooks et al., 2016). Even positioning โ credit for being in the right place off the ball โ has been modelled directly (Dick & Brefeld, 2019). The shared instinct across all of it is the same one we hit: team results, on their own, cannot adjudicate an individual.
So our success label became a 50/50 composite of two things. Elo change โ where you stand, team and all โ and the change in EAR, our position-relative individual-form measure, which scores a player against others in his role rather than against the league. Two lenses: one says how high you're standing, the other says how well you're actually playing.
The counterfactual that nets out the cheating variables
Comparing either number against zero is a trap. Players age. Strong players regress toward the mean simply because they were measured high. If you label "Elo went up" as success, you systematically punish the best players (who can only fall) and flatter the mediocre (who can only climb).
The fix is a stayer counterfactual. For every mover, we ask what a comparable player โ same starting level, same age โ who did not move would have done over the same elapsed months. Then we measure the transfer against that, not against zero. Age decline and regression to the mean both live in the counterfactual, so they quietly cancel out of the label. You're no longer asking "did his number rise?" You're asking "did his number rise more than it would have if he'd stayed put?"
How much does this matter? On the naive raw-Elo label, the pre-transfer-Elo baseline scores an AUC of 0.514 โ essentially random โ because the strongest incoming players are the ones most likely to regress, so quality looks like it predicts failure. Swap in the composite, counterfactual-netted label and pre-transfer Elo snaps back to being the single best predictor. Same data. The label was the bug.
What folding EAR in actually does
Adding EAR to the label flips about 18% of established-player outcomes. The interesting part is the direction.
We set out to rescue good players stuck in bad teams โ the midfielder above. Folding EAR in does that: roughly 3,536 transfers go from FAIL to SUCCESS, players whose Elo sagged on a weak side but whose individual form held or grew. Fine. But the larger correction runs the other way. About 6,389 transfers go from SUCCESS to FAIL โ players whose Elo rose on a strong team's coattails while their own form told a duller story. Demotions outnumber rescues nearly two to one.
Here's the full picture on the established 12-month moves that carry both readings โ about 44,800 of them โ each sign already netted against the counterfactual:
The same moves, split two ways. The off-diagonal cells โ standing up but form down, standing down but form up โ are where a label built on Elo alone gets it backwards.
The two off-diagonal cells are the whole argument. Twelve thousand-plus players whose standing rose but whose play didn't โ riding results. Six thousand whose standing fell while they personally kept delivering. A label built on Elo alone calls the first group winners and the second group losers, and it is wrong in both corners. The composite isn't a tweak. It's what converts a team-success proxy into a player-success measure.
The why reads like a scouting memo
Once the label is honest, the model's reasoning is refreshingly mundane โ which is exactly how you know it's working. Here are the standardized logistic coefficients, the clean linear read on what moves the needle:
What moves the needle. Positive means more likely to take root; age sitting at zero is the tell that the label's age-netting worked.
- Pre-transfer Elo: +0.60. Quality travels. By a distance, the best single signal is simply how good the player already is.
- Level gap: +0.30, with destination club Elo and destination league strength both at โ0.15. These are the same fact seen twice: stepping up to a stronger club or league is harder. The bigger the climb, the steeper the odds.
- Regular minutes at the old club: +0.27. Established starters move better than squad-fringe names. A player who was already playing keeps playing.
- Same-league switch: +0.11. Staying in a familiar competition is safer than crossing a border.
- Pre-transfer EAR: +0.08. Form is a real edge, but a secondary one.
- Age: ~0.00. And this is the tell. Once age is netted out of the label, it predicts almost nothing in the model โ the cleanest possible sign the confound-control worked, rather than just quietly relabelling every older player as a failure. If our counterfactual were broken, age would still be screaming through the coefficients. It's silent.
None of this is exotic. Good players, playing regularly, not biting off too much of a step up, staying in a league they know. It's the memo a sharp recruitment head would write from instinct โ except now it's weighted, and the weights are earned.
Where it honestly stops
We scored 400 real players for a hypothetical move to PEC Zwolle โ team Elo 1488, a useful mid-table yardstick. The correlation between predicted success and pre-transfer Elo came out at +0.82, exactly the "quality travels" story. Probabilities ran from 0.06, for weak players trying to step up, to 0.64, for players already at or above PEC's level.
That 0.64 ceiling is the honest part. The model tops out near a coin-flip-plus, not at certainty, because two things stay irreducibly uncertain before a player arrives: whether he'll get minutes, and whether the fit is real. No model that hasn't watched a single training session gets to claim more.
Three more caveats we won't bury. The probabilities are mildly under-confident, because transfers in the test years simply succeeded more often than in training โ the base rate drifted from 0.21 to 0.31 โ so we recalibrate the raw number while the ranking stays trustworthy. About 64% of the hard failures are players who never appeared for the destination at all: squad-fringe signings, youth punts, immediate re-loans. (Loans themselves aren't over-represented among the failures โ 16% versus 17% โ so "it's just loans" doesn't explain it.) And we carry no transfer-fee feature, because only about 2.7% of moves in the data have a usable fee. A model that can't see the price tag is a model with a known blind spot.
The deployable version, using only fields already sitting on a player's row, lands at AUC 0.72 โ just 0.013 below the full model. That's what now feeds our recruitment scout, attaching a success probability to every candidate it surfaces.
What it's for is the takeaway. It turns "good stylistic fit, right now" into "good fit and likely to take root here." That's a sieve, not an oracle. It narrows the field from hundreds to a handful worth the video and the conversation. The numbers tell you where to look. The humans still make the call.
References
- Schultze, S., & Wellbrock, C. (2017). A weighted plus/minus metric for individual soccer player performance. Journal of Sports Analytics. https://doi.org/10.3233/jsa-170225
- Anonymous (2019). PlayeRank. ACM Transactions on Intelligent Systems and Technology. https://doi.org/10.1145/3343172
- Brooks, J. D., Kerr, M., & Guttag, J. V. (2016). Developing a Data-Driven Player Ranking in Soccer Using Predictive Model Weights. https://doi.org/10.1145/2939672.2939695
- Dick, U., & Brefeld, U. (2019). Learning to Rate Player Positioning in Soccer. Big Data. https://doi.org/10.1089/big.2018.0054