Methodology

Elo Ratings for Horse Racing

How RaceMetrics applies the Elo rating system — originally developed by Arpad Elo for chess — to UK and Irish horse racing across seven connection types, with empirical validation across 25+ years of data.

Simon Walton
Simon Walton
Founder, Proform Racing & RaceMetrics
Published 27 May 2026
Abstract

RaceMetrics applies the Elo rating system to UK and Irish horse racing. Each of the seven race participants — horse, trainer, jockey, owner, sire, dam, damsire — carries an independent Elo rating that updates after every race based on opposition quality, finishing position relative to expectations, and field size. Empirically, the horse with the highest weighted Combined Score wins 22.97% of races — 2.15× random — across 25+ years of Proform Racing data. This paper documents the methodology, the multi-runner adaptation from textbook Elo, and the empirical strike-rate calibration of the rating scale.

1.The Origins of Elo

In the 1960s, Arpad Elo — a physics professor at Marquette University and a Master-level chess player — devised a statistical system to rate chess players. His method replaced the inconsistent class-and-norm systems that preceded it with a single ordinal scale where every player has a numerical rating, the expected probability of one player beating another is a function of their rating difference, and after each game ratings update by an amount proportional to how surprising the outcome was given expectations.

The system was adopted by the United States Chess Federation (USCF) in 1960 and by the World Chess Federation (FIDE) in 1970. Elo published the definitive treatment in 1978 — The Rating of Chessplayers, Past and Present — and the system has since spread far beyond chess to nearly every competitive ranking domain: Go, Scrabble, video games (Microsoft's TrueSkill is a multiplayer extension), American football (FiveThirtyEight, ESPN ratings), tennis (the Universal Tennis Rating), and now horse racing.

The Elo formula's appeal is its simplicity and its theoretical grounding. Each rating is, in effect, a Bayesian point estimate of the player's true strength given their results to date; each update is the Bayesian revision after one more observation. Two players close in rating are expected to share results 50/50; a 200-point gap implies the higher-rated player will win roughly 76% of the time.

2.Why Elo Works for Horse Racing

The same logical structure that fits chess fits racing — with one critical adaptation. A race is not a one-on-one game. A 12-runner handicap at Cheltenham is, in Elo terms, 66 simultaneous pairwise contests (each runner against each other runner). Naïve, textbook Elo can't handle this directly: it expects one winner, one loser, two ratings to update.

RaceMetrics' adaptation, in plain English: for every horse in every race, we compute the Elo expectation against the entire field — what percentage of rivals would this horse be expected to beat, given its rating versus the field's ratings? — then update each horse's rating once based on how its actual finishing position compared to that expectation.

A horse rated 1620 in a field averaging 1500 is expected to finish near the top; finishing 8th surprises the model and pulls its rating downward. A horse rated 1480 in the same field finishing 2nd surprises the model in the other direction and pulls its rating up.

The same machinery runs in parallel for the connection types — trainer, jockey, owner, sire, dam, damsire. Each is an Elo pool of its own, populated by every race the connection has been involved in, going back through 25+ years of data.

3.The RaceMetrics Adaptation

Three departures from textbook Elo make racing-Elo work:

3.1 Multi-runner expectation

Instead of one Elo expectation per pair, we compute each horse's expected Percentage of Rivals Beaten (PRB) — a function of its rating versus the average of the field's ratings, weighted by field size. The actual PRB after the race (finishing position relative to field size) is compared to the expectation, and the rating updates proportionally. The K-factor — the magnitude of the update per unit of surprise — is calibrated against the historical race population, large enough to track real form change, small enough to filter noise.

3.2 Seven independent Elo pools

Horse, Trainer, Jockey, Owner, Sire, Dam and Damsire each have their own independent rating populations. The horse's H rating reflects its own form. The Trainer's T rating reflects the cumulative performance of every runner the trainer has saddled in the dataset. The Sire's S rating reflects the cumulative performance of every progeny that has run. This is a more granular form picture than the single-figure horse-only ratings published by most racing-data providers.

3.3 Combined Score weighting

For race-level prediction, we combine the six non-horse connection ratings into a single Combined Score. The weighting was derived empirically from a held-out test set, choosing weights that maximised the rank-correlation between Combined Score and finishing position:

ConnectionCombined Score weight
Owner20%
Trainer20%
Jockey20%
Dam18%
Sire12%
Damsire10%

The horse's own H rating is reported separately and is deliberately not folded into the Combined Score. The two figures answer different questions: "How talented is this horse?" (H) versus "How strong is the package around this horse?" (Combined Score). A horse with a modest H rating but elite connections (high T, J, O, S, D, DS) carries a real Combined Score signal that a single horse-rating system would miss entirely.

4.Reading the Scale

The RaceMetrics rating scale follows the chess convention of a 1500 anchor — that is, the population mean sits around 1500 by construction, and the standard deviation is calibrated so that bands at 50-point intervals correspond to roughly one quarter of a standard deviation of underlying strength:

Rating bandInterpretation
1600 +Elite — top-tier performance
1550 – 1599Strong — consistently above average
1500 – 1549Average
1450 – 1499Below average
Under 1450Struggling form

The cut-offs aren't arbitrary. They were chosen to match observed strike-rate brackets in historical data — the rating that empirically wins close to 1-in-2 at the top is called Elite; the rating that empirically wins close to 1-in-3 is called Strong; and so on (see Section 5).

5.Empirical Validation

Across 25+ years of UK and Irish racing in the Proform Racing dataset, the horse with the highest Combined Score in each race wins 22.97% of the time. Given UK racing's average field size, random selection would produce a win rate around 10-11% — so the top Combined Score outperforms random selection by 2.15×.

Breaking that down by the Combined Score tier of the top-rated horse in each race:

Combined Score tierScore rangeStrike ratevs random
Elite1600 +43.75%3.54×
Very High1575 – 159933.41%3.18×
High1550 – 157424.06%2.49×
Above Avg1525 – 154917.78%1.86×
Average1500 – 152412.38%1.32×
Below Avg1475 – 14997.99%0.87×
Low1450 – 14744.95%0.54×
BottomUnder 14502.81%0.30×

The relationship is monotonic across all eight tiers: each higher band is reliably stronger than the band below it, with no inversions. That monotonicity is the diagnostic of a properly-calibrated rating system. A rating system whose strike-rate tiers cross over is mis-calibrated; one where they cleanly stack tells you the ordinal scale is doing what an ordinal scale is meant to do.

6.What Elo Doesn't Tell You

An honest accounting of what a single rating figure cannot capture:

6.1 Going, distance, course, class

A horse's RaceMetrics rating is form-based. It does not condition on the race conditions in front of it. A 1620-rated horse running on a surface it has never coped with may still underperform a 1550-rated horse perfectly suited to today's conditions. RaceMetrics layers separate tools on top of the Elo ratings — Form Expert for historical condition-specific strike rates, Pattern detection for saved profitable angles, Trip Predictor for surface/distance preference, Draw & Pace for course-specific draw biases — to answer the "is this rating likely to translate today?" question.

6.2 Small-sample priors

A brand-new sire's progeny haven't run yet. The S rating starts at the 1500 anchor and converges toward its true value as runners accumulate. For the first dozen or so progeny, the rating is dominated by the prior, not the data. This is intentional — overconfident priors built from pedigree alone would inject bias. The cost is that genuinely elite young sires take a year or two to climb the rankings; the benefit is that we don't penalise horses for being sired by a stallion with an unfortunate Group race or two.

6.3 Recency and regime change

Pure Elo treats every race equally. Form genuinely degrades and recovers, and a horse that ran in 2018 is not the same horse running in 2026. RaceMetrics' practical implementation includes recency weighting so the most recent results dominate, while older races still contribute proportionally — a compromise between full-history accuracy and current-form responsiveness.

6.4 Single-figure simplification

Any rating is a point estimate of a probabilistic quantity. A 1620 rating doesn't mean the horse will win — it means the model's central expectation, given everything it knows, places this horse higher than a 1550-rated horse. Variance, draw, going, jockey choice, traffic and pace dynamics all matter on the day, and none are inside the Elo number. They are inside the surrounding tools.

7.Comparison with Other Rating Systems

Each of the major UK racing rating systems answers a slightly different question. They are not directly comparable across scales:

SystemBasisUpdatedScale
BHA Official Rating (OR)Handicapper judgement, anchored on weightWeekly0 – 140+
Topspeed (TS)Time-based — finishing time vs parPer raceInternal
Form-based commercial ratingsSubjective form + contextPer raceInternal
RaceMetrics (H, T, J, O, S, D, DS)Elo-style — form vs opposition qualityDaily1500-anchored

RaceMetrics' structural advantage is breadth and statistical foundation: separate ratings for all seven connection types, updated daily, on a single scale calibrated against 25+ years of out-of-sample race results. The handicapper's OR is an authority figure; the Elo system is a statistical estimator. Different tools for different jobs — and ideally both inputs to a serious form student's judgement.

8.References

  1. Elo, A. E. (1978). The Rating of Chessplayers, Past and Present. Arco Publishing, New York.
  2. Glickman, M. E. (1995). A comprehensive guide to chess ratings. American Chess Journal, 3, 59-102.
  3. Glickman, M. E. (2012). Example of the Glicko-2 system. Boston University. [PDF]
  4. Herbrich, R., Minka, T., & Graepel, T. (2007). TrueSkill: A Bayesian skill rating system. Advances in Neural Information Processing Systems, 19.
  5. Sismanis, Y. (2010). How I won the "Chess Ratings — Elo vs the Rest of the World" competition. arXiv:1012.4571.
  6. World Chess Federation (FIDE). FIDE Rating Regulations. Current edition. FIDE Handbook.
  7. Wikipedia: Elo rating system. Sports rating system.

See the ratings in action

Every UK and Irish race, every day, with H ratings, Combined Scores, HP Dots, HHI predictability index, and 60+ other metrics. Free tier available with no payment details.