CAN ELO RATINGS BE IMPROVED? A CASE STUDY WITH ELITE CHESS PLAYERS

Originally designed as a way to reflect past performance, chess ratings are now widely used to reflect players strength with many important aspects in tournament scheduling, advertising and premium shares. The ELO system has been officially adopted by World Chess Federation (FIDE). We used Bayesian analysis of actual data from elite chess players to fit parametric statistical models that could subsidize proposals for rating system improvement. Although most of the considered options are not new, since based on well known preference models, the use of a weighed likelihood function to emulate dynamic rating systems via Bayesian inference is novel. We compared descriptive ability using marginal likelihood based information criteria. Akaike information criterion was used to compare predictions. Many of the considered options improve on Elo ratings and there is strong evidence that dynamic models considering both white advantage and propensity to draws would result in more accurate systems.


Introduction
Chess is one of the most popular games in the world, being practiced by millions of people (formally or informally). There is a lot of literature on chess, thousands of books and magazines, websites and data banks to retrieve games played from 15 century on and many sources of information on tournaments and players history. Early introduction of rating systems to estimate players' relative strengths played an important role in chess popularity. World Chess Federation (FIDE) and many national federations, like United States Chess Federation (USCF), keep track of players ratings resulting in very important information not only for fair pairing systems and tournament scheduling but also for marketing and sponsorship purposes.
The system used by FIDE was developed by the Hungarian-American physicist Arpad Elo, and it assumes that ratings of players are random variables and expected result is a function of rating differences. Elo has developed updating formulae using Gaussian distribution with a convenient standard deviation (and logistic approximation). Chess community follow official ratings and chess related professionals use them to describe past performance and to make predictions. From a statistical viewpoint, Elo's system is a particular case of preference models for paired comparisons (BRADLEY and TERRY,1952). One of the main generalizations of those models are contemplating ties probabilities as in (DAVIDSON, 1969) with an extra "tie propensity" parameter.
It is very easy to include in those models a general or player specific parameter to describe the "home advantage" of the player conducting white pieces. It is also possible to tune rating variances or player activity by using alternative formulae for the expected result. However, most of the comparisons for real data sets using modifications of ELO system are derived from practical realizations or computing convenience (GLICKMAN and JONES, 1999;SISMANIS, 2010;KAGGLE, 2020).
Criticism of FIDE ratings are based on well documented effects, like the larger tendency to draws in elite chess players, the advantage of playing white and the variability in players activity. This has been adressed by many authors and some very effective alternative systems have been worked out. "Glicko" system deal with heteregeneous variances on rating parameters in a Bradley-Terry type of model (GLICKMAN, 1999). A fully Bayesian approach using concepts from Glicko system is implemented in "TrueSkill" system by Microsoft Research (HERBRICH et al. 2,007). Using more frequent evaluation periods and tuning factors for beginners and experts has also been used and in fact have changed FIDE ratings calculations (FIDE, 2020a;USCF, 2020).
In what follows, we analysed actual data from elite chess players. Our objective is to find parametric statistical models that could subsidize proposals for changes in the FIDE system. Options considered are based on fitting modifications of Bradley-Terry (1952) and Davidson (1969) preference models for such unbalanced source of data. Comparison with "true" dynamic models in the literature was not considered in this context since it would be unpractical as they depend on a wider choice of prior distributions and tuning parameters as frequency of updates and shrinkage factors. Our choice instead was to emulate dynamic rating systems to make a fair comparison of different proposals. We carried out Bayesian inference using weighed likelihood functions and proper prior distributions that where elicited, and then turned vague using arbitrary higher variance.

Data
From an initial set of 28, 042 official games in which both players had FIDE rating > 2, 500 (Grandmaster level), played from January 2010 to November 2012, we selected the ones from players that would play Grand Prix series 2013 (46 players). This resulted in a Training set with 6, 807 games that will be used to fit all considered models.
Testing sets used data from 2013 (Grand Prix, Candidates Tournament, World Cup and other major tournaments) to compare prediction abilities of the models. This was done in two different ways: a) Using games in which both players had their rating parameters previously estimated in the training set, resulting in 411 games played by 36 players (Testing set A); b) Using games in which at least one of the players had their rating parameters estimated in the training set, resulting in 732 games. For those games, 37 players had rating parameters estimated in the training set and for the remaining 51 we used current FIDE ratings (Testing set B ).
Observable data included game date, players identification conducting white and black pieces (tournament design) and game results recorded as 0, 0.5 or 1, respectively, for a defeat, draw or win for white pieces. Game dates were used in time-dependent likelihood functions that emulate rating dynamics. Table 3 in supplementary material has names, country of origin, and FIDE ratings (as in December 2012) of all considered elite players. A complete description (full games, biography, etc) can be found in (CHESSBASE, 2020;CHESSGAMES, 2020;CHESSRESULTS, 2020;FIDE, 2020b).

Rating models
ELO system, the proposal from BRADLEY and TERRY (1952) and derived models (hereinafter called BT) have a direct relation. So, let γ i be the rating parameter in BT and R i its translation in the ELO system scale. Let y ij = 1 be the observed result for a win of player i over player j, y ij = 0.5 for a draw and y ij = 0 for a loss of player i. ELO (1978) formulated his model for differences of player's ratings, each normally distributed with standard deviation 200. The author then proceeded approximating the expectancy of player i score against player j in the Gaussian distribution with mean zero and standard deviation √ 2 × 200 2 ≈ 282.8 by the inverse logistic function using normalizing factor 400 as follows: In which D ij = Rj − Ri. Note that original model has no draw probability and this result is counted as half win for each player. Thus, This is the same as a preference probability for i over j in BT. For easy reading among chess players and arbiters, we present rating estimates corrected to FIDE-ELO's scale, like above.
Modifying the model to include a single parameter for all players that represents a common advantage of playing white will result in the following change in the linear predictor: Similarly, it is possible to make a new linear predictor D * * ij = R j + δ j − R i − δ i , allowing for different advantages of being white for each player.
Under this specification, likelihood function for BT is written as: where π ijk is the expected result in favor of white player to win the game and y ijk is the result from k th game between players i (as white) versus j (as black). The proposal from Davidson (1969) and derived models are hereinafter called DV. They differ from BT by having three preference classes, directly modeling the probability of draw. Considering it is not very direct to re-scale parameters of those models to meet FIDE-ELO ratings, we kept them in the original scale.
Below are described models with a common white advantage parameter δ. Original model only has one parameter λ related to the drawing propensity. In what follows, this model can be yielded back making δ = 0. To have models with specific white advantage parameters for each player we simply replace δ by δ i and δ j . π ij = P (i win j) = e γi+δ e γj + e γi+δ + e λ+ γ i +γ j +δ 2 , π ij0 = P r(i draw j) = e λ+ γ i +γ j +δ 2 e γj + e γi+δ + e λ+ γ i +γ j +δ 2 , π ji = P (i loose j) = e γj e γj + e γi+δ + e λ+ γ i +γ j +δ 2 . (4) For DV is possible to rewrite the likelihood function as: where I {.} is an indicator variable for game result. Gaussian prior distributions were used for ratings and white advantages. Their averages were elicited, but with high variances, to allow for easy learning from data. For BT models, µ hyperparameter was the quantile of FIDE-ELO for the 100 th best player and σ was twice the one used for ELO system. This preserves a proper prior distribution with little information on ratings. For δ, we used a small but positive average as expected from chess literature, reflecting about 7% increase in the winning probability for white, with a large standard deviation to allow for negative values. This also reflected in a proper prior with little information. Specific distributions chosen for BT are: R i ∼ N (µ = 2705, σ = 400), δ i ∼ N (µ = 50, σ = 40) and δ ∼ N (µ = 50, σ = 40).
Joint posterior distributions for each model are products of Gaussian priors and Bernoulli (BT) or multinomial (DV) likelihoods. The full conditional distributions can be simplified as follows: BT: Distribution of rating for the i th player (R i * ) given all other player's ratings (R −i * ) and player-specific white advantage parameters δ: Distribution of the white advantage parameter for the i th player δ i given all other player's (δ i ) and all player's ratings (R): In the same fashion, we present full conditional distributions for DV as follows: DV: Distribution of rating for the i th player (γ i * ) given all other player's ratings (gamma −i * ), player-specific white advantage parameters δ and drawing propensity parameter (λ): Distribution of white advantage parameters δ i for the i th given all other player's (δ −i * ), player's ratings (γ) and drawing propensity parameter (λ): Distribution of drawing propensity parameter (λ) given player's ratings (gamma) and white advantage parameters δ: (10)

Sampling from the joint posterior distribution
All inference on the descriptive properties of the models was carried out on samples from joint marginal distribution for all parameters given training data. Samples from predictive distributions for each model were also drawn. This made possible to evaluate model performance in the Testing sets. Markov Chain Monte-Carlo methods were used to sample the joint posterior distributions. Sampling from posterior distributions was done using Metropolis -Hastings methods (HASTINGS,1970) with adaptive candidate generating functions embedded within Gibbs Sampling algorithm (GILKS et al., 1995).
Normal distributions were used to generate candidate points. Hyperparameters from those distributions were updated after a window of 1, 000 samples. In what follows, we present marginal summaries from the posterior distributions for each model and it's parameters.
For each parameter 3 parallel chains with 130, 000 iterations were drawn. We burnt the first 50, 000 samples and later used a 20 iterations jump, yielding final sample size of 4, 000(×3).
Algorithms were implemented in R (R CORE TEAM R, 2020) and sampling diagnostics were done using (RAFTERY and LEWIS, 1992), (GELMAN and RUBIN, 1992) and (BROOKS and GELMAN, 1998) criteria implemented on coda package (PLUMMER et al. 2006).

Weighed Likelihood to emulate dynamic models
A time related variable ω was created according to Sismanis (2010) to weigh the likelihood function in a way that old games has less importance than new games.
in which t min and t max are, respectively, the time of a game being analyzed and the time considered as reference to make inference in the Training set.
Using ω in the likelihood function makes for a proxy of true dynamic models. Other ways to implement dynamic models would require to carry out the estimation computing posterior averages (or modes) periodically or making a fully parameterized model in which ratings have a parameter for each time. Those models would be too much of a computational burden to the purpose of this paper whose main objective is to elucidate the effect of having such a correction on rating models. So we just have implemented time weighing leaving actual dynamic models for future research. However, we will refer to models with ω weighing the likelihood as "dynamic models", in contrast to "static models" with ω = 1. Some results are presented for all 12 models (6 static and 6 dynamic), while others are just presented for the best version of BT or DV.

Decision on best models
Decision on best models, both from description and for prediction was based on direct estimates of AIC (AKAIKE,1974) or its Monte Carlo approximation AICM (RAFTERY and NEWTON,2007). For prediction models in each scenario we used two different ideas on how to handle incomplete data: a) The joint predictive distributions for possible game result from Testing sets A and B were worked out. This yielded probability estimates of p(y * |y, θ), in which y * is the new result actually observed in testing sets; b) Based on p(y * |y, θ) we evaluated information or distance criteria to realized y * in each given model. Those results are numerical approximations of predictive AIC, AICM and DeFinetti distance measures, in each case.
For predictions using Testing set B, for each player that had not been monitored, we used ELO rating as given by FIDE in the game day (eventually rescaled to contemplate model specification in DV). As for δ parameter, the following proposals were used: i δ i = 0, representing no advantage of conducting white; ii δ i = µ δ , or the posterior marginal average for the advantage of playing as white, found for the model with a single δ; iii δ i = −δ j , assuming the estimate for players j could be considered as reverse effect in his opponent.
Each result was compared to a correspondent reference model using educated guesses (or naive estimation of probabilities). Assumptions for each of those references are: I a win, a draw or a loss are equiprobable, meaning basic 50% − 50% to BT or 1 3 , 1 3 and 1 3 to DV; II summaries of historic results for wins, draws and losses from white players' perspective. This "proportional" model can be used to check DV.
III the same as previous case, but considering a draw as half a loose and half a win to check BT, as it is basically a rescaled ELO.
As a special case to check BT, we used probabilities derived from FIDE ratings as given in the game day. This represents a true dynamic model that is expected to outperform its static counterpart. To make fair predictions, however, we used FIDE ratings from December 2012 for Testing sets A and B.

Results and discussion
Samples from joint posterior distribution had good statistical properties and could be treated as independent. Examples are depict in trace plots from Figures 1 (Michael Adams, BT) and 2 (Shakhriar Mamediarov, DV). That made simpler to evaluate all post hoc statistics or depict plots from marginal posterior distributions. Prior information was not relevant to inference as can be seen in the densities presented in Figures 3 and 4 for respective players and models.

Posterior analysis
In Table 1 we present estimates of AIC from different models fitted to Training set and evaluated in Testing sets. Such estimates can be used for model comparison by themselves or compared to static references and FIDE-ELO official ratings used as dynamic reference.
As we can see all the proposed models outperform equivalent reference model. Regarding BT, the best choice for both static and dynamic models was to use a single parameter to the advantage of playing white. Posterior average for the marginal distribution for this parameter in static model wasδ = 41.7 with a 95% HPD given by [33.5 ; 50.6]. In the dynamic model it was estimated asδ = 43.7, with HPD given by [37.1 ; 50.1]. This is roughly equivalent to say that a player has a 41.7 increase in its ELO rating if it is assigned to play as white (using static model), or a 43.7 increase in the dynamic model.
As an example of the static model use, in a game between players A, R i = 2, 686 and B, R j = 2, 715, first player would have a 6% increase in its expected winning probability by playing white, or equivalently, a 2, 727.7 strength. The same example in the dynamic model would result in a 2, 729.7 strength and 6.3% increase in the winning chance for A.
In Figure 5 we have shown that ELO ratings are equivalent to BT models and underestimate probabilities of white victory. BT models with a common white advantage parameter has shown better fit to actual data. ELO ratings specially underestimate white advantage with high rating differences (∆ R < −200 or ∆ R > 200).  All models derived from Davidson basic version were better than respective reference models. Best model has both players white advantage parameters and a single drawing parameter, that was estimated as λ = 1, 098 with HP D 95% : [1, 04 ; 1, 15] for the static model and as λ = 1, 106 with HP D 95% : [1, 06 ; 1, 14] for the model with weighed likelihood.
Those estimates are indicative of a prevalent drawing tendency in games played at this level. Comparing these with rating estimates that one can find in Figure 6, estimated by the best models (static and dynamic versions) is a good indicative of the extent those players are prone to draw.
For instance, take the probabilities for expected outcomes of current world champion Magnus Carlsen (with highest rating γ ≈ 18 or ELO ≈ 3120) playing his predecessor Ruslan Ponomariov (with rating estimated as γ ≈ 16 or ELO ≈ 2780). Ignoring white advantage, the expected result would be approximately 70%. But a win for white would be just as likely as a draw, with around 46.8% probability.
Consider now two much weaker players with the same rating difference, lets say, γ 1 = 12 (ELO ≈ 2084) and γ 2 = 10 (ELO ≈ 1787). Even if rating differences in ELO scale are smaller, due to the proportionality factor, winning chances for white are higher. In this case, expected result would be about 78.8% with a winning probability of 66.5% for the first player. This is in strong agreement of what is observed in practice.
White advantage parameter estimates by the two best versions of DV are depicted in Figure 7. In here, we found a considerable disagreement between both models. For the static model, estimates seems to imply that there is a small group of players that considerably benefit from playing white. For most of the other players, credibility intervals include (δ = 0). It is worth noting that just Ponomariov and Andreikin apparently do not perform better than expected as white. It is possible that due to weaker opposition their results can be biased, but the result for Mamediarov, Bacrot, Caruana, Shirov and Kramnik are very consistent with practical observation. As a matter of fact, Topalov's estimate having such a large credibility interval (including δ = 0 as likely) is slightly unexpected, but may reflect his diminished activity in the period.
The estimates from dynamic model, on the other hand, has shown a larger group of players that benefit clearly from playing white, being Ponomariov the lone exception and Topalov the most prone to win as white. Again, based on reports from numerous tournaments and psychological attributes fellow players assign to ex-world champions, it is a likely result.
Posterior distributions were far off the prior averages. This is clear sign of likelihood dominance over prior information,

Testing set A
In this section we evaluated prediction for new games involving players that have parameters estimated in Training set.  model applied to both Testing sets. For testing set A, In all considered cases BT dynamic versions outperform the respective static version. One BT and two DV models beat their respective references. In both cases, best model was the dynamic version with a time-weighed likelihood, and a common white advantage parameter. Note that ELO ratings are truly dynamic in its implementation and outperform other references. This result agrees with recent findings in Delloite Challenges using BT dynamic models (Sismanis, 2010). This is a strong hint that true dynamic models (ratings evaluated continuously) would be better proxies to chess strength than current system.

Testing set B
In Testing set B there are games in which at least one of the players have parameters estimated in Training set ("first player"). To evaluate predictions it was necessary to input values for "new players" that where not in the Training set. We used FIDE-ELO ratings declared for the game day and three options for individual δ j parameters, namely: • δ j = 0: using no advantage for the new player; • δ j = −δ i : using minus the advantage estimated for the "first player"; • δ j = µ δ : using the average of the advantages estimated for all the players.
Using both BT and DV models, the best prediction comes from the ones with a common parameter for white advantage. Those are also the only models that outperform all the references in all the criteria. This is an indication that if, for any reason, individual parameters delta are badly estimated (or estimates are unavailable) we should use a simpler model with a common parameter δ for better prediction.
The good features of the ELO dynamic strategy can be improved easily with a single parameter for white advantage. This could be evaluated for different basis and purposes, for instance, based on games played in a given year, tournament or league.
The use of white advantage parameters enhances the fit in all versions. For past performance, using BT we should keep a single parameter (δ) for white advantage, but using DV, individual parameters for each player (δ i ) is the best choice.
Fully parameterized DV may enlighten an interaction of individual white advantage and the well known effect of increasing drawing probability (λ) with the average strength of the players. Those models are more difficult to fit, but could help to better describe past performance and also subsidize more accurate predictive systems.
We successfully implemented Bayesian inference for all considered models using R (R CORE TEAM, 2020). Efficiency of the algorithms was not of great concern as we are not advocating direct use of the models to a new rating system. However, the analysis presented in here reinforces the idea that FIDE-ELO system should be revised. Dynamic models using parameters for white advantage and for drawing probabilities would be a good basis for an alternative system.

Conclusion
Comparing past results and predictions, we could show that the use of weighed likelihood functions is helpful to verify the advantages of dynamic models.
A new system with a single white advantage parameter (δ) could be used to improve ELO. Both BT or DV could be used as a basis for such system. The former is closer to current FIDE ratings, but the later has best predictive potential.

Acknowledgments
We would like to thank reviewers and editors for their comments and suggestions. The authors would also like to thank the doctoral scholarship granted by CAPES to the first author, and the research aid granted by FAPEMIG to the second author.