Analyzing chess ratings


Chess ratings

In chess, a player’s rating is a number that determines that player’s strength relative to other players. The higher the rating relative to an opponent, the better that player is expected to be. For example, a player with a rating of 1500 is expected to be quite a bit better than a player with a rating of 1000.

Accurate ratings are important to determine pairings between players in competitions or in online pool play. The rating by itself has no meaning and is arbitrary, however chess ratings typically vary between 0-3000, with the population of players the ratings are calculated from characterizing the actual distribution of ratings, among other factors.

For example, the chess.com blitz (online fast chess) pool regularly sees players with ratings well above 3000, but in the FIDE classical pool (over the board classical chess) there are no ratings above 3000.

When considering players with different ratings, it is not enough to know whether one player is expected be better or worse than another. It is important to know exactly how much better or worse a given player is relative to another so that ratings can be best updated after a game. Chess rating systems have a way to account for this as well as the uncertainty inherent to a player’s assigned rating at a given moment. Is a player rated 1810 really that much better than one rated 1800? What about a player rated 1900? For any given game that question is hard to answer, for a large sample of games it becomes easier as variability is averaged out and player performance converges about a mean.

To investigate the characteristics of chess ratings I downloaded a sample of 5 million games played during October of 2021 from Lichess, a popular chess server on the internet. Lichess uses the Glicko rating system, developed by Mark Glickman. The Glicko rating system has a formula to calculate the expected result of a game between two players with given ratings, shown below in equation (1).

\[\begin{equation} \begin{split} E &= \frac{1}{1+10^{g(\sqrt{\sigma_{w}^{2} + \sigma_{b}^{2}})(r_w - r_b)/400}} \\ g(x) &= \frac{1}{\sqrt{1+\frac{3(\frac{\ln_{10}}{400})^2(x)^2}{\pi^2}}} \end{split} \tag{1} \end{equation}\]

Where \(r_b\) and \(r_w\) represent the black and white ratings respectively, and \(\sigma_b\) and \(\sigma_w\) represent the rating deviation, or certainty, about the black and white ratings respectively. Lichess sets bounds for the rating deviation between 30 and 500. A rating of 500 represents the most possible uncertainty in a rating, all new players are assigned a rating deviation of 500. The more games played the lower the rating deviation becomes, to a minimum of 30. Lichess sets the minimum deviation at 30 to ensure that ratings can be sufficiently adjusted after wins and losses. How many rating points are gained or lost after a game is function of player rating and the rating deviation.

Consider a new Lichess account with a rating of 1500 and a rating deviation of 500. A 95% confidence interval which contains the player’s true strength is \(\{1500 \pm 1.96(500)\} = \{520, 2480\}\). In contrast, consider a seasoned player with a rating of 1900 and rating deviation of 30. Then a 95% confidence interval containing the player’s true strength is \(\{1900 \pm 1.96(30)\} = \{1841.2,1958.8\}\). A new player’s true strength is extremely uncertain whereas the experienced player’s strength is reasonably certain.

Figure 1 shows the expected result for a white player rated 1500 with \(\sigma_w = 30\) against all possible black ratings with respect to \(\sigma_b = \{30,500\}\).

Expected results for 1500 rated player against all ratings with respect to most and least certain rating deviation.

Figure 1: Expected results for 1500 rated player against all ratings with respect to most and least certain rating deviation.

The S curves in the above figure illustrate how the white player’s scoring chances change with respect to their opponent’s rating. When black is rated exactly the same as white, the expected result is an equal score for both players, regardless of the rating deviation. However the scoring chances vary with respect to \(\sigma_b\). When black’s rating is most certain white has better scoring chances when they outrate black, when black’s rating is least certain white still has better scoring chances but less so. The curve is mirrored and the inverse is also true.

It is clear that introducing uncertainty into the ratings of players has an effect on winning chances in the Glicko model.

Analyzing rating data

Exploratory analysis

Using the python-chess python library I extracted the metadata I wanted (game result, player ratings, and piece colour) from the sample of 5 million Lichess games. Lichess data is in a PGN format and not workable for data analysis as is. I then imported the cleaned dataset into R.

It is useful to explore the summary statistics for the white and black rating distributions.

Table 1: Summary statistics for white ratings.
Minimum1st QuartileMedianMean3rd QuartileMaximum
60014151689167719433310
Table 2: Summary statistics for black ratings.
Minimum1st QuartileMedianMean3rd QuartileMaximum
60014141688167619433312

The summary statistics are practically identical between white and black ratings, which makes sense because these ratings are derived from a pool of pairings. A player who is white one game may be black the next, the distribution of the ratings will be very similar. The two measures of central tendency, mean and median, both give a value in the high 1600s. This rating, plus or minus a few dozen points, is about where we can place an “average” player.

Looking at overall proportions for game results in Figure 2 reveals nothing new.

Results proportions by piece colour.

Figure 2: Results proportions by piece colour.

White is well known to have an advantage in chess due to having the first move, which is confirmed in the above figure. Overall, white has an almost 5% greater chance of winning compared to black.

Investigating how the proportions of game outcomes change with respect to the average strength of both players reveals something interesting.

Result proportions by player strength.

Figure 3: Result proportions by player strength.

The average strength of both players per game was binned in increments of 100 for ratings between 1000 and 2500, and then two wider bins of \((0, 1000]\) and \((2500, \text{Inf})\) to account for the lowest and highest rated pairings. There is a clear trend in Figure 3. At the weakest ratings there is a slightly higher proportion of draws, which tends to minimize about a rating of 1500, and then increase as the players become stronger. This corresponds closely to white’s absolute win rate, maximizing about 1500 and declining as players become stronger. A similar plot is given in Figure 4 showing the relative advantage white has with respect to player strength by taking the difference between white and black win rates in Figure 3.

Difference between white and black win rates by player strength.

Figure 4: Difference between white and black win rates by player strength.

White has a small advantage among weaker pairings, minimizing about the average pairings, and then a definite advantage at stronger pairings. An explanation of the results for stronger pairings is that better players know how to maximize their advantages, and thus the white pieces will perform better as the players become stronger. However, since players of similar strength are most often paired, the player handling the black pieces is no slouch either. Thus the draw percentage increases with player strength as black is more equipped to neutralize the white advantage and secure more draws. Draws will also increase among better players because better players can recognize when playing for a draw is their best option.

In contrast, the draw rates are minimized among the average player. The average chess player may play certain positions too aggressively and others too meekly. The average chess player’s relative lack of skill in recognizing positional advantages and disadvantages could lead to a greater proportion of decisive results. This lack of skill also corresponds to the average player’s performance playing white in Figure 4. An average player lacks the ability to fully take advantage of the white pieces compared to a skilled player.

Finally, this leaves the weaker players, where the white pieces have marginally better performance than with the average players and there are slightly more draws. An explanation for this could be weak players do not understand the game to the level of an average player, they do not know what they do not know. Consequently their moves have a more random quality to them, producing marginally higher white win rates and draw occurrences than players with a greater understanding of the game.

How many rating points are the white pieces worth?

An interesting question given the demonstrated advantage for the white pieces in game outcomes is how many rating points are the white pieces worth? The difference between the ratings of the black and white players is taken across all games, \(r_b - r_w\), and then plotted against the average score among all games played with that rating difference. The average score represents a proportion, a value of 1 implies white won every game whereas 0 implies black won every game. A value of 0.5 implies that both black and white scored equally. The Glicko equation in (1) implies no advantage for white. At equivalent ratings white and black are given even (0.5) scoring chances. The Glicko equation also gives even chances for white and black regardless of the rating deviation. These facts make 0.5 the ideal reference to calculate the equivalent rating advantage for playing the white pieces. The expected score data is subset to include a small interval about 0.5 and the corresponding data for \(r_b - r_w\). The expected scoring is plotted against the rating difference and shown in Figure 5.

Difference between black and white ratings plotted against the average score among all games played with that rating difference, average score subset to values close to an even score of 0.5.

Figure 5: Difference between black and white ratings plotted against the average score among all games played with that rating difference, average score subset to values close to an even score of 0.5.

There is a strong relationship between expected scoring and rating difference, which is unsurprising. What is notable is the typical rating difference when the scoring is equal, around 20 points. This means that when black outrates white by 20 points the expected result is even. Fitting a simple regression model to the relationship and some easy algebra leads to an exact number.

\[\begin{equation} \begin{split} x &= \frac{-0.02}{-9.11 \times 10^{-4}} \\ x &= 21.95 \end{split} \tag{2} \end{equation}\]

If the regression is to be trusted, playing with the white pieces is an advantage equivalent to 21.95 rating points. Conversely, playing with the black pieces is a disadvantage equivalent to 21.95 rating points. This relationship is modelled across all players, but could vary with player strength. Figure 6 plots regression lines with respect to player strength.

Regression fits with respect to player strength predicting expected score given a rating difference.

Figure 6: Regression fits with respect to player strength predicting expected score given a rating difference.

Unsurprisingly the rating point advantage the white pieces confer varies with respect to player strength. Ultimately the bins chosen were arbitrary, but the overall theme is evident. Weaker players have a greater rating point advantage than stronger players playing with the white pieces. This result largely agrees with Figure 3 where the absolute white winrate is highest among weaker players. The relative advantage for the white pieces shown in Figure 4 is highest among stronger players, but the draw proportion also drastically increases. The higher number of draws would skew the expected scoring proportion closer to 0.5.

Estimating the rating deviation

The results shown in Figure 5 were not informed by the rating deviation. This number was not available in the dataset and is unknown. However, it may be estimated from the data. Once again the difference between the ratings of the black and white players \(r_b - r_w\) is taken across all games. \(r_b - r_w\) is then plotted against the average score among all games played with that rating difference. The Glicko equation for expected scoring is shown on the scatterplot for the minimum possible rating deviation (\(\sigma_w = \sigma_b = 30\)) and the maximum (\(\sigma_w = \sigma_b = 500\)).

Difference between black and white ratings plotted against the average score among all games played with that rating difference, and Glicko predictions for expected results with most and least certain rating deviations.

Figure 7: Difference between black and white ratings plotted against the average score among all games played with that rating difference, and Glicko predictions for expected results with most and least certain rating deviations.

When the rating deviation is lowest the expected scoring curve is clearly S-shaped, a player who outrates another is given the highest scoring chances. In contrast when the rating deviation is highest the curve is almost linear. Higher rated players are still expected to score better but the lower rated player is given good chances.

What value of \(\sigma_w\) and \(\sigma_b\) would best describe the actual distribution of scoring chances? The scoring chances were calculated by taking the average of the results for each 1 point change in the difference between player ratings, if the rating deviations were known then an average could also be taken of these values to parametrize the Glicko curve. In absence of the actual rating deviations, their mean can be estimated. It can be assumed that the average rating deviation is identical across differences in player ratings. In other words, the mean of the rating deviations among all games played between players with a 0 point difference in rating is the same as the mean among all games played between players with a 500 point difference in rating.

Given equation (1), \(\sigma_w\) and \(\sigma_b\) can be represented as a single number \(\sigma_\text{pool} = \sqrt{\sigma_w^2 + \sigma_b^2}\) which will parametrize the Glicko curve. \(\sigma_\text{pool}\) will be estimated as the value which minimizes the sum of squared errors between the Glicko curve and the data. Doing so results in \(\sigma_\text{pool} = 342.5\), the resulting fit is given in Figure 8.

Difference between black and white ratings plotted against the average score among all games played with that difference, with best possible Glicko fit for the data.

Figure 8: Difference between black and white ratings plotted against the average score among all games played with that difference, with best possible Glicko fit for the data.

The fit is reasonable. It is possible discarding the assumption of constant average rating deviations with respect to rating differences would result in a better fit.

Of course values in a small neighbourhood about \(\sigma_\text{pool} = 342.5\) can be created by multiple combinations of \(\sigma_w\) and \(\sigma_b\). A contour plot of the underling function \(\sqrt{\sigma_w^2 + \sigma_b^2}\) illustrates this in Figure 9.

Contours of $\sigma_\text{pool}$ about the most likely pooled rating deviation.

Figure 9: Contours of \(\sigma_\text{pool}\) about the most likely pooled rating deviation.

The range of values that will produce numbers close to \(\sigma_\text{pool} = 342.5\) is highlighted in red in Figure 9. At first it may seem unlikely that the average rating deviation lies between 244 and 245 on Lichess, playing with regular frequency would result in a value close or equal to the minimum of 30. Lichess also considers all ratings with deviations above 110 provisional. However the mean is sensitive to outliers and it is possible a handful of players are skewing the results. Model misspecification could also be a cause, maybe some important variable is not being considered. Regardless this range of rating deviations produce the best fit for the available data.

Conclusions

Some interesting features of the Lichess rating data were highlighted. The ways in which the proportion of game outcomes (win, loss, draw) changed with respect to player strength was illustrated. The equivalent advantage in rating points the white pieces were worth was calculated across all player strengths to be 22 points, and was demonstrated to change with respect to player strength. Finally, the rating deviation was treated as an unknown parameter in the Glicko model and the mean rating deviation estimated from the data. This mean rating deviation was calculated to be well above the value Lichess uses to determine provisional rating status. Outliers or model misspecification could be behind the apparently unlikely result.