Chess Opening Analysis with a Lichess Dataset

Adam Comer

Updated Feb 16, 2021

Introduction

Since the start of the pandemic, I found a new interest in the game Chess and have been an avid player online in my free time. As a student in statistics, I’ve always wanted to apply techniques I’ve learned in class to the game. Luckily, I’ve been assigned a “Professional Development” assignment from my STA303 Professor to build or write something to enhance my image as a statistician.

For this project, I am going to look at the relationship between knowledge in the opening phase of the game and rating of players. Many chess players will refer to this as “Opening Theory” or “Theory.” Before starting this project, my hypothesis was players memorize more Opening Theory to give themselves an advantage as they leave the opening phase of the game. As a player, this felt intuitive to me because I found myself memorizing moves after checking my games with a Chess Engine or searching a database for similar games played by Master level players. This project will test if players online are doing the same thing with real chess game data from Lichess.

Disclaimer: I’m not a strong Chess player. And you can verify this by looking at my Chess.com profile and games. That being said, I’m not going to be analyzing individual games or positions here. Instead, I’m going to base my analysis on public games from Lichess and games played by Masters.

Data

First, I need to get some chess games data. For this type of analysis, I need games from players of all rating ranges. Lichess, a open source Chess server, is nice enough to give their data back to the community. I used the set of games from April 2017 because it was the smallest dataset with timestamps(Although, I never used them).

To analyzing openings, I needed an Opening Book to compare with my games. This was a lot harder than I thought it was going to be. Many books are proprietary or only come in physical form. After searching for a few days, I eventually found some opening books in a nice machine readable format. Unfortunately, the names were all different but I worked around that by matching the moves from the game to the moves in the book.

Initially, I stopped my data collection here. But as did the further analysis, I realized my definition of “Opening Theory” was extremely limited. Most if not all strong players have opening preparations that extend far beyond the 4-6 moves in a standard opening book. In the recent Opera Euro Rapid Finals, Wesley So attributes his loss in round 2 to missing the move 21. Qxe4. I wanted to look deeper into these online games and see how players at every level are copying these Masters. To do this, I used a dataset of Master Level games to check for similarities.

Due to computational limits of my laptop, I used the first 200,000 master games to construct a Trie of all of the moves. Effectivity, I was treating the search of opening moves as substring search. This tree of games and their moves is referred to as the Master Game Index.

To build my nice DataFrame, I extracted the first 20,000 Blitz and Bullet games from the Lichess Dataset and counted the number of moves that matched the Opening Book and the Master Game Index. I took the header information in from the games, fields such as player rating or time format, and made them into columns. A small preview can be seen below and the full dataset link can be found in the Appendix.

Aside: This project involved a lot more Data Engineering then I anticipated. As a student, I seldomly come across projects that force you efficiently index and search these large dataset. If you teach a 300+ level statistics class, try giving an assignment that has Data Engineering sections. Most data that is public or in the workplace was not formatted for your specific use case. And teaching students how to process, index, and integrate multiple data sources with varying formats will pay dividends in future careers.

games = pd.read_csv('data/games.csv', index_col='ID', delimiter=',')
games.head()
                             GameType          White  WhiteElo  WhiteRatingDiff         Black  BlackElo  BlackRatingDiff Result                                            Opening  ECO TimeControl  OpeningMoves  MasterMoves
ID                                                                                                                                                                                                                            
https://lichess.org/tGpzk7yJ    Blitz   calvinmaster      2186                4  dislikechess      1907               -4    1-0          King's Gambit Accepted: Schallopp Defense  C34       180+0             6            8
https://lichess.org/LzvBtZ93   Bullet    Gregster101      1385               10     flavietta      1339               -9    1-0       King's Gambit Accepted: King's Knight Gambit  C34       120+1             5            9
https://lichess.org/TR5upkT0    Blitz          Napen      1905                8       Volcoom      1836               -9    1-0  Ruy Lopez: Schliemann Defense, Schoenemann Attack  C63       180+0             7           18
https://lichess.org/UHBSjQt4   Bullet   bRoCk-LeSnAr      1670              -21        czarny      1812                6    0-1                                  Hungarian Opening  A00       120+1             1            1
https://lichess.org/Qh8ynBVM    Blitz  farhad_karaji       988               -7     micro-dog      1145                7    0-1                               Van't Kruijs Opening  A00       180+0             1            5

[20000 rows x 12 columns]

Analysis

This sections is split into two parts for the two most popular formats of chess online: Blitz and Bullet Chess. Blitz Chess games usually allocate 3 or 5 minutes per player and Bullet Chess games usually allocate 1 or 2 minutes per player. All steps of the analysis are repeated for both game types.

Blitz

First, the games are filtered to only Blitz matches. This leaves 13,150 games in the dataset.

# 13150 Games remaining
blitz_games = games[games['GameType'] == 'Blitz']

Next, the distribution of games needs to be checked. Ideally, the players ratings should be representative of the population, which has a nice bell-shaped curve.

fig, ax = plot.subplots()
plot.hist(blitz_games['WhiteElo'], bins=50, color='#0062ff')

plot.title('Blitz ELO with White')
ax.set_xlabel('Player ELO')
ax.set_ylabel('Games')

plot.show()

Histogram of ELO with the White Pieces for Blitz games

For the player with the white pieces, this sample of games looks representative.

fig, ax = plot.subplots()
plot.hist(blitz_games['BlackElo'], bins=50, color='#0062ff')

plot.title('Blitz ELO with Black')
ax.set_xlabel('Player ELO')
ax.set_ylabel('Games')

plot.show()

Histogram of ELO with the Black Pieces for Blitz games

Again, for the player with the black pieces, this sample of games looks representative.

Next, the games should be balanced. That is, the ratings between the players should be similar. Games where the difference is too large will guarantee one side will win, potentially breaking future analysis.

fig, ax = plot.subplots()
plot.hist(blitz_games['WhiteElo'] - blitz_games['BlackElo'], bins=50, color='#0062ff')

plot.title('Blitz ELO Difference')
ax.set_xlabel('ELO Difference')
ax.set_ylabel('Games')

plot.show()

Histogram of of the ELO Difference between the Players for Blitz games

Most of the games fall in the fair range of ±200 rating points. There are a few outliers bleeding into the ±500 point range.

# 11562 Games remaining
blitz_games = blitz_games[abs(blitz_games['WhiteElo'] - blitz_games['BlackElo']) < 200]

After clearing out the games with a wide rating differential, there are 11,562 games remaining in this dataset.

Now, I will look at the relationship between Opening Knowledge and player rating.

agg = blitz_games.groupby(['OpeningMoves']).agg(['count'])

fig, ax = plot.subplots()
plot.bar(agg.index, agg[('Opening', 'count')], color='#0062ff')

plot.title('Blitz Games # of Book Moves')
ax.set_xlabel('# of Book Moves')
ax.set_ylabel('Games')

plot.show()

Bar graph of the number of games played grouped by the number of Book moves featured for Blitz games

Grouping by the number of Book Moves featured, a nice gradually descending distribution is apparent.

Next, I will do a regression analysis of the rating of the players and the number of Book moves featured in a game.

blitz_games['ELOAverage'] = (blitz_games['WhiteElo'] + blitz_games['BlackElo']) / 2

res = stats.linregress(blitz_games['OpeningMoves'], blitz_games['ELOAverage'])
Regression Results:
  slope: 12.878527255330097
  intercept: 1576.655656676459, 
  rvalue: 0.08137428579792201, 
  pvalue: 1.8927751604955526e-18, 
  stderr: 1.4670922151454777, 
  intercept_stderr: 5.938045602789431
fig, ax = plot.subplots()
plot.title('Blitz Game ELO vs. # of Book Moves')
ax.set_xlabel('# of Book Moves')
ax.set_ylabel('Average ELO of Players')

plot.scatter(blitz_games['OpeningMoves'], blitz_games['ELOAverage'], alpha=0.05, color='#0062ff')
plot.plot(blitz_games['OpeningMoves'], res.intercept + res.slope * blitz_games['OpeningMoves'], color='#fa4d56', 
          label=f'ELO = {round(res.slope, 2)} * BookMoves + {round(res.intercept, 2)}')

plot.show()

Scatter plot of Average Player ELO vs number of Book moves featured for Blitz games

Looking at our regression, there is a slight positive relationship that is significant (p < 0.05) but has a small R^2 of 0.00662. I expected the variance to be high but not this high.

The first version of my analysis stopped here. After sitting on it for a few days, I realized that players don’t look at the Encyclopedia of Chess Openings for ideas, they use a database of games from Master level players. After adding this extra field, I will do a second regression on the player ratings and the number of moves from Master games featured.

fig, ax = plot.subplots()
plot.bar(agg.index, agg[('Opening', 'count')], color='#0062ff')

plot.title('Blitz Games # of Master Moves')
ax.set_xlabel('# of Book Moves')
ax.set_ylabel('Games')

plot.show()

Bar graph of the number of games played grouped by the number of Master moves featured for Blitz games

Comparing this bar chart with the previous, players are using longer lines and more ideas from Master games than just Book moves alone. This gives us good evidence that “Opening Theory” consists of more than a 2-6 move Opening Book.

res = stats.linregress(blitz_games['MasterMoves'], blitz_games['ELOAverage'])
Regression Results:
  slope: 30.674892863592703
  intercept: 1404.7621823835452
  rvalue: 0.37307786524629527
  pvalue: 0.0
  stderr: 0.7095108471201739
  intercept_stderr: 5.61993939067637
fig, ax = plot.subplots()
plot.title('Blitz Game ELO vs. # of Master Moves')
ax.set_xlabel('# of Master Moves')
ax.set_ylabel('Average ELO of Players')

plot.scatter(blitz_games['MasterMoves'], blitz_games['ELOAverage'], alpha=0.05, color='#0062ff')
plot.plot(blitz_games['MasterMoves'], res.intercept + res.slope * blitz_games['MasterMoves'], color='#fa4d56', 
          label=f'ELO = {round(res.slope, 2)} * MasterMoves + {round(res.intercept, 2)}')

plot.show()

Scatter plot of Average Player ELO vs number of Master moves featured for Blitz games

Looking at our second regression, there is a positive relationship that is significant (p < 0.05) but has a modest R^2 of 0.13919. This was the relationship I hypothesized in the introduction.

Bullet

For the Bullet section, I did the same analysis but the graphs and conclusions were essentially identical to the Blitz section. I included the graphs in the Appendix for completeness.

First, the games are filtered to Bullet format and the games with less than a ±200 rating differential.

# 6850 Games remaining
bullet_games = games[games['GameType'] == 'Bullet']
# 5555 Games remaining
bullet_games = bullet_games[abs(bullet_games['WhiteElo'] - bullet_games['BlackElo']) < 200]

Next, I will do the same regressions in the Blitz Section.

bullet_games['ELOAverage'] = (bullet_games['WhiteElo'] + bullet_games['BlackElo']) / 2

res = stats.linregress(bullet_games['OpeningMoves'], bullet_games['ELOAverage'])
Regression Results:
  slope: -18.938642094441818
  intercept: 1746.6965926816904
  rvalue: -0.11197971306496589
  pvalue: 5.736604103545491e-17
  stderr: 2.255306260261277
  intercept_stderr: 7.668411448527872
fig, ax = plot.subplots()
plot.title('Bullet Game ELO vs. # of Book Moves')
ax.set_xlabel('# of Book Moves')
ax.set_ylabel('Average ELO of Players')

plot.scatter(bullet_games['OpeningMoves'], bullet_games['ELOAverage'], alpha=0.05, color='#0062ff')
plot.plot(bullet_games['OpeningMoves'], res.intercept + res.slope * bullet_games['OpeningMoves'], color='#fa4d56', 
          label=f'ELO = {round(res.slope, 2)} * BookMoves + {round(res.intercept, 2)}')

plot.show()

Scatter plot of Average Player ELO vs number of Book moves featured for Bullet games

Initially, this regression and graph caught me off guard. Why would playing fewer book moves correlate with a higher rating? To verify this this relationship, I did a second regression of player rating and Master moves.

res = stats.linregress(bullet_games['MasterMoves'], bullet_games['ELOAverage'])
Regression Results:
  slope: 8.206545855725649
  intercept: 1639.3828293657148
  rvalue: 0.09319956154540714
  pvalue: 3.4044901651001904e-12
  stderr: 1.1764901073637415
  intercept_stderr: 8.150519297257242
fig, ax = plot.subplots()
plot.title('Blitz Game ELO vs. # of Master Moves')
ax.set_xlabel('# of Master Moves')
ax.set_ylabel('Average ELO of Players')

plot.scatter(bullet_games['MasterMoves'], bullet_games['ELOAverage'], alpha=0.05, color='#0062ff')
plot.plot(bullet_games['MasterMoves'], res.intercept + res.slope * bullet_games['MasterMoves'], color='#fa4d56', 
          label=f'ELO = {round(res.slope, 2)} * MasterMoves + {round(res.intercept, 2)}')

plot.show()

Scatter plot of Average Player ELO vs number of Master moves featured for Bullet games

When looking at the number of master moves played and player rating, the relationship is greatly diminished compared to the Blitz section. This confirms the trend in the prior regression.

To a non-chess player this relationship might be very unexpected. If opening moves are typically memorized, why are players have trouble playing Book moves or Master moves? In the Discussion section, I will explore this finding more.

Discussion

In the Analysis, I looked at relationship of moves played from an Opening Book and Master Games and player rating for Blitz and Bullet time formats. I verified that player ratings were representative and normally distributed, matching what I expected. Additionally, I verified that there was a downward trend of Book and Master moves featured past the limited set of first moves.

In the Blitz regression analysis, a weak positive connection between the number of Book moves and rating was present. And a strong positive connection existed between the number of Master moves and rating. There are two possible ways to interpret this result. 1. Strong players can generally find the top moves that Masters play. 2. Strong players are using moves from a database of Master games to improve. There is probably a mix of both going here. Good moves are objectively good in perfect information game like Chess and players are looking at what others are doing to improve their game.

In the Bullet regression analysis, a weak negative negative connection between the number of Book moves and rating was present. And a weak positive connection existed between the number of Master moves and rating. This was an unexpected result. Avid online chess players will tell you that Bullet Chess is a fundamentally different game than Blitz Chess. Many games are won or lost on time, not checkmate. This favors the player who can play faster, not better. To increase the amount of time your opponent consumes per move, players try to complicate the position and throw unusual moves at their opponent.

In the recent IM not a GM tournament, International Master(IM) Levy Rozman recounts his opening preparation for the bullet section consisted of the unusual move 1.b3, featured below.

Chessboard with 1. b3 played

In the Chess.com Move Database, 1.b3 is the 6th most common first move with 12,327 of 2,785,695 games or 0.44% of the games. IM Rozman argues his move was a tactic to confuse his opponent in the Bullet section. And it was probably the reason he won the match.

The lack of time means an opponent doesn’t have time to think about how a strange move might affect the position long term, they need to keep moving just to not lose on time. The data is revealing this common understanding of the time control. Players are playing sub-optimal moves designed to make their opponent think and put them into time trouble.

Conclusion

Overall, I analyzed 20,000 Lichess Bullet and Blitz games to compare opening knowledge and player rating. In the Blitz section, my hypothesized positive relationship of Book and Master moves compared to player rating was found. In the Bullet section, my analysis found the opposite relationship with Book moves and a small positive relationship with Master moves. In the Discussion section, I brought in an analysis from a professional Blitz and Bullet tournament player, IM Levy Rozman, to show how throwing odd moves at your opponent can cause them to burn time on their clock for a winning advantage. For me, this whole project is good motivation to study the Master games from a Chess Book I recently picked up.

Appendix

All resources used to create this dataset and analysis.

Datasets

Master Game Index Builder

import pickle
import chess.pgn


def main(pgn):
    move_idx = {}

    counter = 0
    while True:
        game = chess.pgn.read_game(pgn)
        if game is None:
            break
        counter += 1

        pos = move_idx
        for m in game.mainline_moves():
            m = m.uci()
            if m in pos:
                pos = pos[m]
            else:
                pos[m] = {}
                pos = pos[m]

        if counter == 200000:
            break

    with open('data/master_game_index.pickle', 'wb') as handle:
        pickle.dump(move_idx, handle)


if __name__ == '__main__':
    pgn = open('data/mb-3.45.pgn', encoding='ISO-8859-1')
    main(pgn)

Dataset Builder

import pickle
from typing import Dict

import chess.pgn
import chess.polyglot
from chess.pgn import Game
import pandas as pd


def get_book():
    a_eco = pd.read_csv('data/a.tsv', index_col='moves', delimiter='\t')
    b_eco = pd.read_csv('data/b.tsv', index_col='moves', delimiter='\t')
    c_eco = pd.read_csv('data/c.tsv', index_col='moves', delimiter='\t')
    d_eco = pd.read_csv('data/d.tsv', index_col='moves', delimiter='\t')
    e_eco = pd.read_csv('data/e.tsv', index_col='moves', delimiter='\t')

    return pd.concat([a_eco, b_eco, c_eco, d_eco, e_eco])


def get_master_move_index():
    index_file = open('data/master_game_index.pickle', 'rb')
    idx = pickle.load(index_file)
    index_file.close()

    return idx


def read_games():
    book = get_book()
    master_move_idx = get_master_move_index()
    counter = 0

    pgn = open('data/lichess_db_standard_rated_2017-04.pgn')

    games = []
    while True:
        offset = pgn.tell()
        headers = chess.pgn.read_headers(pgn)
        if headers is None:
            break

        if headers['Event'] != 'Rated Blitz game' and headers['Event'] != 'Rated Bullet game':
            chess.pgn.skip_game(pgn)
            continue
        if 'WhiteElo' not in headers or 'BlackElo' not in headers:
            chess.pgn.skip_game(pgn)
            continue
        if 'WhiteRatingDiff' not in headers or 'BlackRatingDiff' not in headers:
            chess.pgn.skip_game(pgn)
            continue

        pgn.seek(offset)
        game = chess.pgn.read_game(pgn)

        game = process_game(book, master_move_idx, game)
        if game is None:
            continue

        if counter == 20000:
            break

        games.append(game)
        counter += 1

    return games


def process_game(book: pd.DataFrame, master_move_idx: Dict[str, Dict], game: Game):
    opening = None
    moves = []

    moves_key = ''
    for m in game.mainline_moves():
        moves_key += m.uci()

        if moves_key in book.index:
            opening = book.loc[moves_key]['name']
            moves = moves_key
        else:
            break
        moves_key += ' '

    master_moves = 0
    pos = master_move_idx
    for m in game.mainline_moves():
        m = m.uci()
        if m in pos:
            master_moves += 1
            pos = pos[m]
        else:
            break

    if opening is None:
        print(f'Skipped: {game.headers["Opening"]}')
        return None

    return [
        game.headers['Site'],
        'Bullet' if game.headers['Event'] == 'Rated Bullet game' else 'Blitz',
        game.headers['White'],
        game.headers['WhiteElo'],
        game.headers['WhiteRatingDiff'],
        game.headers['Black'],
        game.headers['BlackElo'],
        game.headers['BlackRatingDiff'],
        game.headers['Result'],
        opening,
        game.headers['ECO'],
        game.headers['TimeControl'],
        len(moves.split(' ')),
        master_moves
    ]


if __name__ == '__main__':
    games = read_games()

    games = pd.DataFrame(games, columns=['ID', 'GameType', 'White', 'WhiteElo', 'WhiteRatingDiff', 'Black', 'BlackElo', 'BlackRatingDiff', 'Result', 'Opening', 'ECO', 'TimeControl', 'OpeningMoves', 'MasterMoves'])
    games = games.set_index('ID')
    games.to_csv('data/games.csv')

Graphs

Blitz

Bullet