Apr 25, 2024

I followed Andrej Karpathy lessons on deep learning to predict premier league game outcomes

As the title indicates, this article is inspired by the teachings of Andrej. His Make More series is a gold mine, and if you are interested in deep learning at all, I would highly recommend you watch his videos. I am writing this article as part of my learning process, his videos are much more detailed and more clearly explained Resource 1.

That said, if you would like to read a simple explanation of neural networks, and their application to something as simple as the prediction of premier league games, welcome to my podcast. I will follow Andrej’s path and go from NGrams to MLPs and finish with my application of MLPs to premier league game predictions.

Bigrams and N-Grams

Resource 1 Resource 2 A bigram model is a language model that relies on the previous word to predict the next. It is an instance of a generalized model known as the N-Gram model, which, you guessed it, relies on the previous N - 1 words to predict the next.

One way these models work is by computing probability tables (by counting the relative frequency of words). In the case of the Bigram Model, if we have a vocabulary of say, 100 words, that model would consist of computing a table of probabilities 100 X 100, where an entry of the table represents a probability. e.g.

P(cats | the)

In the case of N-Gram models, where N > 2, we should just take the N previous words to compute the probability of the next word:

P(cats|\text{the, immigrants, are, eating, the})

Although it becomes impractical as N becomes larger and larger, we could build these models exactly like the Bigram Model, by counting the frequencies (how many times in the corpus does the word cats follow the sentence the immigrants are eating the?).

In essence, with these models, we are trying to compute the joint probability distribution of a sequence. Using the Bigram model, we can compute such distributions by decomposing them using the chain rule of probability. Chain Rule

P(\text{the, immigrants, are, eating, the, ?}) = P(the)P(immigrants|the)P(are|immigrants)P(eating|are)P(the|eating)P(?|the) \text{ ? being each word in the vocabulary}

OR:

P(Sequence) = \prod_{k=1}^n P(X_{k}|X_{k-1})\text{ X representing each element in the Sequence and n, the length of the sequence}

How do we know they are eating the cats?

As we saw above, using N-Gram models we can compute the joint probability distribution of a sequence. This probability distribution is over all the words in the vocabulary. In table form we would have something like the following:

? = Word in vocabulary	P(the, immigrants, are, eating, the, ?)
meat	0.9
cereal	0.8
dogs	0.0001
cats	0.00001
…	…
Based on this distribution, we would choose the word meat as the most probable word to complete the sequence. And if we order in ascending because we clicked the wrong button, we might end up choosing cats. That is why we need a measure of how good the measure of how good the model is before we release it to the world.

How Do we know the model is good?

Let’s start by defining what good means in this context. We want the model to be a good representation of reality. In the case of a bigram model, we would like it to give us higher probabilities for sequences of words that are in the data, and lower probabilities for sequences that are not in the data. As we saw earlier, we can compute the probabilities of entire sequences via the chain rule of probability:

P(Sequence) = \prod_{k=1}^n P(X_{k}|X_{k-1})

To evaluate how well the model fits the data, we can consider the joint probability of the entire training dataset. The higher this probability, the better the model. However, multiplying many small probabilities (values between 0 and 1) can result in very small numbers that approach zero, making it difficult to work with.

To address this, we use log probabilities. The logarithm of probabilities allows us to sum the values instead of multiplying them, which is more computationally stable:

If the model assigns a high probability to an event, the log of this probability will be close to 0.
If the model assigns a low probability, the log of this probability will be a large negative number.

To make the measure more intuitive (where lower values indicate better performance), we take the negative of the log probabilities. Additionally, we average this value over the length of the dataset (or sequence) to compute the average negative log-likelihood (NLL), which gives us a more interpretable metric.

The Negative Log Likelihood (NLL) serves as a loss function: the lower the NLL, the better the model’s fit to the data.

MLP Multi-Layer Perceptron

Resource 1 Resource 2

The Big Flaw of Bigram Models

As you might have realized already, the biggest flaw of Bigram models is: How do they predict a sequence that is not in the training data? If the count C for:

C(rats|the) = 0 \text{ then }P(rats|the) = 0

Or if the count C for:

C(the) = 0 \text{ then } P(rats|the) = inf

There are some methods to overcome this, such as adding 1 to all the counts before computing the probabilities. But more is needed to solve the real problem. The real issue we are trying to solve is that of generalization: How, from having seen the training data, can the model predict sequences that don’t even happen in the data? Enter, Neural Probabilistic Language Models.

Multi-Layer Neural Networks

For a brief introduction to Neural Networks, visit the following Wikipedia entries:

Perceptron Algorithm
Multi-Layer Perceptron Multi-layer Neural Networks are our solution to the generalization problem of Bigram models. In a nutshell, we try to solve the problem by :

Associating each word in the vocabulary to a vector (the feature vector).
Express the joint probability function of word sequences in terms of the feature vectors of these words in the sequence.
Learn simultaneously the word feature vectors and the parameters of that probability function.

The feature vector

The feature vector represents different aspects of a word, for example, if we have a three-dimensional vector to represent the word dog, depending on the rest of the training data, one of the dimensions of the vector might point to the fact that a dog is an animal because it appears in very similar sequences to words such as cats and rats, another one might point to the fact it is a singular word etc. This highly depends on the training data we have. If we knew exactly what we wanted each of these dimensions to represent, we would use that prior knowledge to build the feature vectors. In our example, we will assume we don’t know what each of these dimensions will represent, and so, we will learn them from the data.

How do we learn the vectors?

First, let’s give them a name. This conjoint of all the vectors that will represent our vocabulary is called the embedding. The vector space it represents is known as the embedding space. The first step is to initialize this embedding. It is essentially a matrix C where each row of the matrix represents a word in our vocabulary and the columns represent the features. The number of features is something that we control to tune the model. It is a hyper-parameter, and it will define the number of parameters (entries of the matrix) the embedding has. These parameters are initialized at random at first and then tuned to maximize the log-likelihood of the training data (more on this later).

The joint probability function

Going back to the same example as in the N-Gram model, what we are trying to compute is a conditional probability:

P(cats|\text{the, immigrants, are, eating, the})

In the case of the N-Gram model, we did that by just counting the number of times that the sentence ‘the immigrants are eating the’ was followed by the word cats. In the case of MLP, we will try to compute that conditional probability by learning a function f such that:

f(the, immigrants, are, eating, the, cats) = P(cats|the, immigrants, are, eating, the)

To be more precise, now that we express our words as feature vectors, we say that we need a function g that maps the sequence of feature vectors to a conditional probability distribution over the words in our vocabulary for the next word in the sequence. The output of this function will be a vector that represents the conditional probability distribution, and each i element of that vector will represent the probability that the next word in the sequence is i.

f(i, w_{t-1},\dots, w_{t- n+1}) = g(i, \text{vector representation of } w_{t-1}, \dots, \text{vector representation of } w_{t-n+1})

The Model (f disclosed )

The output of our model looks something like this:

y = U\tanh(d + Hx)

Where y is a vector where each element of the vector, represents the unnormalized log-probabilities of each word in the vocabulary.
Where x represents the vector representation of each word in the context.
H is a matrix representing the weights of the hidden layer and has a shape h x |context| x m where h is a hyper-parameter, and m is the dimensionality of our embedding.
d is a vector with dimension h.
U is a matrix representing the hidden-to-output weights and has a dimension of |V| x h Once we obtain y we normalize it using the softmax operation to obtain the probabilities:

P(w_{t} | w_{t-1}, \dots, w_{t-n+1}) = \frac{e^{y_{wt}}}{\sum_{i} e^{y_{i}}}

The Learning

In the Bigram section, we defined a metric that lets us know if the model is good enough. the Negative Log-Likelihood (NLL):

-\frac{1}{T}\sum_{t} \log P(w_{t}, w_{t-1}, \dots , w_{t - n+1})

When using MLP we can make use of the same metric to know how good our model is:

-\frac{1}{T}\sum_{t} \log f(w_{t}, w_{t-1}, \dots , w_{t - n+1})

So, now we just have to minimize the NLL of the model over the training corpus and we will have our model.

The Training

To minimize the NLL of our model, we:

Do a feedforward operation. Which means, getting the context and obtaining our conditional probability distribution over all the words in the vocabulary. As described in The Model section above.
We calculate the NLL of our model and perform a back-propagation. Which simply means, deriving the NLL backwards and updating each parameter of the model using the gradient descent algorithm.

How do we get from that to premier league game models?

Well, a league for a given team is no more than a sequence of games. The result of each game encapsulates information about the team’s performance. They either Win, Lose or Draw. What does a win mean? Maybe it means a confidence boost, maybe it means the entire squad is fit etc. The same goes for losses and draws. And given a context of say, 5 games, we should have some level of information to predict the result of the next game. Drawing parallels to bigrams and NLP, we should be able to compute:

P(win | loss, loss, loss, loss, loss)

If we take the premier league, from 2025 until now, we have a big enough dataset, that would allow us to create an okay model that computes conditional probability distributions. And that is exactly what I did.

The Dataset

The Data set consists of three features:

Game results context: This is a n x 2 x 8 matrix containing, for each game, the results of the past 8 games for the home and away teams. An entry of the matrix is 2 x 8 (two teams, past 8 games).
Opponents context: This is again an n x 2 x 8 matrix where for each game we have a classification of the past 8 rivals for the home and away teams.
Type of teams: This is an n x 2 matrix containing for each game, a classification of the home team and away team.
Y is the result of the given game.

The model

The model has the following parts:

Embedding for results.
Shared Embedding for team classification.
Linear layer for results context
Linear layer for opponents context
Linear layer for type of teams

Here is the PyTorch implementation:

class Embedding(torch.nn.Module):

    def __init__(self, num_embeddings, embedding_dim):
        super(Embedding, self).__init__()
        self.embedding = torch.nn.Embedding(num_embeddings, embedding_dim)

    def forward(self, x):
        x = self.embedding(x)
        x = x.view(x.shape[0], -1)
        return x

class Linear(torch.nn.Module):

    def __init__(self, input_dim, output_dim):
        super(Linear, self).__init__()
        self.sequential = torch.nn.Sequential(
            torch.nn.Linear(input_dim, output_dim),
            torch.nn.Tanh(),
            torch.nn.BatchNorm1d(output_dim)
        )

    def forward(self, x):
        x = self.sequential(x)
        return x

class Model(torch.nn.Module):

    def __init__(
        self,
        embedding_dim_results,
        embedding_clusters,
        output_clusters,
        output_dim_teams,
        output_dim_results
    ):
        super(Model, self).__init__()
        self.embedding_results = Embedding(4, embedding_dim_results)
        self.embedding_clusters = Embedding(5, embedding_clusters)
        self.linear_teams = Linear(embedding_clusters * 2 * context_window , output_dim_teams)
        self.linear_results = Linear(embedding_dim_results * 2 * context_window, output_dim_results)
        self.linear_clusters = Linear(embedding_clusters * 2, output_clusters)
        self.logits = torch.nn.Linear(output_dim_teams + output_dim_results + output_clusters, 4)

    def forward(self, x_teams, x_results, x_clusters):
        x_teams = self.embedding_clusters(x_teams)
        x_teams = self.linear_teams(x_teams)
        
        x_clusters = self.embedding_clusters(x_clusters)
        x_clusters = self.linear_clusters(x_clusters)
        
        x_results = self.embedding_results(x_results)
        x_results = self.linear_results(x_results)
        
        x = torch.cat([x_teams, x_results, x_clusters], dim=1)
        x = self.logits(x)
        return x
		```

## Is it any good?
It is decent at predicting wins and losses (60% accuracy). Awful at predicting draws (23% accuracy)