Waking up the backend server...

Explore the Model

Click on the pitch to plot a shot and get its predicted xG value. The direction of attack is left to right.

From Raw Data to Probabilistic xG Mapping

So, What Exactly is Expected Goals (xG)?

Expected Goals, or xG, is a powerful metric in football analytics that helps us quantify the quality of a goal-scoring opportunity. Instead of just counting shots, xG assigns a probability to each attempt, representing how likely it was to result in a goal. A shot with an xG of 0.1 is expected to be scored 10% of the time, while a tap-in might have an xG of 0.9 (a 90% chance).

This allows for a deeper, more nuanced analysis of team and player performance, moving beyond the luck inherent in the final scoreline. A team that consistently creates high-quality chances (a high cumulative xG) is likely performing well, even if the goals aren't flowing just yet. This interactive plotter is the front-end for a bespoke xG model, built from the ground up. Let's explore how it was made.

Step 1: Gathering the Raw Ingredients

Every good model starts with good data. The foundation of this xG model is a rich dataset of thousands of shots, scraped from the public football analytics website Understat. A custom Python script was developed to systematically gather data across several seasons from Europe's top leagues, including the Premier League, La Liga, and the Bundesliga. This script navigates the site, extracts detailed shot-level data from match pages, capturing everything from the coordinates on the pitch to the game situation, and organises it for the next stage.

Step 2: Preparing the Data

Raw data is rarely perfect. The next step involved a rigorous cleansing process to handle inconsistencies and prepare the dataset for modelling. Once clean, the real magic begins with feature engineering. The raw `X` and `Y` coordinates of a shot are useful, but we can derive more powerful predictive features from them. Key engineered features include:

  • Distance to Goal: A straightforward calculation of the Euclidean distance from the shot location to the centre of the goal.
  • Angle to Goal: The angle, in radians, that the shooter has to the goal. A wider angle generally means a better chance of scoring. This is calculated using the vectors from the shot location to each goalpost.

Categorical data, like the 'Situation' (e.g., Open Play, Set Piece) and 'Shot Type' (e.g., Head, Right Foot), were converted into a numerical format using one-hot encoding, allowing the model to interpret them correctly.

Step 3: Building the Intelligence

With a fully preprocessed dataset, the next phase was to train the predictive models. This project uses Logistic Regression, a robust and highly interpretable algorithm well-suited for binary classification tasks like predicting a goal (1) or no goal (0).

Rather than a single, one-size-fits-all model, four distinct models were trained to provide more specialised predictions. The backend intelligently selects the best model based on the inputs you provide in the 'Controls' panel:

  • Basic Model: Uses only location-based features (coordinates, distance, and angle).
  • Situation Model: Incorporates the game situation (e.g., Open Play, Penalty).
  • Shot Type Model: Adds information about how the shot was taken (e.g., Head, Left Foot).
  • Advanced Model: The most comprehensive model, using all available features for the most nuanced predictions.

Each model was trained and fine-tuned using `scikit-learn`, with hyperparameters optimised through randomised search and cross-validation to ensure robust performance.

Step 4: Pre-calculating Heatmaps for Visualisation

To offer a richer understanding of the model's predictions, the 'Heatmaps' page visualises xG across the entire pitch. Calculating these values on-the-fly for every user request would be computationally expensive and slow. To solve this, a dedicated Python script (`generate_heatmaps.py`) runs as the final step in the data pipeline.

This script iterates over a fine grid of coordinates covering the pitch and calculates the xG value at each point for every possible combination of game situation and shot type. The results are compiled into a single, optimised JSON file (`heatmaps.json`). When you select a filter on the Heatmaps page, the application simply fetches the corresponding pre-calculated grid, ensuring a fast and seamless experience.

Step 5: Serving the Model with a Flask API

With the models trained and heatmaps generated, they need to be made available to the frontend. This is handled by a lightweight backend server built using Flask, a popular Python web framework. The Flask application (`app.py`) creates a REST API with two key endpoints:

  • /redshaw-xg/api/predict: This endpoint receives the shot details (coordinates, situation, etc.) from the interactive plotter, uses helper functions to determine the most appropriate model, preprocesses the inputs, and returns the live xG prediction.
  • /redshaw-xg/api/predict/grid: This endpoint serves the pre-generated heatmap data from the `heatmaps.json` file, allowing the frontend to visualise the model's predictions quickly.

Step 6: Deployment to the Web

To bring this all together into a publicly accessible web application, the project is deployed across two services. The backend Flask API is hosted on Render, a cloud platform designed for easily deploying and scaling web applications. The frontend (i.e. all the HTML, CSS, and JavaScript files you're interacting with right now) is served as a static site using GitHub Pages.

Step 7: Explore the Code Yourself

This entire project, from the data pipeline to the interactive frontend, is open-source. If you're interested in the technical details or just want to see the code, you can find the full repository on GitHub.

View on GitHub