Click on the pitch to plot a shot and get its predicted xG value. The direction of attack is left to right.
Expected Goals, or xG, is a powerful metric in football analytics that helps us quantify the quality of a goal-scoring opportunity. Instead of just counting shots, xG assigns a probability to each attempt, representing how likely it was to result in a goal. A shot with an xG of 0.1 is expected to be scored 10% of the time, while a tap-in might have an xG of 0.9 (a 90% chance).
This allows for a deeper, more nuanced analysis of team and player performance, moving beyond the luck inherent in the final scoreline. A team that consistently creates high-quality chances (a high cumulative xG) is likely performing well, even if the goals aren't flowing just yet. This interactive plotter is the front-end for a bespoke xG model, built from the ground up. Let's explore how it was made.
Every good model starts with good data. The foundation of this xG model is a rich dataset of thousands of shots, scraped from the public football analytics website Understat. A custom Python script was developed to systematically gather data across several seasons from Europe's top leagues, including the Premier League, La Liga, and the Bundesliga. This script navigates the site, extracts detailed shot-level data from match pages, capturing everything from the coordinates on the pitch to the game situation, and organises it for the next stage.
Raw data is rarely perfect. The next step involved a rigorous cleansing process to handle inconsistencies and prepare the dataset for modelling. Once clean, the real magic begins with feature engineering. The raw `X` and `Y` coordinates of a shot are useful, but we can derive more powerful predictive features from them. Key engineered features include:
Categorical data, like the 'Situation' (e.g., Open Play, Set Piece) and 'Shot Type' (e.g., Head, Right Foot), were converted into a numerical format using one-hot encoding, allowing the model to interpret them correctly.
With a fully preprocessed dataset, the next phase was to train the predictive models. This project uses Logistic Regression, a robust and highly interpretable algorithm well-suited for binary classification tasks like predicting a goal (1) or no goal (0).
Rather than a single, one-size-fits-all model, four distinct models were trained to provide more specialised predictions. The backend intelligently selects the best model based on the inputs you provide in the 'Controls' panel:
Each model was trained and fine-tuned using `scikit-learn`, with hyperparameters optimised through randomised search and cross-validation to ensure robust performance.
To offer a richer understanding of the model's predictions, the 'Heatmaps' page visualises xG across the entire pitch. Calculating these values on-the-fly for every user request would be computationally expensive and slow. To solve this, a dedicated Python script (`generate_heatmaps.py`) runs as the final step in the data pipeline.
This script iterates over a fine grid of coordinates covering the pitch and calculates the xG value at each point for every possible combination of game situation and shot type. The results are compiled into a single, optimised JSON file (`heatmaps.json`). When you select a filter on the Heatmaps page, the application simply fetches the corresponding pre-calculated grid, ensuring a fast and seamless experience.
With the models trained and heatmaps generated, they need to be made available to the frontend. This is handled by a lightweight backend server built using Flask, a popular Python web framework. The Flask application (`app.py`) creates a REST API with two key endpoints:
/redshaw-xg/api/predict
: This endpoint receives the shot details (coordinates, situation, etc.) from the interactive plotter, uses helper functions to determine the most appropriate model, preprocesses the inputs, and returns the live xG prediction./redshaw-xg/api/predict/grid
: This endpoint serves the pre-generated heatmap data from the `heatmaps.json` file, allowing the frontend to visualise the model's predictions quickly.To bring this all together into a publicly accessible web application, the project is deployed across two services. The backend Flask API is hosted on Render, a cloud platform designed for easily deploying and scaling web applications. The frontend (i.e. all the HTML, CSS, and JavaScript files you're interacting with right now) is served as a static site using GitHub Pages.
This entire project, from the data pipeline to the interactive frontend, is open-source. If you're interested in the technical details or just want to see the code, you can find the full repository on GitHub.