Skip to content

Unsupervised Machine Learning on attacking footballers in Europe’s top 5 leagues (24/25), featuring K-Means Clustering, Anomaly Detection, and an NMF-based similarity recommendation system.

Notifications You must be signed in to change notification settings

sp-muramutsa/football

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Unsupervised Machine Learning: Footballer Attacking Productivity Clustering & Similarity Recommendation System

Python pandas scikit-learn streamlit matplotlib seaborn License


📖 Project Overview

This project applies unsupervised learning techniques to analyze attacking footballers' productivity metrics from the 2024/25 season across Europe’s Big 5 leagues:

  • Clustering with K-Means: Players are grouped by similarity in goal contributions, expected goals/assists, and progressive actions to identify meaningful player archetypes.
  • Dimensionality Reduction and Similarity Search with NMF: Extracts latent feature representations from non-negative attacking stats to recommend players with similar attacking productivity profiles.

The deliverables include:

  • An exploratory data analysis (EDA) Jupyter notebook covering distributional analysis, team-level attacking efficiency, clustering, and player profiling.
  • A Streamlit web app that provides an interactive interface to recommend and visualize similar forwards using NMF-based embeddings.

🗃️ Dataset & Feature Engineering

  • Source: Detailed player performance data extracted from Europe’s top five leagues (Premier League, La Liga, Serie A, Bundesliga, Ligue 1) for the 2024/25 season.

  • Focus on forwards only (Pos includes "FW").

  • Selected features reflect attacking productivity and contributions, including:

    Metric Description
    Gls Goals scored
    Ast Assists made
    G+A Goals plus assists
    G-PK Goals excluding penalties
    PK Penalty goals
    PKatt Penalty attempts
    xG Expected goals
    npxG Non-penalty expected goals
    xAG Expected assists
    npxG+xAG Composite expected goal contribution
    PrgC Progressive carries
    PrgP Progressive passes
    PrgR Progressive receptions
  • Preprocessing:

    • Applied MinMaxScaler to normalize feature ranges [0,1].
    • For clustering, standardized with StandardScaler where appropriate.
    • Dimensionality reduction using NMF to leverage non-negativity and parts-based representation.

🔧 Notebook: Exploratory Data Analysis & Clustering

  • EDA steps:

    • Univariate histograms reveal strong left skewness typical for attacking metrics, with many low-scoring players and a few high performers.
    • Aggregated team-level attacking outputs via sums of goals, assists, xG, and xAG.
    • Constructed an Efficiency Score composite metric, weighted to reflect contributions beyond raw stats.
    • Visualized relationships between actual and expected goals/assists identifying over- and under-performers.
  • K-Means clustering:

    • Performed on standardized attacking metrics to segment forwards into 4 clusters.

    • Silhouette scores and inertia evaluated cluster separation and compactness.

    • Cluster centroids visualized via radar/spider plots to interpret trait differences.

    • Identified clusters roughly correspond to:

      • Elite “lethal strikers” with dominant finishing and expected goals.
      • Creative, well-rounded attackers combining assists and progressive play.
      • Average contributors with moderate stats.
      • Low-productivity or peripheral forwards.
    • Outlier analysis isolates standout players within clusters (e.g., Mbappé as an outlier in lethal striker cluster).


⚙️ Streamlit App: NMF-based Similarity Recommender

  • Workflow:

    1. Scale selected features with MinMaxScaler.
    2. Fit NMF (max_iter=500) to factorize the player-feature matrix into latent components.
    3. Normalize the resulting latent vectors using cosine normalization.
    4. Compute cosine similarity between the target player and all others in latent space.
    5. Return top-N similar players ranked by similarity score.

🎯 Applications

  • Player scouting: Find hidden gems or comparable alternatives by productivity profile, useful when positional data is unavailable or ambiguous.
  • Transfer market: Data-driven similarity can inform recruitment strategy and mitigate risk.
  • Player development: Track progression in latent productivity traits over time.
  • Football analytics: Extends typical metrics to non-negative latent factors that capture attacking style nuances.

🚀 How to Run

  1. Clone the repo.

  2. Install dependencies:

    pip install -r requirements.txt
  3. Open and explore the Jupyter Notebook for detailed EDA and clustering:

    jupyter notebook football_analysis.ipynb
  4. Launch the Streamlit app for real-time recommendations:

    streamlit run football.py

📚 Libraries & Tools


⚖️ License

This project is licensed under the MIT License.

About

Unsupervised Machine Learning on attacking footballers in Europe’s top 5 leagues (24/25), featuring K-Means Clustering, Anomaly Detection, and an NMF-based similarity recommendation system.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published