Data Dives

An Exploratory Data Analysis of my curated EPL data set

April 05, 2024 | 7 Minute Read


Visualizing The Beautiful Game

This post will visually explore the English Premier League data that I curated in my last blog post. For more information on the scraping and curation process, feel free to read that post!

Cover image source: Vecteezy

Introduction

As I discussed in my last blog post, I love “The Beautiful Game,” which soccer (or “football”) is often called. The best soccer league in the world is the English Premier League (EPL) and it boasts many of the top players. One of the biggest events that happens each year in the EPL isn’t actually a game; rather, it is the transfer window (in other sports, this may be known as the trade deadline or trade window). One of the biggest talking points throughout the season is which players will get transferred to which squads.

The 20 clubs in the EPL this year
Image Source: Premier League/Facebook

Using data from the FootballTransfers website, I curated a data set on the top 250 players in the EPL, based on their Estimated Transfer Value (ETV). In this post, I will show many different visuals that will help us understand the relationships between the different variables.

Understanding the Data

The data set I used for this is actually quite small. I could’ve scraped more from to curate a larger data set, but I was curious to see how much I could do even with little data. The variables included in my data set are the following:

  • Name
    • the player’s name
  • Position
    • the specific position(s) of the player
  • Skill
    • the current skill level of the player
  • Potential
    • the player’s potential level, based on how similar players have developed in the past
  • ETV
    • the player’s Estimated Transfer Value, which is roughly how much a club would have to pay for them if they purchased them right now

Modifying the Data

Looking at the FootballTransfers website, most of the variables already come in great form for us to do some analysis and visualization. However, we may have issues with the Position variable. Each player has very specific positions, making it difficult to truly see what is going on.

The position variable may cause issues because of how FootballTransfers lists the positions

We will quickly create a new variable in the data set called general_position, which is simply the player’s more common position names, those being Goalkeepers, Defenders, Midfielders, and Forwards. To do this, we will utilize the startswith function in Python. The following is a code chunk which will accomplish what we are looking to do:

# Create a new variable for positions
def map_general_position(position):
    if position == 'GK':
        return 'Goalkeeper'
    elif position.startswith('D ') or position.startswith('D,'):
        return 'Defender'
    elif position.startswith('DM') or position.startswith('AM') or position.startswith('M'):
        return 'Midfielder'
    elif position.startswith('F'):
        return 'Forward'
    else:
        return 'Other'  # You can handle other cases as needed

# Apply the function to create the new 'general_position' variable
df['general_position'] = df['position'].apply(map_general_position)

We start by defining a function that looks for certain characters at the beginning of the position string in the Position column. Then, we combine that new variable into the rest of the data set.

A screenshot of the csv file, taken from my GitHub profile

Now we have our final data set that we will use to run our data visualization.

Exploratory Data Analysis

Let’s first get a feel for our data by creating a correlation matrix of the numeric variables (Skill, Potential, and Etv). We will do this by utilizing Seaborn, a Python data visualization library based on matplotlib, and for loops.

# Correlation Matrix of all numeric variables
df_copy = df.iloc[:, [4,5,6]].copy()

ax = sns.heatmap(df_copy.corr(), vmin=-1, vmax=1, cmap='YlGnBu')

# Add annotations to every square
for i in range(len(df_copy.columns)):
    for j in range(len(df_copy.columns)):
        ax.text(j+0.5, i+0.5, f"{df_copy.corr().iloc[i, j]:.2f}", ha='center', va='center', color='white')

plt.title('Correlation Heatmap')
plt.show()

I also wanted to learn how the different variables are distributed across different positions. To do this, we will start simply with a pie chart of the proportions of positions of the top 250 EPL players.

plt.pie(position_distribution, labels=position_distribution.index, autopct='%1.1f%%', startangle=140, colors=['skyblue', 'orchid', 'seagreen', 'coral'])
plt.title('Proportion of Top 250 Player Positions in EPL')
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle
plt.show()

Next, we will make a violin plot showing how ETV varies across positions.

sns.violinplot(data=df, x='general_position', y='etv', palette=['seagreen', 'skyblue', 'orchid', 'coral'])
plt.title('Distribution of (ETV) Across Positions')
plt.xlabel('Position')
plt.ylabel('Estimated Transfer Value (ETV)')
plt.xticks(rotation=45)
plt.grid(axis='y')
plt.show()

Finally, we will create a regression plot to show us the relationship between ETV and Skill.

sns.regplot(x="skill", y="etv", data=df)
plt.title('ETV by Skill')
plt.show()

These were just some basic plots that show us relationships between variables of interest. This is certainly not a comprehensive list. The opportunities here are endless! We can switch out the variables, use different plots, or whatever your mind could possibly desire. Seaborn and matplotlib are excellent visualization tools to help us better understand data.

Conclusion

The purpose of this blog post was not only to show how different variables are connected in the English Premier League, but also to show you how simple it is to start visualizing your data! There are nearly endless types of plots and charts you can use to begin understanding your data. This was a small taste of mine.

Stay tuned for my next blog post where I will be demonstrating how to create a Streamlit Web App to do many of these same things in a more aesthetic form!

If you have any questions, please feel free to contact me through any of my social media sites.