- Initial Exploration
- Positions
- Height and Weight
- Top 5 Leagues
- England vs. Germany PKs
- Southpaws
- Similarity

The FIFA videogame franchise is a worldwide phenomenon, accounting for roughly 40% of Electronic Arts' yearly revenue. FIFA 18, released in Nov. 2017, sold 5.9 million units in its first week alone, and surpassed 10 million units by the end of the year. In order to meet the demands of soccer fanatics across the globe, EA strives for accuracy, and as such, each FIFA iteration includes an enormous database of player statistics. In this notebook, we explore the FIFA 18 database (retrieved from Kaggle).

Let's begin by getting an idea at the scope of the dataset.

In [1]:

```
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
plt.style.use('seaborn')
fifa = pd.read_csv('complete.csv')
```

In [2]:

```
print('Players:', '\t', fifa.shape[0], '\n',
'Nationalities:', '\t', fifa.groupby('nationality')['ID'].count().size, '\n',
'Clubs:', '\t\t', fifa.groupby('club')['ID'].count().size, '\n',
'Leagues:', '\t', fifa.groupby('league')['ID'].count().size, sep='')
```

Wow, that is a lot of players and clubs. Now let's look at what data the dataframe actually contains. Note that the players in the dataframe are indexed in descending order based on the **overall** stat, which is commonly viewed as the most important/significant stat for a player.

In [3]:

```
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
display(fifa.head())
```

Each row represents a different player, and each player has 185 different stat fields in the dataframe. The columns contain a variety of different data types including strings, ints, and booleans. Let's examine the most common nationalities over the entire dataframe. (We also add a "Rank" column, in case our grouping later on changes the initial indexing.)

In [4]:

```
fifa['Rank']=fifa.index+1
fifa.groupby('nationality')['ID'].count().sort_values(ascending=False).head(10)
```

Out[4]:

Let's also take a look at the 30 least common nationalities.

In [5]:

```
fifa.groupby('nationality')['ID'].count().sort_values(ascending=False).tail(30)
```

Out[5]:

In later explorations, for statistical significance, we will filter out countries that have very few players.

Instead of just looking at the raw number of players from each country, let's see how many players from each country have an overall FIFA rating of 80 or above. (80 is a very high score; these players are ranked as the top 524 players in the world).

In [6]:

```
fifa[fifa.overall>=80].groupby('nationality')['ID'].count().sort_values(ascending=False).head(10)
```

Out[6]:

Comparing this filtered player count to the raw player count, we see that Spain vastly outperforms expectations in terms of great players with 80 players out of 1020 with an overall ranking 80 or higher. On the other hand, England seems to be underperforming with only 33 players out of 1631 meeting the same ranking requirement. Let's compare the distribution of the overall ratings.

In [7]:

```
sns.kdeplot(fifa[fifa.nationality=='Spain']['overall'], label='Spain', shade=True)
sns.kdeplot(fifa[fifa.nationality=='England']['overall'], label='England', shade=True)
plt.xlabel('Overall Rating')
plt.ylabel('Probability Density')
```

Out[7]:

This PDF comparison is illuminating, showing how many more elite-level players Spain has as compared to England.

Let's shift our focus towards positions. At the very basic level, soccer positions can be split into Goalkeepers, Defenders, Midfielders, and Forwards. At a more detailed level, there are more nuanced positions and roles such as the False 9, Poacher, No. 10, etc. Let's choose to take a middling approach and categorize the positions into

- GK
- Center-back
- Outside-back
- Center-mid
- Outside-mid
- Forward

The dataframe includes columns indicating which (highly specific) position the player prefers to play. Let's write some code to perform our categorization.

In [8]:

```
fifa['Position']= np.nan
fifa.loc[fifa.prefers_lm == True, ['Position']] = 'Outside-mid'
fifa.loc[fifa.prefers_rm == True, ['Position']] = 'Outside-mid'
fifa.loc[fifa.prefers_lw == True, ['Position']] = 'Outside-mid'
fifa.loc[fifa.prefers_rw == True, ['Position']] = 'Outside-mid'
fifa.loc[fifa.prefers_cam == True, ['Position']] = 'Center-mid'
fifa.loc[fifa.prefers_cm == True, ['Position']] = 'Center-mid'
fifa.loc[fifa.prefers_cdm == True, ['Position']] = 'Center-mid'
fifa.loc[fifa.prefers_rb == True, ['Position']] = 'Outside-back'
fifa.loc[fifa.prefers_lb == True, ['Position']] = 'Outside-back'
fifa.loc[fifa.prefers_cb == True, ['Position']] = 'Center-back'
fifa.loc[fifa.prefers_st == True, ['Position']] = 'Forward'
fifa.loc[fifa.prefers_cf == True, ['Position']] = 'Forward'
fifa.loc[fifa.prefers_gk == True, ['Position']] = 'GK'
pos_order = ['GK', 'Center-back', 'Outside-back', 'Center-mid', 'Outside-mid', 'Forward']
```

I wonder if any countries excel at producing world-class soccer players for specific positions. Let's take a look at the top 50 players per position, counting the number from each country.

In [9]:

```
top_gk = fifa[fifa.Position=='GK'].head(50)
top_gk.groupby('nationality')['ID'].count().sort_values(ascending=False).head(5)
```

Out[9]:

In [10]:

```
top_cb = fifa[fifa.Position=='Center-back'].head(50)
top_cb.groupby('nationality')['ID'].count().sort_values(ascending=False).head(5)
```

Out[10]:

In [11]:

```
top_def = fifa[fifa.Position=='Outside-back'].head(50)
top_def.groupby('nationality')['ID'].count().sort_values(ascending=False).head(5)
```

Out[11]:

In [12]:

```
top_cm = fifa[fifa.Position=='Center-mid'].head(50)
top_cm.groupby('nationality')['ID'].count().sort_values(ascending=False).head(5)
```

Out[12]:

In [13]:

```
top_om = fifa[fifa.Position=='Outside-mid'].head(50)
top_om.groupby('nationality')['ID'].count().sort_values(ascending=False).head(5)
```

Out[13]:

In [14]:

```
top_for = fifa[fifa.Position=='Forward'].head(50)
top_for.groupby('nationality')['ID'].count().sort_values(ascending=False).head(5)
```

Out[14]:

We see that Spain tops 4 out of the 6 lists. The most impressive figure here is surely the fact that 12 of the top 50 center midfielders (including attacking and defensive) are Spanish. Let's take a look to see who they are. Soccer fans will recognize these names as some of the best in the world.

In [15]:

```
key_attr = ['name', 'overall', 'club', 'league', 'nationality']
top_cm[top_cm.nationality=='Spain'][key_attr]
```

Out[15]:

At the professional level, soccer players must be in peak physical shape to be competitive. However, players come in all shapes and sizes. Let's take a look at a KDE plot with height on the x_axis and weight on the y_axis. Also, the marginal distributions are included outside the plot.

In [16]:

```
sns.jointplot(x='height_cm', y='weight_kg', data=fifa, kind="kde")
```

Out[16]:

Let's take a quick glance at the statistics for the height and weight metrics.

In [17]:

```
fifa.height_cm.describe()
```

Out[17]:

In [18]:

```
fifa.weight_kg.describe()
```

Out[18]:

Converting to imperial units, we see that the average height is 5'11.4" (min: 5'1", max: 6'8.7"), and the average weight is 166.2 lbs (min: 108 lbs, max: 242.5 lbs).

Let's see if there are any countries that produce significantly taller or shorter players than other countries. We first filter out any countries that contain fewer than 50 players in the dataframe. Then, we find the countries with the tallest and shortest average heights for their players.

In [19]:

```
fifa_sig = fifa.groupby('nationality')
fifa_sig = fifa_sig.filter(lambda x:x['nationality'].count()>=50)
fifa_sig.groupby('nationality')['height_cm'].mean().sort_values(ascending=False).head(5)
```

Out[19]:

In [20]:

```
fifa_sig.groupby('nationality')['height_cm'].mean().sort_values(ascending=False).head(5)
fifa_sig.groupby('nationality')['height_cm'].mean().sort_values().head(5)
```

Out[20]:

Let's create a KDE plot comparing the heights and weights of the two countries at the extremes.

In [21]:

```
bosnia=fifa[fifa.nationality=='Bosnia Herzegovina']
saudi=fifa[fifa.nationality=='Saudi Arabia']
sns.kdeplot(bosnia.height_cm, bosnia.weight_kg, cmap='Reds')
sns.kdeplot(saudi.height_cm, saudi.weight_kg, cmap='Blues')
```

Out[21]:

The plots clearly shows that players from Bosnia Herzegovina tend to be significantly taller and heavier than the players from Saudi Arabia. This agrees with various lists of average human height worldwide, all of which list Bosnia Herzegovina either first or second in terms of average height (and Saudi Arabia near the bottom).

Let's shift our attention to the body composition per position. We use a boxplot; the box itself contains the interquartile range (IQR), i.e., from 25% to 75%. The whiskers extend above and below the box, each with a length of $1.5*IQR$. Outside of the whiskers, the outliers are included as marks. The notch in the box displays the median of the data.

In [22]:

```
sns.boxplot(x='Position', y='height_cm', data=fifa, order=pos_order)
```

Out[22]:

Wow, there are a few notable things to point out.

- The goalkeepers are tallest, which is to be expected since there is a high correlation between height and wingspan.
- The IQR of the center-backs and the outside-backs do not even overlap! This makes sense and highlights why we separated the two defender positions. Center-backs tend to be tall so that they can win headers in and around the 18-yard box. Outside-backs need to man-mark speedy forwards, and thus tend to be smaller in stature.
- The outside-mids are the shortest on average. They need to be able to fly down the wings at incredible speeds to deliver crosses to the trailing forwards.

It appears that forwards have the most spread out height distribution. Let's glance at the violin plot for further details.

In [23]:

```
sns.violinplot(x='Position', y='height_cm', data=fifa, order=pos_order)
```

Out[23]:

Yes, it is clear that the forwards are the most spread out in terms of height. For numerical data, we can turn to the standard deviation.

In [24]:

```
fifa.groupby('Position')['height_cm'].std()
```

Out[24]:

Let's calculate the linear correlation between height and weight per position.

In [25]:

```
fifa.groupby('Position')[['height_cm', 'weight_kg']].corr()
```

Out[25]:

The position with the lowest correlation between height and weight is goalkeeper. Since goalkeepers aren't required to run miles on end during the game, they can come in a variety of shapes and sizes. While some goalkeepers are extremely tall and lanky, others are short and stout (with unusually long arms).

It is widely acknowledged that the top 5 leagues in the world are the

- Spanish Primera División
- English Premier League
- Italian Serie A
- German Bundesliga
- French Ligue 1.

While the exact order is up for debate, I will keep them in this order, as per the current UEFA coefficients. To begin with, let's take a look at the average overall score per player per league.

In [26]:

```
leagues=['Spanish Primera División', 'English Premier League', 'Italian Serie A', 'German Bundesliga', 'French Ligue 1']
fifa_sub = fifa.loc[fifa.league.isin(leagues)]
fifa_sub.groupby('league')['overall'].mean()
```

Out[26]:

It seems that the Spanish league has the highest average quality of players and that the French league is far behind the other 4. Let's look at a strip plot, conditioned on position, to get a better sense of things.

In [27]:

```
_ = sns.stripplot(x='league', y='overall', hue='Position', data=fifa_sub, dodge=0.6, jitter=True, alpha=.5, order=leagues, hue_order=pos_order)
loc, labels = plt.xticks()
_.set_xticklabels(labels, rotation=45)
plt.legend(bbox_to_anchor=(1.04,1), loc="upper left")
```

Out[27]:

It's tough to glean insight from this plot due to the sheer number of points. A boxplot will be more useful.

In [28]:

```
_=sns.boxplot(x='league', y='overall', hue='Position', data=fifa_sub, order=leagues, hue_order=pos_order)
loc, labels = plt.xticks()
_.set_xticklabels(labels, rotation=45)
plt.legend(bbox_to_anchor=(1.04,1), loc="upper left")
```

Out[28]:

It is clear that the Spanish teams have a higher lower-floor, i.e., their teams have much greater depth from the bench. While the English league seems to be well balanced by position, note that the outside-mids in the Italian league are far worse than the other positions, while the outside-mids in the German league are far better. To savvy soccer fans, this last observation makes perfect sense; the Bundesliga is known for its frenetic pace, sending wingers tearing down the sidelines, while the Serie A has a slower, more defense, play-not-to-lose style.

Let's take a look at the rankings of the top 50 players in each league. (Since this is a ranking, the lower the number the better).

In [29]:

```
fifa_sub_top = fifa_sub.groupby('league').head(50)
_ = sns.boxplot(x='league', y='Rank', data=fifa_sub_top, order=leagues)
_ = sns.stripplot(x='league', y='Rank', data=fifa_sub_top, order=leagues)
loc, labels = plt.xticks()
_.set_xticklabels(labels, rotation=45)
```

Out[29]:

It's very interesting to see that when just looking at the top 50 players in each league, the English league appears have the best ranking. However, with rankings, there are some nuances with boxplots that don't make them the best visual representation. Let's instead turn to a stackplot. The x-axis will represent the cumulative rankings, and the y-axis shows the percentage of players ranked at that point (i.e., at x=10, players ranked 1-10 will be represented) per country. Let's first perform this visual up to rank 100, then we will expand to 1000.

In [30]:

```
top=100
fifa_top = fifa.head(top)
top_vec=[None]*(len(leagues))
for L in range(len(leagues)):
top_vec[L] = [((fifa_top.Rank<=x+1) & (fifa_top.league==leagues[L])).sum() for x in range(top)]
#need to calculate the 'other' league
other = [None]*top
top_vec_numpy = np.array(top_vec)
for r in range(top):
other[r] = r-np.sum(top_vec_numpy[:,r])+1
#now let's make the data frame
top_df = pd.DataFrame({leagues[0]:top_vec[0],
leagues[1]:top_vec[1],
leagues[2]:top_vec[2],
leagues[3]:top_vec[3],
leagues[4]:top_vec[4],
'Other Leagues':other}, index=range(1,top+1))
top_df_perc = top_df.divide(top_df.sum(axis=1), axis=0)
plt.stackplot(range(1,top+1), top_df_perc[leagues[0]], top_df_perc[leagues[1]],
top_df_perc[leagues[2]], top_df_perc[leagues[3]],
top_df_perc[leagues[4]], top_df_perc['Other Leagues'],
labels=[*leagues, 'Other Leagues'])
plt.legend(bbox_to_anchor=(1.04,1), loc="upper left")
plt.margins(0,0)
plt.title('Player Ranking Stacked Chart per League')
plt.xlabel('Player Ranking')
```

Out[30]:

It appears that the Spanish league has a high-percentage of the very, very best players in the world. The French league gets an early boost from Neymar, ranked 3rd in the world, but doesn't get another player ranked until rank 30. Let's take a look at the percentages at the top 10 and the top 100.

In [31]:

```
#top-10
top_df_perc.iloc[9]
```

Out[31]:

In [32]:

```
#top-100
top_df_perc.iloc[-1]
```

Out[32]:

While the English league gets off to a slow start, they actually have more top-100 players than any other league. One thing that is remarkable is the utter dominance of the top 5 leagues. There is only 1 player ranked in the top 100 who is not in the main 5 leagues. Let's find out who it is.

In [33]:

```
fifa.loc[~fifa.league.isin(leagues)][['name', 'Rank', 'club', 'league', 'nationality', 'overall']].head(1)
```

Out[33]:

We now perform the same actions, but this time we get the stack plot for the top 1000 players instead of the top 100.

In [34]:

```
top=1000
fifa_top = fifa.head(top)
top_vec=[None]*(len(leagues))
for L in range(len(leagues)):
top_vec[L] = [((fifa_top.Rank<=x+1) & (fifa_top.league==leagues[L])).sum() for x in range(top)]
#need to calculate the 'other' league
other = [None]*top
top_vec_numpy = np.array(top_vec)
for r in range(top):
other[r] = r-np.sum(top_vec_numpy[:,r])+1
#now let's make the data frame
top_df = pd.DataFrame({leagues[0]:top_vec[0],
leagues[1]:top_vec[1],
leagues[2]:top_vec[2],
leagues[3]:top_vec[3],
leagues[4]:top_vec[4],
'Other Leagues':other}, index=range(1,top+1))
top_df_perc = top_df.divide(top_df.sum(axis=1), axis=0)
plt.stackplot(range(1,top+1), top_df_perc[leagues[0]], top_df_perc[leagues[1]],
top_df_perc[leagues[2]], top_df_perc[leagues[3]],
top_df_perc[leagues[4]], top_df_perc['Other Leagues'],
labels=[*leagues, 'Other Leagues'])
plt.legend(bbox_to_anchor=(1.04,1), loc="upper left")
plt.margins(0,0)
plt.title('Player Ranking Stacked Chart per League')
plt.xlabel('Player Ranking')
```

Out[34]:

In [35]:

```
#top-1000
top_df_perc.iloc[-1]
```

Out[35]:

These figures clearly demonstrate that the English and Spanish leagues are ahead of their competitors. Almost every starting player in these two leagues are ranked in the top 1000.

When it comes to Penalty Kicks in international competitions, two teams clearly stand out from the norm: England and Germany. England has a woeful history with PKs, losing all 3 in World Cup competitions, and winning only 1 out of 7 including the European Cup. Their relentless choking during PKs is well documented (1, 2, 3, 4) and may be adding to the continuing pressure in a feedback loop. At the opposite end of the spectrum, the German team is legendary for their calm demeanor in expertly executing PKs, winning all 4 of their PKs during World Cups. In fact, their last PK loss (in the WC or Euro) dates back more than 42 years, at the Euro 1976. There are very few things in international soccer as sure bets as England losing a PK and Germany winning.

FIFA 18 includes a **Penalty** stat that describes how good a player is at taking a penalty kick. Let's investigate the differences between England and Germany.

In [36]:

```
countries = ['England', 'Germany']
fifa_pen = fifa.loc[fifa.nationality.isin(countries)]
sns.violinplot(x='nationality', y='penalties', data=fifa_pen, inner='quartile')
```

Out[36]:

This plot is surprising, it seems that English players have the slight edge over German player in the penalties stat. Let's break this down by position; it could be that players who are unlikely to take PKs in the first place are bringing down the German distribution.

In [37]:

```
sns.violinplot(x='Position', y='penalties', hue='nationality', data=fifa_pen, split=True, inner='quartile', order=pos_order)
```

Out[37]:

Wow, it seems like England really does have the edge over Germany in penalty stats for offensive players.

PKs are a two-sided coin though, and goalkeepers play a huge role in the outcome. Let's see if there are any differences in the overall quality of GKs between these two countries.

In [38]:

```
sns.violinplot(x='nationality', y='overall', data=fifa_pen[fifa_pen.Position=='GK'], inner='quartile')
```

Out[38]:

This is quite illuminating. The median overall metric for German GKs is at the same level as the 75th percentile overall metric for English GKs. Also note that England doesn't seem to really have any world-class GKs at the top of the violin plot. Since only the very best GKs per country really matter during an international tournament, let's see how the top GKs from each country compare.

In [39]:

```
fifa_pen.loc[(fifa_pen['Position']=='GK') & (fifa_pen['nationality']=='Germany')][['name', 'club', 'overall']].head(7)
```

Out[39]:

In [40]:

```
fifa_pen.loc[(fifa_pen['Position']=='GK') & (fifa_pen['nationality']=='England')][['name', 'club', 'overall']].head(7)
```

Out[40]:

These latest stats provide some deep insight. The 6 best German GKs are all ranked higher than J. Hart, the highest rated English GK. In addition to the mounting pressure the English players feel to break the PK curse, it can't help that they do not have a single choice for a world-class GK. On the other hand, the German players can keep their nerve, knowing they have a choice between some of the best GKs in the world.

Lefties make up roughly 10% of the population. However, it is well documented that in some sports, lefties make up a much larger proportion (e.g., 39% of hitters and 28% of pitchers played left-handed in the 2012 MLB season).

In soccer, we expect to see higher than 10% of players as left-footed since being left-footed is an advantage when playing on the left side of the field. This is especially true for left wingers who race down the sideline in order to whip in a cross. Let's investigate the percentages.

In [41]:

```
fifa[fifa.preferred_foot=='Left']['ID'].count()/fifa.shape[0]
```

Out[41]:

We see that 23% of the players in the dataframe are left-footed, which agrees with our intuition. Let's see if there is any significant difference in the overall quality of players based on foot preference.

In [42]:

```
fifa.groupby('preferred_foot')['overall'].mean()
```

Out[42]:

These scores are very close together, hinting that there is no discernable difference between the overall quality of left-footed player and right-footed players on average. However, in order to see if this difference of 0.5 is likely due to noise, our analysis must go one level deeper. Here, we use the bootstrapping method), in which we resample our players (with replacement). The advantage of the bootstrapping method is that it gives us intuition as to whether this 0.5 difference is statistically significant without having to assume any underlying distribution.

In [43]:

```
iterations=10000
boot=[]
for i in range(iterations):
boot_mean = fifa.sample(frac=1, replace=True).groupby('preferred_foot')['overall'].mean()
boot.append(boot_mean)
boot=pd.DataFrame(boot)
boot.plot.kde()
```

Out[43]: