the decision blog

song data viewed through principal component analysis

Written on

Today we use principal component analysis to explore Spotify data for songs written between 1950 and 2010. The data are in separate files (called 1950.csv, 1960.csv, etc) for each decade; we start by combining them into a single data frame.

import pandas as pd

data_frames = {}
for i in range(1950,2011,10):
    data_frames[i] = pd.read_csv(str(i)+'.csv')
    data_frames[i] = data_frames[i].drop(['Number'],axis=1)

songs = pd.concat([data_frames[i]  \
    for i in range(1950,2011,10)])
print(songs.head())
                                               title               artist  \
0                       Put Your Head On My Shoulder            Paul Anka   
1  Whatever Will Be Will Be (Que Sera Sera) (with...            Doris Day   
2                           Everybody Loves Somebody          Dean Martin   
3        Take Good Care Of My Baby - 1990 Remastered            Bobby Vee   
4                                 A Teenager In Love  Dion & The Belmonts

         top genre  year  bpm  nrgy  dnce  dB  live  val  dur  acous  spch  \
0  adult standards  2000  116    34    55  -9    10   47  155     75     3   
1  adult standards  1948  177    34    42 -11    72   78  123     86     4   
2  adult standards  2013   81    49    26  -9    34   40  162     81     4   
3  adult standards  2011   82    43    49 -12    12   66  151     70     6   
4  adult standards  1959   79    38    56  -9    13   62  158     67     3

   pop  
0   72  
1   62  
2   61  
3   60  
4   60

The fields with unobvious meanings are: bpm = beats per minute; nrgy = energy level; dnce = danceability; dB = loudness; live = liveness (with a higher number meaning a higher likelihood of the recording being live); val = valence (with a higher number corresponding to a more positive mood); dur = duration; acous = acousticness; spch = speechiness (higher means more spoken words); pop = popularity.

We extract the numerical fields and see how many principal components we need to explain various percentages of the data.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from sklearn.decomposition import PCA

songs = pd.read_csv('all_years.csv')
pca = PCA()

X = songs.drop(['title','artist','top genre'],axis=1)

%matplotlib inline
pca.fit(X)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components (minus 1)')
plt.ylabel('cumulative explained variance')
Text(0, 0.5, 'cumulative explained variance')

png

We see that about 70% of the data can be explained by two principal components (note that the first component is labeled "0"). Let's express the data in terms of these component directions and group the result by genre.

pca = PCA(0.70)
pca.fit(X)
X_pca = pd.DataFrame(pca.transform(X))
X_pca['genre'] = songs['top genre']
genre_pca = X_pca.groupby(by=['genre']).mean()
print(genre_pca.head())
                          0          1
genre                                 
acoustic blues    65.129148 -10.982230
adult standards  -45.261659  30.869426
afrobeat         -64.351904 -26.235664
afropop          178.522942  64.016836
album rock        33.228259  -4.024576

So, for example, acoustic blues and afropop load high on principal component zero while, of these two, only afropop loads high on principal component one. We can see how these values relate to, say danceability, with a scatter plot.

plt.scatter(X_pca.iloc[:,0],X_pca.iloc[:,1],s=10, \
            c=songs['dnce'])
plt.xlabel('pca0')
plt.ylabel('pca1')
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x7fbe0136d460>

png

The more danceable genres tend to load low on principal component 1. We can see which genres are the most and least danceable by querying the data:

high_dnce = (genre_pca[1]<0)
genre_pca[high_dnce].head()
0 1
genre
acoustic blues 65.129148 -10.982230
afrobeat -64.351904 -26.235664
album rock 33.228259 -4.024576
alternative metal 1.633728 -31.132024
alternative r&b -15.708144 -22.355832
low_dnce = (genre_pca[1]>50.0)
genre_pca[low_dnce].head()
0 1
genre
afropop 178.522942 64.016836
avant-garde jazz 213.712463 140.855648
bebop 160.565217 104.111861
blues 229.492590 74.519282
british folk -18.102521 61.018127

Let's repeat the same steps for popularity:

plt.scatter(X_pca.iloc[:,0],X_pca.iloc[:,1],s=10, \
            c=songs['pop'])
plt.xlabel('pca0')
plt.ylabel('pca1')
plt.colorbar()
<matplotlib.colorbar.Colorbar at 0x7fbe000a4790>

png

most_pop = (genre_pca[0]>-50) & (genre_pca[0]<100)
genre_pca[most_pop].head()
0 1
genre
acoustic blues 65.129148 -10.982230
adult standards -45.261659 30.869426
album rock 33.228259 -4.024576
alternative country -40.116761 33.310572
alternative metal 1.633728 -31.132024
least_pop = (genre_pca[0]>-150) & (genre_pca[0]<-50)
genre_pca[least_pop].head()
0 1
genre
afrobeat -64.351904 -26.235664
american folk revival -64.653072 22.946793
appalachian folk -70.237852 -0.487309
australian rock -58.740034 -4.316319
australian talent show -103.017823 -35.674851

Of course we could have found out the same things by querying the original data. The main point of these scatter plots is to build a bridge between the data expressed in pca coordinates and the original features, to make the pca coordinates more interpretable. And the main strength of principal component analysis is that we can filter out noise or make an otherwise computationally prohibitive problem much more tractible by lowering the number of features a model would have to deal with.

Discuss on Twitter