top of page

Clustering

This part of clustering would be mainly running on Spotify Data.

    After filtering out duplicate data, we got a final count of over 510k of songs with audio features in our dataset. Thus, some clustering algorithm on this scale of data could be very slow, and we will be applying sampling to make the program runnable and finish in reasonable time. With a sampling fraction of 0.1, which resulting about 51k data points, here are the results of clustering with different algorithms. Within the clustering, 10 attributes are fed into the model: danceability, energy, key, loudness, speechiness, acousticness, instrumentalness, liveness, valence and tempo. For unable to show 10 dimensions on a single plot, the graphs above are the clustering results of a PCA projection of the attributes.

    The index between algorithms are the basically the same with what we can see from the plot: DBSCAN is not able to divide clusters among data, only pointed out a few outliers, resulting an index of only 94. DBSCAN is a density based clustering method, which means different eps will result different clusters. But with this dataset, there seems no clusters can be find by density. Upon varying the eps, the results were either one big cluster which contains most most of the data, or a small cluster and a huge amount of outliers. This means all songs in term of these ten attributes, are likely to be normally distributing among the hyper-space.

    By ranking the Calinski-Harabaz Index of different attribute pairs of a single clustering result, we can get the attribute pairs with best dividedness of each algorithm by taking the highest scores:

    Then we draw a matrix of hierarchical clustering results in projections of every attribute pair:

It's all fairly reasonable.

    And in projection of loudness and release year: (Note that the release time is not an attribute in clustering.)

    We can see that different clusters have its own distribution across time. So there are different trends of music through time. 

bottom of page