Classification
This analysis explores the performance on song's classification.
This classification section mainly utilize data from Spotify.
We select four models, Naive Bayes, Decision Tree, Random Forest and Knn to handle music data and make classification. The reason we choose these models is that we want to see what is the difference within results gained from models based on statistical assumptions, trees and distance. In addition, we are curious about how these models can handle multi-classes.
We first utilize Kmeans model to compress genres size. We refer to the number in influence data and set the cluster number as 10.
In the pre-processing step, we normalize the numerical features by min-max method so that all the data are in the range between 0 to 1.
In order to get a better model, we use grid search to optimize the parameters. Since the Bayes model needs priori distribution, its parameter is set the default. The parameters are displayed below:
We go through cross validation for each model to judge if our models are outfitted. We set 10-folds. Then draw the ROC figure for each classification model. In the figure, we calculate the AUC value for each target group and exhibit it in the graph.
The table shows that four models are all unreasonable for our data set. All precision are below 0.5 and AUC values are all around 0.7. The best model is Random Forest as it is an ensemble method exceeding the ability of single one. However, the strength is not distinct. The tree models perform better than statistical model. It is surprising that Knn performs similarly to Decision Tree. We speculate that our types are gotten from Kmeans based on distance so they are more customized to Knn.
Bayes
Decision Tree
Knn
Bayes
From ROC figures, it exists a common trend that cluster0, cluster1, cluster3 and cluster6 are well predicted. Besides cluster3, comedy, we outline other three clusters as ‘metal music’, ‘classic music’ and ‘dynamic music’ respectively. It validates these three groups may have distinctive features. Still, models have difference with each other. Knn can’t classify cluster8,9 greatly while Random forest predict them better but to category cluster2,4,5 badly. Decision tree is similar to Random forest but its result are worse than the latter one.
The data are unbalanced. (i.e. cluster3 only has comedy songs) It should be mentioned that the tree models both have the parameter class_weight. We gain the result that no weighted data could bring to better results. Therefore we use down-sampling controlling the size of each cluster to resolve unbalanced issue. The precision are greatly improved.
When we try clustering genres by lower numbers, the results perform better. We don’t display here since we think it betrays the essence of the music. Each genre has its own feature. We want to find a model to predict various genres well, but if we reduce size to few groups, it makes no sense.