top of page

Artists data set

In the Artists data set, The data issue is duplication. There are 28883 duplicate data in "id" which the duplicate is not allow. Thus, by dropping the duplicate from 87278 rows of data. The remain data set has 58395 rows of data, which put into a new file clean_artist.csv. Noticing that artists data is not an independent dataset, it’s just an expansion of the Spotify songs data, for storing the artist data in the songs attribute is redundant (every artist have multiple songs).

art1.PNG

Data cleaning

Songs data set

In songs datasets, the issues are duplicate, missing number and wrong value (consider as noise). We do the following process:

  1) Remove the duplicate: searching the whole datasets by id_song, if the same id_song are found, keep the first one and          drop the other.

  2) Fill the missing value, by inputting the average of all data in this attribute.

  3) Remove the wrong value. Since the data cannot be over bound, for instance, data in ‘tempo’ must be larger than 0, the        loudness data cannot be more than 0, or that would not be a human-made song.

By doing the cleaning process, there are 510980 rows of data in song data.

song_data.PNG

The clean data  is way more than any other Spotify datasets you could find online. The largest dataset currently reachable online is from Kaggle, which contains about 160k songs in it, and we have more than three times more data than it!

Influence data set

For influence dataset, we just keep all the features in the original dataset. Furthermore, we add a new variable that let follower's active starting year deducts influencer's active starting year to describe how long the influencers can influence others.

Wikipedia data set

fa4bb777c7941209f42ebc019d765ed.png

The main issue of text data is  unconsistency especially in influence section. This is the fraction of text with tags within all valid data. So it's reasonable to remove all tags to keep data clean. Still, there are some noise showing up. We can't consider each edge case so these noisy terms would do harm to the text analysis.

bottom of page