Inspiration

https://youtu.be/pGntmcy_HX8?si=LRdT2GLWSJiytaKf

Screenshot 2024-08-29 at 5.52.56 PM.png

image.png

Screenshot 2024-08-29 at 5.55.05 PM.png

Screenshot 2024-08-29 at 5.55.43 PM.png

From the video, we can tell that Spotify uses…

  1. collaborative filtering (user behavior proximity based)
    1. when do tracks happen to be playlisted together very often?
  2. content based filtering (similarity in release date, label, danceability, loudness, lyrics context, etc)
    1. do they share the same holiday, lyrics, loudness, genre etc?

But how do we measure “similarity”? → use cosine similarity to quantify all features

Particularly useful for comparing text, ratings based feature vectors because it’s not affected by the magnitude of the vectors, only their direction.

Data Preparation

To perform cosine similarity measure, we need 2 vectors…

  1. Tracks Vector

    1. 20 features, 1159764 rows

      Spotify_1Million_Tracks

      Screenshot 2024-09-05 at 11.29.45 AM.png

    2. Big data requires efficient storage instead of local storage → AWS S3 Bucket

      Screenshot 2024-09-06 at 2.03.10 PM.png

  2. Playlist Vector

    1. We measure similarity between the tracks in the first Tracks Vector and a real Spotify playlist. This playlist should have a variety of songs to cover a wide range of features

    2. Go to Spotify for Developers to set up API credentials and authenticate with Spotipy (Library for Spotify Web API)

      Screenshot 2024-08-30 at 1.29.01 PM.jpeg

    3. Once authenticated, we have access to Spotify’s data, let’s chose a big playlist with diverse songs

      Screenshot 2024-09-05 at 11.35.35 AM.png

Data Cleaning

  1. Tracks Vector

    1. Drop unnecessary columns: Unnamed: 0, key, duration_ms, time_signature
    2. All columns must be numerical to conduct cosine similarity
      1. Perform multi-hot encoding for text-based columns like genre into binary columns → indicate yes and no of a specific genre per song
    3. Feature engineering years to periodic features and reduce dimensionality
    4. Standardize all features into 0-100 scale

    Screenshot 2024-09-06 at 12.59.52 PM.png

  2. Playlist Vector

    1. Create a dataframe out of the Wild & Free playlist that we accessed with the API while ensuring the features are consistent to the 1st Tracks Vector’s dataframe.

      1. Extract track details (track name, artist name, track URI) from the playlist and store in 3 separate lists: titles, artists, and uri

      2. Initialize audio features for each song and fetch them through a loop: danceability, energy, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, and tempo.

        Screenshot 2024-09-05 at 11.39.20 AM.png

    2. But we don’t have one of the most important feature: genre. Luckily, Spotify provides genres for each artist

      1. Define process_artist function to fetch genres associated with an artist
      2. Create task for each artist to get their genres and make the processing more efficient by using joblib for parallel processing (distribute process_artist tasks across multiple CPU cores)

      Screenshot 2024-09-05 at 12.54.18 PM.png

    3. But it turns out that Spotify provides unique genres for each artist. Again, we need identical features with similar values to perform cosine similarity…

      1. create updated_genre_list from the unique genres present in the Track vector, which ensures we are using the same genres across both datasets
      2. perform one hot encoding from text based columns into binary columns to check if that song is categorized as which genre

      Screenshot 2024-09-05 at 1.54.12 PM.png

FINALLY we have both vectors to have the consistent features to perform cosine similarity!

Final Track Vector

Final Track Vector

Final Playlist Vector

Final Playlist Vector

Modeling: Cosine Similarity-Based Recommendation System

If user changes the 2nd Playlist Vector by inserting song API, playlist API, etc, it will act as a reference vector for the 1st Track Vector to personalize the recommendation

In cosine similarity, you are comparing the "angle" between two vectors. By using the average vector, you're comparing each individual track's vector to this reference point, allowing you to measure how similar each track is to the overall "profile" of the user's favorite songs.

  1. Create reference vector from 2nd Playlist Vector by calculating averages per column

    Screenshot 2024-09-06 at 1.26.35 PM.png

    Ex. If the average for danceability is 0.55 and the average for energy is 0.54, a track with similar values (e.g., danceability = 0.57, energy = 0.52) will have a higher similarity score than a track that is significantly different.

  2. Get similarity scores