Spotify AI-Driven Song Recommendation System

Inspiration

https://youtu.be/pGntmcy_HX8?si=LRdT2GLWSJiytaKf

Screenshot 2024-08-29 at 5.52.56 PM.png

Screenshot 2024-08-29 at 5.55.05 PM.png

Screenshot 2024-08-29 at 5.55.43 PM.png

From the video, we can tell that Spotify uses…

collaborative filtering (user behavior proximity based)
1. when do tracks happen to be playlisted together very often?
content based filtering (similarity in release date, label, danceability, loudness, lyrics context, etc)
1. do they share the same holiday, lyrics, loudness, genre etc?

But how do we measure “similarity”? → use cosine similarity to quantify all features

Particularly useful for comparing text, ratings based feature vectors because it’s not affected by the magnitude of the vectors, only their direction.

Data Preparation

To perform cosine similarity measure, we need 2 vectors…

Tracks Vector
1. 20 features, 1159764 rows
  
  Spotify_1Million_Tracks
2. Big data requires efficient storage instead of local storage → AWS S3 Bucket
Playlist Vector
1. We measure similarity between the tracks in the first Tracks Vector and a real Spotify playlist. This playlist should have a variety of songs to cover a wide range of features
2. Go to Spotify for Developers to set up API credentials and authenticate with Spotipy (Library for Spotify Web API)
3. Once authenticated, we have access to Spotify’s data, let’s chose a big playlist with diverse songs

Data Cleaning

Tracks Vector
1. Drop unnecessary columns: Unnamed: 0, key, duration_ms, time_signature
2. All columns must be numerical to conduct cosine similarity
  1. Perform multi-hot encoding for text-based columns like genre into binary columns → indicate yes and no of a specific genre per song
3. Feature engineering years to periodic features and reduce dimensionality
4. Standardize all features into 0-100 scale
Playlist Vector
1. Create a dataframe out of the Wild & Free playlist that we accessed with the API while ensuring the features are consistent to the 1st Tracks Vector’s dataframe.
  1. Extract track details (track name, artist name, track URI) from the playlist and store in 3 separate lists: titles, artists, and uri
  2. Initialize audio features for each song and fetch them through a loop: danceability, energy, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, and tempo.
2. But we don’t have one of the most important feature: genre. Luckily, Spotify provides genres for each artist
  1. Define process_artist function to fetch genres associated with an artist
  2. Create task for each artist to get their genres and make the processing more efficient by using joblib for parallel processing (distribute process_artist tasks across multiple CPU cores)
3. But it turns out that Spotify provides unique genres for each artist. Again, we need identical features with similar values to perform cosine similarity…
  1. create updated_genre_list from the unique genres present in the Track vector, which ensures we are using the same genres across both datasets
  2. perform one hot encoding from text based columns into binary columns to check if that song is categorized as which genre

FINALLY we have both vectors to have the consistent features to perform cosine similarity!

Final Track Vector

Final Playlist Vector

Modeling: Cosine Similarity-Based Recommendation System

If user changes the 2nd Playlist Vector by inserting song API, playlist API, etc, it will act as a reference vector for the 1st Track Vector to personalize the recommendation

In cosine similarity, you are comparing the "angle" between two vectors. By using the average vector, you're comparing each individual track's vector to this reference point, allowing you to measure how similar each track is to the overall "profile" of the user's favorite songs.

Create reference vector from 2nd Playlist Vector by calculating averages per column

Ex. If the average for danceability is 0.55 and the average for energy is 0.54, a track with similar values (e.g., danceability = 0.57, energy = 0.52) will have a higher similarity score than a track that is significantly different.
Get similarity scores
- Updated Process for Cosine Similarity:
  1. Ensure columns between feat_vec (set of songs you are comparing to the reerence vector) and averages_cosine_sim (reference vector) are align by sorting features alphabetically and dropping non numerical features
  2. Calculate cosine similarity and assign the similarity scores to feat_vec
  3. Sort the dataset based on similarity scores and removed any songs already in the user's playlist
- Genre based filtering: Top 3 Genre Recommendations
  - Get top 3 genres from the genre_ columns and filter the song recommendations based on these genres
  - Retrieve 45, 30, and 15 song recommendations for the first, second, and third most popular genres, respectively
  Findings:
  - Most of the top results belong to the black-metal genre, with artists like Thy Catafalque, Oak Pantheon, Abgott, Cough, and Lord Agheros appearing prominently in the top similarities
  - Songs in the black-metal genre generally have low danceability and energy scores. For example, Thy Catafalque's "Fehérvasárnap" has very low danceability (0.0593) and energy (0.00755), while Cough's "Still They Pray" shows a slightly higher energy level (0.319)
  - Ambient songs such as Boards of Canada's "Diving Station", Mokadelic's "Tragic Vodka", and Max Richter's "Spring 2 - 2022" also rank high in the list, showing relatively low energy and danceability, but higher acousticness and instrumentalness
  - Despite being one of the top 3 genres by aggregate count, there are no direct results from the gospel genre in the top song list, possibly indicating fewer matching tracks in the user's preferences.