Details for spotify-archive-analyses-extended.ipynb

Published by gedankenstuecke

Description

The Spotify integration now also collects further metadata on the tracks you listen to, such as the key & mode of the songs and predictions whether songs are live recordings, instrumental, etc. Explore those new data with this notebook!

0

Tags & Data Sources

music Spotify integration

Comments

Please log in to comment.

Notebook
Last updated 2 months, 4 weeks ago

Analyze your Spotify listening history

This Notebook requires you to have data from the Spotify integration in your Open Humans account.

With the notebook we want to look into

  • which artists do you listen to?
  • which tracks do you listen to?
  • How much do listen to music on a given day?
  • How popular is your music taste on Spotify?
  • Do you listen to the same songs for long stretches?

To get started we import some libraries we need and then access your spotify data

In [46]:
from ohapi import api
import os
import requests
import json
import pandas as pd
import datetime

member = api.exchange_oauth2_member(os.environ.get('OH_ACCESS_TOKEN'))
for f in member['data']:
    if f['source'] == 'direct-sharing-176' and f['basename'] == 'spotify-listening-archive.json':
        sp_songs = requests.get(f['download_url'])
    if f['source'] == 'direct-sharing-176' and f['basename'] == 'spotify-track-metadata.json':
        sp_meta = requests.get(f['download_url'])
        
sp_data = json.loads(sp_songs.content)
sp_metadata = json.loads(sp_meta.content)

Now that we got all of your data we want to transform the rather complex Spotify JSON format into something that is easier to read, a simple table - also called a dataframe. The lines below do this:

In [78]:
track_title = []
artist_name = []
album_name = []
played_at = []
popularity = []
duration_ms = []
explicit = []
track_id = []

danceability = []
energy = [] 
key = [] 
loudness = [] 
mode = []
speechiness = []
acousticness = []
instrumentalness = []
liveness = []
valence = []

['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence']


key_translation = {
    '0': "C",
    "1": "C#",
    "2": "D",
    "3": "D#",
    "4": "E", 
    "5": "F",
    "6": "F#",
    "7": "G",
    "8": "G#",
    "9": "A",
    "10": "A#",
    "11": "B"
}

mode_translation = {'1':'major', '0': 'minor'}

for sp in sp_data:
    track_title.append(sp['track']['name'])
    artist_name.append(sp['track']['artists'][0]['name'])
    album_name.append(sp['track']['album']['name'])
    played_at.append(sp['played_at'])
    popularity.append(sp['track']['popularity'])
    duration_ms.append(sp['track']['duration_ms'])
    explicit.append(sp['track']['explicit'])
    track_id.append(sp['track']['id'])
    danceability.append(sp_metadata[sp['track']['id']]['danceability'])
    energy.append(sp_metadata[sp['track']['id']]['energy'])
    key.append(key_translation[str(sp_metadata[sp['track']['id']]['key'])])
    loudness.append(sp_metadata[sp['track']['id']]['loudness'])
    mode.append(mode_translation[str(sp_metadata[sp['track']['id']]['mode'])])
    speechiness.append(sp_metadata[sp['track']['id']]['speechiness'])
    acousticness.append(sp_metadata[sp['track']['id']]['acousticness'])
    instrumentalness.append(sp_metadata[sp['track']['id']]['instrumentalness'])
    liveness.append(sp_metadata[sp['track']['id']]['liveness'])
    valence.append(sp_metadata[sp['track']['id']]['valence'])
    
def parse_timestamp(lst):
    timestamps = []
    for item in lst:
        try:
            timestamp = datetime.datetime.strptime(
                            item,
                            '%Y-%m-%dT%H:%M:%S.%fZ')
        except ValueError:
            timestamp = datetime.datetime.strptime(
                    item,
                    '%Y-%m-%dT%H:%M:%SZ')
        timestamps.append(timestamp)
    return timestamps
    
played_at = parse_timestamp(played_at)

dataframe = pd.DataFrame(data={
    'track_id': track_id,
    'track': track_title,
    'artist': artist_name,
    'album': album_name,
    'popularity': popularity,
    'duration_ms': duration_ms,
    'explicit': explicit,
    'played_at': played_at,
    'danceability': danceability,
    'energy': energy,
    'key': key,
    'loudness': loudness,
    'mode': mode,
    'speechiness': speechiness,
    'acousticness': acousticness,
    'instrumentalness': instrumentalness,
    'liveness': liveness,
    'valence': valence

})
dataframe = dataframe.set_index(dataframe['played_at'])

We can now look at the dataframe and look at some example data we have:

In [3]:
dataframe.head()
Out[3]:
track_id track artist album popularity duration_ms explicit played_at
played_at
2018-10-17 22:28:10.219 2YxLFAW82UprL82e9brViV Ordinary Man Christy Moore On the Road 46 235346 False 2018-10-17 22:28:10.219
2018-10-17 22:31:45.194 37xwNDTrrOxUGiREjPV8os Ride On Christy Moore On the Road 34 215080 False 2018-10-17 22:31:45.194
2018-10-17 22:36:10.358 7r8uwyBaQG3k7ISC9GHX8e Joxer Goes to Stuttgart Christy Moore On the Road 35 265146 False 2018-10-17 22:36:10.358
2018-10-17 22:39:26.908 2xsMuhrWv7FBEQczFAZFkY Black Is the Colour Christy Moore On the Road 34 196653 False 2018-10-17 22:39:26.908
2018-10-17 22:44:28.672 1G23TVzQG1rMOXzMLqGJbE Don't Forget Your Shovel Christy Moore On the Road 36 301706 False 2018-10-17 22:44:28.672

We got the time at which we listened to a song, the title of the song and album on which it was released along with the artist that made the recording. Furthermore we can see whether the song features explicit lyrics, its popularity and a unique track ID (which we can use to identify whether we listened to the same song more than once)

Plotting and Analyzing our Spotify archive

Average song length

We start out by having a look at how long the songs we listen to are. For that we plot the distribution of song lengths found in the archive - along with the average song length:

In [60]:
%load_ext rpy2.ipython
In [5]:
%%R -i dataframe -w 4 -h 2 --units in -r 200
library(ggplot2)
ggplot(dataframe,aes(duration_ms/1000/60)) + 
    geom_histogram(binwidth=0.3) + 
    scale_x_continuous('song length in minutes') + 
    theme_minimal() + 
    geom_vline(xintercept=mean(dataframe$duration/1000/60),color='red') + ggtitle('red bar is average length')

We see that in my case songs on average clock a bit at over 4 minutes in length. Those there is an interesting peak in the data below that. We will look into that later on!

When do we listen to music?

Let us have a look at when we listen to music. Unfortunately the data from Spotify isn't timezone aware and reports the time in UTC by default. While not on daylight savings time (DST) this is the same time zone as London. During DST it is one hour before London time.

Due to this the best we can do is to approximate the timezone. To do this we can adjust the UTC_OFFSET variable on top of the next cell. In my case the predominant timezone for the archive is California, so I set it to UTC_OFFSET <- -7. If you are in a different timezone adjust this accordingly.

In [6]:
%%R -w 10 -h 2 --units in -r 200

UTC_OFFSET <- -7

UTC_OFFSET <- UTC_OFFSET*60*60
dataframe$time <- as.numeric(format(dataframe$played_at+UTC_OFFSET, format = "%H"))

ggplot(dataframe,aes(time)) + 
    geom_histogram(binwidth=1) +
    scale_x_continuous(limit=c(0,24),'hour') + 
    theme_minimal() +
    ggtitle('When do I listen to music?')

In my case the data contains lots of listening done in Europe, which is why there are songs played in the middle of the night from midnight to 5AM. But besides this there are two peaks around 9-12am and 1pm to 4pm. Which is the time I'm usually clearly in the office and listening to music.

We have seen that the Spotify archive contains some details on how popular individual tracks are. Let's have a look on whether the songs I listen to over time are more or less popular:

In [12]:
%%R -i dataframe
library(ggplot2)
df2 <- aggregate(dataframe$popularity,by=list(as.Date(dataframe$played_at)),FUN=mean)

ggplot(df2,aes(Group.1, x)) + 
    geom_point() + stat_smooth(method='glm') + theme_minimal() + scale_x_date('date') + scale_y_continuous('average popularity')
In [17]:
%%R -i dataframe
df2 <- aggregate(dataframe$popularity,by=list(as.Date(dataframe$played_at)),FUN=mean)

ggplot(df2,aes(Group.1, x)) + 
    geom_point() + stat_smooth(method='glm') + theme_minimal() + scale_x_date('date') + scale_y_continuous('average popularity')

While there is some small downward trend, this seems more an artifact of the last few days where the average is a bit lower and not some systematic drop in popularity.

But this begs the question, which are the artists I listen to most? Let's have a look at the artists most played in my own Spotify data:

In [13]:
%%R -w 5 -h 2 --units in -r 200

dataframe <- within(dataframe, 
                   artist <- factor(artist, 
                                      levels=names(sort(table(artist), 
                                                        decreasing=TRUE))))

filter_list = as.data.frame(summary(dataframe$artist, max=12))

ggplot(subset(dataframe,dataframe$artist %in% rownames(filter_list)),aes(artist)) + geom_bar() + theme(axis.text.x = element_text(angle = 90, hjust = 1,size=5)) + coord_flip() + theme_minimal()

My own artist top list is dominated by Typhoon, which Spotify classifies as a Portland, OR-based Indie Rock band. From there on the listening counts drop sharply for the other artists and flatten out.

Most played Tracks

Let's now look into the most played individual tracks:

In [14]:
%%R -w 5 -h 2 --units in -r 200

dataframe <- within(dataframe, 
                   track <- factor(track, 
                                      levels=names(sort(table(track), 
                                                        decreasing=TRUE))))

filter_list = as.data.frame(summary(dataframe$track, max=10))

ggplot(subset(dataframe,dataframe$track %in% rownames(filter_list)),aes(track)) + 
    geom_bar() + 
    theme_minimal() + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1,size=5)) + 
    coord_flip()

That picture looks just as skewed, with the track Paper Forest by Emmy The Great topping the list. And if you look closely you'll see that this song alone explains all the plays for this artist as shown in the figure above! Similarly, the song Long Way From Home by the Lumineers accounts for most of the plays done by this artist.

Now we can wonder, can the song Paper Forest maybe explain the bump in the average song playtime distribution further above? Let's have a look:

In [10]:
if len(dataframe[dataframe['track'] == 'Paper Forest (In the Afterglow of Rapture)']):
    print("The song has a length of {} minutes".format(
        dataframe[dataframe['track'] == 'Paper Forest (In the Afterglow of Rapture)']['duration_ms'][0]/1000/60))

Yep, that fits the peak at less than 4 minutes in the song-length distribution pretty well!

My listening behaviour over time

Let's now investigate whether I listen more or less to music over time by plotting the number of songs played on a given day:

In [11]:
%%R -w 3 -h 3 --units in -r 200

dataframe$date <- as.Date(dataframe$played_at)

ggplot(dataframe,aes(date)) + 
    geom_histogram(binwidth=1) + theme_minimal()

There are two things going on: First of all there is not much data for much of September. Which makes sense, as I was traveling a lot during that time and didn't have much time to listen to music during conferences etc.

But there is also some trend to be seen at the beginning of September, which days with lots of plays being interrupted by days with very little music consumption. Could this be an effect of weekdays vs. weekends?

In [12]:
%%R -w 4 -h 2 --units in -r 200
df2 <- aggregate(dataframe$duration_ms/1000/60/60,by=list(dataframe$date),FUN=sum)
library(lubridate)

df2$weekday <- wday(df2$Group.1, label=TRUE)
df2$weekend <- df2$weekday %in% c('Sun','Sat')


ggplot(df2,aes(x=df2$weekend,y=df2$x,)) + 
    geom_violin() + theme_minimal() + scale_x_discrete('is it a weekend day?') + scale_y_continuous('hours of music played')
/opt/conda/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:186: RRuntimeWarning: 
Attaching package: ‘lubridate’


  warnings.warn(x, RRuntimeWarning)
/opt/conda/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:186: RRuntimeWarning: The following object is masked from ‘package:base’:

    date


  warnings.warn(x, RRuntimeWarning)

There seems to be at least a small effect that I play more music during weekdays (FALSE in the plot above) compared to weekends.

What's up with all the songs that are played so often?

Let's try to see whether repeat one can explain why the two songs we saw above have been played so much more often than the other ones.

For that we use the unique track ID that Spotify gives to each song and check whether the song played just before had the same track ID. If so we store this repetition and can then plot the data later on.

Below we create the repeat table of songs that have been played more than once after each other along with the times and dates this happened:

In [13]:
ids = []
played_at = []
artist_track = []

last_id = ''
for i in sp_data:
    if last_id == i['track']['id'] or last_id == '':
        played_at.append(i['played_at'])
        ids.append(i['track']['id'])
        artist_track.append(i['track']['artists'][0]['name'] + ' - ' + i['track']['name'])
    last_id = i['track']['id']
    
played_at = [datetime.datetime.strptime(i,'%Y-%m-%dT%H:%M:%S.%fZ') for i in played_at]

repeats = pd.DataFrame(data={
    'track_id': ids,
    'artist_title': artist_track,
    'played_at': played_at}
)

Now we can plot this data. On the X axis we keep the date/time of when the song was played and on the Y axis we have the different songs that were repeated at least once. The plot then shows us when these repeats happened:

In [14]:
%%R -i repeats -w 15 -h 6 --units in -r 500
library(ggplot2)
head(repeats$played_at)
ggplot(repeats, aes(x=played_at,y=artist_title)) + 
    #geom_line() + 
    geom_point() + 
    theme_minimal() +
    theme(axis.text=element_text(size=13)) + 
    scale_x_datetime('played at') + 
    scale_y_discrete('artist - tracktitle')

And indeed, we can see that both The Lumineers song Long Way From Home, as well as Emmy The Great's Paper Forest have ramped up their counts through playing them on Repeat One.

And at least for The Lumineers my calendar proves an easy reason why: I fell asleep on my flight from San Francisco to Frankfurt without ever turning off Spotify. 😂

Characteristics of the songs I listen to

Spotify not only gives you the metadata about how long songs are, but also provides some automatic classifications of those songs. Amongst more traditional musical dimensions, like which key a song is in, whether it is in major or minor mode and how loud the recording is, they also have some further characteristics: These include things like the danceability of a song, its energy and valence (basically a mood-score) as well as scores for whether the song is instrumental, a live recording or an acoustic track.

We can dig into those characteristics. Let's start out by looking at the classical measures: Key, Mode & Volume:

Key

In [87]:
%%R -i dataframe -w 4 -h 2 --units in -r 200
library(ggplot2)
ggplot(dataframe,aes(key)) + 
    geom_histogram(stat='count') + 
    scale_x_discrete('key') + 
    theme_minimal()

Mode

In [90]:
%%R -i dataframe -w 4 -h 2 --units in -r 200
library(ggplot2)
ggplot(dataframe,aes(mode)) + 
    geom_histogram(stat='count') + 
    scale_x_discrete('mode') + 
    theme_minimal()

Volume / Loudness

Measuring the average volume of a track in decibels (dB). Values typical range between -60 and 0 db.

In [92]:
%%R -i dataframe -w 4 -h 2 --units in -r 200
library(ggplot2)
ggplot(dataframe,aes(loudness)) + 
    geom_histogram(binwidth=0.1) + 
    scale_x_continuous('loudness') + 
    theme_minimal() + 
    geom_vline(xintercept=mean(dataframe$loudness),color='red') + ggtitle('red bar is average')

Other metrics

Danceability

The score ranges from 0-1, with 0 being least danceable and 1 being most danceable.

In [93]:
%%R -i dataframe -w 4 -h 2 --units in -r 200
library(ggplot2)
ggplot(dataframe,aes(danceability)) + 
    geom_histogram(binwidth=0.1) + 
    scale_x_continuous('danceability') + 
    theme_minimal() + 
    geom_vline(xintercept=mean(dataframe$danceability),color='red') + ggtitle('red bar is average')

Acousticness

The score ranges from 0-1, with 0 most unlikely to be an acoustic track and 1 being most likely

In [94]:
%%R -i dataframe -w 4 -h 2 --units in -r 200
library(ggplot2)
ggplot(dataframe,aes(acousticness)) + 
    geom_histogram(binwidth=0.1) + 
    scale_x_continuous('acousticness') + 
    theme_minimal() + 
    geom_vline(xintercept=mean(dataframe$acousticness),color='red') + ggtitle('red bar is average')

Liveness

This tries to predict whether a song is a live recording or not. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

In [95]:
%%R -i dataframe -w 4 -h 2 --units in -r 200
library(ggplot2)
ggplot(dataframe,aes(liveness)) + 
    geom_histogram(binwidth=0.1) + 
    scale_x_continuous('liveness') + 
    theme_minimal() + 
    geom_vline(xintercept=mean(dataframe$liveness),color='red') + ggtitle('red bar is average')