spotify-archive-analyses.ipynb
This notebooks performs some analyses on a member's Spotify archive that has been compiled over time. Among other things it looks into the top artists, their popularity and when the member listened to music
This Notebook requires you to have data from the Spotify integration in your Open Humans account.
With the notebook we want to look into
To get started we import some libraries we need and then access your spotify data
Now that we got all of your data we want to transform the rather complex Spotify JSON format into something that is easier to read, a simple table - also called a dataframe
. The lines below do this:
We can now look at the dataframe
and look at some example data we have:
We got the time at which we listened to a song, the title of the song and album on which it was released along with the artist that made the recording. Furthermore we can see whether the song features explicit lyrics, its popularity and a unique track ID (which we can use to identify whether we listened to the same song more than once)
We start out by having a look at how long the songs we listen to are. For that we plot the distribution of song lengths found in the archive - along with the average song length:
We see that in my case songs on average clock a bit at over 4 minutes in length. Those there is an interesting peak in the data below that. We will look into that later on!
Let us have a look at when we listen to music. Unfortunately the data from Spotify isn't timezone aware and reports the time in UTC
by default. While not on daylight savings time (DST) this is the same time zone as London. During DST it is one hour before London time.
Due to this the best we can do is to approximate the timezone. To do this we can adjust the UTC_OFFSET
variable on top of the next cell. In my case the predominant timezone for the archive is California, so I set it to UTC_OFFSET <- -7
. If you are in a different timezone adjust this accordingly.
In my case the data contains lots of listening done in Europe, which is why there are songs played in the middle of the night from midnight to 5AM. But besides this there are two peaks around 9-12am and 1pm to 4pm. Which is the time I'm usually clearly in the office and listening to music.
We have seen that the Spotify archive contains some details on how popular individual tracks are. Let's have a look on whether the songs I listen to over time are more or less popular:
While there is some small downward trend, this seems more an artifact of the last few days where the average is a bit lower and not some systematic drop in popularity.
But this begs the question, which are the artists I listen to most? Let's have a look at the artists most played in my own Spotify data:
My own artist top list is dominated by Typhoon, which Spotify classifies as a Portland, OR-based Indie Rock band. From there on the listening counts drop sharply for the other artists and flatten out.
Let's now look into the most played individual tracks:
That picture looks just as skewed, with the track Paper Forest by Emmy The Great topping the list. And if you look closely you'll see that this song alone explains all the plays for this artist as shown in the figure above! Similarly, the song Long Way From Home by the Lumineers accounts for most of the plays done by this artist.
Now we can wonder, can the song Paper Forest maybe explain the bump in the average song playtime distribution further above? Let's have a look:
Yep, that fits the peak at less than 4 minutes in the song-length distribution pretty well!
Let's now investigate whether I listen more or less to music over time by plotting the number of songs played on a given day:
There are two things going on: First of all there is not much data for much of September. Which makes sense, as I was traveling a lot during that time and didn't have much time to listen to music during conferences etc.
But there is also some trend to be seen at the beginning of September, which days with lots of plays being interrupted by days with very little music consumption. Could this be an effect of weekdays vs. weekends?
There seems to be at least a small effect that I play more music during weekdays (FALSE
in the plot above) compared to weekends.
Let's try to see whether repeat one
can explain why the two songs we saw above have been played so much more often than the other ones.
For that we use the unique track ID that Spotify gives to each song and check whether the song played just before had the same track ID. If so we store this repetition and can then plot the data later on.
Below we create the repeat
table of songs that have been played more than once after each other along with the times and dates this happened:
Now we can plot this data. On the X axis we keep the date/time of when the song was played and on the Y axis we have the different songs that were repeated at least once. The plot then shows us when these repeats happened:
And indeed, we can see that both The Lumineers
song Long Way From Home
, as well as Emmy The Great's Paper Forest
have ramped up their counts through playing them on Repeat One.
And at least for The Lumineers
my calendar proves an easy reason why: I fell asleep on my flight from San Francisco to Frankfurt without ever turning off Spotify. 😂
What did you find in your own data?
This Notebook requires you to have data from the Spotify integration in your Open Humans account.
With the notebook we want to look into
To get started we import some libraries we need and then access your spotify data
from ohapi import api
import os
import requests
import json
import pandas as pd
import datetime
member = api.exchange_oauth2_member(os.environ.get('OH_ACCESS_TOKEN'))
for f in member['data']:
if f['source'] == 'direct-sharing-176':
sp = requests.get(f['download_url'])
sp_data = json.loads(sp.content)
Now that we got all of your data we want to transform the rather complex Spotify JSON format into something that is easier to read, a simple table - also called a dataframe
. The lines below do this:
track_title = []
artist_name = []
album_name = []
played_at = []
popularity = []
duration_ms = []
explicit = []
track_id = []
for sp in sp_data:
track_title.append(sp['track']['name'])
artist_name.append(sp['track']['artists'][0]['name'])
album_name.append(sp['track']['album']['name'])
played_at.append(sp['played_at'])
popularity.append(sp['track']['popularity'])
duration_ms.append(sp['track']['duration_ms'])
explicit.append(sp['track']['explicit'])
track_id.append(sp['track']['id'])
def parse_timestamp(lst):
timestamps = []
for item in lst:
try:
timestamp = datetime.datetime.strptime(
item,
'%Y-%m-%dT%H:%M:%S.%fZ')
except ValueError:
timestamp = datetime.datetime.strptime(
item,
'%Y-%m-%dT%H:%M:%SZ')
timestamps.append(timestamp)
return timestamps
played_at = parse_timestamp(played_at)
dataframe = pd.DataFrame(data={
'track_id': track_id,
'track': track_title,
'artist': artist_name,
'album': album_name,
'popularity': popularity,
'duration_ms': duration_ms,
'explicit': explicit,
'played_at': played_at})
dataframe = dataframe.set_index(dataframe['played_at'])
We can now look at the dataframe
and look at some example data we have:
dataframe.head()
We got the time at which we listened to a song, the title of the song and album on which it was released along with the artist that made the recording. Furthermore we can see whether the song features explicit lyrics, its popularity and a unique track ID (which we can use to identify whether we listened to the same song more than once)
We start out by having a look at how long the songs we listen to are. For that we plot the distribution of song lengths found in the archive - along with the average song length:
%load_ext rpy2.ipython
%%R -i dataframe -w 4 -h 2 --units in -r 200
library(ggplot2)
ggplot(dataframe,aes(duration_ms/1000/60)) +
geom_histogram(binwidth=0.3) +
scale_x_continuous('song length in minutes') +
theme_minimal() +
geom_vline(xintercept=mean(dataframe$duration/1000/60),color='red') + ggtitle('red bar is average length')
We see that in my case songs on average clock a bit at over 4 minutes in length. Those there is an interesting peak in the data below that. We will look into that later on!
Let us have a look at when we listen to music. Unfortunately the data from Spotify isn't timezone aware and reports the time in UTC
by default. While not on daylight savings time (DST) this is the same time zone as London. During DST it is one hour before London time.
Due to this the best we can do is to approximate the timezone. To do this we can adjust the UTC_OFFSET
variable on top of the next cell. In my case the predominant timezone for the archive is California, so I set it to UTC_OFFSET <- -7
. If you are in a different timezone adjust this accordingly.
%%R -w 10 -h 2 --units in -r 200
UTC_OFFSET <- -7
UTC_OFFSET <- UTC_OFFSET*60*60
dataframe$time <- as.numeric(format(dataframe$played_at+UTC_OFFSET, format = "%H"))
ggplot(dataframe,aes(time)) +
geom_histogram(binwidth=1) +
scale_x_continuous(limit=c(0,24),'hour') +
theme_minimal() +
ggtitle('When do I listen to music? (not time zone aware, estimated correc time zone)')
In my case the data contains lots of listening done in Europe, which is why there are songs played in the middle of the night from midnight to 5AM. But besides this there are two peaks around 9-12am and 1pm to 4pm. Which is the time I'm usually clearly in the office and listening to music.
We have seen that the Spotify archive contains some details on how popular individual tracks are. Let's have a look on whether the songs I listen to over time are more or less popular:
%%R -i dataframe
df2 <- aggregate(dataframe$popularity,by=list(as.Date(dataframe$played_at)),FUN=mean)
ggplot(df2,aes(Group.1, x)) +
geom_point() + stat_smooth(method='glm') + theme_minimal() + scale_x_date('date') + scale_y_continuous('average popularity')
While there is some small downward trend, this seems more an artifact of the last few days where the average is a bit lower and not some systematic drop in popularity.
But this begs the question, which are the artists I listen to most? Let's have a look at the artists most played in my own Spotify data:
%%R -w 5 -h 2 --units in -r 200
dataframe <- within(dataframe,
artist <- factor(artist,
levels=names(sort(table(artist),
decreasing=TRUE))))
filter_list = as.data.frame(summary(dataframe$artist, max=12))
ggplot(subset(dataframe,dataframe$artist %in% rownames(filter_list)),aes(artist)) + geom_bar() + theme(axis.text.x = element_text(angle = 90, hjust = 1,size=5)) + coord_flip() + theme_minimal()
My own artist top list is dominated by Typhoon, which Spotify classifies as a Portland, OR-based Indie Rock band. From there on the listening counts drop sharply for the other artists and flatten out.
Let's now look into the most played individual tracks:
%%R -w 5 -h 2 --units in -r 200
dataframe <- within(dataframe,
track <- factor(track,
levels=names(sort(table(track),
decreasing=TRUE))))
filter_list = as.data.frame(summary(dataframe$track, max=10))
ggplot(subset(dataframe,dataframe$track %in% rownames(filter_list)),aes(track)) +
geom_bar() +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1,size=5)) +
coord_flip()
That picture looks just as skewed, with the track Paper Forest by Emmy The Great topping the list. And if you look closely you'll see that this song alone explains all the plays for this artist as shown in the figure above! Similarly, the song Long Way From Home by the Lumineers accounts for most of the plays done by this artist.
Now we can wonder, can the song Paper Forest maybe explain the bump in the average song playtime distribution further above? Let's have a look:
if len(dataframe[dataframe['track'] == 'Paper Forest (In the Afterglow of Rapture)']):
print("The song has a length of {} minutes".format(
dataframe[dataframe['track'] == 'Paper Forest (In the Afterglow of Rapture)']['duration_ms'][0]/1000/60))
Yep, that fits the peak at less than 4 minutes in the song-length distribution pretty well!
Let's now investigate whether I listen more or less to music over time by plotting the number of songs played on a given day:
%%R -w 3 -h 3 --units in -r 200
dataframe$date <- as.Date(dataframe$played_at)
ggplot(dataframe,aes(date)) +
geom_histogram(binwidth=1) + theme_minimal()
There are two things going on: First of all there is not much data for much of September. Which makes sense, as I was traveling a lot during that time and didn't have much time to listen to music during conferences etc.
But there is also some trend to be seen at the beginning of September, which days with lots of plays being interrupted by days with very little music consumption. Could this be an effect of weekdays vs. weekends?
%%R -w 4 -h 2 --units in -r 200
df2 <- aggregate(dataframe$duration_ms/1000/60/60,by=list(dataframe$date),FUN=sum)
library(lubridate)
df2$weekday <- wday(df2$Group.1, label=TRUE)
df2$weekend <- df2$weekday %in% c('Sun','Sat')
ggplot(df2,aes(x=df2$weekend,y=df2$x,)) +
geom_violin() + theme_minimal() + scale_x_discrete('is it a weekend day?') + scale_y_continuous('hours of music played')
There seems to be at least a small effect that I play more music during weekdays (FALSE
in the plot above) compared to weekends.
Let's try to see whether repeat one
can explain why the two songs we saw above have been played so much more often than the other ones.
For that we use the unique track ID that Spotify gives to each song and check whether the song played just before had the same track ID. If so we store this repetition and can then plot the data later on.
Below we create the repeat
table of songs that have been played more than once after each other along with the times and dates this happened:
ids = []
played_at = []
artist_track = []
last_id = ''
for i in sp_data:
if last_id == i['track']['id'] or last_id == '':
played_at.append(i['played_at'])
ids.append(i['track']['id'])
artist_track.append(i['track']['artists'][0]['name'] + ' - ' + i['track']['name'])
last_id = i['track']['id']
played_at = [datetime.datetime.strptime(i,'%Y-%m-%dT%H:%M:%S.%fZ') for i in played_at]
repeats = pd.DataFrame(data={
'track_id': ids,
'artist_title': artist_track,
'played_at': played_at}
)
Now we can plot this data. On the X axis we keep the date/time of when the song was played and on the Y axis we have the different songs that were repeated at least once. The plot then shows us when these repeats happened:
%%R -i repeats -w 15 -h 4 --units in -r 500
library(ggplot2)
head(repeats$played_at)
ggplot(repeats, aes(x=played_at,y=artist_title)) +
#geom_line() +
geom_point() +
theme_minimal() +
theme(axis.text=element_text(size=13)) +
scale_x_datetime('played at') +
scale_y_discrete('artist - tracktitle')