Details for spotify-archive-analyses.ipynb

Published by gedankenstuecke


This notebooks performs some analyses on a member's Spotify archive that has been compiled over time. Among other things it looks into the top artists, their popularity and when the member listened to music


Tags & Data Sources

music listening behaviour Spotify integration


Please log in to comment.

Last updated 1 year, 4 months ago

Analyze your Spotify listening history

This Notebook requires you to have data from the Spotify integration in your Open Humans account.

With the notebook we want to look into

  • which artists do you listen to?
  • which tracks do you listen to?
  • How much do listen to music on a given day?
  • How popular is your music taste on Spotify?
  • Do you listen to the same songs for long stretches?

To get started we import some libraries we need and then access your spotify data

In [1]:
from ohapi import api
import os
import requests
import json
import pandas as pd
import datetime

member = api.exchange_oauth2_member(os.environ.get('OH_ACCESS_TOKEN'))
for f in member['data']:
    if f['source'] == 'direct-sharing-176':
        sp = requests.get(f['download_url'])
sp_data = json.loads(sp.content)

Now that we got all of your data we want to transform the rather complex Spotify JSON format into something that is easier to read, a simple table - also called a dataframe. The lines below do this:

In [2]:
track_title = []
artist_name = []
album_name = []
played_at = []
popularity = []
duration_ms = []
explicit = []
track_id = []
for sp in sp_data:

def parse_timestamp(lst):
    timestamps = []
    for item in lst:
            timestamp = datetime.datetime.strptime(
        except ValueError:
            timestamp = datetime.datetime.strptime(
    return timestamps
played_at = parse_timestamp(played_at)

dataframe = pd.DataFrame(data={
    'track_id': track_id,
    'track': track_title,
    'artist': artist_name,
    'album': album_name,
    'popularity': popularity,
    'duration_ms': duration_ms,
    'explicit': explicit,
    'played_at': played_at})
dataframe = dataframe.set_index(dataframe['played_at'])

We can now look at the dataframe and look at some example data we have:

In [3]:
album artist duration_ms explicit played_at popularity track track_id
2018-08-28 16:34:26.159 Fences Bombadil 194666 False 2018-08-28 16:34:26.159 25 Not Those Kind of People 23fNGTRCTHQA3hywjKPVug
2018-08-28 16:36:30.082 Appalachia (On the Back Porch) Josiah and the Bonnevilles 233814 False 2018-08-28 16:36:30.082 43 Appalachia (On the Back Porch) 2xpwxgIgOMuEuSLOdDZN02
2018-08-28 20:09:53.786 Appalachia (On the Back Porch) Josiah and the Bonnevilles 233814 False 2018-08-28 20:09:53.786 43 Appalachia (On the Back Porch) 2xpwxgIgOMuEuSLOdDZN02
2018-08-28 20:15:21.790 My Bubba & Elsa Sjunger Visor // Sing Swedish ... My Bubba 222106 False 2018-08-28 20:15:21.790 17 Visa i molom 2vzCa4sJ7ktUlEtfzVy74l
2018-08-28 20:33:31.199 In The Magic Hour Aoife O'Donovan 212333 False 2018-08-28 20:33:31.199 31 Detour Sign 4dE4cTWfXbTUUzhvcHXwt6

We got the time at which we listened to a song, the title of the song and album on which it was released along with the artist that made the recording. Furthermore we can see whether the song features explicit lyrics, its popularity and a unique track ID (which we can use to identify whether we listened to the same song more than once)

Plotting and Analyzing our Spotify archive

Average song length

We start out by having a look at how long the songs we listen to are. For that we plot the distribution of song lengths found in the archive - along with the average song length:

In [4]:
%load_ext rpy2.ipython
In [5]:
%%R -i dataframe -w 4 -h 2 --units in -r 200
ggplot(dataframe,aes(duration_ms/1000/60)) + 
    geom_histogram(binwidth=0.3) + 
    scale_x_continuous('song length in minutes') + 
    theme_minimal() + 
    geom_vline(xintercept=mean(dataframe$duration/1000/60),color='red') + ggtitle('red bar is average length')

We see that in my case songs on average clock a bit at over 4 minutes in length. Those there is an interesting peak in the data below that. We will look into that later on!

When do we listen to music?

Let us have a look at when we listen to music. Unfortunately the data from Spotify isn't timezone aware and reports the time in UTC by default. While not on daylight savings time (DST) this is the same time zone as London. During DST it is one hour before London time.

Due to this the best we can do is to approximate the timezone. To do this we can adjust the UTC_OFFSET variable on top of the next cell. In my case the predominant timezone for the archive is California, so I set it to UTC_OFFSET <- -7. If you are in a different timezone adjust this accordingly.

In [6]:
%%R -w 10 -h 2 --units in -r 200


dataframe$time <- as.numeric(format(dataframe$played_at+UTC_OFFSET, format = "%H"))

ggplot(dataframe,aes(time)) + 
    geom_histogram(binwidth=1) +
    scale_x_continuous(limit=c(0,24),'hour') + 
    theme_minimal() +
    ggtitle('When do I listen to music? (not time zone aware, estimated correc time zone)')

In my case the data contains lots of listening done in Europe, which is why there are songs played in the middle of the night from midnight to 5AM. But besides this there are two peaks around 9-12am and 1pm to 4pm. Which is the time I'm usually clearly in the office and listening to music.

We have seen that the Spotify archive contains some details on how popular individual tracks are. Let's have a look on whether the songs I listen to over time are more or less popular:

In [7]:
%%R -i dataframe
df2 <- aggregate(dataframe$popularity,by=list(as.Date(dataframe$played_at)),FUN=mean)

ggplot(df2,aes(Group.1, x)) + 
    geom_point() + stat_smooth(method='glm') + theme_minimal() + scale_x_date('date') + scale_y_continuous('average popularity')

While there is some small downward trend, this seems more an artifact of the last few days where the average is a bit lower and not some systematic drop in popularity.

But this begs the question, which are the artists I listen to most? Let's have a look at the artists most played in my own Spotify data:

In [8]:
%%R -w 5 -h 2 --units in -r 200

dataframe <- within(dataframe, 
                   artist <- factor(artist, 

filter_list =$artist, max=12))

ggplot(subset(dataframe,dataframe$artist %in% rownames(filter_list)),aes(artist)) + geom_bar() + theme(axis.text.x = element_text(angle = 90, hjust = 1,size=5)) + coord_flip() + theme_minimal()

My own artist top list is dominated by Typhoon, which Spotify classifies as a Portland, OR-based Indie Rock band. From there on the listening counts drop sharply for the other artists and flatten out.

Most played Tracks

Let's now look into the most played individual tracks:

In [9]:
%%R -w 5 -h 2 --units in -r 200

dataframe <- within(dataframe, 
                   track <- factor(track, 

filter_list =$track, max=10))

ggplot(subset(dataframe,dataframe$track %in% rownames(filter_list)),aes(track)) + 
    geom_bar() + 
    theme_minimal() + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1,size=5)) + 

That picture looks just as skewed, with the track Paper Forest by Emmy The Great topping the list. And if you look closely you'll see that this song alone explains all the plays for this artist as shown in the figure above! Similarly, the song Long Way From Home by the Lumineers accounts for most of the plays done by this artist.

Now we can wonder, can the song Paper Forest maybe explain the bump in the average song playtime distribution further above? Let's have a look:

In [10]:
if len(dataframe[dataframe['track'] == 'Paper Forest (In the Afterglow of Rapture)']):
    print("The song has a length of {} minutes".format(
        dataframe[dataframe['track'] == 'Paper Forest (In the Afterglow of Rapture)']['duration_ms'][0]/1000/60))
The song has a length of 3.6942166666666667 minutes

Yep, that fits the peak at less than 4 minutes in the song-length distribution pretty well!

My listening behaviour over time

Let's now investigate whether I listen more or less to music over time by plotting the number of songs played on a given day:

In [11]:
%%R -w 3 -h 3 --units in -r 200

dataframe$date <- as.Date(dataframe$played_at)

ggplot(dataframe,aes(date)) + 
    geom_histogram(binwidth=1) + theme_minimal()

There are two things going on: First of all there is not much data for much of September. Which makes sense, as I was traveling a lot during that time and didn't have much time to listen to music during conferences etc.

But there is also some trend to be seen at the beginning of September, which days with lots of plays being interrupted by days with very little music consumption. Could this be an effect of weekdays vs. weekends?

In [12]:
%%R -w 4 -h 2 --units in -r 200
df2 <- aggregate(dataframe$duration_ms/1000/60/60,by=list(dataframe$date),FUN=sum)

df2$weekday <- wday(df2$Group.1, label=TRUE)
df2$weekend <- df2$weekday %in% c('Sun','Sat')

ggplot(df2,aes(x=df2$weekend,y=df2$x,)) + 
    geom_violin() + theme_minimal() + scale_x_discrete('is it a weekend day?') + scale_y_continuous('hours of music played')
/opt/conda/lib/python3.6/site-packages/rpy2/rinterface/ RRuntimeWarning: 
Attaching package: ‘lubridate’

  warnings.warn(x, RRuntimeWarning)
/opt/conda/lib/python3.6/site-packages/rpy2/rinterface/ RRuntimeWarning: The following object is masked from ‘package:base’:


  warnings.warn(x, RRuntimeWarning)

There seems to be at least a small effect that I play more music during weekdays (FALSE in the plot above) compared to weekends.

What's up with all the songs that are played so often?

Let's try to see whether repeat one can explain why the two songs we saw above have been played so much more often than the other ones.

For that we use the unique track ID that Spotify gives to each song and check whether the song played just before had the same track ID. If so we store this repetition and can then plot the data later on.

Below we create the repeat table of songs that have been played more than once after each other along with the times and dates this happened:

In [13]:
ids = []
played_at = []
artist_track = []

last_id = ''
for i in sp_data:
    if last_id == i['track']['id'] or last_id == '':
        artist_track.append(i['track']['artists'][0]['name'] + ' - ' + i['track']['name'])
    last_id = i['track']['id']
played_at = [datetime.datetime.strptime(i,'%Y-%m-%dT%H:%M:%S.%fZ') for i in played_at]

repeats = pd.DataFrame(data={
    'track_id': ids,
    'artist_title': artist_track,
    'played_at': played_at}

Now we can plot this data. On the X axis we keep the date/time of when the song was played and on the Y axis we have the different songs that were repeated at least once. The plot then shows us when these repeats happened:

In [14]:
%%R -i repeats -w 15 -h 4 --units in -r 500
ggplot(repeats, aes(x=played_at,y=artist_title)) + 
    #geom_line() + 
    geom_point() + 
    theme_minimal() +
    theme(axis.text=element_text(size=13)) + 
    scale_x_datetime('played at') + 
    scale_y_discrete('artist - tracktitle')

And indeed, we can see that both The Lumineers song Long Way From Home, as well as Emmy The Great's Paper Forest have ramped up their counts through playing them on Repeat One.

And at least for The Lumineers my calendar proves an easy reason why: I fell asleep on my flight from San Francisco to Frankfurt without ever turning off Spotify. 😂

What did you find in your own data?

In [ ]: