Description

The Spotify integration now also collects further metadata on the tracks you listen to, such as the key & mode of the songs and predictions whether songs are live recordings, instrumental, etc. Explore those new data with this notebook!

1

Tags & Data Sources

music Spotify integration

Comments

Please log in to comment.

Output
Output & Code

Notebook
Last updated 5 years, 2 months ago

Analyze your Spotify listening history¶

This Notebook requires you to have data from the Spotify integration in your Open Humans account.

With the notebook we want to look into

which artists do you listen to?
which tracks do you listen to?
How much do listen to music on a given day?
How popular is your music taste on Spotify?
Do you listen to the same songs for long stretches?

To get started we import some libraries we need and then access your spotify data

Now that we got all of your data we want to transform the rather complex Spotify JSON format into something that is easier to read, a simple table - also called a dataframe. The lines below do this:

We can now look at the dataframe and look at some example data we have:

Out[3]:

	track_id	track	artist	album	popularity	duration_ms	explicit	played_at
played_at
2018-10-17 22:28:10.219	2YxLFAW82UprL82e9brViV	Ordinary Man	Christy Moore	On the Road	46	235346	False	2018-10-17 22:28:10.219
2018-10-17 22:31:45.194	37xwNDTrrOxUGiREjPV8os	Ride On	Christy Moore	On the Road	34	215080	False	2018-10-17 22:31:45.194
2018-10-17 22:36:10.358	7r8uwyBaQG3k7ISC9GHX8e	Joxer Goes to Stuttgart	Christy Moore	On the Road	35	265146	False	2018-10-17 22:36:10.358
2018-10-17 22:39:26.908	2xsMuhrWv7FBEQczFAZFkY	Black Is the Colour	Christy Moore	On the Road	34	196653	False	2018-10-17 22:39:26.908
2018-10-17 22:44:28.672	1G23TVzQG1rMOXzMLqGJbE	Don't Forget Your Shovel	Christy Moore	On the Road	36	301706	False	2018-10-17 22:44:28.672

We got the time at which we listened to a song, the title of the song and album on which it was released along with the artist that made the recording. Furthermore we can see whether the song features explicit lyrics, its popularity and a unique track ID (which we can use to identify whether we listened to the same song more than once)

Plotting and Analyzing our Spotify archive¶

Average song length¶

We start out by having a look at how long the songs we listen to are. For that we plot the distribution of song lengths found in the archive - along with the average song length:

We see that in my case songs on average clock a bit at over 4 minutes in length. Those there is an interesting peak in the data below that. We will look into that later on!

When do we listen to music?¶

Let us have a look at when we listen to music. Unfortunately the data from Spotify isn't timezone aware and reports the time in UTC by default. While not on daylight savings time (DST) this is the same time zone as London. During DST it is one hour before London time.

Due to this the best we can do is to approximate the timezone. To do this we can adjust the UTC_OFFSET variable on top of the next cell. In my case the predominant timezone for the archive is California, so I set it to UTC_OFFSET <- -7. If you are in a different timezone adjust this accordingly.

In my case the data contains lots of listening done in Europe, which is why there are songs played in the middle of the night from midnight to 5AM. But besides this there are two peaks around 9-12am and 1pm to 4pm. Which is the time I'm usually clearly in the office and listening to music.

How popular is my music choice over time?¶

We have seen that the Spotify archive contains some details on how popular individual tracks are. Let's have a look on whether the songs I listen to over time are more or less popular:

While there is some small downward trend, this seems more an artifact of the last few days where the average is a bit lower and not some systematic drop in popularity.

Most popular artists¶

But this begs the question, which are the artists I listen to most? Let's have a look at the artists most played in my own Spotify data:

My own artist top list is dominated by Typhoon, which Spotify classifies as a Portland, OR-based Indie Rock band. From there on the listening counts drop sharply for the other artists and flatten out.

Most played Tracks¶

Let's now look into the most played individual tracks:

That picture looks just as skewed, with the track Paper Forest by Emmy The Great topping the list. And if you look closely you'll see that this song alone explains all the plays for this artist as shown in the figure above! Similarly, the song Long Way From Home by the Lumineers accounts for most of the plays done by this artist.

Now we can wonder, can the song Paper Forest maybe explain the bump in the average song playtime distribution further above? Let's have a look:

Yep, that fits the peak at less than 4 minutes in the song-length distribution pretty well!

My listening behaviour over time¶

Let's now investigate whether I listen more or less to music over time by plotting the number of songs played on a given day:

There are two things going on: First of all there is not much data for much of September. Which makes sense, as I was traveling a lot during that time and didn't have much time to listen to music during conferences etc.

But there is also some trend to be seen at the beginning of September, which days with lots of plays being interrupted by days with very little music consumption. Could this be an effect of weekdays vs. weekends?

/opt/conda/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:186: RRuntimeWarning: 
Attaching package: ‘lubridate’


  warnings.warn(x, RRuntimeWarning)
/opt/conda/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:186: RRuntimeWarning: The following object is masked from ‘package:base’:

    date


  warnings.warn(x, RRuntimeWarning)

There seems to be at least a small effect that I play more music during weekdays (FALSE in the plot above) compared to weekends.

What's up with all the songs that are played so often?¶

Let's try to see whether repeat one can explain why the two songs we saw above have been played so much more often than the other ones.

For that we use the unique track ID that Spotify gives to each song and check whether the song played just before had the same track ID. If so we store this repetition and can then plot the data later on.

Below we create the repeat table of songs that have been played more than once after each other along with the times and dates this happened:

Now we can plot this data. On the X axis we keep the date/time of when the song was played and on the Y axis we have the different songs that were repeated at least once. The plot then shows us when these repeats happened:

And indeed, we can see that both The Lumineers song Long Way From Home, as well as Emmy The Great's Paper Forest have ramped up their counts through playing them on Repeat One.

And at least for The Lumineers my calendar proves an easy reason why: I fell asleep on my flight from San Francisco to Frankfurt without ever turning off Spotify. 😂

Characteristics of the songs I listen to¶

Spotify not only gives you the metadata about how long songs are, but also provides some automatic classifications of those songs. Amongst more traditional musical dimensions, like which key a song is in, whether it is in major or minor mode and how loud the recording is, they also have some further characteristics: These include things like the danceability of a song, its energy and valence (basically a mood-score) as well as scores for whether the song is instrumental, a live recording or an acoustic track.

We can dig into those characteristics. Let's start out by looking at the classical measures: Key, Mode & Volume:

Key¶

Mode¶

Volume / Loudness¶

Measuring the average volume of a track in decibels (dB). Values typical range between -60 and 0 db.

Other metrics¶

Danceability¶

The score ranges from 0-1, with 0 being least danceable and 1 being most danceable.

Acousticness¶

The score ranges from 0-1, with 0 most unlikely to be an acoustic track and 1 being most likely

Liveness¶

This tries to predict whether a song is a live recording or not. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

Valence¶

according to Spotify A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

Energy¶

According to spotify: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.

Notebook
Last updated 5 years, 2 months ago

Analyze your Spotify listening history¶

This Notebook requires you to have data from the Spotify integration in your Open Humans account.

With the notebook we want to look into

which artists do you listen to?
which tracks do you listen to?
How much do listen to music on a given day?
How popular is your music taste on Spotify?
Do you listen to the same songs for long stretches?

To get started we import some libraries we need and then access your spotify data

In [46]:

from ohapi import api
import os
import requests
import json
import pandas as pd
import datetime

member = api.exchange_oauth2_member(os.environ.get('OH_ACCESS_TOKEN'))
for f in member['data']:
    if f['source'] == 'direct-sharing-176' and f['basename'] == 'spotify-listening-archive.json':
        sp_songs = requests.get(f['download_url'])
    if f['source'] == 'direct-sharing-176' and f['basename'] == 'spotify-track-metadata.json':
        sp_meta = requests.get(f['download_url'])
        
sp_data = json.loads(sp_songs.content)
sp_metadata = json.loads(sp_meta.content)

In [78]:

track_title = []
artist_name = []
album_name = []
played_at = []
popularity = []
duration_ms = []
explicit = []
track_id = []

danceability = []
energy = [] 
key = [] 
loudness = [] 
mode = []
speechiness = []
acousticness = []
instrumentalness = []
liveness = []
valence = []

['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence']


key_translation = {
    '0': "C",
    "1": "C#",
    "2": "D",
    "3": "D#",
    "4": "E", 
    "5": "F",
    "6": "F#",
    "7": "G",
    "8": "G#",
    "9": "A",
    "10": "A#",
    "11": "B"
}

mode_translation = {'1':'major', '0': 'minor'}

for sp in sp_data:
    track_title.append(sp['track']['name'])
    artist_name.append(sp['track']['artists'][0]['name'])
    album_name.append(sp['track']['album']['name'])
    played_at.append(sp['played_at'])
    popularity.append(sp['track']['popularity'])
    duration_ms.append(sp['track']['duration_ms'])
    explicit.append(sp['track']['explicit'])
    track_id.append(sp['track']['id'])
    danceability.append(sp_metadata[sp['track']['id']]['danceability'])
    energy.append(sp_metadata[sp['track']['id']]['energy'])
    key.append(key_translation[str(sp_metadata[sp['track']['id']]['key'])])
    loudness.append(sp_metadata[sp['track']['id']]['loudness'])
    mode.append(mode_translation[str(sp_metadata[sp['track']['id']]['mode'])])
    speechiness.append(sp_metadata[sp['track']['id']]['speechiness'])
    acousticness.append(sp_metadata[sp['track']['id']]['acousticness'])
    instrumentalness.append(sp_metadata[sp['track']['id']]['instrumentalness'])
    liveness.append(sp_metadata[sp['track']['id']]['liveness'])
    valence.append(sp_metadata[sp['track']['id']]['valence'])
    
def parse_timestamp(lst):
    timestamps = []
    for item in lst:
        try:
            timestamp = datetime.datetime.strptime(
                            item,
                            '%Y-%m-%dT%H:%M:%S.%fZ')
        except ValueError:
            timestamp = datetime.datetime.strptime(
                    item,
                    '%Y-%m-%dT%H:%M:%SZ')
        timestamps.append(timestamp)
    return timestamps
    
played_at = parse_timestamp(played_at)

dataframe = pd.DataFrame(data={
    'track_id': track_id,
    'track': track_title,
    'artist': artist_name,
    'album': album_name,
    'popularity': popularity,
    'duration_ms': duration_ms,
    'explicit': explicit,
    'played_at': played_at,
    'danceability': danceability,
    'energy': energy,
    'key': key,
    'loudness': loudness,
    'mode': mode,
    'speechiness': speechiness,
    'acousticness': acousticness,
    'instrumentalness': instrumentalness,
    'liveness': liveness,
    'valence': valence

})
dataframe = dataframe.set_index(dataframe['played_at'])

We can now look at the dataframe and look at some example data we have:

In [3]:

dataframe.head()

Out[3]:

	track_id	track	artist	album	popularity	duration_ms	explicit	played_at
played_at
2018-10-17 22:28:10.219	2YxLFAW82UprL82e9brViV	Ordinary Man	Christy Moore	On the Road	46	235346	False	2018-10-17 22:28:10.219
2018-10-17 22:31:45.194	37xwNDTrrOxUGiREjPV8os	Ride On	Christy Moore	On the Road	34	215080	False	2018-10-17 22:31:45.194
2018-10-17 22:36:10.358	7r8uwyBaQG3k7ISC9GHX8e	Joxer Goes to Stuttgart	Christy Moore	On the Road	35	265146	False	2018-10-17 22:36:10.358
2018-10-17 22:39:26.908	2xsMuhrWv7FBEQczFAZFkY	Black Is the Colour	Christy Moore	On the Road	34	196653	False	2018-10-17 22:39:26.908
2018-10-17 22:44:28.672	1G23TVzQG1rMOXzMLqGJbE	Don't Forget Your Shovel	Christy Moore	On the Road	36	301706	False	2018-10-17 22:44:28.672

Plotting and Analyzing our Spotify archive¶

Average song length¶

We start out by having a look at how long the songs we listen to are. For that we plot the distribution of song lengths found in the archive - along with the average song length:

In [60]:

%load_ext rpy2.ipython

In [5]:

%%R -i dataframe -w 4 -h 2 --units in -r 200
library(ggplot2)
ggplot(dataframe,aes(duration_ms/1000/60)) + 
    geom_histogram(binwidth=0.3) + 
    scale_x_continuous('song length in minutes') + 
    theme_minimal() + 
    geom_vline(xintercept=mean(dataframe$duration/1000/60),color='red') + ggtitle('red bar is average length')

We see that in my case songs on average clock a bit at over 4 minutes in length. Those there is an interesting peak in the data below that. We will look into that later on!

When do we listen to music?¶

In [6]:

%%R -w 10 -h 2 --units in -r 200

UTC_OFFSET <- -7

UTC_OFFSET <- UTC_OFFSET*60*60
dataframe$time <- as.numeric(format(dataframe$played_at+UTC_OFFSET, format = "%H"))

ggplot(dataframe,aes(time)) + 
    geom_histogram(binwidth=1) +
    scale_x_continuous(limit=c(0,24),'hour') + 
    theme_minimal() +
    ggtitle('When do I listen to music?')

How popular is my music choice over time?¶

We have seen that the Spotify archive contains some details on how popular individual tracks are. Let's have a look on whether the songs I listen to over time are more or less popular:

In [12]:

%%R -i dataframe
library(ggplot2)
df2 <- aggregate(dataframe$popularity,by=list(as.Date(dataframe$played_at)),FUN=mean)

ggplot(df2,aes(Group.1, x)) + 
    geom_point() + stat_smooth(method='glm') + theme_minimal() + scale_x_date('date') + scale_y_continuous('average popularity')

In [17]:

%%R -i dataframe
df2 <- aggregate(dataframe$popularity,by=list(as.Date(dataframe$played_at)),FUN=mean)

ggplot(df2,aes(Group.1, x)) + 
    geom_point() + stat_smooth(method='glm') + theme_minimal() + scale_x_date('date') + scale_y_continuous('average popularity')

While there is some small downward trend, this seems more an artifact of the last few days where the average is a bit lower and not some systematic drop in popularity.

Most popular artists¶

But this begs the question, which are the artists I listen to most? Let's have a look at the artists most played in my own Spotify data:

In [13]:

%%R -w 5 -h 2 --units in -r 200

dataframe <- within(dataframe, 
                   artist <- factor(artist, 
                                      levels=names(sort(table(artist), 
                                                        decreasing=TRUE))))

filter_list = as.data.frame(summary(dataframe$artist, max=12))

ggplot(subset(dataframe,dataframe$artist %in% rownames(filter_list)),aes(artist)) + geom_bar() + theme(axis.text.x = element_text(angle = 90, hjust = 1,size=5)) + coord_flip() + theme_minimal()

Most played Tracks¶

Let's now look into the most played individual tracks:

In [14]:

%%R -w 5 -h 2 --units in -r 200

dataframe <- within(dataframe, 
                   track <- factor(track, 
                                      levels=names(sort(table(track), 
                                                        decreasing=TRUE))))

filter_list = as.data.frame(summary(dataframe$track, max=10))

ggplot(subset(dataframe,dataframe$track %in% rownames(filter_list)),aes(track)) + 
    geom_bar() + 
    theme_minimal() + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1,size=5)) + 
    coord_flip()

Now we can wonder, can the song Paper Forest maybe explain the bump in the average song playtime distribution further above? Let's have a look:

In [10]:

if len(dataframe[dataframe['track'] == 'Paper Forest (In the Afterglow of Rapture)']):
    print("The song has a length of {} minutes".format(
        dataframe[dataframe['track'] == 'Paper Forest (In the Afterglow of Rapture)']['duration_ms'][0]/1000/60))

Yep, that fits the peak at less than 4 minutes in the song-length distribution pretty well!

My listening behaviour over time¶

Let's now investigate whether I listen more or less to music over time by plotting the number of songs played on a given day:

In [11]:

%%R -w 3 -h 3 --units in -r 200

dataframe$date <- as.Date(dataframe$played_at)

ggplot(dataframe,aes(date)) + 
    geom_histogram(binwidth=1) + theme_minimal()

In [12]:

%%R -w 4 -h 2 --units in -r 200
df2 <- aggregate(dataframe$duration_ms/1000/60/60,by=list(dataframe$date),FUN=sum)
library(lubridate)

df2$weekday <- wday(df2$Group.1, label=TRUE)
df2$weekend <- df2$weekday %in% c('Sun','Sat')


ggplot(df2,aes(x=df2$weekend,y=df2$x,)) + 
    geom_violin() + theme_minimal() + scale_x_discrete('is it a weekend day?') + scale_y_continuous('hours of music played')

/opt/conda/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:186: RRuntimeWarning: 
Attaching package: ‘lubridate’


  warnings.warn(x, RRuntimeWarning)
/opt/conda/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:186: RRuntimeWarning: The following object is masked from ‘package:base’:

    date


  warnings.warn(x, RRuntimeWarning)

There seems to be at least a small effect that I play more music during weekdays (FALSE in the plot above) compared to weekends.

What's up with all the songs that are played so often?¶

Let's try to see whether repeat one can explain why the two songs we saw above have been played so much more often than the other ones.

Below we create the repeat table of songs that have been played more than once after each other along with the times and dates this happened:

In [13]:

ids = []
played_at = []
artist_track = []

last_id = ''
for i in sp_data:
    if last_id == i['track']['id'] or last_id == '':
        played_at.append(i['played_at'])
        ids.append(i['track']['id'])
        artist_track.append(i['track']['artists'][0]['name'] + ' - ' + i['track']['name'])
    last_id = i['track']['id']
    
played_at = [datetime.datetime.strptime(i,'%Y-%m-%dT%H:%M:%S.%fZ') for i in played_at]

repeats = pd.DataFrame(data={
    'track_id': ids,
    'artist_title': artist_track,
    'played_at': played_at}
)

In [14]:

%%R -i repeats -w 15 -h 6 --units in -r 500
library(ggplot2)
head(repeats$played_at)
ggplot(repeats, aes(x=played_at,y=artist_title)) + 
    #geom_line() + 
    geom_point() + 
    theme_minimal() +
    theme(axis.text=element_text(size=13)) + 
    scale_x_datetime('played at') + 
    scale_y_discrete('artist - tracktitle')

And indeed, we can see that both The Lumineers song Long Way From Home, as well as Emmy The Great's Paper Forest have ramped up their counts through playing them on Repeat One.

And at least for The Lumineers my calendar proves an easy reason why: I fell asleep on my flight from San Francisco to Frankfurt without ever turning off Spotify. 😂

Characteristics of the songs I listen to¶

We can dig into those characteristics. Let's start out by looking at the classical measures: Key, Mode & Volume:

Key¶

In [87]:

%%R -i dataframe -w 4 -h 2 --units in -r 200
library(ggplot2)
ggplot(dataframe,aes(key)) + 
    geom_histogram(stat='count') + 
    scale_x_discrete('key') + 
    theme_minimal()

Mode¶

In [90]:

%%R -i dataframe -w 4 -h 2 --units in -r 200
library(ggplot2)
ggplot(dataframe,aes(mode)) + 
    geom_histogram(stat='count') + 
    scale_x_discrete('mode') + 
    theme_minimal()

Volume / Loudness¶

Measuring the average volume of a track in decibels (dB). Values typical range between -60 and 0 db.

In [92]:

%%R -i dataframe -w 4 -h 2 --units in -r 200
library(ggplot2)
ggplot(dataframe,aes(loudness)) + 
    geom_histogram(binwidth=0.1) + 
    scale_x_continuous('loudness') + 
    theme_minimal() + 
    geom_vline(xintercept=mean(dataframe$loudness),color='red') + ggtitle('red bar is average')

Other metrics¶

Danceability¶

The score ranges from 0-1, with 0 being least danceable and 1 being most danceable.

In [93]:

%%R -i dataframe -w 4 -h 2 --units in -r 200
library(ggplot2)
ggplot(dataframe,aes(danceability)) + 
    geom_histogram(binwidth=0.1) + 
    scale_x_continuous('danceability') + 
    theme_minimal() + 
    geom_vline(xintercept=mean(dataframe$danceability),color='red') + ggtitle('red bar is average')

Acousticness¶

The score ranges from 0-1, with 0 most unlikely to be an acoustic track and 1 being most likely

In [94]:

%%R -i dataframe -w 4 -h 2 --units in -r 200
library(ggplot2)
ggplot(dataframe,aes(acousticness)) + 
    geom_histogram(binwidth=0.1) + 
    scale_x_continuous('acousticness') + 
    theme_minimal() + 
    geom_vline(xintercept=mean(dataframe$acousticness),color='red') + ggtitle('red bar is average')

Liveness¶

In [95]:

%%R -i dataframe -w 4 -h 2 --units in -r 200
library(ggplot2)
ggplot(dataframe,aes(liveness)) + 
    geom_histogram(binwidth=0.1) + 
    scale_x_continuous('liveness') + 
    theme_minimal() + 
    geom_vline(xintercept=mean(dataframe$liveness),color='red') + ggtitle('red bar is average')

Valence¶

In [81]:

%%R -i dataframe -w 4 -h 2 --units in -r 200
library(ggplot2)
ggplot(dataframe,aes(valence)) + 
    geom_histogram(binwidth=0.1) + 
    scale_x_continuous('valence') + 
    theme_minimal() + 
    geom_vline(xintercept=mean(dataframe$valence),color='red') + ggtitle('red bar is average')

Energy¶

In [85]:

%%R -i dataframe -w 4 -h 2 --units in -r 200
library(ggplot2)
ggplot(dataframe,aes(energy)) + 
    geom_histogram(binwidth=0.1) + 
    scale_x_continuous('energy') + 
    theme_minimal() + 
    geom_vline(xintercept=mean(dataframe$energy),color='red') + ggtitle('red bar is average')

Details for spotify-archive-analyses-extended.ipynb Open Edit & Run

Published by gedankenstuecke

Description

1645 1

Tags & Data Sources

Comments

Notebook Last updated 5 years, 2 months ago

Analyze your Spotify listening history¶

Plotting and Analyzing our Spotify archive¶

Average song length¶

When do we listen to music?¶

How popular is my music choice over time?¶

Most popular artists¶

Most played Tracks¶

My listening behaviour over time¶

What's up with all the songs that are played so often?¶

Characteristics of the songs I listen to¶

Key¶

Mode¶

Volume / Loudness¶

Other metrics¶

Danceability¶

Acousticness¶

Liveness¶

Valence¶

Energy¶

Notebook Last updated 5 years, 2 months ago

Analyze your Spotify listening history¶

Plotting and Analyzing our Spotify archive¶

Average song length¶

When do we listen to music?¶

How popular is my music choice over time?¶

Most popular artists¶

Most played Tracks¶

My listening behaviour over time¶

What's up with all the songs that are played so often?¶

Characteristics of the songs I listen to¶

Key¶

Mode¶

Volume / Loudness¶

Other metrics¶

Danceability¶

Acousticness¶

Liveness¶

Valence¶

Energy¶

Details for `spotify-archive-analyses-extended.ipynb`

1

Notebook
Last updated 5 years, 2 months ago

Notebook
Last updated 5 years, 2 months ago