Details for Google Fit exploration.ipynb

Published by carolinux

Notebook
Last updated 11 months, 1 week ago

Google Fit Exploration

Google Fit is an app for Android that stores your step count and other activity related metrics. It can also interface with other devices and get data from them as well. At OpenHumans, we have built an integration that queries the Google Fit API daily to get the step count, the distance, calories expended and active minutes, on a per minute granularity. Go here to connect your Google Fit data with your Open Humans account if you haven't done so already.

An important thing to know about Google Fit data is that a certain metric can be computed from different sources - for example, distance covered can be computed either from raw location data or be derived from the step counter. Google Fit stores the same metric multiple times, once per each source, and each source is tagged with the name of the sensor that produced the data. We will see concrete examples of this as we go along.

This notebook will show how to load the data that has been uploaded to Open Humans and ask questions across all your stored activity history.

In [1]:
%matplotlib inline
import json
import pandas as pd
from datetime import datetime
import pytz
import sys
import requests
from functools import reduce
import os

Loading the data

The Google Fit data is stored on Open Humans in monthly json files. We are going to download all of them.

Make sure to set your own timezone. (List of timezone names)

In [2]:
token = os.environ.get('OH_ACCESS_TOKEN')
response = requests.get("https://www.openhumans.org/api/direct-sharing/project/exchange-member/?access_token={}".format(token))

# MONTH = '2018-12'
TIMEZONE = 'Europe/London'

user = json.loads(response.content.decode("utf-8"))
month_data = []
for dset in sorted(user['data'], key=lambda x:x['id']):
    if 'Google Fit' in dset['metadata']['tags']: #and dset['metadata']['month'] == MONTH:
        raw_data = requests.get(dset['download_url']).content
        month_data.append(json.loads(raw_data.decode("utf-8")))

We've written some functions to parse this json and load it into a pandas dataframe, accounting for data irregularities such as metrics and sources appearing and disappearing over time or metric/source pairs having no data. The end result is that we want to have a dataframe with columns step_count.delta.01, step_count.delta.02 etc, ie, suffix each different source for each metric with a number.

In [3]:
def get_all_metrics(data):
    return sorted(data['datasets'].keys())

def get_all_data_sources_for_metric(data, metric):
    return sorted(data['datasets'][metric].keys())

def get_all_metrics_and_sources_pairs(data):
    """ Get all data types and sources present in the data """
    res = []
    for metric in get_all_metrics(data):
        sources = get_all_data_sources_for_metric(data, metric)
        for src in sources:
            res.append((metric, src))
    return res


def get_dataframe(dataset, dt_timezone, col_name):
    ts = []
    for day in dataset.keys():
        data = dataset[day].get('bucket', [])
        for datum in data:
            if datum['dataset'][0]['point'] == []:
                value = 0
            else:
                try:
                    value = datum['dataset'][0]['point'][0]['value'][0]['intVal']
                except:
                    value = datum['dataset'][0]['point'][0]['value'][0]['fpVal']

            start_ms = datum['startTimeMillis']
            start_sec = int(start_ms) / 1000
            dt = datetime.utcfromtimestamp(start_sec)
            dt = pytz.timezone('UTC').localize(dt)
            dt_local = dt.astimezone(pytz.timezone(dt_timezone))
            ts.append((dt_local, value))

    if len(ts) == 0:
        return None
    df = pd.DataFrame(ts)
    df.columns = ['time', col_name]
    df = df.set_index('time')
    return df


def generate_unique_column_name(metric, existing_names):
    i = 1
    while True:
        # turn com.google.steps.delta to steps.delta
        metric_formatted = '.'.join(metric.split('.')[2:])
        name = metric_formatted + '.' + str(i).zfill(2)
        if name in existing_names:
            i+=1
            continue
        else:
            return name


def get_dataframe_with_metrics_and_sources(dts_pairs, col_names, timezone, data):
    dfs = []
    for (metric, dsource), col_name in zip(dts_pairs, col_names):
        dts_data = data['datasets'].get(metric, {}).get(dsource, {})
        df = get_dataframe(dts_data, timezone, col_name=col_name)
        if df is not None:
            dfs.append(df)
    if len(dfs) > 0:
        df = reduce(lambda acc, df: acc.join(df, how='outer'), dfs)
        df = df.fillna(0)
    else:
        df = None
    return df


def get_renames(columns):
    renames = {}
    columns = sorted(columns)
    new_columns = []
    for col in columns:
        new_col = get_renamed_column(col, new_columns)
        new_columns.append(new_col)
        renames[col] = new_col
    return renames

def get_renamed_column(col, new_columns):
    num = int(col.split('.')[-1])
    if num == 1:
        return col
    while (num > 1):
        current_possible_name = '.'.join(col.split('.')[:-1])+ '.' + str(num).zfill(2)
        previous_possible_name = '.'.join(col.split('.')[:-1])+ '.' + str(num-1).zfill(2)
        if previous_possible_name not in new_columns:
            num = num - 1
        else:
            return current_possible_name
    return previous_possible_name
        
    

def googlefit_jsons_to_df(data, timezone):
    dts_pairs = set()
    mapping = {}
    dfs = []
    
    for month_data in data:
        current_dts_pairs = set(get_all_metrics_and_sources_pairs(month_data))
        dts_pairs = dts_pairs | current_dts_pairs
    dts_pairs = sorted(list(dts_pairs))
    col_names = []
    for dtype, dsource in dts_pairs:
        col_name = generate_unique_column_name(dtype, col_names)
        col_names.append(col_name)
        mapping[col_name] = dsource

    for month_data in data:
        df = get_dataframe_with_metrics_and_sources(dts_pairs, col_names, timezone, month_data)
        if df is not None:
            dfs.append(df)
    if len(dfs) == 0:
        return None, mapping
    result = pd.concat(dfs).fillna(0)
    column_renames = get_renames(result.columns)
    for col, source in mapping.copy().items():
        new_col = column_renames.get(col)
        if new_col is None:
            continue
        mapping[new_col] = source
    return result.rename(columns=column_renames), mapping
In [4]:
df, source_mapper = googlefit_jsons_to_df(month_data, TIMEZONE)

We now have a dataframe with columns for every metric/source pair and rows for every minute. We also returned a dictionary to map each column name to the associated source. The reason we do that is that the full source name tends to be a very large string. See below:

In [5]:
print("The data source for step_count.delta.01 is {}".format(source_mapper['step_count.delta.01']))
print("The data source for step_count.delta.02 is {}".format(source_mapper['step_count.delta.02']))
The data source for step_count.delta.01 is derived:com.google.step_count.delta:com.google.android.gms:estimated_steps
The data source for step_count.delta.02 is derived:com.google.step_count.delta:com.google.android.gms:merge_step_deltas

Below is what the dataframe looks like. We have four distinct metrics. active_minutes, calores.expended, distance.delta and step_count.delta, each with multiple data sources.

In [6]:
df.head()
Out[6]:
active_minutes.01 active_minutes.02 active_minutes.03 calories.expended.01 calories.expended.02 calories.expended.03 calories.expended.04 distance.delta.01 distance.delta.02 distance.delta.03 distance.delta.04 distance.delta.05 distance.delta.06 step_count.delta.01 step_count.delta.02
time
2018-12-05 00:00:00+00:00 0.0 0.0 0.0 0.0 0.966667 0.966667 0.966667 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2018-12-05 00:01:00+00:00 0.0 0.0 0.0 0.0 0.966667 0.966667 0.966667 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2018-12-05 00:02:00+00:00 0.0 0.0 0.0 0.0 0.966667 0.966667 0.966667 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2018-12-05 00:03:00+00:00 0.0 0.0 0.0 0.0 0.966667 0.966667 0.966667 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2018-12-05 00:04:00+00:00 0.0 0.0 0.0 0.0 0.966667 0.966667 0.966667 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Step counts

We can do a simple aggregation to get the total steps per day and plot it. The Google Fit app itself doesn't show exact numbers.

In [7]:
df['step_count.delta.01'].resample('D').sum()
Out[7]:
time
2018-12-05 00:00:00+00:00     6007.0
2018-12-06 00:00:00+00:00     7615.0
2018-12-07 00:00:00+00:00     3634.0
2018-12-08 00:00:00+00:00    13357.0
2018-12-09 00:00:00+00:00     2895.0
2018-12-10 00:00:00+00:00    16589.0
2018-12-11 00:00:00+00:00    11628.0
2018-12-12 00:00:00+00:00     2071.0
2018-12-13 00:00:00+00:00     8138.0
2018-12-14 00:00:00+00:00      546.0
2018-12-15 00:00:00+00:00     1691.0
2018-12-16 00:00:00+00:00     6314.0
2018-12-17 00:00:00+00:00      132.0
2018-12-18 00:00:00+00:00      193.0
2018-12-19 00:00:00+00:00     8408.0
2018-12-20 00:00:00+00:00    17113.0
2018-12-21 00:00:00+00:00     6871.0
2018-12-22 00:00:00+00:00    27658.0
2018-12-23 00:00:00+00:00     5134.0
2018-12-24 00:00:00+00:00      380.0
2018-12-25 00:00:00+00:00      291.0
2018-12-26 00:00:00+00:00    12685.0
2018-12-27 00:00:00+00:00       58.0
2018-12-28 00:00:00+00:00     2506.0
2018-12-29 00:00:00+00:00      144.0
2018-12-30 00:00:00+00:00      113.0
2018-12-31 00:00:00+00:00     3024.0
2019-01-01 00:00:00+00:00     5664.0
2019-01-02 00:00:00+00:00        1.0
Freq: D, Name: step_count.delta.01, dtype: float64
In [14]:
ax = df[['step_count.delta.01', 'step_count.delta.02']].resample('D').sum().plot()
ax.grid(True, which='minor', axis='x')
ax.grid(True, which='major', axis='x')

Distance plots

We can compare the distance from two different sources. In my dataset (yours can differ) distance.delta.01 uses the GPS and distance.delta.02 uses the step counter (which uses the gyroscope and other sensors from the phone). It is expected that those two sources would produce different results, and in fact, a more accurate distance metric could be computed using a Kalman filter that combines both data sources.

To check out the exact sources of your distance.delta columns, look into the source_mapper dictionary.

In [15]:
ax = df[['distance.delta.01', 'distance.delta.02']].resample('D').sum().plot()
ax.grid(True, which='minor', axis='x')
ax.grid(True, which='major', axis='x')
print("The data source for distance.delta.01 is {}".format(source_mapper['distance.delta.01']))
print("The data source for distance.delta.02 is {}".format(source_mapper['distance.delta.02']))
The data source for distance.delta.01 is derived:com.google.distance.delta:com.google.android.gms:from_high_accuracy_location<-derived:com.google.location.sample:com.google.android.gms:merge_high_fidelity
The data source for distance.delta.02 is derived:com.google.distance.delta:com.google.android.gms:from_steps<-merge_step_deltas

Now we're going to ask some more specific questions of the data, and answer them by using custom pandas functions.

Are you more active during weekdays or during the week?

In [10]:
weekdays = df[df.index.dayofweek < 5]
weekends = df[df.index.dayofweek >= 5]
avg_weekdays = weekdays['active_minutes.02'].resample('D').sum().mean()
avg_weekends = weekends['active_minutes.02'].resample('D').sum().mean()
print("Average active minutes per day on weekdays: {}".format(int(avg_weekdays)))
print("Average active minutes per day on weekends: {}".format(int(avg_weekends)))
Average active minutes per day on weekdays: 63
Average active minutes per day on weekends: 83

High score days

In [11]:
print("Day with most steps walked ({}) is {}".format(
    int(df['step_count.delta.01'].resample('D').sum().max()),
    df['step_count.delta.01'].resample('D').sum().idxmax().strftime("%Y-%m-%d"),

))
print("Day with most active minutes ({}) is {}".format(
    int(df['active_minutes.02'].resample('D').sum().max()),
    df['active_minutes.02'].resample('D').sum().idxmax().strftime("%Y-%m-%d"),

))
Day with most steps walked (27658) is 2018-12-22
Day with most active minutes (318) is 2018-12-22
In [12]:
df['week_no_year'] = df.index.map(lambda x: str(x.date().isocalendar()[0]) + "_" + str(x.date().isocalendar()[1]))

print("Day with most steps per week")
print(df.groupby('week_no_year')['step_count.delta.01'].agg(
    lambda x: (x.resample('D').sum().idxmax().strftime("%a %d/%m"), int(x.resample('D').sum().max()))
     ).sort_index())
Day with most steps per week
week_no_year
2018_49    (Sat 08/12, 13357)
2018_50    (Mon 10/12, 16589)
2018_51    (Sat 22/12, 27658)
2018_52    (Wed 26/12, 12685)
2019_1      (Tue 01/01, 5664)
Name: step_count.delta.01, dtype: object

For how many days did you reach the step count goal?

The 10,000 steps per day recommended by the World Health Organization.

In [13]:
GOAL = 10000
daily_steps = df['step_count.delta.01'].resample('D').sum()
print("Reached {} steps for {} days out of {}".format(GOAL, (daily_steps>GOAL).sum(), len(daily_steps)))
Reached 10000 steps for 6 days out of 29