Details for Github Exploration v.2.ipynb

Published by carolinux

Description

Explore your github data!

0

Tags & Data Sources

github programming programmer commits github

Comments

Please log in to comment.

Notebook
Last updated 4 weeks, 1 day ago

In [47]:
import sys
!{sys.executable} -m pip install wordcloud==1.5.0
!{sys.executable} -m pip install calmap
Requirement already satisfied: wordcloud==1.5.0 in /opt/conda/lib/python3.6/site-packages
Requirement already satisfied: pillow in /opt/conda/lib/python3.6/site-packages (from wordcloud==1.5.0)
Requirement already satisfied: numpy>=1.6.1 in /opt/conda/lib/python3.6/site-packages (from wordcloud==1.5.0)
You are using pip version 9.0.1, however version 19.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Collecting calmap
  Downloading https://files.pythonhosted.org/packages/60/7a/3340f348c4826fad190a265290ade1b7fbfbb311c84e27d82fb43e12d579/calmap-0.0.7-py2.py3-none-any.whl
Requirement already satisfied: numpy in /opt/conda/lib/python3.6/site-packages (from calmap)
Requirement already satisfied: pandas in /opt/conda/lib/python3.6/site-packages (from calmap)
Requirement already satisfied: matplotlib in /opt/conda/lib/python3.6/site-packages (from calmap)
Requirement already satisfied: pytz>=2011k in /opt/conda/lib/python3.6/site-packages (from pandas->calmap)
Requirement already satisfied: python-dateutil>=2.5.0 in /opt/conda/lib/python3.6/site-packages (from pandas->calmap)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /opt/conda/lib/python3.6/site-packages (from matplotlib->calmap)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.6/site-packages/cycler-0.10.0-py3.6.egg (from matplotlib->calmap)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.6/site-packages (from matplotlib->calmap)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.6/site-packages (from python-dateutil>=2.5.0->pandas->calmap)
Requirement already satisfied: setuptools in /opt/conda/lib/python3.6/site-packages (from kiwisolver>=1.0.1->matplotlib->calmap)
Installing collected packages: calmap
Successfully installed calmap-0.0.7
You are using pip version 9.0.1, however version 19.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
In [19]:
from datetime import datetime
import json
import numpy as np
import pandas as pd
import os
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import pickle
import requests
import tempfile
import urllib.request
import ohapi

import matplotlib.pyplot as plt
%matplotlib inline

Loading data from Open Humans

In [20]:
# load the commits from open humans API
token = os.environ.get('OH_ACCESS_TOKEN')
user = ohapi.api.exchange_oauth2_member(token)

for dset in sorted(user['data'], key=lambda x:x['id']):
    if 'Github' in dset['metadata']['tags'] and 'commits' in dset['metadata']['tags']: 
        raw_data = requests.get(dset['download_url']).content
        commit_data = json.loads(raw_data.decode("utf-8"))
        break

messages = []
timestamps = []
repos = []
for repo in commit_data['repo_data']:
    repo_data = commit_data['repo_data'][repo]
    for commit in repo_data['commits']:
        messages.append(commit['commit']['message'].lower())
        timestamps.append(datetime.strptime(commit['commit']['committer']['date'], '%Y-%m-%dT%H:%M:%SZ'))
        repos.append(repo)
                        
# turn into a neat dataframe

df = pd.DataFrame(columns=['repo','message','datetime'])
df.repo = repos; df.message = messages; df.datetime = timestamps
df.head()
Out[20]:
repo message datetime
0 crowdsense/bonobo_trans added readme with my recommendations 2019-05-02 15:43:51
1 crowdsense/bonobo_trans create a working example with db input, and a ... 2019-05-02 15:31:55
2 carolinux/resiroop-cms-service put status ok/failed in config and use consist... 2018-04-03 15:15:16
3 carolinux/resiroop-cms-service add category toppings to dev s3 fake 2018-04-03 11:30:03
4 carolinux/resiroop-cms-service adjust return format of import/category_toppin... 2018-04-03 11:23:53

Commit Cloud

Visualize the most commonly used words in your commit messages

In [21]:
mask_url = 'http://oh-github.herokuapp.com/static/github.png'
mask_path = tempfile.NamedTemporaryFile(delete=False).name
urllib.request.urlretrieve(mask_url, mask_path)


mask = np.array(Image.open(mask_path))

def transform_format(val):
    #print(val)
    if val[3] == 255:
        return 255
    else:
        return 0

# Transform your mask into a new one that will work with the function:
transformed_mask = np.ndarray((mask.shape[0], mask.shape[1]), np.int32)

for i in range(len(mask)):
    transformed_mask[i] = list(map(transform_format, mask[i]))

# Create and generate a word cloud image:
text = ' '.join(df.message.values)
wordcloud = WordCloud(mask=transformed_mask, background_color='white',
                      width=len(mask[0]), height=len(mask)).generate(text)

# Display the generated image:
plt.rcParams['figure.figsize'] = [20, 10]
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Github activity over the years

See the commit activity per repo per year. A trip down memory lane!

In [22]:
import random
import matplotlib
from matplotlib import cm

df['date'] = df.datetime.apply(lambda x: x.date())
df['year'] = df.datetime.apply(lambda x: x.year)

MAX_NUM_REPOS = 7 # maximum number of repos to show in a plot


for year in sorted(df.year.unique()):
    
    df_curr = df[df.year==year]
    df_curr = df_curr.sample(frac=1)

    plt.clf()
    fig, ax = plt.subplots()

    repos = 0
    have_data = False
    grouped_data = df_curr.groupby(['repo'], sort=False) # sort=False enables to pick different repos each time
    
    for (key, grp) in grouped_data:

        df_temp = grp.groupby("date")['message'].count().reset_index(name='num_commits')
        if df_temp.num_commits.sum() < 10:
            continue
        have_data = True
        ax.plot_date( x=df_temp['date'].values, y=df_temp['num_commits'].values,
                     label=key, markersize=14)
        repos+=1
        if repos == MAX_NUM_REPOS:
            break

    if not have_data:
        continue
    plt.xlabel("Time")
    plt.ylabel("Number of commits")
    plt.title("Repo activity in {}".format(year))
    plt.legend(loc='best')
    plt.show()
<Figure size 1440x720 with 0 Axes>
<Figure size 1440x720 with 0 Axes>
<Figure size 1440x720 with 0 Axes>
<Figure size 1440x720 with 0 Axes>
<Figure size 1440x720 with 0 Axes>
<Figure size 1440x720 with 0 Axes>
<Figure size 1440x720 with 0 Axes>
<Figure size 1440x720 with 0 Axes>

Busiest times

Checking the busiest times and days of the week. If a repo is busiest on the weekend, could it be a side project?

In [23]:
def get_part_of_day(hour):
    return (
        "morning" if 6 <= hour <= 11
        else
        "afternoon" if 12 <= hour <= 17
        else
        "evening" if 18 <= hour <= 22
        else
        "night"
    )


df['day_of_week'] = df.datetime.apply(lambda x: datetime.strftime(x, '%A'))
df['hour_of_day'] = df.datetime.apply(lambda x: x.hour)
df['part_of_day'] = df.hour_of_day.apply(get_part_of_day)
df['day_and_hour'] = df.day_of_week + " " + df.part_of_day
print("Busiest days")
print(df['day_of_week'].value_counts())
print('\n')
print("Busiest times")
print(df['part_of_day'].value_counts())
print('\n')
print("Busiest days + times")
print(df['day_and_hour'].value_counts())
print('\n')
Busiest days
Friday       357
Thursday     340
Wednesday    270
Tuesday      241
Monday       218
Saturday     134
Sunday       120
Name: day_of_week, dtype: int64


Busiest times
afternoon    890
morning      384
evening      296
night        110
Name: part_of_day, dtype: int64


Busiest days + times
Thursday afternoon     184
Friday afternoon       182
Wednesday afternoon    152
Monday afternoon       124
Tuesday afternoon      122
Friday morning          83
Thursday morning        74
Friday evening          70
Monday morning          67
Tuesday morning         65
Saturday afternoon      63
Sunday afternoon        63
Wednesday evening       56
Thursday evening        49
Wednesday morning       48
Tuesday evening         35
Thursday night          33
Saturday evening        32
Saturday morning        31
Sunday evening          30
Monday evening          24
Friday night            22
Tuesday night           19
Sunday morning          16
Wednesday night         14
Sunday night            11
Saturday night           8
Monday night             3
Name: day_and_hour, dtype: int64


In [24]:
print("Side Project detection")
repos_with_dow = df.groupby('repo')['day_of_week'].agg(lambda x: (x.value_counts().index[0], len(x)))

for repo, (day, commit_count) in repos_with_dow.iteritems():
    if commit_count >= 5 and day in ("Saturday", "Sunday"):
        print("Project {} has most commits on {}s.".format(repo, day))
Side Project detection
Project carolinux/Pattern2Scala has most commits on Saturdays.
Project carolinux/articles has most commits on Saturdays.
Project carolinux/flask_ansible has most commits on Saturdays.
Project carolinux/fractals has most commits on Sundays.
Project carolinux/grid_helper has most commits on Saturdays.
Project carolinux/londonwald has most commits on Saturdays.

Number of Commits per day

In [45]:
commits_per_day = df.resample('D', on='datetime').message.count()
plt.plot(commits_per_day.index, commits_per_day.values)
plt.xlabel("Time")
plt.ylabel("Number of commits")
plt.title("Commit activity")
plt.show()

Yearly activity plot

A re-implementation of the official github commit calendar, using the calmap library.

In [54]:
import numpy as np; np.random.seed(sum(map(ord, 'calmap')))
import pandas as pd
import calmap

calmap.yearplot(commits_per_day, year=2018)
Out[54]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe620b4d630>

Estimation of time spent coding

Approximate the duration of coding per repo, by looking at times between timestamps. This will tend to underestimate the total time, since the time between starting coding and the first commit is not captured.

In [25]:
from datetime import timedelta
df_sorted = df.sort_values(['repo', 'datetime'])
df_sorted2 = df_sorted.shift(-1)
df_sorted['next_datetime'] = df_sorted2.datetime
df_sorted['next_repo'] = df_sorted2.repo
df_sorted['next_part_of_day'] = df_sorted2.part_of_day
# Set this to a value that makes sense with how often you commit
MAX_TIME_BETWEEN_COMMITS = timedelta(hours=8)


def determine_duration(dt1, dt2, repo1, repo2, pod1, pod2):
    if repo1 != repo2 or (pod1 != "morning" and pod2 == "morning") or dt2 - dt1 > MAX_TIME_BETWEEN_COMMITS:
        # this means commit at dt1 was the last one for the time period
        return timedelta(0)
    return dt2 - dt1


df_sorted['duration'] = df_sorted.apply(lambda x: 
                                        determine_duration(x['datetime'], x['next_datetime'],
                                                           x['repo'], x['next_repo'],
                                                           x['part_of_day'], x['next_part_of_day']), axis=1)

df_sorted.groupby('repo')['duration'].sum().sort_values(ascending=False).head(20)
Out[25]:
repo
carolinux/resiroop-shop               13 days 15:39:16
anitagraser/TimeManager                8 days 11:35:40
OpenHumans/oh-googlefit-integration    1 days 11:28:51
carolinux/mosaic                       1 days 08:25:24
carolinux/resiroop-cms-service         1 days 06:31:43
opengisch/qgis_excel_sync              1 days 05:12:59
carolinux/dotfiles                     1 days 03:50:08
carolinux/TimeManager                  1 days 01:29:27
carolinux/QGIS                         0 days 20:58:00
carolinux/resiroop-infra               0 days 20:45:50
OpenHumans/oh-github-source            0 days 20:33:37
carolinux/carolinux.github.com         0 days 16:35:01
carolinux/opencv_experiments           0 days 11:11:24
carolinux/Subs.py                      0 days 09:17:40
carolinux/shpsync                      0 days 08:05:37
carolinux/grid_helper                  0 days 07:58:30
carolinux/cv                           0 days 07:29:54
carolinux/londonwald                   0 days 07:08:01
carolinux/flask_ansible                0 days 06:47:44
carolinux/forestfires_dw               0 days 05:55:49
Name: duration, dtype: timedelta64[ns]

'I-fix-it' commits

Commits that have been pushed within a short time period of another commit, are likely to be trying to fix something that broke.

In [41]:
SHORT_TIME_BETWEEN_COMMITS = timedelta(minutes=5)

desperate_messages = df_sorted[(df_sorted.duration<SHORT_TIME_BETWEEN_COMMITS) & (df_sorted.duration>timedelta(0))].message
print('\n'.join(list(desperate_messages)))
changes
get skipped
add readme
add result files
results with removed images
add deploy commands to heroku
modify delete_data command
initial commit with working oauth2 and celery setup
some fixes to merge
remove test folder
queue to retry if exception
add retry logic
add warning for when rare state is reached where refresh token is broken
tdd part 2: compress json result from googlefit api
remove empty buckets to save space
cosmetic changes to dashboard frontend
add on_heroku env var and secure_ssl_redirect setting
add note about actually having the app installed
ux: add link to exploratory notebook
create default channels via docker
initial commit
added translation files to project file
renamed chinese translation, added instructions for contributing translations, translation tests
added version check for qt for translations
merge branch 'master' of https://github.com/anitagraser/timemanager
added ctrl+space shortcut to focus on time slider
merge branch 'master' of https://github.com/anitagraser/timemanager
fixed failing build
#99 fix attempt for label options
merge branch 'master' of https://github.com/anitagraser/timemanager
added ui for adding interpolation-enabled layers
#113 bug fixing for archaelogical data, better exception reporting
move away add layer dialog functionality from main gui class
fixed small raster error
#130 address
improve animation support
fix for #139
merge branch 'master' of https://github.com/anitagraser/timemanager
in progress pep8
reverted changes that made icons disappear
prepare 2.2
fix reappearance of unregister function bug
added option to clear previous frame files
#158 make nice message.. nicer
make video.sh work generally
save label format in project settings
prepare 2.2.3
fix video export error and release 2.2.4
#165 disable pep8 checking on travis
merge pull request #169 from rduivenvoorde/arch-dialog-cleanup

make the arch-dialog resizable and more readable
revert "possible fix for #188, also implements #167? (#189)"

this reverts commit f8fb07218a9f0e577c6095ce1d6ac2ff98566d5d.
merge pull request #210 from rduivenvoorde/minor_fix

minor fixes
try with 3.2
stop using relative imports
move dialogs to dialogs module
made mergeable
made the index.py have a main function
now ui appears properly
made python initialization less verbose
edited gui.py to be python 2.7 + added some comments for exception handling in teh future //minor edit
edited gui.py via github
merge pull request #2 from zorbash/patch-1

edited gui.py via github
added file parsing exceptions
merge branch 'master' of github.com:carolinux/subs.py
merge branch 'master' of github.com:carolinux/subs.py
added translation files to project file
renamed chinese translation, added instructions for contributing translations, translation tests
added version check for qt for translations
merge branch 'master' of https://github.com/anitagraser/timemanager
added ctrl+space shortcut to focus on time slider
merge branch 'master' of https://github.com/anitagraser/timemanager
create readme.md
merge pull request #1 from carolinux/gis

first good draft for review
run locally via docker
add my data science articles
add stub for drawing workshop
css fiddling
now can have a default and a slow queue
fix: the default queue name is 'celery' and not 'default'
merge branch 'master' of https://github.com/carolinux/-celery_examples
add decorator mini showcase
reshuffling cv :)
added siroop work experience. bit the bullet and now have 2-page cv
first commit
first commit
first commit
added bashrc & vimrc
more fixes to excel writing
using non memory join
fixes filewatcher mistake
removes bug while adding and deleting a feature at the same edit session
fixes #6 : adds checkbox to suppress form
prepare v0.6
use execute many to speed up
make target table name customizable
add process fire script
improve readme
add download link txts for all years
can now generate koch snowflake
move images to dedicated folder
add settings.json and parsing for easier directory navigation
adds bokeh support
konan fix (?)
restored hubot scripts to normal -ie only loading scripts not present under ./scripts
added reminder script
reminder work
reminder work
reminder work
reminder work
reminder work
reminder work
reminder work
reminder work
reminder work
reminder work
reminder work
reminder work
reminder work
reminder work
reminder work
reminder work
reminder work
reminder work
reminder work
initial commit
update readme.md
update readme.md
added code
added csv db liek capability
updated tests
now simple red object detection works
fixes hsv bug, allows multiple markers
update readme.md
add function to parse products up csv
fix terraform from previous commit
revert fixing lambda"

this reverts commit 349b5c251f6c1f707d79740b928399e00bd2cd0b.
revert "change name of json file to store brand link info"

this reverts commit 376beff91c7aee6a5371933a6240b44bbaf10a59.
temporary fix for the very few malformed csv lines
re-enable the cwl subcription to the data science destination
add user internal notification message
use discount_percentage as a float in test
do not flood the logs with missing translation key warnings that nobody is looking at
allow checkout without warning (and without voucher application) in case of unreached min amount voucher
call pdp recommendation service from pdp
voucher service mountebank fake now expects authorization headers
track filter open/close as a toggle event in search result pages
update/improve readme after running the shop locally
apply voucher to microdata
integrate show more button from kvasir into the shop
revert "functional test for show more button"

this reverts commit 4abf3edfab80ff7beb0da2af5e3270741fa37940.
fix test
switch prod_adlabel and prod_adspecial
sync kvasir changes for displaying price drop in cart
fix bug for price_with_discount=0 and create functions for all the different prices in cart product instead of inlining
add energy efficiency to search and l3
fixing all the tests
revert functional_test ci changes
fix tests
fix pct issue
fix response.url
more fixes to excel writing
using non memory join
initial commit
code v1
added more statistics.. yep, this algo seems to the proposal makers
create default channels via docker
create default channels via docker
more fixes to excel writing
using non memory join
fixes filewatcher mistake
removes bug while adding and deleting a feature at the same edit session
fixes #6 : adds checkbox to suppress form
In [ ]: