explore_imputed_genotypes.ipynb
This notebook helps Open Humans members explore their genotype imputation results from the Imputer app.
If you'd like to learn how to explore you're imputed genotypes, please follow along in the cells below.
Plese see openimpute.com/about or the GitHub page for questions.
Imputation results were returned in vcf format, the next cell will download your imputed genotype file.
There are 4 columns which serve as a unique index for genomic position:
CHROM
, POS
, REF
, and ALT
To clean up the DataFrame
, drop some columns that aren't used:
ID
is a unique string, essentially a concatenated form of the unique index.
QUAL
is .
for every position, we did not store this data type.
FILTER
is also .
for every position, we did not store this data type.
The next cell will transform the remaining columns into a more usable format.
Specifically:
INFO
will be parsed and float value after INFO=
will be extracted.
FORMAT
will be dropped, it is used to inform the reader what is contained in the SAMPLE
field.
SAMPLE
will be split and parsed into two columns as indicated by FORMAT
-- GT
and GP
.
GT
is your genotype at that genomic position.
GP
is the probability output by the Imputer pipeline that the GT
is correct.
The INFO metric serves as a "confidence" that the genotype is called accurately. From the IMPUTE2 homepage:
Our metric typically takes values between 0 and 1, where values near 1 indicate that a SNP has been imputed with high certainty. The metric can occasionally take negative values when the imputation is very uncertain, and we automatically assign a value of -1 when the metric is undefined (e.g., because it wasn't calculated).
Investigators often use the info metric to remove poorly imputed SNPs from their association testing results. There is no universal cutoff value for post-imputation SNP filtering; various groups have used cutoffs of 0.3 and 0.5, for example, but the right threshold for your analysis may differ. One way to assess different info thresholds is to see whether they produce sensible Q-Q plots, although we emphasize that Q-Q plots can look bad for many reasons besides your post-imputation filtering scheme.
Let's take an inclusive approach and set a cutoff of 0.3
How many sites were considered to be imputed with high confidence?
Take a look at the possible values for the GT
field:
0/0
indicates you are homozygous reference at this position.
0/1
indicates you are heterozygous at this position.
1/1
indicates you are homozygous alternate at this position.
./.
indicates that the GP threshold of 0.6
was not met at this position, so no call was made.
The ./.
threshold was arbitrarily chosen.
The next cell will split the genotype probabilities into three columns:
GP_homref
for homozygous reference callsGP_het
for heterozygous callsGP_homalt
for homozygous alterante callsTake a closer look at the genotype probabilities, GP
, for the uncalled sites, they are very uncertain, with a maximum genotype probability of much lower than 1.0
.
Hopefully this notebook will help you explore your imputed genotype data, if you have any questions or suggestions please open an issue on the GitHub page.
If you'd like to learn how to explore you're imputed genotypes, please follow along in the cells below.
Plese see openimpute.com/about or the GitHub page for questions.
import os
import io
import bz2
import pandas as pd
import requests
from ohapi import api
import matplotlib.pyplot as plt
%matplotlib inline
user = api.exchange_oauth2_member(os.environ.get('OH_ACCESS_TOKEN'))
Imputation results were returned in vcf format, the next cell will download your imputed genotype file.
for record in user['data']:
if record['basename'] == 'member.imputed.vcf.bz2':
print('Found your imputed genotype data in Open Humans!')
datafile = requests.get(record['download_url']).content
datafile = bz2.decompress(datafile)
break
# {1-22, 'X'}
chrom = '5'
vcf_array = []
for line in datafile.splitlines():
# this will stop searching the file if the chromsome has been read.
if len(vcf_array) > 0 and not line.startswith(chrom.encode('utf-8')):
print('Finished loading chromosome {}, stopping early'.format(chrom))
break
# add the genetic variants from the user-defined chrom
if line.startswith(chrom.encode('utf-8')):
vcf_array.append(line.decode('utf-8'))
del datafile
df = pd.read_csv(io.StringIO('\n'.join(vcf_array)),
sep='\t',
comment='#',
header=None,
names=[
'CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT', 'SAMPLE'
])
df.set_index(['CHROM', 'POS', 'REF', 'ALT'], inplace=True)
There are 4 columns which serve as a unique index for genomic position:
CHROM
, POS
, REF
, and ALT
To clean up the DataFrame
, drop some columns that aren't used:
ID
is a unique string, essentially a concatenated form of the unique index.
QUAL
is .
for every position, we did not store this data type.
FILTER
is also .
for every position, we did not store this data type.
df.head()
df.drop(['ID', 'QUAL', 'FILTER'], axis=1, inplace=True)
df.head()
The next cell will transform the remaining columns into a more usable format.
Specifically:
INFO
will be parsed and float value after INFO=
will be extracted.
FORMAT
will be dropped, it is used to inform the reader what is contained in the SAMPLE
field.
SAMPLE
will be split and parsed into two columns as indicated by FORMAT
-- GT
and GP
.
GT
is your genotype at that genomic position.
GP
is the probability output by the Imputer pipeline that the GT
is correct.
info = df['INFO'].str.split(';', expand=True)
df['INFO'] = info[1].str.split('=', expand=True)[1].astype(float)
del info
df.drop(['FORMAT'], axis=1, inplace=True)
sample = df['SAMPLE'].str.split(':', expand=True)
df['GT'] = sample[0]
df['GP'] = sample[1]
df.drop(['SAMPLE'], inplace=True, axis=1)
del sample
df.head()
The INFO metric serves as a "confidence" that the genotype is called accurately. From the IMPUTE2 homepage:
Our metric typically takes values between 0 and 1, where values near 1 indicate that a SNP has been imputed with high certainty. The metric can occasionally take negative values when the imputation is very uncertain, and we automatically assign a value of -1 when the metric is undefined (e.g., because it wasn't calculated).
Investigators often use the info metric to remove poorly imputed SNPs from their association testing results. There is no universal cutoff value for post-imputation SNP filtering; various groups have used cutoffs of 0.3 and 0.5, for example, but the right threshold for your analysis may differ. One way to assess different info thresholds is to see whether they produce sensible Q-Q plots, although we emphasize that Q-Q plots can look bad for many reasons besides your post-imputation filtering scheme.
Let's take an inclusive approach and set a cutoff of 0.3
How many sites were considered to be imputed with high confidence?
n_snps = df.shape[0]
n_hiconf_snps = df.loc[df['INFO']>=0.3].shape[0]
print('All snps: {}'.format(n_snps))
print('High confidence snps: {}'.format(n_hiconf_snps))
print('% hi conf snps: {0:.2f}%'.format((n_hiconf_snps / n_snps)*100))
df['INFO'].hist();
Take a look at the possible values for the GT
field:
0/0
indicates you are homozygous reference at this position.
0/1
indicates you are heterozygous at this position.
1/1
indicates you are homozygous alternate at this position.
./.
indicates that the GP threshold of 0.6
was not met at this position, so no call was made.
The ./.
threshold was arbitrarily chosen.
df['GT'].value_counts()
The next cell will split the genotype probabilities into three columns:
GP_homref
for homozygous reference callsGP_het
for heterozygous callsGP_homalt
for homozygous alterante callsgp = df['GP'].str.split(',', expand=True)
df['GP_homref'] = gp[0].astype(float)
df['GP_het'] = gp[1].astype(float)
df['GP_homalt'] = gp[2].astype(float)
del gp
df.drop(['GP'], axis=1, inplace=True)
df.head()
Take a closer look at the genotype probabilities, GP
, for the uncalled sites, they are very uncertain, with a maximum genotype probability of much lower than 1.0
.
df.loc[df['GT'] == './.'][['GP_homref', 'GP_het', 'GP_homalt']].astype(float).max(axis=1).head()
Hopefully this notebook will help you explore your imputed genotype data, if you have any questions or suggestions please open an issue on the GitHub page.