Details for snps tutorial.ipynb

Published by gedankenstuecke (This notebook is an edited copy of arvkevi's notebook)

Description

See how the snps package can be used to merge two of your genotyping files which are on Open Humans

0

Tags & Data Sources

genetic data snps 23andMe Upload AncestryDNA Upload FamilyTreeDNA integration Gencove Imputer Genome/Exome Upload

Other Notebook Versions

Comments

Please log in to comment.

Notebook
Last updated 3 months ago

snps

A tutorial using Open Humans data

This tutorial will walk through how to use the snps module. The last few cells are included to allow Open Humans members to work with their own data.

In [1]:
! pip install snps
Requirement already satisfied: snps in /opt/conda/lib/python3.6/site-packages
Requirement already satisfied: pandas in /opt/conda/lib/python3.6/site-packages (from snps)
Requirement already satisfied: atomicwrites in /opt/conda/lib/python3.6/site-packages (from snps)
Requirement already satisfied: PyVCF in /opt/conda/lib/python3.6/site-packages (from snps)
Requirement already satisfied: numpy in /opt/conda/lib/python3.6/site-packages (from snps)
Requirement already satisfied: pytz>=2011k in /opt/conda/lib/python3.6/site-packages (from pandas->snps)
Requirement already satisfied: python-dateutil>=2.5.0 in /opt/conda/lib/python3.6/site-packages (from pandas->snps)
Requirement already satisfied: setuptools in /opt/conda/lib/python3.6/site-packages (from PyVCF->snps)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.6/site-packages (from python-dateutil>=2.5.0->pandas->snps)
You are using pip version 9.0.1, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
In [2]:
import ipywidgets as widgets

import bz2
import gzip
import os
import requests
import tempfile

from io import StringIO
from ipywidgets import Layout
from ohapi import api
from snps import SNPs

Download Example Data

Let's download some example data from openSNP:

In [3]:
from snps.resources import Resources
r = Resources()
In [4]:
paths = r.download_example_datasets()

Load Raw Data

Load a 23andMe raw data file:

The loaded SNPs are available via a pandas.DataFrame:

In [5]:
s = SNPs('resources/662.23andme.340.txt.gz')
In [6]:
df = s.snps
df.columns.values
Out[6]:
array(['chrom', 'pos', 'genotype'], dtype=object)
In [7]:
df.index.name
Out[7]:
'rsid'
In [8]:
len(df)
Out[8]:
991786

snps also attempts to detect the build / assembly of the data:

In [9]:
s.build
Out[9]:
37
In [10]:
s.build_detected
Out[10]:
True
In [11]:
s.assembly
Out[11]:
'GRCh37'

Remap SNPs

Let's remap the SNPs to change the assembly / build:

In [12]:
# find the position of a named snp
s.snps.loc["rs3094315"].pos
Out[12]:
752566
In [13]:
chromosomes_remapped, chromosomes_not_remapped = s.remap_snps(38)
In [14]:
s.build
Out[14]:
38
In [15]:
s.assembly
Out[15]:
'GRCh38'
In [16]:
s.snps.loc["rs3094315"].pos
Out[16]:
817186

SNPs can be remapped between Build 36 (NCBI36), Build 37 (GRCh37), and Build 38 (GRCh38).

Merge Raw Data Files

The dataset consists of raw data files from two different DNA testing companies. Let's combine these files using a SNPsCollection.

In [17]:
from snps import SNPsCollection
sc = SNPsCollection("resources/662.ftdna-illumina.341.csv.gz", name="User662")
Loading resources/662.ftdna-illumina.341.csv.gz
In [18]:
sc.build
Out[18]:
36
In [19]:
chromosomes_remapped, chromosomes_not_remapped = sc.remap_snps(37)
In [20]:
sc.snp_count
Out[20]:
708092

As the data gets added, it's compared to the existing data, and SNP position and genotype discrepancies are identified. (The discrepancy thresholds can be tuned via parameters.)

In [21]:
sc.load_snps(["resources/662.23andme.340.txt.gz"], discrepant_genotypes_threshold=300)
Loading resources/662.23andme.340.txt.gz
27 SNP positions were discrepant; keeping original positions
151 SNP genotypes were discrepant; marking those as null
In [22]:
len(sc.discrepant_snps)  # SNPs with discrepant positions and genotypes, dropping dups
Out[22]:
169
In [23]:
sc.snp_count
Out[23]:
1006960

Save SNPs

Ok, so far we've remapped the SNPs to the same build and merged the SNPs from two files, identifying discrepancies along the way. Let's save the merged dataset consisting of over 1M+ SNPs to a CSV file:

In [24]:
saved_snps = sc.save_snps()
Saving output/User662_GRCh37.csv

Moreover, let's get the reference sequences for this assembly and save the SNPs as a VCF file:

In [25]:
r = Resources(resources_dir='ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/')
In [26]:
saved_snps = sc.save_snps(vcf=True)
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.1.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.2.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.3.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.4.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.5.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.6.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.7.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.8.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.9.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.10.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.11.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.12.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.13.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.14.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.15.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.16.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.17.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.18.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.19.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.20.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.21.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.22.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.X.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.MT.fa.gz
Saving output/User662_GRCh37.vcf

All output files are saved to the output directory.

Load Your Open Humans Data & Merge files

If you have more than one genotyping or VCF file in your Open Humans account you can use the snps package for this too!

Merging your own data

We'll give this a try. Select two different genotyping files below through the dropdown menus:

In [27]:
user = api.exchange_oauth2_member(os.environ.get('OH_ACCESS_TOKEN'))

# 128 - 23andMe Upload
# 129 - AncestryDNA Upload
# 120 - FamilyTreeDNA Upload
# 40 - Gencove Upload
# 131 - Genome/Exome Upload
# 92 - Imputer

sources = [
    'direct-sharing-128',
    'direct-sharing-129',
    'direct-sharing-120',
    'direct-sharing-40',
    'direct-sharing-131',
    'direct-sharing-92',
]

available_sources = []

for record in user['data']:
    if record['source'] in sources:
        description = f"{record['basename']}--{record['metadata']['description']}--{record['id']}"
        download_url = record['download_url']
        available_sources.append((description, download_url))

        
selected_file = widgets.Dropdown(
                            options=available_sources,
                            description='Select the first Genotype File:',
                            disabled=False,
    layout=Layout(width='80%', height='50px'),
    style = {'description_width': 'initial'}
)
display(selected_file)

selected_file_two = widgets.Dropdown(
                            options=available_sources,
                            description='Select the second Genotype File:',
                            disabled=False,
    layout=Layout(width='80%', height='50px'),
    style = {'description_width': 'initial'}
)
display(selected_file_two)

def process_selected_file(selected_file_url):
    datafile = requests.get(selected_file_url)
    try:
        try:
            try:
                return bz2.decompress(datafile.content)
                #handle.write(textobj)
            except OSError:
                return gzip.decompress(datafile.content)
                #handle.write(textobj)
        except OSError:
            return datafile.content
    except Exception as e:
        raise e
        return None

Now we'll load those two files and put them into SNP objects to then see how many SNPs are in each.

In [28]:
processed_file = process_selected_file(selected_file.value)
processed_file_two = process_selected_file(selected_file_two.value)
tmp = tempfile.NamedTemporaryFile()
tmp.write(processed_file);
tmp_two = tempfile.NamedTemporaryFile()
tmp_two.write(processed_file_two)
first_snp_file = SNPs(tmp.name)
second_snp_file = SNPs(tmp_two.name)
# close the temporary file (in Open Humans notebook only)

Number of SNPs per file

Let's have a look how many SNPs are in each of our two genotyping files we've selected above:

In [29]:
print("The first file has {} SNPs".format(first_snp_file.snp_count))
print("The second file has {} SNPs".format(second_snp_file.snp_count))
The first file has 638469 SNPs
The second file has 960614 SNPs

Now we can go ahead and try to merge those two files into a single SNP collection and see what happens:

In [31]:
sc = SNPsCollection(tmp.name, name="me")
sc.load_snps([tmp_two.name], discrepant_genotypes_threshold=300)
print('Between those two files there are {} discrepancies'.format(len(sc.discrepant_snps)))
Loading ../../tmp/tmpn835mwmj
Loading ../../tmp/tmp8pwdqvcc
24 SNP positions were discrepant; keeping original positions
9 SNP genotypes were discrepant; marking those as null
Between those two files there are 33 discrepancies

Let us now look at the discrepant SNPs that we found between our files:

In [32]:
print(sc.discrepant_snps)
           chrom chrom_added genotype genotype_added        pos  pos_added
rsid                                                                      
i4000417       1           1       DD             DD  155210453  155210452
rs63750426     2           2       II             II   47639603   47639604
rs63749850     2           2       DD             DD   47657002   47657001
rs63750662     2           2       DD             DD   47693926   47693925
rs63750854     2           2       DD             DD   48026545   48026544
rs63751167     2           2       DD             DD   48027083   48027082
rs63750250     7           7       DD             DD    6026565    6026564
rs63750106     7           7       DD             DD    6027090    6027089
i4000324       7           7       DD             DD  117282548  117282547
i5012770       8           8       II             II   90983442   90983446
i5012629       9           9       II             II   37426654   37426657
i5012665       9           9       II             II  104190767  104190770
rs8176719      9           9       DI             DI  136132909  136132908
i5012556      11          11       II             II   17417436   17417438
rs1799732     11          11       II             II  113346253  113346252
i5012880      11          11       II             II  118895981  118895982
rs62640570    12          12       DD             DD   88487681   88487680
i5012678      13          13       II             II   77575055   77575056
rs63750631    14          14       DD             DD   73640402   73640401
i4000391      15          15       DD             DD   72638921   72638920
rs5743293     16          16       DD             DD   50763782   50763781
i4000378      17          17       DD             DD   41209083   41209079
rs11568822    19          19       DD             DD   45417641   45417640
rs2032623      Y           Y        D              D   21878073   21878072
rs13110106     4           4       GG             GT  156523316  156523316
rs4473631      4           4       CC             AC  174501769  174501769
rs6477382      9           9       GG             AG    9252865    9252865
rs1387488     11          11       AG             GG   26045753   26045753
rs6488516     12          12       TT             CT   12514842   12514842
rs11045818    12          12       AG             GG   21329761   21329761
rs11065711    12          12       CC             CT  111188067  111188067
rs9533302     13          13       CT             CC   43490338   43490338
rs4981527     14          14       CC             CT   24940641   24940641

Documentation

Check out the socumentation that is available here to learn more about the snps package

In [33]:
tmp.close()
tmp_two.close()