Details for snps tutorial.ipynb

Published by arvkevi

Description

A tutorial describing how to use the snps Python package to work with genotype files.

2

Tags & Data Sources

Genomics AncestryDNA Upload 23andMe Upload Imputer Gencove Genome/Exome Upload FamilyTreeDNA integration

Other Notebook Versions

Comments

Please log in to comment.

Notebook
Last updated 4 years, 9 months ago

snps

A tutorial using Open Humans data

This tutorial will walk through how to use the snps module. The last few cells are included to allow Open Humans members to work with their own data.

Requirement already satisfied: snps in /opt/conda/lib/python3.6/site-packages
Requirement already satisfied: pandas in /opt/conda/lib/python3.6/site-packages (from snps)
Requirement already satisfied: atomicwrites in /opt/conda/lib/python3.6/site-packages (from snps)
Requirement already satisfied: numpy in /opt/conda/lib/python3.6/site-packages (from snps)
Requirement already satisfied: PyVCF in /opt/conda/lib/python3.6/site-packages (from snps)
Requirement already satisfied: pytz>=2011k in /opt/conda/lib/python3.6/site-packages (from pandas->snps)
Requirement already satisfied: python-dateutil>=2.5.0 in /opt/conda/lib/python3.6/site-packages (from pandas->snps)
Requirement already satisfied: setuptools in /opt/conda/lib/python3.6/site-packages (from PyVCF->snps)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.6/site-packages (from python-dateutil>=2.5.0->pandas->snps)
You are using pip version 9.0.1, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

Download Example Data

Let's download some example data from openSNP:

Load Raw Data

Load a 23andMe raw data file:

The loaded SNPs are available via a pandas.DataFrame:

Out[6]:
array(['chrom', 'pos', 'genotype'], dtype=object)
Out[7]:
'rsid'
Out[8]:
991786

snps also attempts to detect the build / assembly of the data:

Out[9]:
37
Out[10]:
True
Out[11]:
'GRCh37'

Remap SNPs

Let's remap the SNPs to change the assembly / build:

Out[12]:
752566
Out[14]:
38
Out[15]:
'GRCh38'
Out[16]:
817186

SNPs can be remapped between Build 36 (NCBI36), Build 37 (GRCh37), and Build 38 (GRCh38).

Merge Raw Data Files

The dataset consists of raw data files from two different DNA testing companies. Let's combine these files using a SNPsCollection.

Loading resources/662.ftdna-illumina.341.csv.gz
Out[18]:
36
Out[20]:
708092

As the data gets added, it's compared to the existing data, and SNP position and genotype discrepancies are identified. (The discrepancy thresholds can be tuned via parameters.)

Loading resources/662.23andme.340.txt.gz
27 SNP positions were discrepant; keeping original positions
151 SNP genotypes were discrepant; marking those as null
Out[22]:
169
Out[23]:
1006960

Save SNPs

Ok, so far we've remapped the SNPs to the same build and merged the SNPs from two files, identifying discrepancies along the way. Let's save the merged dataset consisting of over 1M+ SNPs to a CSV file:

Saving output/User662_GRCh37.csv

Moreover, let's get the reference sequences for this assembly and save the SNPs as a VCF file:

Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.1.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.2.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.3.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.4.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.5.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.6.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.7.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.8.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.9.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.10.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.11.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.12.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.13.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.14.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.15.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.16.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.17.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.18.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.19.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.20.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.21.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.22.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.X.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.MT.fa.gz
Saving output/User662_GRCh37.vcf

All output files are saved to the output directory.

Load Your Open Humans Data

Documentation

Documentation is available here.

Notebook
Last updated 4 years, 9 months ago

snps

A tutorial using Open Humans data

This tutorial will walk through how to use the snps module. The last few cells are included to allow Open Humans members to work with their own data.

In [1]:
! pip install snps
Requirement already satisfied: snps in /opt/conda/lib/python3.6/site-packages
Requirement already satisfied: pandas in /opt/conda/lib/python3.6/site-packages (from snps)
Requirement already satisfied: atomicwrites in /opt/conda/lib/python3.6/site-packages (from snps)
Requirement already satisfied: numpy in /opt/conda/lib/python3.6/site-packages (from snps)
Requirement already satisfied: PyVCF in /opt/conda/lib/python3.6/site-packages (from snps)
Requirement already satisfied: pytz>=2011k in /opt/conda/lib/python3.6/site-packages (from pandas->snps)
Requirement already satisfied: python-dateutil>=2.5.0 in /opt/conda/lib/python3.6/site-packages (from pandas->snps)
Requirement already satisfied: setuptools in /opt/conda/lib/python3.6/site-packages (from PyVCF->snps)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.6/site-packages (from python-dateutil>=2.5.0->pandas->snps)
You are using pip version 9.0.1, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
In [2]:
import ipywidgets as widgets

import bz2
import gzip
import os
import requests
import tempfile

from io import StringIO
from ipywidgets import Layout
from ohapi import api
from snps import SNPs

Download Example Data

Let's download some example data from openSNP:

In [3]:
from snps.resources import Resources
r = Resources()
In [4]:
paths = r.download_example_datasets()

Load Raw Data

Load a 23andMe raw data file:

The loaded SNPs are available via a pandas.DataFrame:

In [5]:
s = SNPs('resources/662.23andme.340.txt.gz')
In [6]:
df = s.snps
df.columns.values
Out[6]:
array(['chrom', 'pos', 'genotype'], dtype=object)
In [7]:
df.index.name
Out[7]:
'rsid'
In [8]:
len(df)
Out[8]:
991786

snps also attempts to detect the build / assembly of the data:

In [9]:
s.build
Out[9]:
37
In [10]:
s.build_detected
Out[10]:
True
In [11]:
s.assembly
Out[11]:
'GRCh37'

Remap SNPs

Let's remap the SNPs to change the assembly / build:

In [12]:
# find the position of a named snp
s.snps.loc["rs3094315"].pos
Out[12]:
752566
In [13]:
chromosomes_remapped, chromosomes_not_remapped = s.remap_snps(38)
In [14]:
s.build
Out[14]:
38
In [15]:
s.assembly
Out[15]:
'GRCh38'
In [16]:
s.snps.loc["rs3094315"].pos
Out[16]:
817186

SNPs can be remapped between Build 36 (NCBI36), Build 37 (GRCh37), and Build 38 (GRCh38).

Merge Raw Data Files

The dataset consists of raw data files from two different DNA testing companies. Let's combine these files using a SNPsCollection.

In [17]:
from snps import SNPsCollection
sc = SNPsCollection("resources/662.ftdna-illumina.341.csv.gz", name="User662")
Loading resources/662.ftdna-illumina.341.csv.gz
In [18]:
sc.build
Out[18]:
36
In [19]:
chromosomes_remapped, chromosomes_not_remapped = sc.remap_snps(37)
In [20]:
sc.snp_count
Out[20]:
708092

As the data gets added, it's compared to the existing data, and SNP position and genotype discrepancies are identified. (The discrepancy thresholds can be tuned via parameters.)

In [21]:
sc.load_snps(["resources/662.23andme.340.txt.gz"], discrepant_genotypes_threshold=300)
Loading resources/662.23andme.340.txt.gz
27 SNP positions were discrepant; keeping original positions
151 SNP genotypes were discrepant; marking those as null
In [22]:
len(sc.discrepant_snps)  # SNPs with discrepant positions and genotypes, dropping dups
Out[22]:
169
In [23]:
sc.snp_count
Out[23]:
1006960

Save SNPs

Ok, so far we've remapped the SNPs to the same build and merged the SNPs from two files, identifying discrepancies along the way. Let's save the merged dataset consisting of over 1M+ SNPs to a CSV file:

In [24]:
saved_snps = sc.save_snps()
Saving output/User662_GRCh37.csv

Moreover, let's get the reference sequences for this assembly and save the SNPs as a VCF file:

In [25]:
r = Resources(resources_dir='ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/')
In [26]:
saved_snps = sc.save_snps(vcf=True)
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.1.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.2.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.3.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.4.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.5.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.6.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.7.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.8.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.9.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.10.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.11.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.12.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.13.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.14.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.15.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.16.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.17.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.18.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.19.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.20.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.21.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.22.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.X.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.MT.fa.gz
Saving output/User662_GRCh37.vcf

All output files are saved to the output directory.

Load Your Open Humans Data

In [27]:
user = api.exchange_oauth2_member(os.environ.get('OH_ACCESS_TOKEN'))

# 128 - 23andMe Upload
# 129 - AncestryDNA Upload
# 120 - FamilyTreeDNA Upload
# 40 - Gencove Upload
# 131 - Genome/Exome Upload
# 92 - Imputer

sources = [
    'direct-sharing-128',
    'direct-sharing-129',
    'direct-sharing-120',
    'direct-sharing-40',
    'direct-sharing-131',
    'direct-sharing-92',
]

available_sources = []

for record in user['data']:
    if record['source'] in sources:
        description = f"{record['basename']}--{record['metadata']['description']}--{record['id']}"
        download_url = record['download_url']
        available_sources.append((description, download_url))

        
selected_file = widgets.Dropdown(
                            options=available_sources,
                            description='Select a Genotype File:',
                            disabled=False,
    layout=Layout(width='80%', height='50px'),
    style = {'description_width': 'initial'}
)
display(selected_file)

def process_selected_file(selected_file_url):
    datafile = requests.get(selected_file_url)
    try:
        try:
            try:
                return bz2.decompress(datafile.content)
                #handle.write(textobj)
            except OSError:
                return gzip.decompress(datafile.content)
                #handle.write(textobj)
        except OSError:
            return datafile.content
    except Exception as e:
        raise e
        return None

processed_file = process_selected_file(selected_file.value)
In [28]:
tmp = tempfile.NamedTemporaryFile()
tmp.write(processed_file);
In [29]:
s = SNPs(tmp.name)
In [30]:
# close the temporary file (in Open Humans notebook only)
tmp.close()

Documentation

Documentation is available here.