Description

A tutorial describing how to use the snps Python package to work with genotype files.

2

Tags & Data Sources

Genomics AncestryDNA Upload 23andMe Upload Imputer Gencove Genome/Exome Upload FamilyTreeDNA integration

Other Notebook Versions

gedankenstuecke's notebook

Comments

Please log in to comment.

Output
Output & Code

Notebook
Last updated 5 years, 1 month ago

snps¶

A tutorial using Open Humans data¶

This tutorial will walk through how to use the snps module. The last few cells are included to allow Open Humans members to work with their own data.

Requirement already satisfied: snps in /opt/conda/lib/python3.6/site-packages
Requirement already satisfied: pandas in /opt/conda/lib/python3.6/site-packages (from snps)
Requirement already satisfied: atomicwrites in /opt/conda/lib/python3.6/site-packages (from snps)
Requirement already satisfied: numpy in /opt/conda/lib/python3.6/site-packages (from snps)
Requirement already satisfied: PyVCF in /opt/conda/lib/python3.6/site-packages (from snps)
Requirement already satisfied: pytz>=2011k in /opt/conda/lib/python3.6/site-packages (from pandas->snps)
Requirement already satisfied: python-dateutil>=2.5.0 in /opt/conda/lib/python3.6/site-packages (from pandas->snps)
Requirement already satisfied: setuptools in /opt/conda/lib/python3.6/site-packages (from PyVCF->snps)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.6/site-packages (from python-dateutil>=2.5.0->pandas->snps)
You are using pip version 9.0.1, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

Download Example Data¶

Let's download some example data from openSNP:

Load Raw Data¶

Load a 23andMe raw data file:

The loaded SNPs are available via a pandas.DataFrame:

Out[6]:

array(['chrom', 'pos', 'genotype'], dtype=object)

Out[7]:

'rsid'

Out[8]:

snps also attempts to detect the build / assembly of the data:

Out[9]:

Out[10]:

True

Out[11]:

'GRCh37'

Remap SNPs¶

Let's remap the SNPs to change the assembly / build:

Out[12]:

Out[14]:

Out[15]:

'GRCh38'

Out[16]:

SNPs can be remapped between Build 36 (NCBI36), Build 37 (GRCh37), and Build 38 (GRCh38).

Merge Raw Data Files¶

The dataset consists of raw data files from two different DNA testing companies. Let's combine these files using a SNPsCollection.

Loading resources/662.ftdna-illumina.341.csv.gz

Out[18]:

Out[20]:

As the data gets added, it's compared to the existing data, and SNP position and genotype discrepancies are identified. (The discrepancy thresholds can be tuned via parameters.)

Loading resources/662.23andme.340.txt.gz
27 SNP positions were discrepant; keeping original positions
151 SNP genotypes were discrepant; marking those as null

Out[22]:

Out[23]:

Save SNPs¶

Ok, so far we've remapped the SNPs to the same build and merged the SNPs from two files, identifying discrepancies along the way. Let's save the merged dataset consisting of over 1M+ SNPs to a CSV file:

Saving output/User662_GRCh37.csv

Moreover, let's get the reference sequences for this assembly and save the SNPs as a VCF file:

Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.1.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.2.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.3.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.4.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.5.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.6.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.7.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.8.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.9.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.10.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.11.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.12.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.13.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.14.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.15.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.16.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.17.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.18.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.19.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.20.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.21.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.22.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.X.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.MT.fa.gz
Saving output/User662_GRCh37.vcf

All output files are saved to the output directory.

Load Your Open Humans Data¶

Documentation¶

Documentation is available here.

Notebook
Last updated 5 years, 1 month ago

snps¶

A tutorial using Open Humans data¶

This tutorial will walk through how to use the snps module. The last few cells are included to allow Open Humans members to work with their own data.

In [1]:

! pip install snps

Requirement already satisfied: snps in /opt/conda/lib/python3.6/site-packages
Requirement already satisfied: pandas in /opt/conda/lib/python3.6/site-packages (from snps)
Requirement already satisfied: atomicwrites in /opt/conda/lib/python3.6/site-packages (from snps)
Requirement already satisfied: numpy in /opt/conda/lib/python3.6/site-packages (from snps)
Requirement already satisfied: PyVCF in /opt/conda/lib/python3.6/site-packages (from snps)
Requirement already satisfied: pytz>=2011k in /opt/conda/lib/python3.6/site-packages (from pandas->snps)
Requirement already satisfied: python-dateutil>=2.5.0 in /opt/conda/lib/python3.6/site-packages (from pandas->snps)
Requirement already satisfied: setuptools in /opt/conda/lib/python3.6/site-packages (from PyVCF->snps)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.6/site-packages (from python-dateutil>=2.5.0->pandas->snps)
You are using pip version 9.0.1, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

In [2]:

import ipywidgets as widgets

import bz2
import gzip
import os
import requests
import tempfile

from io import StringIO
from ipywidgets import Layout
from ohapi import api
from snps import SNPs

Download Example Data¶

Let's download some example data from openSNP:

In [3]:

from snps.resources import Resources
r = Resources()

In [4]:

paths = r.download_example_datasets()

Load Raw Data¶

Load a 23andMe raw data file:

The loaded SNPs are available via a pandas.DataFrame:

In [5]:

s = SNPs('resources/662.23andme.340.txt.gz')

In [6]:

df = s.snps
df.columns.values

Out[6]:

array(['chrom', 'pos', 'genotype'], dtype=object)

In [7]:

df.index.name

Out[7]:

'rsid'

In [8]:

len(df)

Out[8]:

snps also attempts to detect the build / assembly of the data:

In [9]:

s.build

Out[9]:

In [10]:

s.build_detected

Out[10]:

True

In [11]:

s.assembly

Out[11]:

'GRCh37'

Remap SNPs¶

Let's remap the SNPs to change the assembly / build:

In [12]:

# find the position of a named snp
s.snps.loc["rs3094315"].pos

Out[12]:

In [13]:

chromosomes_remapped, chromosomes_not_remapped = s.remap_snps(38)

In [14]:

s.build

Out[14]:

In [15]:

s.assembly

Out[15]:

'GRCh38'

In [16]:

s.snps.loc["rs3094315"].pos

Out[16]:

SNPs can be remapped between Build 36 (NCBI36), Build 37 (GRCh37), and Build 38 (GRCh38).

Merge Raw Data Files¶

The dataset consists of raw data files from two different DNA testing companies. Let's combine these files using a SNPsCollection.

In [17]:

from snps import SNPsCollection
sc = SNPsCollection("resources/662.ftdna-illumina.341.csv.gz", name="User662")

Loading resources/662.ftdna-illumina.341.csv.gz

In [18]:

sc.build

Out[18]:

In [19]:

chromosomes_remapped, chromosomes_not_remapped = sc.remap_snps(37)

In [20]:

sc.snp_count

Out[20]:

As the data gets added, it's compared to the existing data, and SNP position and genotype discrepancies are identified. (The discrepancy thresholds can be tuned via parameters.)

In [21]:

sc.load_snps(["resources/662.23andme.340.txt.gz"], discrepant_genotypes_threshold=300)

Loading resources/662.23andme.340.txt.gz
27 SNP positions were discrepant; keeping original positions
151 SNP genotypes were discrepant; marking those as null

In [22]:

len(sc.discrepant_snps)  # SNPs with discrepant positions and genotypes, dropping dups

Out[22]:

In [23]:

sc.snp_count

Out[23]:

Save SNPs¶

In [24]:

saved_snps = sc.save_snps()

Saving output/User662_GRCh37.csv

Moreover, let's get the reference sequences for this assembly and save the SNPs as a VCF file:

In [25]:

r = Resources(resources_dir='ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/')

In [26]:

saved_snps = sc.save_snps(vcf=True)

Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.1.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.2.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.3.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.4.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.5.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.6.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.7.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.8.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.9.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.10.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.11.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.12.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.13.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.14.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.15.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.16.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.17.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.18.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.19.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.20.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.21.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.22.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.X.fa.gz
Downloading resources/fasta/GRCh37/Homo_sapiens.GRCh37.dna.chromosome.MT.fa.gz
Saving output/User662_GRCh37.vcf

All output files are saved to the output directory.

Load Your Open Humans Data¶

In [27]:

user = api.exchange_oauth2_member(os.environ.get('OH_ACCESS_TOKEN'))

# 128 - 23andMe Upload
# 129 - AncestryDNA Upload
# 120 - FamilyTreeDNA Upload
# 40 - Gencove Upload
# 131 - Genome/Exome Upload
# 92 - Imputer

sources = [
    'direct-sharing-128',
    'direct-sharing-129',
    'direct-sharing-120',
    'direct-sharing-40',
    'direct-sharing-131',
    'direct-sharing-92',
]

available_sources = []

for record in user['data']:
    if record['source'] in sources:
        description = f"{record['basename']}--{record['metadata']['description']}--{record['id']}"
        download_url = record['download_url']
        available_sources.append((description, download_url))

        
selected_file = widgets.Dropdown(
                            options=available_sources,
                            description='Select a Genotype File:',
                            disabled=False,
    layout=Layout(width='80%', height='50px'),
    style = {'description_width': 'initial'}
)
display(selected_file)

def process_selected_file(selected_file_url):
    datafile = requests.get(selected_file_url)
    try:
        try:
            try:
                return bz2.decompress(datafile.content)
                #handle.write(textobj)
            except OSError:
                return gzip.decompress(datafile.content)
                #handle.write(textobj)
        except OSError:
            return datafile.content
    except Exception as e:
        raise e
        return None

processed_file = process_selected_file(selected_file.value)

In [28]:

tmp = tempfile.NamedTemporaryFile()
tmp.write(processed_file);

In [29]:

s = SNPs(tmp.name)

In [30]:

# close the temporary file (in Open Humans notebook only)
tmp.close()

Documentation¶

Documentation is available here.

Details for `snps tutorial.ipynb`

Published by arvkevi

Description

2

Tags & Data Sources

Other Notebook Versions

Comments

Notebook
Last updated 5 years, 1 month ago

snps¶

A tutorial using Open Humans data¶

Download Example Data¶

Load Raw Data¶

Remap SNPs¶

Merge Raw Data Files¶

Save SNPs¶

Load Your Open Humans Data¶

Documentation¶

Notebook
Last updated 5 years, 1 month ago

snps¶

A tutorial using Open Humans data¶

Download Example Data¶

Load Raw Data¶

Remap SNPs¶

Merge Raw Data Files¶

Save SNPs¶

Load Your Open Humans Data¶

Documentation¶

Details for snps tutorial.ipynb Open Edit & Run

Published by arvkevi

Description

1960 2

Tags & Data Sources

Other Notebook Versions

Comments

Notebook Last updated 5 years, 1 month ago

snps¶

A tutorial using Open Humans data¶

Download Example Data¶

Load Raw Data¶

Remap SNPs¶

Merge Raw Data Files¶

Save SNPs¶

Load Your Open Humans Data¶

Documentation¶

Notebook Last updated 5 years, 1 month ago

snps¶

A tutorial using Open Humans data¶

Download Example Data¶

Load Raw Data¶

Remap SNPs¶

Merge Raw Data Files¶

Save SNPs¶

Load Your Open Humans Data¶

Documentation¶

Details for `snps tutorial.ipynb`

2

Notebook
Last updated 5 years, 1 month ago

Notebook
Last updated 5 years, 1 month ago