snps tutorial.ipynb
See how the snps
package can be used to merge two of your genotyping files which are on Open Humans
Let's download some example data from openSNP:
Load a 23andMe raw data file:
The loaded SNPs are available via a pandas.DataFrame:
snps also attempts to detect the build / assembly of the data:
Let's remap the SNPs to change the assembly / build:
SNPs can be remapped between Build 36 (NCBI36), Build 37 (GRCh37), and Build 38 (GRCh38).
The dataset consists of raw data files from two different DNA testing companies. Let's combine these files using a SNPsCollection.
As the data gets added, it's compared to the existing data, and SNP position and genotype discrepancies are identified. (The discrepancy thresholds can be tuned via parameters.)
Ok, so far we've remapped the SNPs to the same build and merged the SNPs from two files, identifying discrepancies along the way. Let's save the merged dataset consisting of over 1M+ SNPs to a CSV file:
Moreover, let's get the reference sequences for this assembly and save the SNPs as a VCF file:
All output files are saved to the output
directory.
Now we'll load those two files and put them into SNP
objects to then see how many SNPs are in each.
Let's have a look how many SNPs are in each of our two genotyping files we've selected above:
Now we can go ahead and try to merge those two files into a single SNP collection
and see what happens:
Let us now look at the discrepant SNPs that we found between our files:
! pip install snps
import ipywidgets as widgets
import bz2
import gzip
import os
import requests
import tempfile
from io import StringIO
from ipywidgets import Layout
from ohapi import api
from snps import SNPs
Let's download some example data from openSNP:
from snps.resources import Resources
r = Resources()
paths = r.download_example_datasets()
Load a 23andMe raw data file:
The loaded SNPs are available via a pandas.DataFrame:
s = SNPs('resources/662.23andme.340.txt.gz')
df = s.snps
df.columns.values
df.index.name
len(df)
snps also attempts to detect the build / assembly of the data:
s.build
s.build_detected
s.assembly
Let's remap the SNPs to change the assembly / build:
# find the position of a named snp
s.snps.loc["rs3094315"].pos
chromosomes_remapped, chromosomes_not_remapped = s.remap_snps(38)
s.build
s.assembly
s.snps.loc["rs3094315"].pos
SNPs can be remapped between Build 36 (NCBI36), Build 37 (GRCh37), and Build 38 (GRCh38).
The dataset consists of raw data files from two different DNA testing companies. Let's combine these files using a SNPsCollection.
from snps import SNPsCollection
sc = SNPsCollection("resources/662.ftdna-illumina.341.csv.gz", name="User662")
sc.build
chromosomes_remapped, chromosomes_not_remapped = sc.remap_snps(37)
sc.snp_count
As the data gets added, it's compared to the existing data, and SNP position and genotype discrepancies are identified. (The discrepancy thresholds can be tuned via parameters.)
sc.load_snps(["resources/662.23andme.340.txt.gz"], discrepant_genotypes_threshold=300)
len(sc.discrepant_snps) # SNPs with discrepant positions and genotypes, dropping dups
sc.snp_count
Ok, so far we've remapped the SNPs to the same build and merged the SNPs from two files, identifying discrepancies along the way. Let's save the merged dataset consisting of over 1M+ SNPs to a CSV file:
saved_snps = sc.save_snps()
Moreover, let's get the reference sequences for this assembly and save the SNPs as a VCF file:
r = Resources(resources_dir='ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/')
saved_snps = sc.save_snps(vcf=True)
All output files are saved to the output
directory.
user = api.exchange_oauth2_member(os.environ.get('OH_ACCESS_TOKEN'))
# 128 - 23andMe Upload
# 129 - AncestryDNA Upload
# 120 - FamilyTreeDNA Upload
# 40 - Gencove Upload
# 131 - Genome/Exome Upload
# 92 - Imputer
sources = [
'direct-sharing-128',
'direct-sharing-129',
'direct-sharing-120',
'direct-sharing-40',
'direct-sharing-131',
'direct-sharing-92',
]
available_sources = []
for record in user['data']:
if record['source'] in sources:
description = f"{record['basename']}--{record['metadata']['description']}--{record['id']}"
download_url = record['download_url']
available_sources.append((description, download_url))
selected_file = widgets.Dropdown(
options=available_sources,
description='Select the first Genotype File:',
disabled=False,
layout=Layout(width='80%', height='50px'),
style = {'description_width': 'initial'}
)
display(selected_file)
selected_file_two = widgets.Dropdown(
options=available_sources,
description='Select the second Genotype File:',
disabled=False,
layout=Layout(width='80%', height='50px'),
style = {'description_width': 'initial'}
)
display(selected_file_two)
def process_selected_file(selected_file_url):
datafile = requests.get(selected_file_url)
try:
try:
try:
return bz2.decompress(datafile.content)
#handle.write(textobj)
except OSError:
return gzip.decompress(datafile.content)
#handle.write(textobj)
except OSError:
return datafile.content
except Exception as e:
raise e
return None
Now we'll load those two files and put them into SNP
objects to then see how many SNPs are in each.
processed_file = process_selected_file(selected_file.value)
processed_file_two = process_selected_file(selected_file_two.value)
tmp = tempfile.NamedTemporaryFile()
tmp.write(processed_file);
tmp_two = tempfile.NamedTemporaryFile()
tmp_two.write(processed_file_two)
first_snp_file = SNPs(tmp.name)
second_snp_file = SNPs(tmp_two.name)
# close the temporary file (in Open Humans notebook only)
Let's have a look how many SNPs are in each of our two genotyping files we've selected above:
print("The first file has {} SNPs".format(first_snp_file.snp_count))
print("The second file has {} SNPs".format(second_snp_file.snp_count))
Now we can go ahead and try to merge those two files into a single SNP collection
and see what happens:
sc = SNPsCollection(tmp.name, name="me")
sc.load_snps([tmp_two.name], discrepant_genotypes_threshold=300)
print('Between those two files there are {} discrepancies'.format(len(sc.discrepant_snps)))
Let us now look at the discrepant SNPs that we found between our files:
print(sc.discrepant_snps)
tmp.close()
tmp_two.close()