Use HAPNEST data for gwas demo? #43

jeromekelleher · 2023-11-23T19:37:19Z

The 1 million sample HAPNEST dataset (https://github.com/pystatgen/sgkit/discussions/1144#discussioncomment-7654640 ) seems ideal for our purposes.

Larger thank ukb, and no messing with data access problems. Also lets us showcase our plink format support.

Any thoughts @hammer ?

jeromekelleher · 2023-11-23T19:38:00Z

Also includes phenotypes, btw

jeromekelleher · 2023-11-27T11:36:35Z

The advantages of a fully reproducible analysis pipeline to go along with the paper seems compelling to me. Working with something like UKB inevitably introduces friction. This synthetic dataset has been carefully curated for realism, and I'm not sure what extra we'd be showing by working with actual data.

There's a neatness to demonstrating that we can work with two different synthetic datasets at the 1 million sample scale, through both VCF and plink.

If we make it a requirement that all of the things that go into the paper are fully reproducible (which chimes well with the overall philosophy of openness), and we want to do something at the largest scale, then this seems like a great way to go.

hammer · 2023-11-27T15:53:50Z

I will have a look this week! I've been using GitHub Codespaces so far for my explorations and will need to think about how scaling experiments. We hit some scalability issues last time we tried to do a GWAS at the UKB scale (https://github.com/pystatgen/sgkit/issues/390) so I may also need to get some help resolving those issues.

A quick look at the S-BSST936 listing shows the .bed files range from 141.37 GB (chr2) to 27.64 GB (chr21). I wonder if anyone has put this data on a cloud object store already? I'll poke around a bit to save myself the download time.

two different synthetic datasets at the 1 million sample scale, through both VCF and plink.

@jeromekelleher forgive my ignorance but do we have a VCF synthetic data set at this scale as well?

hammer · 2023-11-27T15:58:11Z

Some places to look for this data on cloud storage already:

Registry of Open Data on AWS: don't see it
Google Cloud Datasets: don't see it
Azure Open Datasets: don't see it

jeromekelleher · 2023-11-27T16:04:02Z

@jeromekelleher forgive my ignorance but do we have a VCF synthetic data set at this scale as well?

Yep - our data/basic compute task scaling figure goes up to a million samples, taken as subsets of the 1.4M in the simulations provided in this paper

(Note: @benjeffery and I are planning to add another line for the SAV file format/C++ toolkit here. Fig is also quite drafty, obvs)

hammer · 2023-11-29T05:50:27Z

Okay figured out their FTP structure, everything is under ftp://ftp.ebi.ac.uk//biostudies/fire/S-BSST/936/S-BSST936/Files. Will start moving to a cloud store now.

For my reference, I'm using a command like:

curl ftp://ftp.ebi.ac.uk//biostudies/fire/S-BSST/936/S-BSST936/Files/example/<file> | gsutil cp - gs://<bucket>/<file>

Transfer speeds not so bad, seeing around 27 MiB/s, will take about 17 minutes for chr21 and probably 2 hours or so for chr1. Will kick off a big transfer tomorrow.

hammer · 2023-12-03T16:40:30Z

Okay I've gotten our GWAS demo running using one chromosome and one phenotype of the example (600 subjects) data.

Notebook is at https://github.com/hammer/sgkitpub/blob/main/hapnest_gwas.ipynb

Some thoughts:

GWAS demo uses a quantitative trait, while HAPNEST has binary traits.
GWAS demo uses a VCF file with sequencing features, so QC is a lot more interesting.
At least for the first phenotype on the first chromosome, there's no association to find.

I will next try to scale to all chromosomes and all phenotypes on the example data, then go to the big dataset.

hammer · 2023-12-24T15:32:03Z

Just noting for myself that tools from other language ecosystems that might be fun to try out in this section would be https://github.com/privefl/bigsnpr (GWAS docs) and https://github.com/OpenMendel/MendelGWAS.jl

hammer mentioned this issue Dec 4, 2023

Add ability to read from cloud storage fastlmm/bed-reader#22

Closed

hammer mentioned this issue Jan 3, 2024

Statistical genetics section #78

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use HAPNEST data for gwas demo? #43

Use HAPNEST data for gwas demo? #43

jeromekelleher commented Nov 23, 2023

jeromekelleher commented Nov 23, 2023

jeromekelleher commented Nov 27, 2023

hammer commented Nov 27, 2023

hammer commented Nov 27, 2023

jeromekelleher commented Nov 27, 2023 •

edited

Loading

hammer commented Nov 29, 2023 •

edited

Loading

hammer commented Dec 3, 2023 •

edited

Loading

hammer commented Dec 24, 2023

Use HAPNEST data for gwas demo? #43

Use HAPNEST data for gwas demo? #43

Comments

jeromekelleher commented Nov 23, 2023

jeromekelleher commented Nov 23, 2023

jeromekelleher commented Nov 27, 2023

hammer commented Nov 27, 2023

hammer commented Nov 27, 2023

jeromekelleher commented Nov 27, 2023 • edited Loading

hammer commented Nov 29, 2023 • edited Loading

hammer commented Dec 3, 2023 • edited Loading

hammer commented Dec 24, 2023

jeromekelleher commented Nov 27, 2023 •

edited

Loading

hammer commented Nov 29, 2023 •

edited

Loading

hammer commented Dec 3, 2023 •

edited

Loading