Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LiftOver for summary statistics #34

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

danielduyvo
Copy link
Contributor

@danielduyvo danielduyvo commented Jul 18, 2023

Performs liftover on munged summary statistics. Added in documentation and tests for the functions.
New functions:

  • readchain: Reads in a chain file mapping coordinates between two builds
  • liftoversumstats!: takes in a munged summary statistics DataFrame and chain file DataFrame and performs liftover. Returns a tuple of unmapped variants and variants that match multiple positions.

Functions are included under `mungesumstats.jl`. Consists of two user
facing functions: `readchain` to read a chain file for liftover, and
`liftover_sumstats!` to liftover a munged summary statistics DataFrame.
To do includes handling the edge case where the target strand in the
chain file has a negative strand and adding examples into the
documentation.
`liftover_gwas!` is now `liftover_sumstats` to be in line with
`mungesumstats!`. Additionally, `liftover_gwas!` can now be applied to
AbstractVector{<:AbstractDataFrame}, in line with `mungesumstats`
behavior. `parsechain` drops the "chr" prefix to be in line with
`mungesumstats!` behavior. Function parameter `echain` renamed to
`chain` (originally named `echain` for "expanded chain", but was
needlessly verbose since we never use the output from `parsechain`).
Documentation for the liftover functions added to the end of the summary
statistics tutorial and mentioned in the GENCODE GTF parsing tutorial.
Tests updated to test liftover functionality over a vector.
All the other functions omit underscores; rename to match style.
@codecov-commenter
Copy link

codecov-commenter commented Jul 18, 2023

Codecov Report

Patch coverage: 68.35% and project coverage change: +3.81% 🎉

Comparison is base (419d74f) 60.38% compared to head (fa78716) 64.19%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files
@@            Coverage Diff             @@
##           master      #34      +/-   ##
==========================================
+ Coverage   60.38%   64.19%   +3.81%     
==========================================
  Files          12       12              
  Lines         982     1472     +490     
==========================================
+ Hits          593      945     +352     
- Misses        389      527     +138     
Files Changed Coverage Δ
src/GeneticsMakie.jl 100.00% <ø> (ø)
src/mungesumstats.jl 61.35% <68.35%> (+22.73%) ⬆️

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Rewrote `findnewcoords` to sort and then iterate through the summary
statistics in order to liftover more efficiently. `readchain` also
sorts the chain file now for that same purpose. This replaces the
slower old code which used DataFrames.subset to find matching regions
between builds. Also, added a bug fix to `readchain` where the ending
coordinate was included: the chain format uses half-open intervals.
@danielduyvo
Copy link
Contributor Author

Rewrote the code for liftover to replace the calls to DataFrame.subset when finding the matching region in the chain file. Processes 100,000+ variants a second now.

@danielduyvo
Copy link
Contributor Author

Added code for normalizing GWAS to the reference build and renaming the SNPs with rsIDs from a VCF file. Also added to liftover functionality to liftover indels as well as SNPs.

@danielduyvo
Copy link
Contributor Author

Edits still needed:

  • Choose which column is the reference allele (right now hard-coded so that A1 is reference)
  • Redo keyword arguments for liftover; there's too many and they're too verbose

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants