Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ploidy #19

Open
3 tasks done
dramanica opened this issue Mar 25, 2024 · 11 comments
Open
3 tasks done

Ploidy #19

dramanica opened this issue Mar 25, 2024 · 11 comments
Assignees
Labels
enhancement New feature or request

Comments

@dramanica
Copy link
Member

dramanica commented Mar 25, 2024

We could support multiple ploidy. A column can only contain one vector, so storing the ploidy information there is difficult. However, we could simply add a ploidy column to the $fam slot of the bigSNP object in the attributes of genotypes. To implement that, we would need:

  • When we create a gen_tibble, we should start adding a ploidy attribute to the genotypes column, with an integer corresponding to the ploidy of all individuals (if it is the same) or '0' to indicate multiple ploidy. Then we should make sure that functions that rely on diploids check for ploidy=2.
  • Figure out how we read multiple ploidy (parsing a VCF) so that ploidy information is stored when the data is imported.
  • Create a couple of example functions to operate on mixed ploidy to illustrate how it is done.
@dramanica dramanica added the enhancement New feature or request label Mar 25, 2024
@dramanica
Copy link
Member Author

A package to check out
https://cran.r-project.org/web/packages/StAMPP/index.html

@dramanica
Copy link
Member Author

A possible mixed ploidy dataset that looks pretty suitable as an example is:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6758580/
We should see how easy it is to put together the dataset, but it could work as a test set for mixed ploidy.

@dramanica
Copy link
Member Author

An interesting paper detailing tools and theory
https://pubmed.ncbi.nlm.nih.gov/36720820/

@dramanica
Copy link
Member Author

There is now a ploidy branch. It implements storing the ploidy value, which is currently set to 2 by default in all cases. There are also check in most basic functions that stop them from operating if ploidy is not 2 (some might be fine, but they need to tested properly).

@dramanica
Copy link
Member Author

We now have loci_maf(), loci_alt_freq() and loci_missingness() working with mixed ploidy. The unit tests (not complete) are bunched up in test_show_ploidy, but later it would be better to move them under each function. I envision that, for each function, we would have a testthat section for diploid, and one for polyploid.
In terms of implementation, I am of the mind of using an optimised version for diploid, and a more generic function for polyploids (working on both homogenous ploidy, and mixed ploidy). Right now, I am coding the latter mostly in R, we can then think whether some functions would be best moved to C for speed. But I think it makes sense to get functionality first, and then focus on the bottlenecks for optimisation.
Finally, @Euphrasiologist we should probably have a look together at vcf reading to think on how to bring in polyploids (again, probably with an optimised diploid version like we have now, and then a more general option for polyploids). I think we could look at the first locus, count the alleles for each individual, and then use to decide how to process the vcf. In terms of parsing, if we need a generic function, then it would get the odd characters and sum them (1/0/1/1), summing the 1st, 3rd, 5th and 7th element.

@dramanica
Copy link
Member Author

A quick polyploid parser for a vector of genotypes:

genotypes<-c("1/0/1/1","0/0/0/1","1|1|0|1","./././.")
# get ploidy for each individual
sapply(strsplit(genotypes,"[/|]"),function(x) length(x) )
# get dosage for each individual
poly_dosage <- function (x){
  if (x[1]!="."){
    sum(as.numeric(x))
    } else {
    return(NA)
  }
}
sapply(strsplit(genotypes,"[/|]"),poly_dosage )

@dramanica dramanica mentioned this issue May 13, 2024
@dramanica
Copy link
Member Author

We now have the full ploidy infrastructure in main. We don't do anything with multiple ploidy data, but in principle we do have the infrastructure for it.

@dramanica
Copy link
Member Author

A lot of pop gen formulae for mixed ploidy rely on computing the pop frequencies as mean of individual frequencies (i.e. standardising the impact of ploidy so that the individual is the unit of replication, rather than the allele; this distinction is not important when dealing with uniform ploidy). If we adapt a couple of the cpp functions to compute frequencies from individual frequencies, then we should be able to easily adapt a lot of functions to multiple ploidy.

@Euphrasiologist
Copy link
Collaborator

I'm unable to compile on my work Mac... but the changes look fairly simple in the cpp code? Do you think you can add these in please? We should absolutely make the full push to multiple ploidy - the plant genetics/genomics community will be grateful!

@dramanica
Copy link
Member Author

Compiling on Mac requires some set up: https://cran.r-project.org/bin/macosx/tools/
We have the infrastructure for ploidy, but adapting all functions (and testing them) is a fair amount of work, so not sure when that will happen unless we get some more manpower. But happy to advise anyone who wants to help.

@Euphrasiologist
Copy link
Collaborator

It's more of a permissions thing on my Mac at ARU... I'll see what I can do! If I can get it working maybe we can meet and chat about what would need to be done :) Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants