Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in ses.pd function #17

Closed
rafmolven opened this issue Mar 19, 2019 · 3 comments
Closed

Bug in ses.pd function #17

rafmolven opened this issue Mar 19, 2019 · 3 comments
Assignees
Labels

Comments

@rafmolven
Copy link

rafmolven commented Mar 19, 2019

Hi everyone,

I write to report a bug in the ses.pd function. As you may know, the arguments of ses.pd are passed through the function pd, which means that one can specify whether to include the distance from the most recent common ancestor of the species and the root of the tree (MRCA – root distance). The default setting (in the pd function) is "include.root = TRUE". However, I have noted that ses.pd computes null PD values without including the MRCA – root distance regardless of the logical value that is specified in the include.root argument. This is, if include.root = TRUE (default = TRUE), the observed PD will include the MRCA – root distance, but the null PD values will not. For example:

require(picante)
data(phylocom)
phylocom$sample -> Sample
phylocom$phylo -> Tree
prune.sample(Sample,Tree)->Tree_F
Sample <- Sample[,Tree_F$tip.label]
set.seed(12345) # fix the seed to make the analyses reproducible
ses.pd(Sample,Tree_F,null.model="taxa.labels",runs=999,include.root=FALSE) ->
No_root_1 # the MRCA – root distance should be excluded from all the computations
set.seed(12345) # back to the same seed
ses.pd(Sample,Tree_F,null.model="taxa.labels",runs=999,include.root=TRUE) ->
root_1 # the MRCA – root distance should be included in all the computations

No_root_1
root_1

A quick look to the results reveals that as expected, the observed PD values between "No_root_1" and "root_1" are different for communities "clump1" and "clump2a" (the only two communities where the root of the tree is not traversed by the minimum spanning path connecting the species in the phylogeny), because in "No_root_1" the include.root argument was set to FALSE. HOWEVER, the descriptors of the null distributions in "No_root_1" and "root_1" are identical, and this is because the null PD values are computed without including the MRCA – root distance in all cases regardless of the logical value that is specified in the include.root argument.

The good news are that this anomaly only significantly affects communities with low species richness (less than four species). For more details, you may take a look to the following link, which includes a preprint with detailed information on the issue (a step by step tutorial) along with a simulation exercise (which informs about the communities that may be affected) and two alternatives to fix the problem (all the R code is fully available in the Supplementary Material of the preprint): https://www.biorxiv.org/content/10.1101/579300v1

Rafael Molina Venegas

@skembel skembel self-assigned this Mar 20, 2019
@skembel skembel added the bug label Mar 20, 2019
@skembel
Copy link
Owner

skembel commented Mar 20, 2019

Thanks for reporting this. This is indeed a bug, I'll push an updated version of the package with this function fixed to CRAN soon, and make an announcement about it.

@skembel
Copy link
Owner

skembel commented Mar 21, 2019

Hello,
this bug has been patched. The updated version of picante with this bug fixed (version 1.8) has been submitted to CRAN. ses.pd now functions correctly for both include.root=TRUE and include.root=FALSE arguments. Thanks for bringing this to my attention.

I remain perplexed about how this bug happened and would welcome any feedback or insights. All functions in picante were tested against various example data sets and I did not catch this error. Looking at the unpatched version of the code that caused the error

ses.pd <-
function (samp, tree, null.model = c("taxa.labels", "richness", "frequency", "sample.pool",
    "phylogeny.pool", "independentswap", "trialswap"), runs = 999, iterations = 1000, 
    ...)
{
    pd.obs <- as.vector(pd(samp, tree, ...)$PD)
    null.model <- match.arg(null.model)
    pd.rand <- switch(null.model,
    	taxa.labels = t(replicate(runs, as.vector(pd(samp, tipShuffle(tree), ...)$PD))),
    	richness = t(replicate(runs, as.vector(pd(randomizeMatrix(samp,null.model="richness"), tree, ...)$PD))),
    	frequency = t(replicate(runs, as.vector(pd(randomizeMatrix(samp,null.model="frequency"), tree, ...)$PD))),
    	sample.pool = t(replicate(runs, as.vector(pd(randomizeMatrix(samp,null.model="richness"), tree, ...)$PD))),
    	phylogeny.pool = t(replicate(runs, as.vector(pd(randomizeMatrix(samp,null.model="richness"),
    		tipShuffle(tree), ...)$PD))),
    	independentswap = t(replicate(runs, as.vector(pd(randomizeMatrix(samp,null.model="independentswap", iterations), tree, ...)$PD))),
    	trialswap = t(replicate(runs, as.vector(pd(randomizeMatrix(samp,null.model="trialswap", iterations), tree, ...)$PD)))
    )
    pd.rand.mean <- apply(X = pd.rand, MARGIN = 2, FUN = mean, na.rm=TRUE)
    pd.rand.sd <- apply(X = pd.rand, MARGIN = 2, FUN = sd, na.rm=TRUE)
    pd.obs.z <- (pd.obs - pd.rand.mean)/pd.rand.sd
    pd.obs.rank <- apply(X = rbind(pd.obs, pd.rand), MARGIN = 2,
        FUN = rank)[1, ]
    pd.obs.rank <- ifelse(is.na(pd.rand.mean),NA,pd.obs.rank)
    data.frame(ntaxa=specnumber(samp),pd.obs, pd.rand.mean, pd.rand.sd, pd.obs.rank,
        pd.obs.z, pd.obs.p=pd.obs.rank/(runs+1),runs=runs, row.names = row.names(samp))
}

If ses.pd was called with the argument include.root=FALSE, the argument was passed to the pd function in the ellipsis ... object.
This worked correctly for the calculation of observed pd
pd.obs <- as.vector(pd(samp, tree, ...)$PD)
but then it was apparently not passed to the pd function during the computation of the randomized pd values, e.g.
taxa.labels = t(replicate(runs, as.vector(pd(samp, tipShuffle(tree), ...)$PD))),
resulted in a calculation with include.root=TRUE regardless of the argument supplied to ses.pd

The patched version of ses.pd avoids this issue by hard-coding the include.root argument and has been tested to work correctly.

@skembel skembel pinned this issue Mar 21, 2019
@skembel skembel closed this as completed Mar 21, 2019
@rafmolven
Copy link
Author

Just a minor comment on Steve clarifications. The issue concerns the null PD values rather than the observed PD. As such, if include.root = TRUE within ses.pd, then the observed PD values will include the distance from the most recent common ancestor of the species in the communities and the root of the tree (as expected), BUT the null PD values will not (resulting in biases SES.PD values, particularly in low richness communities). However, when include.root = FALSE, then such distance is not included neither in the observed PD or the null PD values. Therefore, analyses conducted with include.root = FALSE are not affected, but note that the default setting of the function is include.root = TRUE.

@kerby1998 kerby1998 unpinned this issue Aug 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants