Bug in ses.pd function #17

rafmolven · 2019-03-19T08:14:20Z

Hi everyone,

I write to report a bug in the ses.pd function. As you may know, the arguments of ses.pd are passed through the function pd, which means that one can specify whether to include the distance from the most recent common ancestor of the species and the root of the tree (MRCA – root distance). The default setting (in the pd function) is "include.root = TRUE". However, I have noted that ses.pd computes null PD values without including the MRCA – root distance regardless of the logical value that is specified in the include.root argument. This is, if include.root = TRUE (default = TRUE), the observed PD will include the MRCA – root distance, but the null PD values will not. For example:

require(picante)
data(phylocom)
phylocom$sample -> Sample
phylocom$phylo -> Tree
prune.sample(Sample,Tree)->Tree_F
Sample <- Sample[,Tree_F$tip.label]
set.seed(12345) # fix the seed to make the analyses reproducible
ses.pd(Sample,Tree_F,null.model="taxa.labels",runs=999,include.root=FALSE) ->
No_root_1 # the MRCA – root distance should be excluded from all the computations
set.seed(12345) # back to the same seed
ses.pd(Sample,Tree_F,null.model="taxa.labels",runs=999,include.root=TRUE) ->
root_1 # the MRCA – root distance should be included in all the computations

No_root_1
root_1

A quick look to the results reveals that as expected, the observed PD values between "No_root_1" and "root_1" are different for communities "clump1" and "clump2a" (the only two communities where the root of the tree is not traversed by the minimum spanning path connecting the species in the phylogeny), because in "No_root_1" the include.root argument was set to FALSE. HOWEVER, the descriptors of the null distributions in "No_root_1" and "root_1" are identical, and this is because the null PD values are computed without including the MRCA – root distance in all cases regardless of the logical value that is specified in the include.root argument.

The good news are that this anomaly only significantly affects communities with low species richness (less than four species). For more details, you may take a look to the following link, which includes a preprint with detailed information on the issue (a step by step tutorial) along with a simulation exercise (which informs about the communities that may be affected) and two alternatives to fix the problem (all the R code is fully available in the Supplementary Material of the preprint): https://www.biorxiv.org/content/10.1101/579300v1

Rafael Molina Venegas

skembel · 2019-03-20T15:20:14Z

Thanks for reporting this. This is indeed a bug, I'll push an updated version of the package with this function fixed to CRAN soon, and make an announcement about it.

skembel · 2019-03-21T17:17:16Z

Hello,
this bug has been patched. The updated version of picante with this bug fixed (version 1.8) has been submitted to CRAN. ses.pd now functions correctly for both include.root=TRUE and include.root=FALSE arguments. Thanks for bringing this to my attention.

I remain perplexed about how this bug happened and would welcome any feedback or insights. All functions in picante were tested against various example data sets and I did not catch this error. Looking at the unpatched version of the code that caused the error

ses.pd <-
function (samp, tree, null.model = c("taxa.labels", "richness", "frequency", "sample.pool",
    "phylogeny.pool", "independentswap", "trialswap"), runs = 999, iterations = 1000, 
    ...)
{
    pd.obs <- as.vector(pd(samp, tree, ...)$PD)
    null.model <- match.arg(null.model)
    pd.rand <- switch(null.model,
    	taxa.labels = t(replicate(runs, as.vector(pd(samp, tipShuffle(tree), ...)$PD))),
    	richness = t(replicate(runs, as.vector(pd(randomizeMatrix(samp,null.model="richness"), tree, ...)$PD))),
    	frequency = t(replicate(runs, as.vector(pd(randomizeMatrix(samp,null.model="frequency"), tree, ...)$PD))),
    	sample.pool = t(replicate(runs, as.vector(pd(randomizeMatrix(samp,null.model="richness"), tree, ...)$PD))),
    	phylogeny.pool = t(replicate(runs, as.vector(pd(randomizeMatrix(samp,null.model="richness"),
    		tipShuffle(tree), ...)$PD))),
    	independentswap = t(replicate(runs, as.vector(pd(randomizeMatrix(samp,null.model="independentswap", iterations), tree, ...)$PD))),
    	trialswap = t(replicate(runs, as.vector(pd(randomizeMatrix(samp,null.model="trialswap", iterations), tree, ...)$PD)))
    )
    pd.rand.mean <- apply(X = pd.rand, MARGIN = 2, FUN = mean, na.rm=TRUE)
    pd.rand.sd <- apply(X = pd.rand, MARGIN = 2, FUN = sd, na.rm=TRUE)
    pd.obs.z <- (pd.obs - pd.rand.mean)/pd.rand.sd
    pd.obs.rank <- apply(X = rbind(pd.obs, pd.rand), MARGIN = 2,
        FUN = rank)[1, ]
    pd.obs.rank <- ifelse(is.na(pd.rand.mean),NA,pd.obs.rank)
    data.frame(ntaxa=specnumber(samp),pd.obs, pd.rand.mean, pd.rand.sd, pd.obs.rank,
        pd.obs.z, pd.obs.p=pd.obs.rank/(runs+1),runs=runs, row.names = row.names(samp))
}

If ses.pd was called with the argument include.root=FALSE, the argument was passed to the pd function in the ellipsis ... object.
This worked correctly for the calculation of observed pd
pd.obs <- as.vector(pd(samp, tree, ...)$PD)
but then it was apparently not passed to the pd function during the computation of the randomized pd values, e.g.
taxa.labels = t(replicate(runs, as.vector(pd(samp, tipShuffle(tree), ...)$PD))),
resulted in a calculation with include.root=TRUE regardless of the argument supplied to ses.pd

The patched version of ses.pd avoids this issue by hard-coding the include.root argument and has been tested to work correctly.

rafmolven · 2019-03-22T08:50:56Z

Just a minor comment on Steve clarifications. The issue concerns the null PD values rather than the observed PD. As such, if include.root = TRUE within ses.pd, then the observed PD values will include the distance from the most recent common ancestor of the species in the communities and the root of the tree (as expected), BUT the null PD values will not (resulting in biases SES.PD values, particularly in low richness communities). However, when include.root = FALSE, then such distance is not included neither in the observed PD or the null PD values. Therefore, analyses conducted with include.root = FALSE are not affected, but note that the default setting of the function is include.root = TRUE.

skembel self-assigned this Mar 20, 2019

skembel added the bug label Mar 20, 2019

skembel pinned this issue Mar 21, 2019

skembel closed this as completed Mar 21, 2019

kerby1998 unpinned this issue Aug 29, 2023

skembel mentioned this issue Oct 4, 2024

PD.ses for low sample sizes #38

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in ses.pd function #17

Bug in ses.pd function #17

rafmolven commented Mar 19, 2019 •

edited

Loading

skembel commented Mar 20, 2019

skembel commented Mar 21, 2019

rafmolven commented Mar 22, 2019

Bug in ses.pd function #17

Bug in ses.pd function #17

Comments

rafmolven commented Mar 19, 2019 • edited Loading

skembel commented Mar 20, 2019

skembel commented Mar 21, 2019

rafmolven commented Mar 22, 2019

rafmolven commented Mar 19, 2019 •

edited

Loading