Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about scaling propr to very large data sets (Part 2) #24

Open
sapuizait opened this issue Jun 11, 2021 · 11 comments
Open

Question about scaling propr to very large data sets (Part 2) #24

sapuizait opened this issue Jun 11, 2021 · 11 comments
Labels
helpful This question has been marked as potentially helpful to others. question

Comments

@sapuizait
Copy link

Hi all

First of congrats on your wonderful software, just read about it and it looks very promising.
I have a question/request.

I have two datasets (coming from the same samples (same fastq files) but have been generated using different databases) and I want to compare the elements (features) of dataset A against the elements of dataset B only. This is simply because otherwise this project is not feasible.
Dataset A has 200k features and dataset B 40k features. And thus I want to check for correlations of the
A[1] ~ B[1]
A[1] ~ B[2]
A[1] ~ B[3]
....
A[2] ~ B[1]
A[2] ~ B[2]
....
etc etc

Now normally (using the good ol spearman -apologies for speaking the name who shall not be spoken out loud- ) I would use sth like this:
rho <- matrix(NA, nrow = ncol(x), ncol=ncol(y)) for (i in 1:ncol(x)) {for (j in 1:ncol(y)) { rho[i, j] <- cor.test(x[,i], y[,j], method = "spearman")$estimate}}

(x and y are the two datasets)

but when I tried
propr(x[1]~ y[2], ivar = NA)
or

propr(x[1], y[2], ivar = NA)

does not seem to work... so my question is: is this possible at all?
ps: my data are already clr transformed, this is why I used 'ivar=NA'

thanks in advance

@tpq
Copy link
Owner

tpq commented Jun 12, 2021

Thanks for your interest in propr! It sounds like the data can be thought of as "multi-omics data" in that you have two views of the same subjects. I'll assume A and B are both compositional. Could you please have a look at #9 and let me know if you have any further questions? Best, tpq

@sapuizait
Copy link
Author

Hey tpq
Thanks for your reply, yes data are compositional and 0s have been imputed using the zCompositions package.
I actually did look at the multiomics thread and i tried the example you suggested with the 'data(iris)'
However, as far as I can see the solution works well for merging the two datasets and comparing all features, meaning; if dataset A has 3 features and B has 2 features, then the comparisons will be A1 vs A2, A1 vs A3, A1 vs B1, A1 vs B2, A2 vs A3, A2 vs B1, A2 vs B2, A3 vs B1, A3 vs B2
which is all possible comparisons.
What I am looking for is a solution that will allow me to specify that I only want to compare features of dataset A with features of dataset B but not the features of each dataset with each other.
Therefore the desired output would only look the comparisons between:
A1 vs B1 and A1 vs B2,
A2 vs B1 and A2 vs B2 and finally
A3 vs B1 and A3 vs B2

Is that even possible or should I just try the #9 solution and then pick the pairs that I care about?

@tpq
Copy link
Owner

tpq commented Jun 12, 2021

Ah, I see. Unfortunately, I have not made any functions for this case. The #9 solution would be the way to go. You may find the functions getMatrix() and getRatios() helpful for parsing the results. Though, I'm afraid you might run into performance issues with 240k features (unless you have a very big computer!).

You might want to try something like...

devtools::install_github("tpq/propr")
library(propr)
clr <- function(x) sweep(log(x), 1, rowMeans(log(x)), "-")
A <- clr(matrix(rpois(100*3, 100), 100, 3))
B <- clr(matrix(rpois(100*30, 100), 100, 30))
for(col_j in 1:ncol(A)){
  REL_j <- cbind(A[,col_j], B)
  pr <- propr(REL_j, ivar = NA)
  print(getMatrix(pr))
}

Each j-th element in the for loop would give you proportionality between A[,j] and B. It'll also break down the 240k^2 results into 200 separate 40k^2 results, which is easier to parallelize.

@sapuizait
Copy link
Author

Great! Thanks a lot! I will give it a try and see how it goes. I m running everything on a cluster so hopefully it should be feasible!

@sapuizait
Copy link
Author

Hey, works nicely, thanks a lot!
One quick question and then I ll close the topic:
I cannot use the fdr. Not only because of the large dataset but also because I am primarily interested in the negative 'rho' values (and as far as I understand fdr is only for positive values).
I used to calculate pvalues using t-approximation from the correlation matrix, but I havent tried to apply sth like that here -
What is your opinion on this?
I am of course inclined towards dropping pvalues and fdr completely and presenting the highest correlations.
Again, thanks for your help

@tpq
Copy link
Owner

tpq commented Jun 15, 2021

Our reason for calculating FDR on only positive values is that the negative values can be difficult to interpret (see #4). I don't feel comfortable endorsing exact p-value calculation from rho using t-approximation, simply because I don't know enough to know whether the approach is valid. Permutation is computationally expensive, but also easy to implement. Have a look at propr:::updateCutoffs.propr source code for an example.

  for(cut in 1:nrow(FDR)){ # for each user-specified cutoff...

    # Count positives as rho > cutoff, cor > cutoff, phi < cutoff, phs < cutoff
    if(object@metric == "rho" | object@metric == "cor"){
      FDR[cut, "truecounts"] <- sum(object@results$propr > FDR[cut, "cutoff"])
    }else{ # phi & phs
      FDR[cut, "truecounts"] <- sum(object@results$propr < FDR[cut, "cutoff"])
    }

    FDR[cut, "FDR"] <- FDR[cut, "randcounts"] / FDR[cut, "truecounts"]
  }

You could, for example, compute a small null distribution of negative rho for each A[,j]. This would break up the FDR computations into parallelizable chunks.

@tpq tpq changed the title question about specifying pairs to examine Question about scaling propr to very large data sets Jun 15, 2021
@tpq tpq added question helpful This question has been marked as potentially helpful to others. labels Jun 15, 2021
@tpq tpq changed the title Question about scaling propr to very large data sets Question about scaling propr to very large data sets (Thread 2) Jun 15, 2021
@tpq tpq changed the title Question about scaling propr to very large data sets (Thread 2) Question about scaling propr to very large data sets (Part 2) Jun 15, 2021
@sapuizait
Copy link
Author

I see. That is a bit unfortunately as I am primarily interested in negative correlations. Is there any other metric or any other way I could try to look for negative correlations besides the negative r values? or is there any way to exclude false positives?

@tpq
Copy link
Owner

tpq commented Jun 15, 2021

I think the key issue is that the negative rho (like negative correlations) would be highly sensitive to the choice of the reference. They may mean something if you believe the CLR is a suitable normalization method. Without normalization, I do not know of any ways to obtain true negative pairwise associations. In the case of multi-omics data analysis, however, you might be satisfied with knowing what associations exist with respect to the CLR centers. In this case, negative rho (or indeed negative Pearson) may hold some meaning to you. You might find our discussion of this topic helpful https://www.nature.com/articles/s41592-020-01006-1

@tpq
Copy link
Owner

tpq commented Jun 15, 2021

s41592-020-01006-1.pdf

@sapuizait
Copy link
Author

OK, thanks! i ll look at it carefully tomorrow and see if i can figure sth out!

@sapuizait
Copy link
Author

Dear Thomas
I have looked in the paper you sent me and things are a bit clearer now. Thanks a lot for helping with this. I realized a mistake that I had done was to use the same normalization reference for both datasets but i went back and fixed that. That way the datasets are completely independent and comparing features from A to B should not cause any problems. As far as I can see your scientific reports paper describes that weird false-positive-negative-correlations event when comparing features that come from the same dataset and have been corrected against the same geometric mean. i also completely understand why for A and B the clr has to be applied independently. Again, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
helpful This question has been marked as potentially helpful to others. question
Projects
None yet
Development

No branches or pull requests

2 participants