Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in rgcca_stability with null sd variables in the resampling and rows of missing values #84

Open
EGoujon opened this issue Aug 9, 2024 · 4 comments

Comments

@EGoujon
Copy link
Collaborator

EGoujon commented Aug 9, 2024

Hi Team RGCCA,
I encountered the following error when using rgcca_stability:

Bootstrap samples sanity check...OK 
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=09s  
Erreur dans x[, y, drop = FALSE] : indice hors limites

Here is an example to reproduce the error:

set.seed(6)
lambdas <- runif(1000, 0.001, 0.05)
data <- apply(matrix(lambdas), 1, FUN = function(x) {rpois(n = 100, lambda = x)}) #generate data with lots of 0 (~RNA-seq count data)
data <- rbind(data1, NA, NA, NA, NA, NA) #add rows of NA

rgcca_res <- rgcca(blocks = list(data,
                                 rnorm(n = 105)),
                   response = 2,
                   method = 'sgcca',
                   sparsity = c(0.3, 1))

stability_res = rgcca_stability(rgcca_res, n_boot = 10)

When trying to reproduce the error, I identified that this behavior only occurs with full rows of missing values (which can happen when blocks are not observed on fully overlapping sets of individuals). In this scenario, the error message is unclear which makes the origin of the error difficult to understand for the user. However, when multiple variables have null sd in the bootstrap samples but there are no full rows of missing data, the behavior is correctly understood and a clear and informative message is sent. The underlying issue could be in rgcca_bootstrap_k or could maybe be caught earlier in generate_resampling.

Do you think this type of error could be caught to avoid misunderstandings?
Thank you :)
Elen

@GFabien
Copy link
Collaborator

GFabien commented Nov 10, 2024

Hi Elen!

Thank you for reporting this issue!

Could you try the branch https://github.com/rgcca-factory/RGCCA/tree/fix_extra_index_in_keepVar and tell me if it's enough? I don't think we can really be robust to this problem, but we can definitely help the user no to be lost when this happens.

Best,
Fabien

@GFabien
Copy link
Collaborator

GFabien commented Nov 11, 2024

Hi @Tenenhaus, I thought about it and it relates to some other problems I am facing with TGCCA.

The core of the problem is that we remove variables with null variance since they will not contribute to the objective function, and we might get into trouble if we try to scale such variables. I don't think we really need to remove those variables, they will give zeros in the associated weight vectors anyway. To handle the scaling part, we can take a small epsilon if the std is null to avoid numerical problems, and since the variables would be centered, it would be 0 / epsilon in any case.

It would also solve other problems like defining what is a constant variable for TGCCA or multigroup RGCCA, having bootstrap samples with different variables, a sparsity constant that depends on the number of non constant variables instead of the total number of variables, and outputs having different number of variables than inputs. What do you think about it?

Best,
Fabien

@EGoujon
Copy link
Collaborator Author

EGoujon commented Nov 25, 2024

Hi Elen!

Thank you for reporting this issue!

Could you try the branch https://github.com/rgcca-factory/RGCCA/tree/fix_extra_index_in_keepVar and tell me if it's enough? I don't think we can really be robust to this problem, but we can definitely help the user no to be lost when this happens.

Best, Fabien

Hi Fabien!
I tried the new branch and saw the error message you added. It's clear and informative, thank you!
Best,
Elen

@GFabien
Copy link
Collaborator

GFabien commented Dec 1, 2024

Hi Elen!
Could you try out this branch: https://github.com/rgcca-factory/RGCCA/tree/tgcca and see if the results seem fine to you?
We decided to stop removing variables with a null standard deviation since they will have zero weights. Therefore, there is no question of having less variables in some bootstrap samples anymore.
Best,
Fabien

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants