Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limitation on the number of cells? #12

Closed
EannaFennell opened this issue Jan 1, 2021 · 13 comments
Closed

Limitation on the number of cells? #12

EannaFennell opened this issue Jan 1, 2021 · 13 comments
Labels
help wanted Extra attention is needed question Further information is requested

Comments

@EannaFennell
Copy link

EannaFennell commented Jan 1, 2021

Hi Tom,

I've used leiden on a number of clustering runs with great results! However, I'm trying to cluster ~110,000 cells from a cyTOF experiment with 36 channels and I'm receiving an error on the number of vertices;

Error in py_call_impl(callable, dots$args, dots$keywords) : 
  ValueError: no such vertex: '1e+05' 
7.
stop(structure(list(message = "ValueError: no such vertex: '1e+05'", 
    call = py_call_impl(callable, dots$args, dots$keywords), 
    cppstack = NULL), class = c("Rcpp::exception", "C++Error", 
"error", "condition"))) 
6.
py_graph$add_edges(r_to_py(edgelist)) 
5.
make_py_graph.igraph(object, weights = weights) 
4.
make_py_graph(object, weights = weights) 
3.
leiden.igraph(object, partition_type = partition_type, weights = weights, 
    node_sizes = node_sizes, resolution_parameter = resolution_parameter, 
    seed = seed, n_iterations = n_iterations, degree_as_node_size = degree_as_node_size, 
    laplacian = laplacian) 
2.
leiden.Matrix(adjacency_matrix, resolution_parameter = clusterResolution) 
1.
leiden::leiden(adjacency_matrix, resolution_parameter = clusterResolution) 

I am using a standard workflow that has been successful with smaller clustering runs;

snn <- as.data.frame(t(RANN::nn2(dataToCluster, k = numNeighbours)$nn.idx))
edge_list <- tidyr::gather(snn)
  edge_list <- apply(edge_list, 2, as.numeric)
  graph <- igraph::graph_from_edgelist(edge_list, directed=FALSE)
  adjacency_matrix <- igraph::as_adjacency_matrix(graph)
partition <- leiden::leiden(adjacency_matrix, resolution_parameter = clusterResolution)

Is there currently a limitation on the number of cells?

Here is the session info

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_Ireland.1252  LC_CTYPE=English_Ireland.1252    LC_MONETARY=English_Ireland.1252
[4] LC_NUMERIC=C                     LC_TIME=English_Ireland.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] leiden_0.3.6                MISSILe_0.1.0               colorspace_1.4-1           
 [4] Rtsne_0.15                  uwot_0.1.8                  Matrix_1.2-18              
 [7] ggplot2_3.3.2               phonTools_0.2-2.1           purrr_0.3.4                
[10] ConsensusClusterPlus_1.52.0 FlowSOM_1.20.0              igraph_1.2.6               
[13] CATALYST_1.12.2             SingleCellExperiment_1.10.1 SummarizedExperiment_1.18.2
[16] DelayedArray_0.14.1         matrixStats_0.57.0          Biobase_2.48.0             
[19] GenomicRanges_1.40.0        GenomeInfoDb_1.24.2         IRanges_2.22.2             
[22] S4Vectors_0.26.1            BiocGenerics_0.34.0         dplyr_1.0.2                
[25] MASS_7.3-53                 colorRamps_2.3              gatepoints_0.1.3           
[28] flowCore_2.1.2

loaded via a namespace (and not attached):
  [1] reticulate_1.18           tidyselect_1.1.0          lme4_1.1-23              
  [4] grid_4.0.2                BiocParallel_1.22.0       devtools_2.3.2           
  [7] munsell_0.5.0             codetools_0.2-16          statmod_1.4.35           
 [10] withr_2.3.0               batchelor_1.4.0           knitr_1.30               
 [13] rstudioapi_0.11           GenomeInfoDbData_1.2.3    polyclip_1.10-0          
 [16] pheatmap_1.0.12           flowWorkspace_4.0.6       rprojroot_1.3-2          
 [19] vctrs_0.3.4               generics_0.0.2            TH.data_1.0-10           
 [22] xfun_0.18                 R6_2.4.1                  ggbeeswarm_0.6.0         
 [25] clue_0.3-57               rsvd_1.0.3                locfit_1.5-9.4           
 [28] bitops_1.0-6              assertthat_0.2.1          scales_1.1.1             
 [31] multcomp_1.4-14           beeswarm_0.2.3            gtable_0.3.0             
 [34] processx_3.4.4            RProtoBufLib_2.1.0        sandwich_3.0-0           
 [37] rlang_0.4.8               GlobalOptions_0.1.2       splines_4.0.2            
 [40] hexbin_1.28.1             yaml_2.2.1                reshape2_1.4.4           
 [43] abind_1.4-5               backports_1.1.10          RBGL_1.64.0              
 [46] usethis_1.6.3             tools_4.0.2               ellipsis_0.3.1           
 [49] RColorBrewer_1.1-2        sessioninfo_1.1.1         ggridges_0.5.2           
 [52] Rcpp_1.0.5                plyr_1.8.6                base64enc_0.1-3          
 [55] zlibbioc_1.34.0           RCurl_1.98-1.2            prettyunits_1.1.1
[58] ps_1.4.0                  GetoptLong_1.0.4          viridis_0.5.1            
 [61] cowplot_1.1.0             zoo_1.8-8                 haven_2.3.1              
 [64] ggrepel_0.8.2             cluster_2.1.0             fs_1.5.0                 
 [67] tinytex_0.26              magrittr_1.5              ncdfFlow_2.34.0          
 [70] data.table_1.13.0         magick_2.5.2              openxlsx_4.2.2           
 [73] circlize_0.4.10           RANN_2.6.1                mvtnorm_1.1-1            
 [76] pkgload_1.1.0             hms_0.5.3                 evaluate_0.14            
 [79] XML_3.99-0.5              rio_0.5.16                jpeg_0.1-8.1             
 [82] readxl_1.3.1              gridExtra_2.3             shape_1.4.5              
 [85] ggcyto_1.16.0             testthat_2.3.2            compiler_4.0.2           
 [88] scater_1.16.2             tibble_3.0.4              KernSmooth_2.23-17       
 [91] crayon_1.3.4              minqa_1.2.4               htmltools_0.5.0          
 [94] tidyr_1.1.2               RcppParallel_5.0.2        ComplexHeatmap_2.4.3     
 [97] rappdirs_0.3.1            boot_1.3-25               car_3.0-10               
[100] diffcyt_1.8.8             cli_2.1.0                 forcats_0.5.0            
[103] pkgconfig_2.0.3           foreign_0.8-80            xml2_1.3.2               
[106] vipor_0.4.5               XVector_0.28.0            drc_3.0-1                
[109] stringr_1.4.0             callr_3.5.1               digest_0.6.26            
[112] tsne_0.1-3                graph_1.66.0              rmarkdown_2.4            
[115] cellranger_1.1.0          edgeR_3.30.3              DelayedMatrixStats_1.10.1
[118] curl_4.3                  gtools_3.8.2              rjson_0.2.20             
[121] nloptr_1.2.2.2            lifecycle_0.2.0           nlme_3.1-148
[124] jsonlite_1.7.1            carData_3.0-4             BiocNeighbors_1.6.0      
[127] desc_1.2.0                viridisLite_0.3.0         limma_3.44.3             
[130] fansi_0.4.1               pillar_1.4.6              lattice_0.20-41          
[133] plotrix_3.7-8             pkgbuild_1.1.0            survival_3.1-12          
[136] remotes_2.2.0             glue_1.4.2                zip_2.1.1                
[139] png_0.1-7                 Rgraphviz_2.32.0          stringi_1.5.3            
[142] nnls_1.4                  BiocSingular_1.4.0        CytoML_2.0.5             
[145] latticeExtra_0.6-29       memoise_1.1.0             cytolib_2.1.18           
[148] irlba_2.3.3
@EannaFennell
Copy link
Author

I just ran this same dataset with one sample removed (equating to ~92,000 cells) and it clustered successfully! Anything over 100,000 seems to be a limitation.

@TomKellyGenetics
Copy link
Owner

Thanks for your question. Paging @vtraag in case it's an issue with the python implementation. If not, I will look into the R/Python interface.

Just to let you know, upcoming changes (#1) use a different R/C implementation of leiden that doesn't rely on Python. This may address the issue I'm not sure.

@TomKellyGenetics TomKellyGenetics added the question Further information is requested label Jan 18, 2021
@natemiller
Copy link

I just ran into this issue as well. Using the igraph implementation avoids have the same issue (and is faster), but the objective_function limitations (only CPM or modularity) are restrictive.
Were you able to determine if the following error represents an issue with the python implementation or the R/Python interface?

Error in py_call_impl(callable, dots$args, dots$keywords) : 
  ValueError: no such vertex: '1e+05' 

@TomKellyGenetics
Copy link
Owner

Thanks for reporting @natemiller, it's still on my to-do list but last year I had a baby and changed jobs so it has been less of a priority recently. I hope you understand this is a difficult issue to reproduce with large data and it is unclear if it is a limitation with the R package, the python leidenalg module, or the reticulate package used to call it.

If anyone has any ideas how to address the issue, suggestions or PRs are welcome! :-)

Please note the igraph 1.2.7 release on CRAN also has an R implementation of leiden which has limited functionality but may not have this issue. I have planned to migrate this package to calling that where possible since the performance is expected to improve.

@vtraag
Copy link

vtraag commented Jan 21, 2022

There are no limits to the python leidenalg package that I know of, besides what can be represented as integers (igraph itself is in the process of transitioning towards 64-bit integers), and I doubt that these limits play a role here. By the looks of it, it may involve something with vertices being represented as doubles, I'm not sure. That would suggest it has something to do with the Python / R interface.

If you can simply export the network to some external file (in whatever format igraph can read it), then I'd be happy quickly test it in the leidenalg package itself. Feel free to share publicly or privately, whatever you are comfortable with.

@natemiller
Copy link

Thank you for the reply @vtraag .
I actually just passed the graph from R to python. As you suggested I was able to run leidenalg within python without issues on the same graph that failed previously, using RBConfigurationVertexPartition . It does appear to be something at the Python/R interface.

@TomKellyGenetics TomKellyGenetics added the help wanted Extra attention is needed label Jan 23, 2022
@TomKellyGenetics
Copy link
Owner

TomKellyGenetics commented Jan 23, 2022

Thanks for looking into the issue @vtraag. This is definitely one of the higher priority issues for the R package. @natemiller I am glad to hear you were able to compute the results in Python.

Looking at the logs reported by @eannaf above, your suggestion that it is an issue with passing edges to Python via reticulate appears to be correct. Note that reticulate does not support passing "igraph" class objects between R and python so it necessary to use an adjacency matrix (satijalab/seurat#1645) or a list of edges: https://github.com/TomKellyGenetics/leiden/blob/master/R/py_objects.R#L68

The problem appears to occurs when reticulate calls Rcpp. @natemiller I've noticed that my friend and Rcpp expert @teuder is your colleague. If you wish to get this working in an R based workflow perhaps Tsuda-san may be able to assist you.

@natemiller
Copy link

@TomKellyGenetics Wow... small world! I'll check in with Tsuda-san.

As you mentioned previously the igraph implementation is useful depending on the partition you are using. And I was actually able to manually pass my R igraph to Python, apply leidenalg there, and then pass it back (with some effort). So there are alternatives.
Thank you for your efforts.

@jsim91
Copy link

jsim91 commented Apr 23, 2022

I use leiden to cluster cytof datasets that are over 2M rows. I had this same error early on but it was a simple fix. I run options(scipen = 1000000000) before clustering with leiden. This raises the scientific notation threshold.

Without reading any source code, my hypothesis was (and is) that the vertices are labeled as numeric/integer values. When R reaches the vertex 100000, that vertex is labeled as 1e+05. If you enter 100000 in the R console, you get 1e+05 back by default. It seems python doesn't have this scientific notation quirk. When this vertex is read elsewhere, that function/interpreter/compiler receives what it thinks is a non-numeric/non-integer value (hence the ValueError).

So just run options(scipen = 1000000000) before you cluster and you should be good to go!

@eannaf Perhaps you already know this but RANN::nn2 can be parallelized with the future.apply package to vastly increase the speed of the neighbor-finding step. If your datasets continue to grow (and if you have enough RAM), it will save you a lot of time.

Here's some benchmarking I've done in the past on a cytof dataset of about 1M cells that compares single thread to multi-thread (22 threads) performance while slowly increasing the number of columns (dimensions) passed to the search:

image

@TomKellyGenetics
Copy link
Owner

@jsim91 Thank you for sharing this! I'm still considering changes to the source code to avoid this but it is great to know there is a straight-forward (if no necessarily intuitive to other users) to implement workaround that does not require it.

You're right that R handles printing 100000 differently to lower numbers and it does not seem to be a limitation with the Python implementation so it could be the interface from R struggling to interpret this. I'm hesitant to force changes to the users options but it seems it is also possible within R to disable this in one instance.

99999
#> [1] 99999
100000
#> [1] 1e+05
1000000
#> [1] 1e+06
as.numeric(100000)
#> [1] 1e+05
format(100000, scientific = FALSE)
#> [1] "100000"
options(scipen = 1000000000)
100000
#> [1] 100000
1000000
#> [1] 1000000

Created on 2022-04-23 by the reprex package (v2.0.1)

@jsim91
Copy link

jsim91 commented Apr 24, 2022

Could the fix be as simple as using on.exit as described in the accepted answer here: https://stackoverflow.com/questions/35584035/changing-options-in-a-function-environment-without-changing-options-in-globa

@TomKellyGenetics
Copy link
Owner

TomKellyGenetics commented Apr 25, 2022

Thanks for your advice! Leiden v0.3.10 which should resolve this has been submitted to CRAN.

@TomKellyGenetics
Copy link
Owner

Leiden 0.3.10 is accepted by CRAN and is now available to install. Sorry for the delays due to issues discussed in #20.

Thanks for your help @jsim91, I really appreciate your input.

TomKellyGenetics added a commit that referenced this issue May 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants