Limitation on the number of cells? #12

EannaFennell · 2021-01-01T12:44:32Z

Hi Tom,

I've used leiden on a number of clustering runs with great results! However, I'm trying to cluster ~110,000 cells from a cyTOF experiment with 36 channels and I'm receiving an error on the number of vertices;

Error in py_call_impl(callable, dots$args, dots$keywords) : 
  ValueError: no such vertex: '1e+05' 
7.
stop(structure(list(message = "ValueError: no such vertex: '1e+05'", 
    call = py_call_impl(callable, dots$args, dots$keywords), 
    cppstack = NULL), class = c("Rcpp::exception", "C++Error", 
"error", "condition"))) 
6.
py_graph$add_edges(r_to_py(edgelist)) 
5.
make_py_graph.igraph(object, weights = weights) 
4.
make_py_graph(object, weights = weights) 
3.
leiden.igraph(object, partition_type = partition_type, weights = weights, 
    node_sizes = node_sizes, resolution_parameter = resolution_parameter, 
    seed = seed, n_iterations = n_iterations, degree_as_node_size = degree_as_node_size, 
    laplacian = laplacian) 
2.
leiden.Matrix(adjacency_matrix, resolution_parameter = clusterResolution) 
1.
leiden::leiden(adjacency_matrix, resolution_parameter = clusterResolution)

I am using a standard workflow that has been successful with smaller clustering runs;

snn <- as.data.frame(t(RANN::nn2(dataToCluster, k = numNeighbours)$nn.idx))
edge_list <- tidyr::gather(snn)
  edge_list <- apply(edge_list, 2, as.numeric)
  graph <- igraph::graph_from_edgelist(edge_list, directed=FALSE)
  adjacency_matrix <- igraph::as_adjacency_matrix(graph)
partition <- leiden::leiden(adjacency_matrix, resolution_parameter = clusterResolution)

Is there currently a limitation on the number of cells?

Here is the session info

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_Ireland.1252  LC_CTYPE=English_Ireland.1252    LC_MONETARY=English_Ireland.1252
[4] LC_NUMERIC=C                     LC_TIME=English_Ireland.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] leiden_0.3.6                MISSILe_0.1.0               colorspace_1.4-1           
 [4] Rtsne_0.15                  uwot_0.1.8                  Matrix_1.2-18              
 [7] ggplot2_3.3.2               phonTools_0.2-2.1           purrr_0.3.4                
[10] ConsensusClusterPlus_1.52.0 FlowSOM_1.20.0              igraph_1.2.6               
[13] CATALYST_1.12.2             SingleCellExperiment_1.10.1 SummarizedExperiment_1.18.2
[16] DelayedArray_0.14.1         matrixStats_0.57.0          Biobase_2.48.0             
[19] GenomicRanges_1.40.0        GenomeInfoDb_1.24.2         IRanges_2.22.2             
[22] S4Vectors_0.26.1            BiocGenerics_0.34.0         dplyr_1.0.2                
[25] MASS_7.3-53                 colorRamps_2.3              gatepoints_0.1.3           
[28] flowCore_2.1.2

loaded via a namespace (and not attached):
  [1] reticulate_1.18           tidyselect_1.1.0          lme4_1.1-23              
  [4] grid_4.0.2                BiocParallel_1.22.0       devtools_2.3.2           
  [7] munsell_0.5.0             codetools_0.2-16          statmod_1.4.35           
 [10] withr_2.3.0               batchelor_1.4.0           knitr_1.30               
 [13] rstudioapi_0.11           GenomeInfoDbData_1.2.3    polyclip_1.10-0          
 [16] pheatmap_1.0.12           flowWorkspace_4.0.6       rprojroot_1.3-2          
 [19] vctrs_0.3.4               generics_0.0.2            TH.data_1.0-10           
 [22] xfun_0.18                 R6_2.4.1                  ggbeeswarm_0.6.0         
 [25] clue_0.3-57               rsvd_1.0.3                locfit_1.5-9.4           
 [28] bitops_1.0-6              assertthat_0.2.1          scales_1.1.1             
 [31] multcomp_1.4-14           beeswarm_0.2.3            gtable_0.3.0             
 [34] processx_3.4.4            RProtoBufLib_2.1.0        sandwich_3.0-0           
 [37] rlang_0.4.8               GlobalOptions_0.1.2       splines_4.0.2            
 [40] hexbin_1.28.1             yaml_2.2.1                reshape2_1.4.4           
 [43] abind_1.4-5               backports_1.1.10          RBGL_1.64.0              
 [46] usethis_1.6.3             tools_4.0.2               ellipsis_0.3.1           
 [49] RColorBrewer_1.1-2        sessioninfo_1.1.1         ggridges_0.5.2           
 [52] Rcpp_1.0.5                plyr_1.8.6                base64enc_0.1-3          
 [55] zlibbioc_1.34.0           RCurl_1.98-1.2            prettyunits_1.1.1
[58] ps_1.4.0                  GetoptLong_1.0.4          viridis_0.5.1            
 [61] cowplot_1.1.0             zoo_1.8-8                 haven_2.3.1              
 [64] ggrepel_0.8.2             cluster_2.1.0             fs_1.5.0                 
 [67] tinytex_0.26              magrittr_1.5              ncdfFlow_2.34.0          
 [70] data.table_1.13.0         magick_2.5.2              openxlsx_4.2.2           
 [73] circlize_0.4.10           RANN_2.6.1                mvtnorm_1.1-1            
 [76] pkgload_1.1.0             hms_0.5.3                 evaluate_0.14            
 [79] XML_3.99-0.5              rio_0.5.16                jpeg_0.1-8.1             
 [82] readxl_1.3.1              gridExtra_2.3             shape_1.4.5              
 [85] ggcyto_1.16.0             testthat_2.3.2            compiler_4.0.2           
 [88] scater_1.16.2             tibble_3.0.4              KernSmooth_2.23-17       
 [91] crayon_1.3.4              minqa_1.2.4               htmltools_0.5.0          
 [94] tidyr_1.1.2               RcppParallel_5.0.2        ComplexHeatmap_2.4.3     
 [97] rappdirs_0.3.1            boot_1.3-25               car_3.0-10               
[100] diffcyt_1.8.8             cli_2.1.0                 forcats_0.5.0            
[103] pkgconfig_2.0.3           foreign_0.8-80            xml2_1.3.2               
[106] vipor_0.4.5               XVector_0.28.0            drc_3.0-1                
[109] stringr_1.4.0             callr_3.5.1               digest_0.6.26            
[112] tsne_0.1-3                graph_1.66.0              rmarkdown_2.4            
[115] cellranger_1.1.0          edgeR_3.30.3              DelayedMatrixStats_1.10.1
[118] curl_4.3                  gtools_3.8.2              rjson_0.2.20             
[121] nloptr_1.2.2.2            lifecycle_0.2.0           nlme_3.1-148
[124] jsonlite_1.7.1            carData_3.0-4             BiocNeighbors_1.6.0      
[127] desc_1.2.0                viridisLite_0.3.0         limma_3.44.3             
[130] fansi_0.4.1               pillar_1.4.6              lattice_0.20-41          
[133] plotrix_3.7-8             pkgbuild_1.1.0            survival_3.1-12          
[136] remotes_2.2.0             glue_1.4.2                zip_2.1.1                
[139] png_0.1-7                 Rgraphviz_2.32.0          stringi_1.5.3            
[142] nnls_1.4                  BiocSingular_1.4.0        CytoML_2.0.5             
[145] latticeExtra_0.6-29       memoise_1.1.0             cytolib_2.1.18           
[148] irlba_2.3.3

The text was updated successfully, but these errors were encountered:

EannaFennell · 2021-01-01T14:34:17Z

I just ran this same dataset with one sample removed (equating to ~92,000 cells) and it clustered successfully! Anything over 100,000 seems to be a limitation.

TomKellyGenetics · 2021-01-18T04:54:47Z

Thanks for your question. Paging @vtraag in case it's an issue with the python implementation. If not, I will look into the R/Python interface.

Just to let you know, upcoming changes (#1) use a different R/C implementation of leiden that doesn't rely on Python. This may address the issue I'm not sure.

natemiller · 2022-01-20T23:45:03Z

I just ran into this issue as well. Using the igraph implementation avoids have the same issue (and is faster), but the objective_function limitations (only CPM or modularity) are restrictive.
Were you able to determine if the following error represents an issue with the python implementation or the R/Python interface?

Error in py_call_impl(callable, dots$args, dots$keywords) : 
  ValueError: no such vertex: '1e+05'

TomKellyGenetics · 2022-01-21T14:39:18Z

Thanks for reporting @natemiller, it's still on my to-do list but last year I had a baby and changed jobs so it has been less of a priority recently. I hope you understand this is a difficult issue to reproduce with large data and it is unclear if it is a limitation with the R package, the python leidenalg module, or the reticulate package used to call it.

If anyone has any ideas how to address the issue, suggestions or PRs are welcome! :-)

Please note the igraph 1.2.7 release on CRAN also has an R implementation of leiden which has limited functionality but may not have this issue. I have planned to migrate this package to calling that where possible since the performance is expected to improve.

vtraag · 2022-01-21T16:04:51Z

There are no limits to the python leidenalg package that I know of, besides what can be represented as integers (igraph itself is in the process of transitioning towards 64-bit integers), and I doubt that these limits play a role here. By the looks of it, it may involve something with vertices being represented as doubles, I'm not sure. That would suggest it has something to do with the Python / R interface.

If you can simply export the network to some external file (in whatever format igraph can read it), then I'd be happy quickly test it in the leidenalg package itself. Feel free to share publicly or privately, whatever you are comfortable with.

natemiller · 2022-01-21T19:26:31Z

Thank you for the reply @vtraag .
I actually just passed the graph from R to python. As you suggested I was able to run leidenalg within python without issues on the same graph that failed previously, using RBConfigurationVertexPartition . It does appear to be something at the Python/R interface.

TomKellyGenetics · 2022-01-23T23:49:28Z

Thanks for looking into the issue @vtraag. This is definitely one of the higher priority issues for the R package. @natemiller I am glad to hear you were able to compute the results in Python.

Looking at the logs reported by @eannaf above, your suggestion that it is an issue with passing edges to Python via reticulate appears to be correct. Note that reticulate does not support passing "igraph" class objects between R and python so it necessary to use an adjacency matrix (satijalab/seurat#1645) or a list of edges: https://github.com/TomKellyGenetics/leiden/blob/master/R/py_objects.R#L68

The problem appears to occurs when reticulate calls Rcpp. @natemiller I've noticed that my friend and Rcpp expert @teuder is your colleague. If you wish to get this working in an R based workflow perhaps Tsuda-san may be able to assist you.

natemiller · 2022-01-24T19:27:45Z

@TomKellyGenetics Wow... small world! I'll check in with Tsuda-san.

As you mentioned previously the igraph implementation is useful depending on the partition you are using. And I was actually able to manually pass my R igraph to Python, apply leidenalg there, and then pass it back (with some effort). So there are alternatives.
Thank you for your efforts.

jsim91 · 2022-04-23T00:18:34Z

I use leiden to cluster cytof datasets that are over 2M rows. I had this same error early on but it was a simple fix. I run options(scipen = 1000000000) before clustering with leiden. This raises the scientific notation threshold.

Without reading any source code, my hypothesis was (and is) that the vertices are labeled as numeric/integer values. When R reaches the vertex 100000, that vertex is labeled as 1e+05. If you enter 100000 in the R console, you get 1e+05 back by default. It seems python doesn't have this scientific notation quirk. When this vertex is read elsewhere, that function/interpreter/compiler receives what it thinks is a non-numeric/non-integer value (hence the ValueError).

So just run options(scipen = 1000000000) before you cluster and you should be good to go!

@eannaf Perhaps you already know this but RANN::nn2 can be parallelized with the future.apply package to vastly increase the speed of the neighbor-finding step. If your datasets continue to grow (and if you have enough RAM), it will save you a lot of time.

Here's some benchmarking I've done in the past on a cytof dataset of about 1M cells that compares single thread to multi-thread (22 threads) performance while slowly increasing the number of columns (dimensions) passed to the search:

TomKellyGenetics · 2022-04-23T11:31:21Z

@jsim91 Thank you for sharing this! I'm still considering changes to the source code to avoid this but it is great to know there is a straight-forward (if no necessarily intuitive to other users) to implement workaround that does not require it.

You're right that R handles printing 100000 differently to lower numbers and it does not seem to be a limitation with the Python implementation so it could be the interface from R struggling to interpret this. I'm hesitant to force changes to the users options but it seems it is also possible within R to disable this in one instance.

99999
#> [1] 99999
100000
#> [1] 1e+05
1000000
#> [1] 1e+06
as.numeric(100000)
#> [1] 1e+05
format(100000, scientific = FALSE)
#> [1] "100000"
options(scipen = 1000000000)
100000
#> [1] 100000
1000000
#> [1] 1000000

^{Created on 2022-04-23 by the reprex package (v2.0.1)}

jsim91 · 2022-04-24T03:14:04Z

Could the fix be as simple as using on.exit as described in the accepted answer here: https://stackoverflow.com/questions/35584035/changing-options-in-a-function-environment-without-changing-options-in-globa

TomKellyGenetics · 2022-04-25T12:00:42Z

Thanks for your advice! Leiden v0.3.10 which should resolve this has been submitted to CRAN.

TomKellyGenetics · 2022-04-28T05:20:06Z

Leiden 0.3.10 is accepted by CRAN and is now available to install. Sorry for the delays due to issues discussed in #20.

Thanks for your help @jsim91, I really appreciate your input.

… larger input graphs) closes #12

TomKellyGenetics added the question Further information is requested label Jan 18, 2021

TomKellyGenetics added the help wanted Extra attention is needed label Jan 23, 2022

Jincheng2009 mentioned this issue Feb 8, 2022

Error in installing ‘leiden’ package #19

Open

TomKellyGenetics closed this as completed in a4f00b4 Apr 25, 2022

TomKellyGenetics added a commit that referenced this issue May 9, 2022

disable scientific notation when printing numerals above 10^5 (allows…

72b7d5b

… larger input graphs) closes #12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limitation on the number of cells? #12

Limitation on the number of cells? #12

EannaFennell commented Jan 1, 2021 •

edited

Loading

EannaFennell commented Jan 1, 2021

TomKellyGenetics commented Jan 18, 2021

natemiller commented Jan 20, 2022

TomKellyGenetics commented Jan 21, 2022

vtraag commented Jan 21, 2022

natemiller commented Jan 21, 2022

TomKellyGenetics commented Jan 23, 2022 •

edited

Loading

natemiller commented Jan 24, 2022

jsim91 commented Apr 23, 2022 •

edited

Loading

TomKellyGenetics commented Apr 23, 2022

jsim91 commented Apr 24, 2022

TomKellyGenetics commented Apr 25, 2022 •

edited

Loading

TomKellyGenetics commented Apr 28, 2022

Limitation on the number of cells? #12

Limitation on the number of cells? #12

Comments

EannaFennell commented Jan 1, 2021 • edited Loading

EannaFennell commented Jan 1, 2021

TomKellyGenetics commented Jan 18, 2021

natemiller commented Jan 20, 2022

TomKellyGenetics commented Jan 21, 2022

vtraag commented Jan 21, 2022

natemiller commented Jan 21, 2022

TomKellyGenetics commented Jan 23, 2022 • edited Loading

natemiller commented Jan 24, 2022

jsim91 commented Apr 23, 2022 • edited Loading

TomKellyGenetics commented Apr 23, 2022

jsim91 commented Apr 24, 2022

TomKellyGenetics commented Apr 25, 2022 • edited Loading

TomKellyGenetics commented Apr 28, 2022

EannaFennell commented Jan 1, 2021 •

edited

Loading

TomKellyGenetics commented Jan 23, 2022 •

edited

Loading

jsim91 commented Apr 23, 2022 •

edited

Loading

TomKellyGenetics commented Apr 25, 2022 •

edited

Loading