-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limitation on the number of cells? #12
Comments
I just ran this same dataset with one sample removed (equating to ~92,000 cells) and it clustered successfully! Anything over 100,000 seems to be a limitation. |
Thanks for your question. Paging @vtraag in case it's an issue with the python implementation. If not, I will look into the R/Python interface. Just to let you know, upcoming changes (#1) use a different R/C implementation of leiden that doesn't rely on Python. This may address the issue I'm not sure. |
I just ran into this issue as well. Using the igraph implementation avoids have the same issue (and is faster), but the objective_function limitations (only CPM or modularity) are restrictive.
|
Thanks for reporting @natemiller, it's still on my to-do list but last year I had a baby and changed jobs so it has been less of a priority recently. I hope you understand this is a difficult issue to reproduce with large data and it is unclear if it is a limitation with the R package, the python leidenalg module, or the reticulate package used to call it. If anyone has any ideas how to address the issue, suggestions or PRs are welcome! :-) Please note the igraph 1.2.7 release on CRAN also has an R implementation of leiden which has limited functionality but may not have this issue. I have planned to migrate this package to calling that where possible since the performance is expected to improve. |
There are no limits to the python If you can simply export the network to some external file (in whatever format |
Thank you for the reply @vtraag . |
Thanks for looking into the issue @vtraag. This is definitely one of the higher priority issues for the R package. @natemiller I am glad to hear you were able to compute the results in Python. Looking at the logs reported by @eannaf above, your suggestion that it is an issue with passing edges to Python via reticulate appears to be correct. Note that reticulate does not support passing "igraph" class objects between R and python so it necessary to use an adjacency matrix (satijalab/seurat#1645) or a list of edges: https://github.com/TomKellyGenetics/leiden/blob/master/R/py_objects.R#L68 The problem appears to occurs when reticulate calls Rcpp. @natemiller I've noticed that my friend and Rcpp expert @teuder is your colleague. If you wish to get this working in an R based workflow perhaps Tsuda-san may be able to assist you. |
@TomKellyGenetics Wow... small world! I'll check in with Tsuda-san. As you mentioned previously the igraph implementation is useful depending on the partition you are using. And I was actually able to manually pass my R igraph to Python, apply leidenalg there, and then pass it back (with some effort). So there are alternatives. |
I use leiden to cluster cytof datasets that are over 2M rows. I had this same error early on but it was a simple fix. I run Without reading any source code, my hypothesis was (and is) that the vertices are labeled as numeric/integer values. When R reaches the vertex 100000, that vertex is labeled as 1e+05. If you enter 100000 in the R console, you get 1e+05 back by default. It seems python doesn't have this scientific notation quirk. When this vertex is read elsewhere, that function/interpreter/compiler receives what it thinks is a non-numeric/non-integer value (hence the ValueError). So just run @eannaf Perhaps you already know this but Here's some benchmarking I've done in the past on a cytof dataset of about 1M cells that compares single thread to multi-thread (22 threads) performance while slowly increasing the number of columns (dimensions) passed to the search: |
@jsim91 Thank you for sharing this! I'm still considering changes to the source code to avoid this but it is great to know there is a straight-forward (if no necessarily intuitive to other users) to implement workaround that does not require it. You're right that R handles printing 100000 differently to lower numbers and it does not seem to be a limitation with the Python implementation so it could be the interface from R struggling to interpret this. I'm hesitant to force changes to the users options but it seems it is also possible within R to disable this in one instance.
Created on 2022-04-23 by the reprex package (v2.0.1) |
Could the fix be as simple as using |
Thanks for your advice! Leiden v0.3.10 which should resolve this has been submitted to CRAN. |
… larger input graphs) closes #12
Hi Tom,
I've used leiden on a number of clustering runs with great results! However, I'm trying to cluster ~110,000 cells from a cyTOF experiment with 36 channels and I'm receiving an error on the number of vertices;
I am using a standard workflow that has been successful with smaller clustering runs;
Is there currently a limitation on the number of cells?
Here is the session info
The text was updated successfully, but these errors were encountered: