-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about interpreting Diffusion Maps on toy dataset? #25
Comments
Hi, Sorry, forgot to respond to this. The nontrivial eigenvalues are negative because we use the convention where we look at the eigenvalues of the transition rate matrix. This matrix has the advantage that it converges to a fixed limit as we decrease epsilon. The dominant eigenfunction, which should have eigenvalue of 0, should then simply be a vector of all ones: this corresponds to fact that the rows of a transition rate matrix sum to zero. As such, it doesn't tell you anything meaningful about the data. Let me know if that helps :-) |
This does help a bit thank you! Does PyDiffMap drop this uninformative dimension in the backend or should I do this post hoc? |
By default, we drop the dominant eigenfunction here.. If after this drop you still have another eigenvalue that is also 0, that suggests that your transition rate matrix is effectively disconnected: you have two large clusters in your data. You can verify this by checking if the resulting coordinate is a constant function or not. For the plot you provided, it suggests that you do indeed have highly separate clusters. If you don't want to disjointed clusters, you can consider increasing the value of the bandwidth. |
Excellent! This is very helpful. For my actual dataset, I'm dealing with boolean features so using jaccard distance and will need to mess around with some settings. I plan on trying this for a very large genomics dataset but was having some performance issues. I was looking into what other packages are doing to address performance issues and noticed that Diffusion maps are being used in scRNA-seq and microbial ecology. Would love to use this package in my research (with proper citing of course). I was originally going to use https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.diffmap.html but I really like the sklearn API you developed. In particular the ability to |
If you don't have time to incorporate new features, I'm definitely willing to give this implementation a shot! If you have any insight on which code I'll need to adapt that would be greatly appreciated. |
This might be a poor exercise but I'm trying to understand the methods of paper and if it makes sense to adapt my linear-based workflow with PCA to non-linear manifold methods; thought trying out diffusion maps would be worth a shot.
I'm trying to understand how to interpret the results from a diffusion map. The iris dataset is definitely not the best toy dataset but thought I would still be able to see some relationships.
I have a few questions:
Apologies if these questions are naive, I'm coming from microbial ecology and trying to understand the methods of a paper that I did not write.
Here's my code:
In this example, I'm seeing that they are excluding the first embedding:
https://www.linkedin.com/pulse/diffusion-maps-unveiling-geometry-high-dimensional-data-yeshwanth-n-qrsfc/
The text was updated successfully, but these errors were encountered: