Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MEF module cleanup #165

Merged
merged 31 commits into from
Dec 8, 2015
Merged

MEF module cleanup #165

merged 31 commits into from
Dec 8, 2015

Conversation

castillohair
Copy link
Collaborator

No description provided.

library, with full covariance matrices for each cluster and a fixed,
uniform set of weights. This means that `clustering_gmm` implicitly
assumes that all bead subpopulations have roughly the same number of
events. For more information, consult ``scikit-learn``'s documentation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throwing away samples 50% away from cluster centers seems pretty arbitrary and unjustified. I assume this is all part of the scikit-learn GMM bug? In which case this is kinda a hack to get around that? I've already raised my concerns regarding that bug in #159, and it may not be worth fixing at this point in time. In general, though, it would be nice to get rid of this strange initialization procedure if the scikit-learn GMM bug was ever fixed. In my previous experience (MATLAB implementation), a vanilla GMM fit would work fine with a reasonable K-means initialization.

I would also prefer not to hardcode the weights equally. If the GMM algorithm works as advertised, that should be a pretty straightforward parameter to fit, and leaving it unbounded allows for edge cases where you have non-uniform subpopulation sizes (again, the devil's advocate argument that immediately comes to mind is if you mixed two different bead samples and were unable to guarantee equal ratio between the two; just feels like an unnecessary assumption). I feel like this also relates to the GMM bug, though, in which case it may not be worth fixing now and would warrant a deeper investigation into the inner workings of scikit-learn's GMM implementation.

If this works good enough now, then I probably wouldn't change these things.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both using 50% of the samples for each cluster and forcing the weights to be equal are what's currently making GMM work. This was tested with a bunch of beads files taken by almost everybody in lab.

K-means initialization is actually performed by default (check line 486 in https://github.com/scikit-learn/scikit-learn/blob/c957249/sklearn/mixture/gmm.py). And we saw that this didn't work properly.

Using equal weights is actually less of a hack that you would think. It's a well supported, documented option in scikit's GMM. And the model I'm using with full covariance matrices is arguably less restrictive than the default, which uses equal, diagonal covariance matrices (http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GMM.html).

@@ -607,14 +353,11 @@ def get_transform_fxn(data_beads, peaks_mef, mef_channels,
----------
data_beads : FCSData object
Flow cytometry data, taken from calibration beads.
peaks_mef : array
mef_values : array
Known MEF values of the calibration beads' subpopulations, for
each channel specified in `mef_channels`.
mef_channels : int, or str, or list of int, or list of str
Channels for which to generate transformation functions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still think mef_values and mef_channels can be more intuitively combined into a dict where key=channel (int or str) and value = 1D numpy array. See #146.

@castillohair castillohair merged commit 780db36 into develop Dec 8, 2015
@castillohair castillohair deleted the mef-improvements branch December 8, 2015 14:29
if hasattr(populations[0], 'domain'):
high = 0.985*populations[0].domain(0)[-1]
else:
raise TypeError("argument 'high' not specified")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think referencing populations[0] will throw an error if populations is an empty list, which is never checked for. Not sure if we care about that edge case, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants