-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved the performance of the SGD classifier on sparse mutations by reducing the noise #71
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Really cool idea combining PCA and LDA - it is surprising that the ROC curves look so good.
There are some plots (namely, the decision plot and probability plot) that are concerning. I made some comments inline.
I also have a general comment about RIT1:
- It is an oncogene important in the RAS/MAPK pathway
- Therefore, it would be interesting to test if a RIT1 classifier can also classify KRAS/NRAS/HRAS mutations.
- Along the same lines, can a KRAS/NRAS/HRAS classifier predict RIT1?
I think these questions could be an alternative way of improving the performance of a RIT specific classifier - but may not work for other genes with low prevalence.
# In[9]: | ||
|
||
# Typically, this can only be done where the number of mutations is large enough | ||
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like with this split there are only (6016 / 0.002874) * 0.1 = 1.7
gold standard positives in the test set. We may want to get a bit creative to overcome this sample size issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to change the size of the testing data to 0.2 and set stratify=y
, which probably is able to make the testing score more reliable.
|
||
# In[10]: | ||
|
||
get_ipython().run_cell_magic('time', '', 'scale = StandardScaler()\nX_train_scale = scale.fit_transform(X_train)\nX_test_scale = scale.transform(X_test)') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason why you are fit_transforming X_train
but not X_test
? Same comment about PCA implementation below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we train the models for feature transformation and dimensionality reduction, I think I should only use the information from the train data. A sample is from Raschka 2015.
|
||
# In[12]: | ||
|
||
get_ipython().run_cell_magic('time', '', '\nlda = LinearDiscriminantAnalysis(n_components=2)\nX_train_lda = lda.fit_transform(X_train_pca, y_train)\nX_test_lda = lda.transform(X_test_pca)') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow! only 2 components! Can you add a scatter plot here where you color RIT1 mutation status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry that I misunderstood the component number for LDA. According to the documentation of sklearn.discriminant_analysis.LinearDiscriminantAnalysis
, n_components
cannot be larger than n_classes-1. Therefore, LDA cannot provide a one-dimensional data, which is probably not very informative. But now I am using PCA only, and keeping a 30-dimensional data.
# Ignore numpy warning caused by seaborn | ||
warnings.filterwarnings('ignore', 'using a non-integer number instead of an integer') | ||
|
||
ax = sns.distplot(predict_df.query("status == 0").decision_function, hist=False, label='Negatives') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the LDA transform causes the decision function to shift far to the negative. Any reason for this?
# In[24]: | ||
|
||
ax = sns.distplot(predict_df.query("status == 0").probability, hist=False, label='Negatives') | ||
ax = sns.distplot(predict_df.query("status == 1").probability, hist=False, label='Positives') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the positives all have extremely low probability of being positive but the negatives are flat. Looks fishy. Is there an intuition as to why this is happening?
@gwaygenomics thanks a lot for so many important comments and questions! I will try to answer the questions one by one. |
@KT12 thank you for the link! That is actually where I learned about this strategy. But now I believe that I need to re-read the post and the linked articles. |
Following @dhimmel 's suggestion, I am using only PCA to remove the noise and reduce the dimensionality to 30. I renamed the file to RIT1-PCA-htcai.ipynb. As a side note, I corrected my misunderstanding of LDA. Specifically, Firstly, I set Also, I scaled the data both before and after PCA. If without the second time of scaling, the range of In addition, the plot of Please let me know what I should do to further improve the notebook. Thanks in advance! Moreover, I am interested in @gwaygenomics 's comments on the the path. I would like to further explore the connection between RIT1 and KRAS/NRAS/HRAS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job with some really interesting and helpful results. Made some comments. We can discuss at meetup tomorrow if you can make it.
# In[1]: | ||
|
||
import os | ||
import urllib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can remove urllib
and random
imports
# In[8]: | ||
|
||
# Here are the percentage of tumors with RIT1 | ||
y.value_counts(True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also run y.value_counts()
? There are so few positives I want to know the exact number... this is cool!
get_ipython().run_cell_magic('time', '', 'scale_pre = StandardScaler()\nX_train_scale = scale_pre.fit_transform(X_train)\nX_test_scale = scale_pre.transform(X_test)') | ||
|
||
|
||
# ## Reducing noise via PCA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get rid of %%time
in this section for better diff viewing. And the "Feature Standardization" as well.
|
||
# In[10]: | ||
|
||
get_ipython().run_cell_magic('time', '', 'scale_pre = StandardScaler()\nX_train_scale = scale_pre.fit_transform(X_train)\nX_test_scale = scale_pre.transform(X_test)') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can write this all in a pipeline, which will simplify the code and make the process more clear. See example in 2.TCGA-MLexample.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can discuss at the meetup, whether we should do cross-validation the right or wrong but faster way as discussed in #70 (comment).
|
||
param_grid = { | ||
'alpha': [10 ** x for x in range(-4, 1)], | ||
'l1_ratio': [0, 0.2, 0.5, 0.8, 1], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd more densely sample alpha
and reduce the l1_ratio
to [0, 0.15]
. See #56.
|
||
# In[17]: | ||
|
||
# Cross-validated performance heatmap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This figure shows how hard the problem is. We're struggling to find a nice optimum of parameters. Performance is all over the place.
|
||
# In[11]: | ||
|
||
get_ipython().run_cell_magic('time', '', 'n_components = 30\npca = PCA(n_components=n_components, random_state=0)\nX_train_pca = pca.fit_transform(X_train_scale)\nX_test_pca = pca.transform(X_test_scale)') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to know what percentage of variance 30 components are capturing? It seems like 30 could be removing some important signals.
], | ||
"source": [ | ||
"# Percentage of preserved variance\n", | ||
"print('{:.4}'.format(sum(pca.explained_variance_ratio_)))" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So 35 components captures ~60% of variance. Can we increase the number of components until we capture at least 90% of variance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned below, increasing the dimensionality leads to over-fitting and extremely low testing score immediately. It appears that a large proportion of the variation in the dataset is probably noise.
I made the following main changes.
|
Great work with this pull request -- this is interesting stuff! Can you re-export
Here's the plot: I don't actually think the effect of I think we may be able to decrease the noise in our performance measurements by playing with the
Interesting. I'm worried that for some mutations the ~40% of uncaptured variance will be important. But I agree it's more important to actually be able to fit a model than include all the relevant features. |
We could also use On second thought, perhaps it's even more elegant! |
@dhimmel Thanks a lot for your detailed and valuable comments! I will re-export the In order to locate the optimal range of Training a classifier over an extremely imbalanced dataset has always been a thorny issue. I tried under-sampling and over-sampling several weeks ago, but it turned out to be a disaster. The performance is much worse than the baseline SGD classifier. I am happy to explore other methods later. It is a good idea to try 10-fold cross-validation. My only concern is that there will be only 1 or 2 positive samples in each of the 10 folds. Still, only the experiment result can clarify whether that is a problem. I agree that the removal of ~40% of the variance is a problem for other mutations. For the about 10 mutations I experimented, the optimal number of principal components is between 30 - 100. Therefore, if with sufficient computational resource, a GridSearchCV over |
Great idea -- really insightful! Agree that computation time will be the drawback.
There will be more positives in the training folds, which is ultimately more important. The larger number of testing folds (10 versus 3) will counterbalance the fewer number of positives in each. This is why leave one out CV is best, even though there are often 0 positives in a given testing fold. |
@dhimmel I used 10-fold GridSearchCV without On the other hand, |
@htcai interesting. I'd love to see 1 × 10-fold for larger alphas, like you did for Do you want to switch the notebook to An alternative would be 10 × 10-fold CV, but that could be a pain to implement. And I don't really think it's fundamentally different from |
I updated the notebook with It is good to see that the positive and the negative curves are more neatly segregated in the plots of However, over-fitting seems to be salient. |
cv_folds = [[t, c] for t, c in cv_folds] | ||
|
||
|
||
# In[17]: | ||
|
||
get_ipython().run_cell_magic('time', '', "clf = SGDClassifier(random_state=0, class_weight='balanced', loss='log', penalty='elasticnet')\ncv = GridSearchCV(estimator=clf, param_grid=param_grid, n_jobs=-1, scoring='roc_auc')\ncv.fit(X = X_train_scale, y=y_train)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move everything besides cv.fit(X = X_train_scale, y=y_train)
out of the %%time
cell for better py export.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In GridSearchCV
you need to specify cv=sss
.
# In[14]: | ||
# In[16]: | ||
|
||
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.3, random_state=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If computation time doesn't become prohibitive, I'd love something more thorough such as:
StratifiedShuffleSplit(n_splits=100, test_size=0.1, random_state=0)
n_splits
times test_size
equals the average number of times each observation will be used for testing in CV. The higher this is the less variation we should have in our performance estimates. However, we want test_size
to be low, so the percent of observations used for training is maximized. Therefore, we should increase n_splits
up until the point where it starts taking a long time. Make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is very sensible. It does not take a long time to run a 100-fold Grid Search. Actually, I set n_split=150
. Now the heat map looks much less noisy.
|
||
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.3, random_state=0) | ||
cv_folds = sss.split(X_train_scale, y_train) | ||
cv_folds = [[t, c] for t, c in cv_folds] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Computing cv_folds
yourself may not be necessary.
Cool one last thing so we can understand what's going on the super high alphas. Can you export the coefficients from the best model with |
That is a good idea! I visualized the coefficients of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going over these results with @cgreene right now. We think the predicted probabilities should be centered around 0.0029 (21 / (21 + 7285)
). So we're trying to diagnose what's going on. Two changes to param_grid
will help us figure this out... hopefully.
|
||
param_grid = { | ||
'alpha': [2**x for x in range(-20, 60)], | ||
'l1_ratio': [0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can add 0.15
to this list and rerun?
# In[15]: | ||
|
||
param_grid = { | ||
'alpha': [2**x for x in range(-20, 60)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's reduce the upper bound to 30 at most.
I searched over four values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm a little confused by the probabilities. Using a pipeline would help us make sure the right transformations are being applied.
Otherwise these are fantastic results. Really like the new CV plot!
|
||
# In[24]: | ||
|
||
X_transformed = scale_post.transform(pca.transform(scale_pre.transform(X))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you want to put scale_pre, pca, and scale_post in a pipeline? This would really simplify things and help avoid errors.
I finally got it run with pipeline (and hopefully of correct order) including scaling, PCA and classifier. Within a small amount of time, I can only search over a smaller space of |
@dhimmel Thanks for your clarification. I guess you are suggesting pipelining the scaling and the PCA while leaving the classifier alone. If this is not the case, we should talk further about the implementation tomorrow. I increased the number of principal components to 50, which is a plausible value if we want a uniform one for all of the sparse genes. The testing score was actually improved for RIT1 comparing with 30 and 35. |
@htcai I remember that you made some progress last night at the meetup. Let me know when this is ready for me to take a look again. |
@dhimmel I am not sure whether you are referring to pipelining the data standardization and PCA. If this is the case, then it is already contained in the latest commit. I tried In addition, I am currently running |
Great, that's what I was referring to. This PR is ready to merge. Let's do a separate for the other dimensionality reduction methods. |
In response to #52, I experimented a variety of methods to improve the performance of the SGD Classifier on sparse genes (e.g., RIT1, MAP2K1).
Mainly, I used PCA to reduce the noise and then employed LDA to compress the data to 2 dimensions. This strategy is justified by the apparent overfitting regarding many genes captured in #52.
Taking RIT1 as the example, the improvement is notable. The test AUROC is 0.837, comparing with the 0.65 in #52. For several other genes I explored (e.g., MAP2K1, SMAD2), the improvement is mostly between 0.07 and 0.10. Moreover, the total time needed for dimensionality reduction and model training is shorter than half a minute in Ubuntu on my ASUS laptop.
Still, there are a few issues. The optimal
n_component
of PCA for different genes is different. Therefore, it will be great if I can search over multiple values ofn_components
, which is probably viable using pipeline. However, Ubuntu (8GB RAM + 8GB swap) always runs out of memory when running the pipeline. My MacBook does not have this issue since it uses compressing memory, but it never ends running the pipeline. Hence, I didn't use pipeline in the file submitted.Also, I am not sure how to interpret the probability plot at the end of the notebook.
Any suggestions and/or comments are welcome. Thanks in advance!