-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Marginal gain of gene expression data over covariates #67
Conversation
AWESOME! Just two things before I review. Can you remove the space in Then run
and track the exported |
… comparison to multiple mutations.
I added a notebook (and script) that extends the comparison of covariates only to covariates+expression models to multiple mutations. If you look at the bar plot at the bottom, you'll see that the training auroc is always better for the covariates+expression model (positive values) but that the testing auroc drops off dramatically and is lower in every case other than TP53. This is probably a case of overtraining. Thoughts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I think there are some weird things going on that make these results hard to interpret. Hopefully, addressing my comments will clear things up!
|
||
# In[46]: | ||
|
||
get_ipython().run_cell_magic('time', '', '# Train model a: covariates only.\nwarnings.filterwarnings("ignore") # ignore deprecation warning for grid_scores_\nrows = list()\nfor m in list(mutations):\n series = pd.Series()\n series[\'mutation\'] = m\n series[\'symbol\'] = mutations[m]\n rows.append(get_aurocs(X[\'a\'], Y[m], cv_pipeline[\'a\'], series))\nauroc_df[\'a\'] = pd.DataFrame(rows)\nauroc_df[\'a\'].sort_values([\'symbol\', \'testing_auroc\'], ascending=[True, False], inplace=True)') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about more descriptive keys for auroc_df
than a
, b
, and c
? Also maybe rename auroc_df
to auroc_dfs
to make clear that it itself is not a dataframe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. a and b are not terribly descriptive but they actually stand for models a and b, which I have consistently used. I changed c though b/c that is totally not description to diff_ba (difference between model b and a)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do what you think is best. My opinion is that it's almost always better to be explicit and save the reader from having to remember a "labeling dictionary" of sorts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change to model a and model b
# In[68]: | ||
|
||
plot_df = pd.melt(auroc_df['c'], id_vars='symbol', value_vars=['mean_cv_auroc', 'training_auroc', 'testing_auroc'], var_name='kind', value_name='auroc') | ||
grid = sns.factorplot(y='symbol', x='auroc', hue='kind', data=plot_df, kind="bar") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change the x-axis label from auroc
to Δ AUROC
, so it's clear that it's the change in AUROC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure.
cv_pipeline = {} | ||
for k in ['a','b']: | ||
if k == 'a': param_grid['select__k'] = ['all'] | ||
elif k=='b': param_grid['select__k'] = [2000] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're likely selecting out most of the covariates because they have a low median absolute deviation. This is probably why the "covariate + expression" models are worse than the "covaraite only" models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FeatureUnion
may be the correct way to deal with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good catch.
I implemented it two ways. 1) Using FeatureUnion. This seems to work but is a bit clunky. 2) Using DataFrameMapper. Cleaner syntax and I think it works, but getting a warning. Let me know which approach you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool... What's the warning for DataFrameMapper
-- I didn't see a warning in your notebook. I agree DataFrameMapper
is cleaner. Do you think the added cleanliness is worth the extra dependency?
Can you modify 2.Marginal-Gain-Multiple-Mutations.ipynb
to use your preferred method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will go with FeatureUnion. The error from DataFrameMapper (you did not see the warning because ran other method) was that the estimator "Estimator %s modifies parameters in init." from the sklearn base library (http://contrib.scikit-learn.org/imbalanced-learn/_modules/sklearn/base.html). Could never figure that one out, but also the DataFrameMapper lib was finicky (e.g. the mapper needed to be at top of pipeline). Overall, I think it is not worth the extra dependency, as you suggested. I modified the FeatureUnion method to use lambda method in FunctionTrasformer which makes it a bit more concise).
'classify__loss': ['log'], | ||
'classify__penalty': ['elasticnet'], | ||
'classify__alpha': [10 ** x for x in range(-3, 1)], | ||
'classify__l1_ratio': [0, 0.2, 0.8, 1], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's stick to a single l1_ratio
but try more alpha
values (see #56):
'classify__alpha': 10.0 ** np.linspace(-3, 1, 10),
'classify__l1_ratio': [0.15],
The above values are from logistic_regression.py#L19-L20
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
# In[263]: | ||
|
||
# What are the top weighted features for model a and model b? | ||
display(coef_df['b'].head(5)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add n_positives
and n_negatives
to these dataframes? Would help me diagnose the variability in performance between CV, training, & testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For which dataframe? coef_df? I think I'm misunderstanding b/c seems like that wouldn't help--the weights already tell you that. Maybe you mean for auroc_df or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mistake... yeah I'm referring to auroc_df
in the second notebook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have modified the code but still need to run it. Once I have done so, I will repost with changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work, especially dealing with the difficulties of parrallel feature processing -- it looks like the sklearn design didn't fully anticipate this crucial use case.
This PR is close to ready... now time to apply your solutions to the second notebook.
# In[263]: | ||
|
||
# What are the top weighted features for model a and model b? | ||
display(coef_df['b'].head(5)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mistake... yeah I'm referring to auroc_df
in the second notebook.
cv_pipeline = {} | ||
for k in ['a','b']: | ||
if k == 'a': param_grid['select__k'] = ['all'] | ||
elif k=='b': param_grid['select__k'] = [2000] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool... What's the warning for DataFrameMapper
-- I didn't see a warning in your notebook. I agree DataFrameMapper
is cleaner. Do you think the added cleanliness is worth the extra dependency?
Can you modify 2.Marginal-Gain-Multiple-Mutations.ipynb
to use your preferred method?
param_grid = { | ||
'classify__loss': ['log'], | ||
'classify__penalty': ['elasticnet'], | ||
'classify__alpha': [10 ** x for x in range(-3, 1)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like your notebook only has 1 value here. Reminder to reset when you're ready for a final analysis. I think it's also fine to consolidate to a single notebook later (probably the current 2.
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. Just for testing.
|
||
# In[46]: | ||
|
||
get_ipython().run_cell_magic('time', '', '# Train model a: covariates only.\nwarnings.filterwarnings("ignore") # ignore deprecation warning for grid_scores_\nrows = list()\nfor m in list(mutations):\n series = pd.Series()\n series[\'mutation\'] = m\n series[\'symbol\'] = mutations[m]\n rows.append(get_aurocs(X[\'a\'], Y[m], cv_pipeline[\'a\'], series))\nauroc_df[\'a\'] = pd.DataFrame(rows)\nauroc_df[\'a\'].sort_values([\'symbol\', \'testing_auroc\'], ascending=[True, False], inplace=True)') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do what you think is best. My opinion is that it's almost always better to be explicit and save the reader from having to remember a "labeling dictionary" of sorts.
Is it possible to only apply the Imputer to the covariate features? The gene expression doesn't have missing data, so it may speed things up. |
It should be possible to apply Imputer to only covariates. But for some reason this is not working with FeatureUnion. It's as if FunctionTransformer were were working for SelectKBest but not Imputer. I will run the code through and then maybe you can have a look at restructuring the code to include this change? Could be I'm missing something. |
…k and updating pipeline for feature combination.
I ran the code on a restricted dataset (1000 samples) for all the mutations. The results are in the 1.Marginal-Gain-Multiple-Mutations notebook. As you can see, the testing cv's are still lower for the combined model for a number of mutations. To diagnose, I added n_positives and n_negatives as you asked. I also added some statistics on the ranks of the weights for covariates wrt ranks of all features. Cumulative, median and mean ranks are presented--cumulative rank means the total index positions of covariate features in the sorted coef_df dataframe. I can run this on the whole dataset, but I think I will hit the same issue. (Also, note that to make this run, I have set the 1st value of the y_test and y_train to 1 just to ensure at least one positive mutation example. Not sure--could be affecting the results a bit.). |
After writing that, I realized that I did not change the alpha parameter back to test all values. Let me do that and get back to you. Probably a regularization issue. |
OK, I re-ran it with all hyperparamaters and data. Still, the testing is only substantially better for TP53. Strangely, the training auroc is lower for some mutations--I don't even know how that's possible. Hopefully the updates will make it easier to diagnose. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a few additional comments. Additionally, it looks like you may have run a git add --all
that added some files we don't want to track.
Specifically some .DS_Store
files crept in which should be removed.
Also 1.Marginal-Gain-Multiple-Mutations.py
appears to have been uploaded in two directories.
|
||
# In[14]: | ||
|
||
auroc_dfs['model a'].to_pickle('./covariates_only.pkl') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for the switch to pickles over TSV here -- TSV is preferable so its readable outside of Python.
Consider combining both tables into a single dataframe for storage by adding a model
column and then concatenating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
series['mean_cv_auroc'] = cv_score_df.score.max() | ||
series['training_auroc'] = roc_auc_score(y_train, y_pred_train) | ||
series['testing_auroc'] = roc_auc_score(y_test, y_pred_test) | ||
series['n_pos_coeffs'] = n_pos |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new columns n_pos_coeffs
and n_neg_coeffs
are really useful. They're not actually what I meant by n_positives
and n_negatives
-- but let's definitely keep them.
By n_positives
I meant number of samples with the mutation (sum(y == 1)
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. Updated. As you might expect, the combined model shows the clearest advantage where we have a lot of positive examples to work from (TP53).
Fit the classifier for the given mutation (y) and output predictions for it | ||
""" | ||
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0) | ||
y_train[0] = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Also, note that to make this run, I have set the 1st value of the y_test and y_train to 1 just to ensure at least one positive mutation example. Not sure--could be affecting the results a bit.).
@joshlevy89, I'm concerned about this workaround. I think specifying the stratify=y
in train_test_split
should resolve this problem. If not, I don't think we want to make a workaround -- the failure shouldn't be masked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't worry, that was just a workaround during debugging--I took it out when I ran the script (I think you are looking at an old notebook here.) In any case, I like your method better.
""" | ||
Fit the classifier for the given mutation (y) and output predictions for it | ||
""" | ||
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's increase testing to 25% (test_size=0.25
). Our testing AUROCs are potentially super unstable due to the small number of positives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
covariates_pipeline = Pipeline(steps=[ | ||
('imputer', Imputer()), | ||
('standardize', StandardScaler()), | ||
('select', SelectKBest(fs_mad,'all')), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just delete this entire step if its not actually doing any selection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you just mean the ('select', SelectKBest(fs_mad,'all')) step. I will take it out (and modify how the coef_df dataframe is built to handle lack of 'select' step).
@dhimmel I have updated this pull request with the requested changes, and added some other stuff as well. We are no longer experiencing fitting failures/turbulence when working with the expression data, probably due to adding the feature selection/pca. 😃 A few notes on what I have done here:
I think that this PR is now ready to be merged. |
Hey @joshlevy89 -- thanks for this huge PR! Wanted to let you know it's on my radar, but I may not get to it for a little bit. The next step will be understanding why expression does not improve upon the covariate models -- whether it's fundamentally not useful or because we need to change our modeling. |
@dhimmel No problem. I agree that is the next step. Indeed, it seems like the current approaches do not improve upon covariates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made some small suggestions to help us narrow in on the situation. Apply to all 3 analyses.
Also fine to proceed ahead with only 2 of 3 notebooks and delete the third -- the PCA one is most important I think. Up to you.
'2353':'FOS',#Transcription factors | ||
} | ||
|
||
mutations = {**genes_LungCancer, **genes_TumorSuppressors, **genes_Oncogenes} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really cool way to combine dicts, didn't realize you could do this!
|
||
# In[43]: | ||
|
||
get_ipython().run_cell_magic('time', '', "path = os.path.join('..', '..', 'download', 'covariates.tsv')\ncovariates = pd.read_table(path, index_col=0)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's restrict to only the following covariates:
n_mutations_log1p
- anything starting with
acronym_
We've seen these are the most important. This way we can remove imputation, which won't currently work for categorical variables that have been dummied anyway.
|
||
# In[43]: | ||
|
||
get_ipython().run_cell_magic('time', '', "path = os.path.join('..', '..', 'download', 'covariates.tsv')\ncovariates = pd.read_table(path, index_col=0)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove %%time
statements from every cell besides .fit
, given how annoying it makes the diffing.
|
||
# In[43]: | ||
|
||
get_ipython().run_cell_magic('time', '', "path = os.path.join('..', '..', 'download', 'covariates.tsv')\ncovariates = pd.read_table(path, index_col=0)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rerun the download notebook, so we're working with v6 of the dataset?
@dhimmel applied changes and re-ran. I deleted marginal-gain-Ret-mad.ipynb example notebook, as it was just to validate that everything was working correctly. I only modified marginal-gain-multiple-genes-pca.ipynb as more changes may be on the way, and we may want to delete marginal-gain-multiple-genes-mad.ipynb later anyway. |
@joshlevy89 great work. I think you've shown that the prediction of many mutations is not benefiting from expression data, once covariates are accounted for. However, some mutations still are such as AKT1, TP53, RB1, NF2, and HRAS. This is really good to know -- our approach may only apply to certain mutations. I also think your last plots are really informative -- gene expression seems most likely to contribute when the number of positives is higher. So this leads me to believe that our model lacks the necessary power to detect gene expression signatures when the sample size is low, even when using PCA. I think it makes sense to refocus our ML team to look more into fitting models with both covariates and expression. Thanks for completing this gigantic analysis with lots of back and forth! Up to you whether you want to delete |
One thing that I noticed - many of the genes where the expression seems to add information beyond the covariates seem to be cell cycle genes. The various Ras genes + NF1 where @gwaygenomics has worked a fair amount seem to be relatively high performing. P53, but this is not surprising as it is known to induce a big effect in the transcriptome. RB1, PTEN, AKT are sort of in the cell cycle hit list. This list would be enriched for cell cycle genes anyway as a cancer gene list, but these are the things that stuck out at me. |
i agree - this PR is monumental! nice work @joshlevy89 It is a bit jarring to see how well covariates perform alone in many circumstances. Can you remind me what the covariates are? Dummy variable disease type and log(mutation count)? It may be good to show a Also, it could be good to eventually show what the coefficients are for some of the models - at the very least to double check that all the signal the gene expression adds is not washed out by giant covariate weights. It would also be interesting to see the weights of the disease type for each model. I suspect that for some gene mutations, especially those that are heavily skewed across cancer types, (e.g. NF1 - 11% GBM (17/149) but 0.6% in THCA (3/485)) the covariate model alone will just pick out disease types rather than individual samples. Another way to test this would be to build models using only a single disease type and then make the comparisons suggested in this PR. Perhaps these suggestions belong in a separate PR? I do not want to stall merging! |
Let's do a separate PR. I think @joshlevy89 was in the process of moving. However, @gwaygenomics raises some great questions. @joshlevy89 you are free to continue tackling them after we merge this PR. But let's save that for a future PR. |
@dhimmel I applied the changes to the marginal-gain-multiple-genes-mad.ipynb notebook and re-ran. It will be good to keep this as a reference, especially to Branka's PR #52 . Should be good to merge now. Thanks for the help on getting this out! I think it raises some interesting issues as pointed out by @cgreene and @gwaygenomics, and it will be interesting to pursue this further. |
Great work @joshlevy89! Will merge now. |
Compares a covariates only model to a covariates with gene expression model in predicting TP53 mutations.