-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create the cognoml package to implement an MVP API #51
Changes from 1 commit
a49bfe1
876b813
4c99168
ebc47d7
4fc8baa
7ef78d5
a050db0
eb1b670
5f011f4
527963b
28cb22b
d52c6a2
9930433
4a778d1
e5a44f0
6961e39
ee7733f
2291a0c
66df379
10308e0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -42,11 +42,11 @@ def classify(sample_id, mutation_status, **kwargs): | |
|
||
obs_df = pd.DataFrame.from_items([ | ||
('sample_id', sample_id), | ||
('status', mutation_status) | ||
('status', mutation_status), | ||
]) | ||
|
||
X = read_data() | ||
X = X.loc[obs_df.sample_id, :] | ||
X_whole = read_data() | ||
X = X_whole.loc[obs_df.sample_id, :] | ||
y = obs_df.status | ||
|
||
X_train, X_test, y_train, y_test = train_test_split( | ||
|
@@ -56,26 +56,34 @@ def classify(sample_id, mutation_status, **kwargs): | |
pipeline.fit(X=X_train, y=y_train) | ||
#cv_score_df = grid_scores_to_df(clf_grid.grid_scores_) | ||
|
||
obs_df['predicted_status'] = pipeline.predict(X) | ||
obs_df['predicted_score'] = pipeline.decision_function(X) | ||
#obs_df['predicted_prob'] = pipeline.predict_proba(X) | ||
|
||
is_testing = obs_df.testing.astype(bool) | ||
y_pred_train = obs_df.predicted_score[~is_testing] | ||
y_pred_test = obs_df.predicted_score[is_testing] | ||
predict_df = pd.DataFrame.from_items([ | ||
('sample_id', X_whole.index), | ||
('predicted_status', pipeline.predict(X_whole)), | ||
('predicted_score', pipeline.decision_function(X_whole)), | ||
('predicted_prob', pipeline.predict_proba(X_whole)[:, 1]), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @yl565 what's the best way to see if a pipeline supports There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See d52c6a2 for my solution |
||
]) | ||
|
||
# obs_df switches to containing non-selected samples | ||
obs_df = obs_df.merge(predict_df, how='right', sort=True) | ||
obs_df['selected'] = obs_df.sample_id.isin(sample_id).astype(int) | ||
for column in 'status', 'testing', 'selected': | ||
obs_df[column] = obs_df[column].fillna(-1).astype(int) | ||
obs_train_df = obs_df.query("testing == 0") | ||
obs_test_df = obs_df.query("testing == 1") | ||
|
||
#y_pred_train = obs_df.query("testing == 0").predicted_score | ||
#y_pred_test = obs_df.query("testing == 1").predicted_score | ||
|
||
dimensions = collections.OrderedDict() | ||
dimensions['observations'] = len(X) | ||
dimensions['observations_selected'] = sum(obs_df.selected == 1) | ||
dimensions['observations_unselected'] = sum(obs_df.selected == 0) | ||
dimensions['features'] = len(X.columns) | ||
dimensions['positives'] = (y == 1).sum() | ||
dimensions['negatives'] = (y == 0).sum() | ||
dimensions['positives'] = sum(obs_df.status == 1) | ||
dimensions['negatives'] = sum(obs_df.status == 0) | ||
dimensions['positive_prevalence'] = y.mean().round(5) | ||
dimensions['training_observations'] = (obs_df.testing == 0).sum() | ||
dimensions['testing_observations'] = (obs_df.testing == 1).sum() | ||
dimensions['training_observations'] = len(obs_train_df) | ||
dimensions['testing_observations'] = len(obs_test_df) | ||
results['dimensions'] = utils.value_map(dimensions, round, ndigits=5) | ||
|
||
obs_train_df = obs_df.query("testing == 0") | ||
obs_test_df = obs_df.query("testing == 1") | ||
|
||
performance = collections.OrderedDict() | ||
for part, df in ('training', obs_train_df), ('testing', obs_test_df): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirming that you're deciding not to stratify based on disease too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, stratification by disease could also make sense. Currently, sample/covariate info is not part of this pull request. I think it probably should be added before the first release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, I was at talk by Olivier Elemento - he was building models for a different purpose (predict immunotherapy responders) but was adjusting for mutation burden as a covariate. We may want to consider checking out his stuff and adjusting for burden too