Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create the cognoml package to implement an MVP API #51

Merged
merged 20 commits into from
Oct 11, 2016
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 26 additions & 18 deletions cognoml/analysis.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,11 +42,11 @@ def classify(sample_id, mutation_status, **kwargs):

obs_df = pd.DataFrame.from_items([
('sample_id', sample_id),
('status', mutation_status)
('status', mutation_status),
])

X = read_data()
X = X.loc[obs_df.sample_id, :]
X_whole = read_data()
X = X_whole.loc[obs_df.sample_id, :]
y = obs_df.status

X_train, X_test, y_train, y_test = train_test_split(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirming that you're deciding not to stratify based on disease too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, stratification by disease could also make sense. Currently, sample/covariate info is not part of this pull request. I think it probably should be added before the first release.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, I was at talk by Olivier Elemento - he was building models for a different purpose (predict immunotherapy responders) but was adjusting for mutation burden as a covariate. We may want to consider checking out his stuff and adjusting for burden too

Expand All @@ -56,26 +56,34 @@ def classify(sample_id, mutation_status, **kwargs):
pipeline.fit(X=X_train, y=y_train)
#cv_score_df = grid_scores_to_df(clf_grid.grid_scores_)

obs_df['predicted_status'] = pipeline.predict(X)
obs_df['predicted_score'] = pipeline.decision_function(X)
#obs_df['predicted_prob'] = pipeline.predict_proba(X)

is_testing = obs_df.testing.astype(bool)
y_pred_train = obs_df.predicted_score[~is_testing]
y_pred_test = obs_df.predicted_score[is_testing]
predict_df = pd.DataFrame.from_items([
('sample_id', X_whole.index),
('predicted_status', pipeline.predict(X_whole)),
('predicted_score', pipeline.decision_function(X_whole)),
('predicted_prob', pipeline.predict_proba(X_whole)[:, 1]),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yl565 what's the best way to see if a pipeline supports predict_proba? We can upgrade to sklearn 18 once that's released, if that will make things easier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See d52c6a2 for my solution

])

# obs_df switches to containing non-selected samples
obs_df = obs_df.merge(predict_df, how='right', sort=True)
obs_df['selected'] = obs_df.sample_id.isin(sample_id).astype(int)
for column in 'status', 'testing', 'selected':
obs_df[column] = obs_df[column].fillna(-1).astype(int)
obs_train_df = obs_df.query("testing == 0")
obs_test_df = obs_df.query("testing == 1")

#y_pred_train = obs_df.query("testing == 0").predicted_score
#y_pred_test = obs_df.query("testing == 1").predicted_score

dimensions = collections.OrderedDict()
dimensions['observations'] = len(X)
dimensions['observations_selected'] = sum(obs_df.selected == 1)
dimensions['observations_unselected'] = sum(obs_df.selected == 0)
dimensions['features'] = len(X.columns)
dimensions['positives'] = (y == 1).sum()
dimensions['negatives'] = (y == 0).sum()
dimensions['positives'] = sum(obs_df.status == 1)
dimensions['negatives'] = sum(obs_df.status == 0)
dimensions['positive_prevalence'] = y.mean().round(5)
dimensions['training_observations'] = (obs_df.testing == 0).sum()
dimensions['testing_observations'] = (obs_df.testing == 1).sum()
dimensions['training_observations'] = len(obs_train_df)
dimensions['testing_observations'] = len(obs_test_df)
results['dimensions'] = utils.value_map(dimensions, round, ndigits=5)

obs_train_df = obs_df.query("testing == 0")
obs_test_df = obs_df.query("testing == 1")

performance = collections.OrderedDict()
for part, df in ('training', obs_train_df), ('testing', obs_test_df):
Expand Down
9 changes: 8 additions & 1 deletion cognoml/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ def model_info(estimator):
model = collections.OrderedDict()
model['class'] = type(estimator).__name__
model['module'] = estimator.__module__
model['parameters'] = estimator.get_params()
model['parameters'] = sort_dict(estimator.get_params())
return model

def get_feature_df(estimator, features):
Expand All @@ -112,3 +112,10 @@ def get_feature_df(estimator, features):
('coefficient', coefficients),
])
return feature_df

def sort_dict(dictionary):
"""
Return a dictionary as an OrderedDict sorted by keys.
"""
items = sorted(dictionary.items())
return collections.OrderedDict(items)
Loading