Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify API docs for a minimalist design #24

Merged
merged 3 commits into from
Sep 30, 2016
Merged

Conversation

dhimmel
Copy link
Member

@dhimmel dhimmel commented Sep 21, 2016

Motivation

Modify the API to be more simplistic and to align with the current state of cancer-data.

Implementation Notes

This is a work in progress (WIP). Please comment of the diffs for more information on why I'm proposing certain changes.

| entrezid | integer | Primary Key. |
| systematic_name | string | |
| standard_name | string | |
| entrez_gene_id | integer | Primary Key. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually coming from https://pypi.python.org/pypi/django-genes/0.2 do we want to roll our own instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah didn't realize it. Let's stick with django-genes. Just talked @rzelayafavila -- a programmer in the Greene Lab who helped make django-genes. He said we will also need to install django-organisms for the organism table.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened an issue to check for Python 3 compatibility of django-genes.

@awm33
Copy link
Member

awm33 commented Sep 22, 2016

@dhimmel Why remove the algorithm classifier field and model? I know we're starting with one algorithm, but are we now only going to have one going forward?

@dhimmel
Copy link
Member Author

dhimmel commented Sep 22, 2016

Why remove the algorithm classifier field and model?

Because I'm not sure exactly how we want to store classifier information. Once the machine learning module matures, we'll have a better idea of how to encode different classifier options. I did think your setup of algorithm (string) and algorithm_parameters (JSON) was versatile. However, the machine learning module in cognoma/machine-learning#51 doesn't include a framework for accepting algorithm_parameters.

cognoma/machine-learning#51 does however output some classifier information such as:

{  
   "class": "SGDClassifier",
   "module": "sklearn.linear_model.stochastic_gradient",
   "parameters": {  
      "alpha": 0.1,
      "average": false,
      "class_weight": "balanced",
      "epsilon": 0.1,
      "eta0": 0.0,
      "fit_intercept": true,
      "l1_ratio": 0.0,
      "learning_rate": "optimal",
      "loss": "log",
      "n_iter": 5,
      "n_jobs": 1,
      "penalty": "elasticnet",
      "power_t": 0.5,
      "random_state": 0,
      "shuffle": true,
      "verbose": 0,
      "warm_start": false
   }
}

@awm33 given these considerations, do you think it makes sense to hold out on making algorithm fields until we know exactly what fields we need and have a way for the machine learning module to accept the input?

| mutation_type | string | |
| gene | object | Gene associated with this mutation. |
| sample_id | string | Foreign Key referencing samples |
| entrez_gene_id | object | Foreign Key referencing a mutated gene for the sample |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still want to keep the gene as an optional join? We could also wait on the frontend for this. What that would mean is that you wouldn't have to pull all the joins and lookup the gene id to do things like display the name of the gene

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still want to keep the gene as an optional join?

Are you asking whether we should add additional columns here for gene_symbol and gene_description?


| Field | Type | Description |
| ------------- |:-------------:| ----------:|
| taxonomy_id | integer | Primary Key. Taxonomy ID assigned by NCBI. |
| common_name | string | Organism common name, e.g. 'Human' |
| scientific_name | string | Organism scientific/binomial name, e.g. 'Homo sapiens' |

### Sample Mutation Summary (embedded in Gene)
### Disease Types (/diseases)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this disease or disease type? Disease is used in the sample model, but I think we should pick one "disease" or "disease type" and stick with it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use disease rather than disease type. It's shorter and is what's used in samples.tsv

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this setup "disease" is in the long format (e.g. Glioblastoma Multiforme) - alternatively, we could also use "acronym" (e.g. GBM).

Copy link
Member Author

@dhimmel dhimmel Sep 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created an issue to discuss acronyms: cognoma/cancer-data#26 (comment). My worry is that using acronyms as primary identifiers is dangerous. For example, if a Xena disease name changes or a new disease gets added, things will break.

@awm33
Copy link
Member

awm33 commented Sep 24, 2016

@dhimmel I can see removing the parameters, but don't understand why removing the parameters unless we are only using one algorithm going forward.

Would the example JSON above be the classifier output for the JSON results object? Or potentially a subobject along with other result objects?

@dhimmel
Copy link
Member Author

dhimmel commented Sep 26, 2016

I can see removing the parameters, but don't understand why removing the parameters unless we are only using one algorithm going forward.

Basically we don't yet know how to specify an algorithm via parameters yet. So should we keep a field that we're not going to use initially or just add it later. Up to you.

Would the example JSON above be the classifier output for the JSON results object?

The above JSON is a subobject of classifiers.results. See here for an example of the entire classifiers.results object.

@dhimmel
Copy link
Member Author

dhimmel commented Sep 29, 2016

Pinging @awm33. Would love to get this PR merged.

@awm33
Copy link
Member

awm33 commented Sep 30, 2016

@dhimmel I think I worded my response to the algorithm fields weirdly. I meant removing algorithm_parameters but keeping algorithm in order to denote what algorithm was used. I suppose we could backfill it once we support more than one algorithm. I assume they would all be "SGDClassifier" until that happens.

If you think the backfill approoch works, than that's fine to remove. Let's just make sure we don't add an algorithm and not note which one was used.

I'm not assigned as the review, but approved 👍

@dhimmel
Copy link
Member Author

dhimmel commented Sep 30, 2016

Okay I think the backfill plan makes sense, especially since I'm not sure SGDClassifier is the best name.

Will merge.

@dhimmel dhimmel merged commit c320aac into cognoma:master Sep 30, 2016
@dhimmel dhimmel deleted the mvp-api branch September 30, 2016 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants