-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support MolculeNet Dataset #165
Conversation
Codecov Report
@@ Coverage Diff @@
## master #165 +/- ##
==========================================
- Coverage 78.35% 76.61% -1.75%
==========================================
Files 80 87 +7
Lines 3423 3707 +284
==========================================
+ Hits 2682 2840 +158
- Misses 741 867 +126 |
Can you take a look? @corochann |
import os | ||
import shutil | ||
|
||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are following chainer's coding guide that we won't use this abbreviation in the library code.
Please use numpy
.
Args: | ||
dataset_name (str): MoleculeNet dataset name. If you want to know the | ||
detail of MoleculeNet, please refer to | ||
`official site <http://moleculenet.ai/datasets-1>`_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When user wants to know what dataset_name
is available for chainer_chemistry, I think user needs to see molnet_config.py
.
So I feel it is nice to comment this fact.
detail of MoleculeNet, please refer to | ||
`official site <http://moleculenet.ai/datasets-1>`_ | ||
preprocessor (BasePreprocessor): Preprocessor. | ||
This sould be chose base on the network to be trained. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 typo:
It should be chosen based on the network to be trained.
which is a vector of smiles for each example or `None`. | ||
|
||
""" | ||
assert dataset_name in molnet_default_config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be user friendly to check by if statement and raise ValueError
with message.
Not high priority.
def postprocess_label(label_list): | ||
label_list = np.asarray(label_list) | ||
label_list[np.isnan(label_list)] = -1 | ||
return label_list.astype(np.int32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add else
statement with ValueError
? (it helps IDE for the type checking etc).
from chainer_chemistry.datasets import NumpyTupleDataset | ||
from chainer_chemistry.datasets.molnet.molnet_config import molnet_default_config # NOQA | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, I fell we can put it in the library code... (maybe in next PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this about GraphConvPredictor
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
examples/molnet/train_molnet.py
Outdated
print('preprocessing dataset...') | ||
preprocessor = preprocess_method_dict[method]() | ||
# only use first 100 for debug if num_data >= 0 | ||
target_index = numpy.arangs(num_data) if num_data >= 0 else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only use first num_data
for debug if num_data >= 0
numpy.arange ??
|
||
from chainer_chemistry.dataset.preprocessors.atomic_number_preprocessor import AtomicNumberPreprocessor # NOQA | ||
from chainer_chemistry.datasets import molnet | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, only bbbp and clearance dataset test are supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I implemented it, I feel testing one regression and another classification dataset are enough. But if we would like to support dataset, we need to test all dataset.(other PR is OK?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other pr is ok.
"task_type": 'mix', | ||
# pIC50: regression | ||
# Class: classification | ||
"tasks": ["pIC50", "Class"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can separate "bace_pIC50" dataset type and "bace_Class" dataset type.
As long as "url" is same, I guess we can share the download cache :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after the discussion, we conluded that "mix" type should be treated as "mix" - multi-task formulation.
examples/molnet/train_molnet.py
Outdated
elif molnet_default_config[args.dataset]['task_type'] == 'classification': | ||
model = Classifier(model, lossfun=loss_fun, metrics_fun=metrics_fun, | ||
device=args.gpu) | ||
# TODO(motoki): how to support bace dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can split "dataset_type" itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this comment is confusing. At that time dataset_type
of bace dataset is 'mix' (classification and regression).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read the MoleculeNet paper, and they are supporting MultiTask network training. It was written that sharing the network weight can reduce validation error. So we should try classification + regression multitask formulation.
I think basic design is ok. Please see comments for minor fix/improvement. Could you submit test shell script for each dataset_type? (in other PR is also fine) I want to execute & test in my environment before merging this PR. |
Should we make existing QM9 and Tox21 dataset deprecated when MoleculeNet is available, as MoleculeNet has both datasets, although detail seems to be different (e.g. data splitting)? |
Can you push |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, thank you!!!
This PR supports MoleculeNet. It is great datasets for chemoinformatics.
The license of MoleculeNet is MIT.
deepchem/deepchem#1193
This PR are going to support all dataset of MoleculeNet except PDBBind that includes protein structure features.
In this PR, these features are out of scope.