Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support MolculeNet Dataset #165

Merged
merged 14 commits into from
Jun 4, 2018
Merged

Support MolculeNet Dataset #165

merged 14 commits into from
Jun 4, 2018

Conversation

mottodora
Copy link
Member

@mottodora mottodora commented May 3, 2018

This PR supports MoleculeNet. It is great datasets for chemoinformatics.
The license of MoleculeNet is MIT.
deepchem/deepchem#1193
This PR are going to support all dataset of MoleculeNet except PDBBind that includes protein structure features.

  • Support one file csv dataset(almost of MoleculeNet
  • Support Kaggle dataset (there are 3 CSV files for train, valid and test dataset)
  • Support Random Splitting (train, valid, test)
  • handling missing values in classification tasks
  • Write Unit Tests
  • Write Documents

In this PR, these features are out of scope.

  • Support PDBBind dataset
  • Transformation
  • Stratified Splitting
  • Support SDF file
  • Index Splitting
  • handling missing values in regression tasks
  • Advanced imputation techniques for missing data

@codecov-io
Copy link

codecov-io commented May 3, 2018

Codecov Report

Merging #165 into master will decrease coverage by 1.74%.
The diff coverage is 29.56%.

@@            Coverage Diff             @@
##           master     #165      +/-   ##
==========================================
- Coverage   78.35%   76.61%   -1.75%     
==========================================
  Files          80       87       +7     
  Lines        3423     3707     +284     
==========================================
+ Hits         2682     2840     +158     
- Misses        741      867     +126

@mottodora mottodora changed the title [WIP] Support MolculeNet [WIP] Support MolculeNet Datasest May 7, 2018
@mottodora mottodora changed the title [WIP] Support MolculeNet Datasest [WIP] Support MolculeNet Dataset May 7, 2018
@mottodora mottodora requested a review from corochann May 7, 2018 10:37
@mottodora mottodora changed the title [WIP] Support MolculeNet Dataset Support MolculeNet Dataset May 7, 2018
@mottodora
Copy link
Member Author

Can you take a look? @corochann

import os
import shutil

import numpy as np
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are following chainer's coding guide that we won't use this abbreviation in the library code.
Please use numpy.

Args:
dataset_name (str): MoleculeNet dataset name. If you want to know the
detail of MoleculeNet, please refer to
`official site <http://moleculenet.ai/datasets-1>`_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When user wants to know what dataset_name is available for chainer_chemistry, I think user needs to see molnet_config.py.
So I feel it is nice to comment this fact.

detail of MoleculeNet, please refer to
`official site <http://moleculenet.ai/datasets-1>`_
preprocessor (BasePreprocessor): Preprocessor.
This sould be chose base on the network to be trained.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 typo:

It should be chosen based on the network to be trained.

which is a vector of smiles for each example or `None`.

"""
assert dataset_name in molnet_default_config
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be user friendly to check by if statement and raise ValueError with message.
Not high priority.

def postprocess_label(label_list):
label_list = np.asarray(label_list)
label_list[np.isnan(label_list)] = -1
return label_list.astype(np.int32)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add else statement with ValueError? (it helps IDE for the type checking etc).

from chainer_chemistry.datasets import NumpyTupleDataset
from chainer_chemistry.datasets.molnet.molnet_config import molnet_default_config # NOQA


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, I fell we can put it in the library code... (maybe in next PR).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this about GraphConvPredictor ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

print('preprocessing dataset...')
preprocessor = preprocess_method_dict[method]()
# only use first 100 for debug if num_data >= 0
target_index = numpy.arangs(num_data) if num_data >= 0 else None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only use first num_data for debug if num_data >= 0
numpy.arange ??


from chainer_chemistry.dataset.preprocessors.atomic_number_preprocessor import AtomicNumberPreprocessor # NOQA
from chainer_chemistry.datasets import molnet

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, only bbbp and clearance dataset test are supported?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I implemented it, I feel testing one regression and another classification dataset are enough. But if we would like to support dataset, we need to test all dataset.(other PR is OK?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other pr is ok.

"task_type": 'mix',
# pIC50: regression
# Class: classification
"tasks": ["pIC50", "Class"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can separate "bace_pIC50" dataset type and "bace_Class" dataset type.
As long as "url" is same, I guess we can share the download cache :).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after the discussion, we conluded that "mix" type should be treated as "mix" - multi-task formulation.

elif molnet_default_config[args.dataset]['task_type'] == 'classification':
model = Classifier(model, lossfun=loss_fun, metrics_fun=metrics_fun,
device=args.gpu)
# TODO(motoki): how to support bace dataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can split "dataset_type" itself?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this comment is confusing. At that time dataset_type of bace dataset is 'mix' (classification and regression).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the MoleculeNet paper, and they are supporting MultiTask network training. It was written that sharing the network weight can reduce validation error. So we should try classification + regression multitask formulation.

@corochann
Copy link
Member

I think basic design is ok. Please see comments for minor fix/improvement.

Could you submit test shell script for each dataset_type? (in other PR is also fine) I want to execute & test in my environment before merging this PR.

@delta2323
Copy link
Member

Should we make existing QM9 and Tox21 dataset deprecated when MoleculeNet is available, as MoleculeNet has both datasets, although detail seems to be different (e.g. data splitting)?

@corochann
Copy link
Member

Can you push chainer_chemistry/datasets/__init__.py?

Copy link
Member

@corochann corochann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, thank you!!!

@corochann corochann merged commit a8fd4d7 into chainer:master Jun 4, 2018
@mottodora mottodora deleted the molnet branch June 27, 2018 07:23
@mottodora mottodora added this to the 0.4.0 milestone Jul 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants