New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Support MolculeNet Dataset #165

Merged

corochann merged 14 commits into chainer:master from mottodora:molnet

Jun 4, 2018

Member

mottodora commented May 3, 2018 •

edited

Loading

This PR supports MoleculeNet. It is great datasets for chemoinformatics.
The license of MoleculeNet is MIT.
deepchem/deepchem#1193
This PR are going to support all dataset of MoleculeNet except PDBBind that includes protein structure features.

Support one file csv dataset(almost of MoleculeNet
Support Kaggle dataset (there are 3 CSV files for train, valid and test dataset)
Support Random Splitting (train, valid, test)
handling missing values in classification tasks
Write Unit Tests
Write Documents

In this PR, these features are out of scope.

Support PDBBind dataset
Transformation
Stratified Splitting
Support SDF file
Index Splitting
handling missing values in regression tasks
Advanced imputation techniques for missing data

mottodora added 2 commits

May 3, 2018 20:56


          support simple dataset in MoleculeNet

0979d4e


          add example

205a289

codecov-io commented May 3, 2018 •

edited

Loading

Codecov Report

Merging #165 into master will decrease coverage by 1.74%.
The diff coverage is 29.56%.

@@            Coverage Diff             @@
##           master     #165      +/-   ##
==========================================
- Coverage   78.35%   76.61%   -1.75%     
==========================================
  Files          80       87       +7     
  Lines        3423     3707     +284     
==========================================
+ Hits         2682     2840     +158     
- Misses        741      867     +126

mottodora and others added 5 commits

May 5, 2018 11:42


          Support Random Splitting (train, valid, test)

6ed1d29


          handling missing values in classification tasks

453206e


          fix typo

3afbdb8


          add unit tests

e398cfb


          flake8

e8d2e1c

mottodora changed the title ~~[WIP] Support MolculeNet~~ [WIP] Support MolculeNet Datasest

mottodora changed the title ~~[WIP] Support MolculeNet Datasest~~ [WIP] Support MolculeNet Dataset


          add documents

d28d92c

mottodora requested a review from corochann

May 7, 2018 10:37

mottodora changed the title ~~[WIP] Support MolculeNet Dataset~~ Support MolculeNet Dataset

Member Author

mottodora commented May 7, 2018

Can you take a look? @corochann

corochann reviewed

View reviewed changes

chainer_chemistry/datasets/molnet/molnet.py Outdated

+              import os
+              import shutil
+              import numpy as np

Member

corochann May 13, 2018

We are following chainer's coding guide that we won't use this abbreviation in the library code.
Please use numpy.

chainer_chemistry/datasets/molnet/molnet.py

+                  Args:
+                      dataset_name (str): MoleculeNet dataset name. If you want to know the
+                          detail of MoleculeNet, please refer to
+                          `official site <http://moleculenet.ai/datasets-1>`_

Member

corochann May 13, 2018

When user wants to know what dataset_name is available for chainer_chemistry, I think user needs to see molnet_config.py.
So I feel it is nice to comment this fact.

chainer_chemistry/datasets/molnet/molnet.py Outdated

+                          detail of MoleculeNet, please refer to
+                          `official site <http://moleculenet.ai/datasets-1>`_
+                      preprocessor (BasePreprocessor): Preprocessor.
+                          This sould be chose base on the network to be trained.

Member

corochann May 13, 2018

3 typo:

It should be chosen based on the network to be trained.

chainer_chemistry/datasets/molnet/molnet.py Outdated

+                      which is a vector of smiles for each example or `None`.
+                  """
+                  assert dataset_name in molnet_default_config

Member

corochann May 13, 2018

It might be user friendly to check by if statement and raise ValueError with message.
Not high priority.

chainer_chemistry/datasets/molnet/molnet.py Outdated

+                      def postprocess_label(label_list):
+                          label_list = np.asarray(label_list)
+                          label_list[np.isnan(label_list)] = -1
+                          return label_list.astype(np.int32)

Member

corochann May 13, 2018

Could you add else statement with ValueError? (it helps IDE for the type checking etc).

examples/molnet/train_molnet.py

		from chainer_chemistry.datasets import NumpyTupleDataset
		from chainer_chemistry.datasets.molnet.molnet_config import molnet_default_config # NOQA

Member

corochann May 13, 2018

Now, I fell we can put it in the library code... (maybe in next PR).

Member Author

mottodora May 28, 2018

Is this about GraphConvPredictor ?

Member

corochann May 29, 2018

yes

examples/molnet/train_molnet.py Outdated

+                      print('preprocessing dataset...')
+                      preprocessor = preprocess_method_dict[method]()
+                      # only use first 100 for debug if num_data >= 0
+                      target_index = numpy.arangs(num_data) if num_data >= 0 else None

Member

corochann May 13, 2018

only use first num_data for debug if num_data >= 0
numpy.arange ??

tests/datasets_tests/molnet_tests/test_molnet.py


		from chainer_chemistry.dataset.preprocessors.atomic_number_preprocessor import AtomicNumberPreprocessor # NOQA
		from chainer_chemistry.datasets import molnet

Member

corochann May 13, 2018

Currently, only bbbp and clearance dataset test are supported?

Member Author

mottodora May 28, 2018

When I implemented it, I feel testing one regression and another classification dataset are enough. But if we would like to support dataset, we need to test all dataset.(other PR is OK?)

Member

corochann May 29, 2018

other pr is ok.

chainer_chemistry/datasets/molnet/molnet_config.py Outdated

+                      "task_type": 'mix',
+                      # pIC50: regression
+                      # Class: classification
+                      "tasks": ["pIC50", "Class"],

Member

corochann May 14, 2018

maybe we can separate "bace_pIC50" dataset type and "bace_Class" dataset type.
As long as "url" is same, I guess we can share the download cache :).

Member Author

mottodora May 28, 2018

👍

Member

corochann Jun 4, 2018

after the discussion, we conluded that "mix" type should be treated as "mix" - multi-task formulation.

examples/molnet/train_molnet.py Outdated

+                  elif molnet_default_config[args.dataset]['task_type'] == 'classification':
+                      model = Classifier(model, lossfun=loss_fun, metrics_fun=metrics_fun,
+                                         device=args.gpu)
+                  # TODO(motoki): how to support bace dataset

Member

corochann May 14, 2018

we can split "dataset_type" itself?

Member Author

mottodora May 28, 2018

Sorry this comment is confusing. At that time dataset_type of bace dataset is 'mix' (classification and regression).

Member

corochann May 29, 2018

I read the MoleculeNet paper, and they are supporting MultiTask network training. It was written that sharing the network weight can reduce validation error. So we should try classification + regression multitask formulation.

Member

corochann commented May 14, 2018

I think basic design is ok. Please see comments for minor fix/improvement.

Could you submit test shell script for each dataset_type? (in other PR is also fine) I want to execute & test in my environment before merging this PR.

Member

delta2323 commented May 17, 2018

Should we make existing QM9 and Tox21 dataset deprecated when MoleculeNet is available, as MoleculeNet has both datasets, although detail seems to be different (e.g. data splitting)?

mottodora and others added 6 commits

May 28, 2018 11:57


          apply comments

93baa3a


          apply comments

a9ff803


          Merge branch 'molnet' of https://github.com/mottodora/chainer-chemistry…

4e6f4ce

… into molnet


          fix train_molnet.py

f2d18b4

fix

48fad1b


          add test script

Member

corochann commented Jun 4, 2018

Can you push chainer_chemistry/datasets/__init__.py?

corochann approved these changes

View reviewed changes

Member

corochann left a comment

LGTM overall, thank you!!!

corochann merged commit a8fd4d7 into chainer:master

corochann mentioned this pull request

Support MoleculeNet experiment TODO list #178

Closed

5 tasks

mottodora deleted the molnet branch

June 27, 2018 07:23

mottodora added this to the 0.4.0 milestone

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet