Skip to content

Latest commit

 

History

History
67 lines (45 loc) · 3.37 KB

README.md

File metadata and controls

67 lines (45 loc) · 3.37 KB

Data Explanation

For more detailed information, please refer to the DeepDTA article.

Similarity files

For each dataset, there are two similarity files, drug-drug and target-target similarities.

  • Drug-drug similarities obtained via Pubchem structure clustering.
  • Target-target similarities are obtained via S-W similarity.

These files were used to re-produce the results of two other methods (Pahikkala et al., 2017) and (He et al., 2017), and also for some experiments in DeepDTA model, please refer to paper.

  • The original Davis data and more explanation can be found here.
  • The original KIBA data and more explanation can be found here.

Binding affinity files

  • For davis dataset, standard value is Kd in nM. In the article, we used the transformation below:

  • For KIBA dataset, standard value is KIBA score. Two versions of the binding affinity value txt files correspond the original values and transformed values (more information here). In the article we used the tranformed form.

  • nan values indicate there is no experimental value for that drug-target pair.

Train and test folds

There are two files for each dataset: train fold and test fold. Both of these files keep the position information for the binding affinity value given in binding affinity matrices in the text files.

  • Since we performed 5-fold cv, each fold file contains five different set of positions.
  • Test set is same for all five training sets.

For using the folds

  • Load affinity matrix Y
import pickle
import numpy as np

Y = pickle.load(open("Y", "rb"))  # Y = pickle.load(open("Y", "rb"), encoding='latin1')
label_row_inds, label_col_inds = np.where(np.isnan(Y)==False)
  • label_row_inds: drug indices for the corresponding affinity matrix positions (flattened)
    e.g. 36275th point in the KIBA Y matrix indicates the 364th drug (same order in the SMILES file)

    label_row_inds[36275]
  • label_col_inds: protein indices for the corresponding affinity matrix positions (flattened)

    e.g. 36275th point in the KIBA Y matrix indicates the 120th protein (same order in the protein sequence file)

    label_col_inds[36275]
  • You can then load the fold files as follows:

    import json
    test_fold = json.load(open(yourdir + "folds/test_fold_setting1.txt"))
    train_folds = json.load(open(yourdir + "folds/train_fold_setting1.txt"))
    
    test_drug_indices = label_row_inds[test_fold]
    test_protein_indices = label_col_inds[test_fold]

    Remember that, train_folds contain an array of 5 lists, each of which correspond to a training set.