Data preprocessing with coarse graining does not seem to work #31

stratisMarkou · 2023-06-26T14:59:55Z

Following the readme instructions I have downloaded the crossdocked, unzipped it and am trying to run the preprocessing script on it with and without the flag --ca_only.

Running

python process_crossdock.py my_data_dir --no_H

runs without errors, but running

python process_crossdock.py .data --no_H --ca_only

fails, giving the error

KeyError "'R' not in amino acid dict (.data/crossdocked_pocket10/WNK1_HUMAN_202_483_0/5tf9_A_rec_5wdy_a6s_lig_tt_min_0_pocket10.pdb, .data/crossdocked_pocket10/WNK1_HUMAN_202_483_0/5tf9_A_rec_5wdy_a6s_lig_tt_min_0.sdf)" WNK1_HUMAN_202_483_0/5tf9_A_rec_5wdy_a6s_lig_tt_min_0_pocket10.pdb WNK1_HUMAN_202_483_0/5tf9_A_rec_5wdy_a6s_lig_tt_min_0.sdf
#failed: 10: 100%|█████████████| 10/10 [00:00<00:00, 128.31it/s]
Traceback (most recent call last):
  File "/home/stratis/repos/DiffSBDD/process_crossdock.py", line 364, in <module>
    lig_coords = np.concatenate(lig_coords, axis=0)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: need at least one array to concatenate

It looks like in the second case, the script is failing to find certain entries in the amino acid dict and is skipping all protein-ligand complexes, resulting in an empty list for lig_coords which can't be concatenated. Looking at the dataset_params dictionary, it seems that there's two sets of preprocessing parameter settings crossdock_full and crossdock. Changing line 24 in the preprocessing script from dataset_info = dataset_params['crossdock_full'] to dataset_info = dataset_params['crossdock'] and running the preprocessing with --ca_only works without any errors, but I'm not sure the resulting data is correctly preprocessed. Is there something wrong with the preprocessing script or am I doing something wrong on my side?

The text was updated successfully, but these errors were encountered:

arneschneuing · 2023-07-25T16:30:26Z

Hi Stratis,
I think the process_crossdock.py file is indeed outdated and should be updated. As far as I can tell, your solution should be fine as a temporary fix because dataset_params['crossdock'] contains the correct amino acid types required for the coarse-grained model (maybe @yuanqidu can confirm). We will try to upload a correct version as soon as possible.
Sorry for the inconvenience!

arneschneuing assigned yuanqidu Jul 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data preprocessing with coarse graining does not seem to work #31

Data preprocessing with coarse graining does not seem to work #31

stratisMarkou commented Jun 26, 2023

arneschneuing commented Jul 25, 2023

Data preprocessing with coarse graining does not seem to work #31

Data preprocessing with coarse graining does not seem to work #31

Comments

stratisMarkou commented Jun 26, 2023

arneschneuing commented Jul 25, 2023