Issues with pre-training on new dataset and dataloading #55

engmubarak48 · 2022-09-03T13:39:34Z

Hi I hope you are doing well, enjoyed reading the paper.

However, I wanted to test how your approach works for other datasets. For example using OGB benchmark datasets. I realized your BIO and CHEM datasets are preprocessed differently. Also, I could not see the code you used to prepare that data. I only see the datasets prepared and stored in ZIP files. I tried to read your data (BIO one particularly) using your provided data loaders so that I can infer the structure. if I get the structure, I could write my own code to preprocess other datasets as you did. Unfortunately, I face the below issue, where I cannot even look at a batch of the data after reading your data because it was built/processed on previous versions of PyTorch and PyTorch geometric. I tried to install the old versions but couldn’t find a compatible one that works. Also using old versions is hectic and requires installing new GPU drivers and changing the versions of every other library or dependency.

For instance, If I read your BIO data.

from dataloader import DataLoaderSubstructContext
from loader import BioDataset
from util import ExtractSubstructureContextPair
from argparse import Namespace
import multiprocessing

args = Namespace(
    batch_size = 4,
    l1 = 1,
    center=0,
)
args.num_workers = multiprocessing.cpu_count()

root_unsupervised = '/content/pretrain-gnns/bio/dataset/dataset/unsupervised'
dataset = BioDataset(root_unsupervised, data_type='unsupervised', transform = ExtractSubstructureContextPair(l1 = args.l1, center = args.center))

loader = DataLoaderSubstructContext(dataset, batch_size=args.batch_size, shuffle=True, num_workers = args.num_workers)

getting a batch will throw an error:

batch = next(iter(loader))

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-24-ef07b0600217>](https://localhost:8080/#) in <module>
----> 1 for step, batch in enumerate(loader):
      2     print('batch: ', batch)
      3     break



3 frames

[/usr/local/lib/python3.7/dist-packages/torch/_utils.py](https://localhost:8080/#) in reraise(self)
    459             # instantiate since we don't know how to
    460             raise RuntimeError(msg) from None
--> 461         raise exception
    462 
    463 

RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.7/dist-packages/torch_geometric/data/dataset.py", line 197, in __getitem__
    data = self.get(self.indices()[idx])
  File "/usr/local/lib/python3.7/dist-packages/torch_geometric/data/in_memory_dataset.py", line 89, in get
    decrement=False,
  File "/usr/local/lib/python3.7/dist-packages/torch_geometric/data/separate.py", line 23, in separate
    for batch_store, data_store in zip(batch.stores, data.stores):
  File "/usr/local/lib/python3.7/dist-packages/torch_geometric/data/data.py", line 486, in stores
    return [self._store]
  File "/usr/local/lib/python3.7/dist-packages/torch_geometric/data/data.py", line 424, in __getattr__
    "The 'data' object was created by an older version of PyG. "
RuntimeError: The 'data' object was created by an older version of PyG. If this error occurred while loading an already existing dataset, remove the 'processed/' directory in the dataset's root folder and try again.

It says remove 'processed/' which doesn't make sense, but even if I do that it will not work.

Issue 2: If I use OGB benchmark datasets for example: 'ogbl-collab'. it is still not possible to use with your BioDataset or DataLoaderSubstructContext because of structural or pre-processing differences.

root_unsupervised = '/content/pretrain-gnns/bio/ogbl_collab'
dataset = BioDataset(root_unsupervised, data_type='unsupervised', transform = ExtractSubstructureContextPair(l1 = args.l1, center = args.center))

This also throws an error, even the raw and processed directories are in the path.

NotImplementedError: Must indicate valid location of raw data. No download allowed

I checked the difference with the BIO directory and realized the OGB dataset comes as CSV files compressed as .gz.

My question is simple and intuitive, how reproduce your work on other datasets. is there a link you can provide where you already explain this question? and Even simply, it would be nice if you can explain what sort of modification I should make to OGB benchmark datasets in order to adapt to your data loading code.

Thanks for the work and look forward to your suggestions.

NB:

I run the code on colab pro
related issue but not helpful: https://stackoverflow.com/questions/70325327/how-to-download-an-older-version-of-pytorch-geometric-in-google-colab

The text was updated successfully, but these errors were encountered:

sugarlemons · 2022-09-18T05:58:22Z

I faced the same issue and It got me into trouble. Wish the author give a solution.

huangzizheng01 · 2022-12-30T16:13:14Z

got same issues : (

heamina · 2023-12-18T07:34:49Z

I met the same problem, hope author help me please~🫡

PanPapag · 2024-07-23T21:50:12Z

Same here 🙁
I would really appreciate an answer from the authors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with pre-training on new dataset and dataloading #55

Issues with pre-training on new dataset and dataloading #55

engmubarak48 commented Sep 3, 2022 •

edited

Loading

sugarlemons commented Sep 18, 2022

huangzizheng01 commented Dec 30, 2022

heamina commented Dec 18, 2023

PanPapag commented Jul 23, 2024

Issues with pre-training on new dataset and dataloading #55

Issues with pre-training on new dataset and dataloading #55

Comments

engmubarak48 commented Sep 3, 2022 • edited Loading

sugarlemons commented Sep 18, 2022

huangzizheng01 commented Dec 30, 2022

heamina commented Dec 18, 2023

PanPapag commented Jul 23, 2024

engmubarak48 commented Sep 3, 2022 •

edited

Loading