Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with pre-training on new dataset and dataloading #55

Open
engmubarak48 opened this issue Sep 3, 2022 · 4 comments
Open

Issues with pre-training on new dataset and dataloading #55

engmubarak48 opened this issue Sep 3, 2022 · 4 comments

Comments

@engmubarak48
Copy link

engmubarak48 commented Sep 3, 2022

Hi I hope you are doing well, enjoyed reading the paper.

However, I wanted to test how your approach works for other datasets. For example using OGB benchmark datasets. I realized your BIO and CHEM datasets are preprocessed differently. Also, I could not see the code you used to prepare that data. I only see the datasets prepared and stored in ZIP files. I tried to read your data (BIO one particularly) using your provided data loaders so that I can infer the structure. if I get the structure, I could write my own code to preprocess other datasets as you did. Unfortunately, I face the below issue, where I cannot even look at a batch of the data after reading your data because it was built/processed on previous versions of PyTorch and PyTorch geometric. I tried to install the old versions but couldn’t find a compatible one that works. Also using old versions is hectic and requires installing new GPU drivers and changing the versions of every other library or dependency.

For instance, If I read your BIO data.

from dataloader import DataLoaderSubstructContext
from loader import BioDataset
from util import ExtractSubstructureContextPair
from argparse import Namespace
import multiprocessing

args = Namespace(
    batch_size = 4,
    l1 = 1,
    center=0,
)
args.num_workers = multiprocessing.cpu_count()

root_unsupervised = '/content/pretrain-gnns/bio/dataset/dataset/unsupervised'
dataset = BioDataset(root_unsupervised, data_type='unsupervised', transform = ExtractSubstructureContextPair(l1 = args.l1, center = args.center))

loader = DataLoaderSubstructContext(dataset, batch_size=args.batch_size, shuffle=True, num_workers = args.num_workers)

getting a batch will throw an error:

batch = next(iter(loader))
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-24-ef07b0600217>](https://localhost:8080/#) in <module>
----> 1 for step, batch in enumerate(loader):
      2     print('batch: ', batch)
      3     break



3 frames

[/usr/local/lib/python3.7/dist-packages/torch/_utils.py](https://localhost:8080/#) in reraise(self)
    459             # instantiate since we don't know how to
    460             raise RuntimeError(msg) from None
--> 461         raise exception
    462 
    463 

RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.7/dist-packages/torch_geometric/data/dataset.py", line 197, in __getitem__
    data = self.get(self.indices()[idx])
  File "/usr/local/lib/python3.7/dist-packages/torch_geometric/data/in_memory_dataset.py", line 89, in get
    decrement=False,
  File "/usr/local/lib/python3.7/dist-packages/torch_geometric/data/separate.py", line 23, in separate
    for batch_store, data_store in zip(batch.stores, data.stores):
  File "/usr/local/lib/python3.7/dist-packages/torch_geometric/data/data.py", line 486, in stores
    return [self._store]
  File "/usr/local/lib/python3.7/dist-packages/torch_geometric/data/data.py", line 424, in __getattr__
    "The 'data' object was created by an older version of PyG. "
RuntimeError: The 'data' object was created by an older version of PyG. If this error occurred while loading an already existing dataset, remove the 'processed/' directory in the dataset's root folder and try again.

It says remove 'processed/' which doesn't make sense, but even if I do that it will not work.

Issue 2: If I use OGB benchmark datasets for example: 'ogbl-collab'. it is still not possible to use with your BioDataset or DataLoaderSubstructContext because of structural or pre-processing differences.

root_unsupervised = '/content/pretrain-gnns/bio/ogbl_collab'
dataset = BioDataset(root_unsupervised, data_type='unsupervised', transform = ExtractSubstructureContextPair(l1 = args.l1, center = args.center))

This also throws an error, even the raw and processed directories are in the path.

NotImplementedError: Must indicate valid location of raw data. No download allowed

I checked the difference with the BIO directory and realized the OGB dataset comes as CSV files compressed as .gz.

My question is simple and intuitive, how reproduce your work on other datasets. is there a link you can provide where you already explain this question? and Even simply, it would be nice if you can explain what sort of modification I should make to OGB benchmark datasets in order to adapt to your data loading code.

Thanks for the work and look forward to your suggestions.

NB:

@sugarlemons
Copy link

I faced the same issue and It got me into trouble. Wish the author give a solution.

@huangzizheng01
Copy link

got same issues : (

@heamina
Copy link

heamina commented Dec 18, 2023

I met the same problem, hope author help me please~🫡

@PanPapag
Copy link

Same here 🙁
I would really appreciate an answer from the authors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants