Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python stack trace while executing datagen and run #76

Open
harisphnx opened this issue Sep 27, 2024 · 0 comments
Open

Python stack trace while executing datagen and run #76

harisphnx opened this issue Sep 27, 2024 · 0 comments

Comments

@harisphnx
Copy link

harisphnx commented Sep 27, 2024

I get the following stack trace while executing datagen, but after that the datagen continues normally. It does not finish though. I left it running overnight by morning it has finished.

[INFO] 2024-09-26T23:13:56.862045 Starting data generation [/root/storage/dlio_benchmark/dlio_benchmark/main.py:157]
[INFO] 2024-09-26T23:13:56.862436 Generating dataset in unet3d_data/train and unet3d_data/valid [/root/storage/dlio_benchmark/dlio_benchmark/data_generator/data_generator.py:77]
[INFO] 2024-09-26T23:13:56.862501 Number of files for training dataset: 7000 [/root/storage/dlio_benchmark/dlio_benchmark/data_generator/data_generator.py:78]
[INFO] 2024-09-26T23:13:56.862548 Number of files for validation dataset: 0 [/root/storage/dlio_benchmark/dlio_benchmark/data_generator/data_generator.py:79]
Error executing job with overrides: ['workload=unet3d_h100', '++workload.workflow.generate_data=True', '++workload.workflow.train=False', '++workload.dataset.num_files_train=7000', '++workload.dataset.data_folder=unet3d_data', '++workload.workflow.profiling=False', '++workload.profiling.profiler=none']
Traceback (most recent call last):
  File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 396, in main
    benchmark.initialize()
  File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
    x = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 158, in initialize
    self.data_generator.generate()
  File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
    x = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/root/storage/dlio_benchmark/dlio_benchmark/data_generator/npz_generator.py", line 55, in generate
    np.savez(out_path_spec, x=records, y=record_labels)
  File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 639, in savez
    _savez(file, args, kwds, False)
  File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 736, in _savez
    zipf = zipfile_factory(file, mode="w", compression=compression)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 103, in zipfile_factory
    return zipfile.ZipFile(file, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/zipfile.py", line 1284, in __init__
    self.fp = io.open(file, filemode)
              ^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: './unet3d_data/train/img_0001_of_7000.npz'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Error executing job with overrides: ['workload=unet3d_h100', '++workload.workflow.generate_data=True', '++workload.workflow.train=False', '++workload.dataset.num_files_train=7000', '++workload.dataset.data_folder=unet3d_data', '++workload.workflow.profiling=False', '++workload.profiling.profiler=none']
Traceback (most recent call last):
  File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 396, in main
    benchmark.initialize()
  File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
    x = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 158, in initialize
    self.data_generator.generate()
  File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
    x = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/root/storage/dlio_benchmark/dlio_benchmark/data_generator/npz_generator.py", line 55, in generate
    np.savez(out_path_spec, x=records, y=record_labels)
  File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 639, in savez
    _savez(file, args, kwds, False)
  File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 736, in _savez
    zipf = zipfile_factory(file, mode="w", compression=compression)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 103, in zipfile_factory
    return zipfile.ZipFile(file, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/zipfile.py", line 1284, in __init__
    self.fp = io.open(file, filemode)
              ^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: './unet3d_data/train/img_0003_of_7000.npz'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Error executing job with overrides: ['workload=unet3d_h100', '++workload.workflow.generate_data=True', '++workload.workflow.train=False', '++workload.dataset.num_files_train=7000', '++workload.dataset.data_folder=unet3d_data', '++workload.workflow.profiling=False', '++workload.profiling.profiler=none']
Traceback (most recent call last):
  File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 396, in main
    benchmark.initialize()
  File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
    x = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 158, in initialize
    self.data_generator.generate()
  File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
    x = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/root/storage/dlio_benchmark/dlio_benchmark/data_generator/npz_generator.py", line 55, in generate
    np.savez(out_path_spec, x=records, y=record_labels)
  File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 639, in savez
    _savez(file, args, kwds, False)
  File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 736, in _savez
    zipf = zipfile_factory(file, mode="w", compression=compression)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 103, in zipfile_factory
    return zipfile.ZipFile(file, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/zipfile.py", line 1284, in __init__
    self.fp = io.open(file, filemode)
              ^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: './unet3d_data/train/img_0007_of_7000.npz'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Error executing job with overrides: ['workload=unet3d_h100', '++workload.workflow.generate_data=True', '++workload.workflow.train=False', '++workload.dataset.num_files_train=7000', '++workload.dataset.data_folder=unet3d_data', '++workload.workflow.profiling=False', '++workload.profiling.profiler=none']
Traceback (most recent call last):
  File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 396, in main
    benchmark.initialize()
  File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
    x = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 158, in initialize
    self.data_generator.generate()
  File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
    x = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/root/storage/dlio_benchmark/dlio_benchmark/data_generator/npz_generator.py", line 55, in generate
    np.savez(out_path_spec, x=records, y=record_labels)
  File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 639, in savez
    _savez(file, args, kwds, False)
  File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 736, in _savez
    zipf = zipfile_factory(file, mode="w", compression=compression)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 103, in zipfile_factory
    return zipfile.ZipFile(file, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/zipfile.py", line 1284, in __init__
    self.fp = io.open(file, filemode)
              ^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: './unet3d_data/train/img_0005_of_7000.npz'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[INFO] 2024-09-26T23:13:57.200507 Generating NPZ Data: [>------------------------------------------------------------] 0.0% 1 of 7000  [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:13:57.675216 Generating NPZ Data: [>------------------------------------------------------------] 0.1% 9 of 7000  [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:13:58.043502 Generating NPZ Data: [>------------------------------------------------------------] 0.2% 17 of 7000  [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:13:58.625244 Generating NPZ Data: [>------------------------------------------------------------] 0.4% 25 of 7000  [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:13:59.369874 Generating NPZ Data: [>------------------------------------------------------------] 0.5% 33 of 7000  [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:14:00.568655 Generating NPZ Data: [>------------------------------------------------------------] 0.6% 41 of 7000  [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:14:02.320880 Generating NPZ Data: [>------------------------------------------------------------] 0.7% 49 of 7000  [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:14:03.230448 Generating NPZ Data: [>------------------------------------------------------------] 0.8% 57 of 7000  [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:14:04.555247 Generating NPZ Data: [=>-----------------------------------------------------------] 0.9% 65 of 7000  [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:14:06.220732 Generating NPZ Data: [=>-----------------------------------------------------------] 1.0% 73 of 7000  [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]

Then later when I execute run, it fails.

[INFO] 2024-09-27T08:00:05.546065 Profiling DLIO /root/storage/resultsdir/trace-0-of-2.pfw [/root/storage/dlio_benchmark/dlio_benchmark/utils/config.py:189]
[INFO] 2024-09-27T08:00:05.546386 Running DLIO with 2 process(es) [/root/storage/dlio_benchmark/dlio_benchmark/main.py:100]
[INFO] Total amount of data each host will consume is 477.86366008222103 GB; each host has [9.694530487060547, 9.694538116455078] GB memory [/root/storage/dlio_benchmark/dlio_benchmark/utils/statscounter.py:121]
Error executing job with overrides: ['workload=unet3d_h100', '++workload.workflow.generate_data=False', '++workload.workflow.train=True', '++workload.dataset.num_files_train=7000', '++workload.dataset.data_folder=unet3d_data', '++workload.workflow.profiling=False', '++workload.profiling.profiler=none']
Traceback (most recent call last):
  File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 396, in main
    benchmark.initialize()
  File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
    x = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 179, in initialize
    filenames = self.storage.walk_node(os.path.join(self.args.data_folder, f"{dataset_type}"))
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
    x = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/root/storage/dlio_benchmark/dlio_benchmark/storage/file_storage.py", line 75, in walk_node
    return os.listdir(self.get_uri(id))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: './unet3d_data/train'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Error executing job with overrides: ['workload=unet3d_h100', '++workload.workflow.generate_data=False', '++workload.workflow.train=True', '++workload.dataset.num_files_train=7000', '++workload.dataset.data_folder=unet3d_data', '++workload.workflow.profiling=False', '++workload.profiling.profiler=none']
Traceback (most recent call last):
  File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 396, in main
    benchmark.initialize()
  File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
    x = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 203, in initialize
    raise Exception(
Exception: Not enough training dataset is found; Please run the code with ++workload.workflow.generate_data=True

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant