You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I get the following stack trace while executing datagen, but after that the datagen continues normally. It does not finish though. I left it running overnight by morning it has finished.
[INFO] 2024-09-26T23:13:56.862045 Starting data generation [/root/storage/dlio_benchmark/dlio_benchmark/main.py:157]
[INFO] 2024-09-26T23:13:56.862436 Generating dataset in unet3d_data/train and unet3d_data/valid [/root/storage/dlio_benchmark/dlio_benchmark/data_generator/data_generator.py:77]
[INFO] 2024-09-26T23:13:56.862501 Number of files for training dataset: 7000 [/root/storage/dlio_benchmark/dlio_benchmark/data_generator/data_generator.py:78]
[INFO] 2024-09-26T23:13:56.862548 Number of files for validation dataset: 0 [/root/storage/dlio_benchmark/dlio_benchmark/data_generator/data_generator.py:79]
Error executing job with overrides: ['workload=unet3d_h100', '++workload.workflow.generate_data=True', '++workload.workflow.train=False', '++workload.dataset.num_files_train=7000', '++workload.dataset.data_folder=unet3d_data', '++workload.workflow.profiling=False', '++workload.profiling.profiler=none']
Traceback (most recent call last):
File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 396, in main
benchmark.initialize()
File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
x = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 158, in initialize
self.data_generator.generate()
File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
x = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/storage/dlio_benchmark/dlio_benchmark/data_generator/npz_generator.py", line 55, in generate
np.savez(out_path_spec, x=records, y=record_labels)
File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 639, in savez
_savez(file, args, kwds, False)
File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 736, in _savez
zipf = zipfile_factory(file, mode="w", compression=compression)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 103, in zipfile_factory
return zipfile.ZipFile(file, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/zipfile.py", line 1284, in __init__
self.fp = io.open(file, filemode)
^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: './unet3d_data/train/img_0001_of_7000.npz'
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Error executing job with overrides: ['workload=unet3d_h100', '++workload.workflow.generate_data=True', '++workload.workflow.train=False', '++workload.dataset.num_files_train=7000', '++workload.dataset.data_folder=unet3d_data', '++workload.workflow.profiling=False', '++workload.profiling.profiler=none']
Traceback (most recent call last):
File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 396, in main
benchmark.initialize()
File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
x = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 158, in initialize
self.data_generator.generate()
File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
x = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/storage/dlio_benchmark/dlio_benchmark/data_generator/npz_generator.py", line 55, in generate
np.savez(out_path_spec, x=records, y=record_labels)
File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 639, in savez
_savez(file, args, kwds, False)
File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 736, in _savez
zipf = zipfile_factory(file, mode="w", compression=compression)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 103, in zipfile_factory
return zipfile.ZipFile(file, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/zipfile.py", line 1284, in __init__
self.fp = io.open(file, filemode)
^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: './unet3d_data/train/img_0003_of_7000.npz'
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Error executing job with overrides: ['workload=unet3d_h100', '++workload.workflow.generate_data=True', '++workload.workflow.train=False', '++workload.dataset.num_files_train=7000', '++workload.dataset.data_folder=unet3d_data', '++workload.workflow.profiling=False', '++workload.profiling.profiler=none']
Traceback (most recent call last):
File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 396, in main
benchmark.initialize()
File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
x = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 158, in initialize
self.data_generator.generate()
File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
x = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/storage/dlio_benchmark/dlio_benchmark/data_generator/npz_generator.py", line 55, in generate
np.savez(out_path_spec, x=records, y=record_labels)
File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 639, in savez
_savez(file, args, kwds, False)
File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 736, in _savez
zipf = zipfile_factory(file, mode="w", compression=compression)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 103, in zipfile_factory
return zipfile.ZipFile(file, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/zipfile.py", line 1284, in __init__
self.fp = io.open(file, filemode)
^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: './unet3d_data/train/img_0007_of_7000.npz'
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Error executing job with overrides: ['workload=unet3d_h100', '++workload.workflow.generate_data=True', '++workload.workflow.train=False', '++workload.dataset.num_files_train=7000', '++workload.dataset.data_folder=unet3d_data', '++workload.workflow.profiling=False', '++workload.profiling.profiler=none']
Traceback (most recent call last):
File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 396, in main
benchmark.initialize()
File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
x = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 158, in initialize
self.data_generator.generate()
File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
x = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/storage/dlio_benchmark/dlio_benchmark/data_generator/npz_generator.py", line 55, in generate
np.savez(out_path_spec, x=records, y=record_labels)
File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 639, in savez
_savez(file, args, kwds, False)
File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 736, in _savez
zipf = zipfile_factory(file, mode="w", compression=compression)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/numpy/lib/npyio.py", line 103, in zipfile_factory
return zipfile.ZipFile(file, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/zipfile.py", line 1284, in __init__
self.fp = io.open(file, filemode)
^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: './unet3d_data/train/img_0005_of_7000.npz'
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[INFO] 2024-09-26T23:13:57.200507 Generating NPZ Data: [>------------------------------------------------------------] 0.0% 1 of 7000 [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:13:57.675216 Generating NPZ Data: [>------------------------------------------------------------] 0.1% 9 of 7000 [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:13:58.043502 Generating NPZ Data: [>------------------------------------------------------------] 0.2% 17 of 7000 [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:13:58.625244 Generating NPZ Data: [>------------------------------------------------------------] 0.4% 25 of 7000 [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:13:59.369874 Generating NPZ Data: [>------------------------------------------------------------] 0.5% 33 of 7000 [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:14:00.568655 Generating NPZ Data: [>------------------------------------------------------------] 0.6% 41 of 7000 [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:14:02.320880 Generating NPZ Data: [>------------------------------------------------------------] 0.7% 49 of 7000 [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:14:03.230448 Generating NPZ Data: [>------------------------------------------------------------] 0.8% 57 of 7000 [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:14:04.555247 Generating NPZ Data: [=>-----------------------------------------------------------] 0.9% 65 of 7000 [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
[INFO] 2024-09-26T23:14:06.220732 Generating NPZ Data: [=>-----------------------------------------------------------] 1.0% 73 of 7000 [/root/storage/dlio_benchmark/dlio_benchmark/utils/utility.py:235]
Then later when I execute run, it fails.
[INFO] 2024-09-27T08:00:05.546065 Profiling DLIO /root/storage/resultsdir/trace-0-of-2.pfw [/root/storage/dlio_benchmark/dlio_benchmark/utils/config.py:189]
[INFO] 2024-09-27T08:00:05.546386 Running DLIO with 2 process(es) [/root/storage/dlio_benchmark/dlio_benchmark/main.py:100]
[INFO] Total amount of data each host will consume is 477.86366008222103 GB; each host has [9.694530487060547, 9.694538116455078] GB memory [/root/storage/dlio_benchmark/dlio_benchmark/utils/statscounter.py:121]
Error executing job with overrides: ['workload=unet3d_h100', '++workload.workflow.generate_data=False', '++workload.workflow.train=True', '++workload.dataset.num_files_train=7000', '++workload.dataset.data_folder=unet3d_data', '++workload.workflow.profiling=False', '++workload.profiling.profiler=none']
Traceback (most recent call last):
File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 396, in main
benchmark.initialize()
File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
x = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 179, in initialize
filenames = self.storage.walk_node(os.path.join(self.args.data_folder, f"{dataset_type}"))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
x = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/storage/dlio_benchmark/dlio_benchmark/storage/file_storage.py", line 75, in walk_node
return os.listdir(self.get_uri(id))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: './unet3d_data/train'
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Error executing job with overrides: ['workload=unet3d_h100', '++workload.workflow.generate_data=False', '++workload.workflow.train=True', '++workload.dataset.num_files_train=7000', '++workload.dataset.data_folder=unet3d_data', '++workload.workflow.profiling=False', '++workload.profiling.profiler=none']
Traceback (most recent call last):
File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 396, in main
benchmark.initialize()
File "/usr/local/lib/python3.11/dist-packages/dlio_profiler/logger.py", line 184, in wrapper
x = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/storage/dlio_benchmark/dlio_benchmark/main.py", line 203, in initialize
raise Exception(
Exception: Not enough training dataset is found; Please run the code with ++workload.workflow.generate_data=True
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
The text was updated successfully, but these errors were encountered:
I get the following stack trace while executing datagen, but after that the datagen continues normally. It does not finish though. I left it running overnight by morning it has finished.
Then later when I execute run, it fails.
The text was updated successfully, but these errors were encountered: