-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
All-classes gen4 dataset #4
Comments
Thank you for taking the time out of your busy schedule to reply to me. Your response has been very helpful, and I now know how to generate a dataset with labels for all categories. Currently, I am downloading the raw data for gen4, but I am concerned about the time it will take to generate the data. I would like to know how long it will take to generate the gen4 data once (about one week or one month?), as I am currently using a 4*RTX3090 server, and how much should the parameters below be set to:
|
For the preprocessing you don't need GPUs but only a CPU. It mostly depends on your hardware; that is how many parallel threads your CPU allows and how fast your memory is. If I remember correctly it took less than 1 hour on my machine using |
Thank you again for your help. It's truly appreciated, and I'm envious of your machine! If my server can complete the data generation in two or three days, that would be great news for me. |
Ok that would be quite slow ;). I suggest you convert a small subset and check if the dataset is according to your expectation before pre-processing the full dataset. If you realize it is too slow, let me know and we may find another solution. In the meantime I am closing this issue. |
Okay, thank you again for your enthusiastic help, I wish you all the best in your research! |
thanks :) |
Hi, magehrig I'm back again. I followed your advice to test on a subset of data, but the speed is still very slow.
As you can see, for the above subset of gen4, I spent around five hours for generations with -np 10. I thought at first that the slow speed was due to my use of a mechanical hard drive, but then I migrated the data to an SSD (Samsung 870), and it seems that the speed has not improved. About the CPU, since the machine I am using is also new, I don't think it should be a problem with the CPU:
I suspect that the reason for the slow speed is because at the beginning, I converted the data from dat format to h5 format, following the method of issues #2 : import os
import sys
sys.path.append(os.getcwd())
import h5py
import ipdb
import scripts.genx.tools.psee_loader as psee_loader
from pathlib import Path
if __name__ == '__main__':
path = '/data/gen4_test'
input_dir = Path(path)
train_path = input_dir / 'train'
val_path = input_dir / 'val'
test_path = input_dir / 'test'
for split in [train_path, val_path, test_path]:
for npy_file in split.iterdir():
if npy_file.suffix != '.npy':
continue
dat_path = npy_file.parent / (
npy_file.stem.split('bbox')[0] + f"td.dat")
dat = psee_loader.PSEELoader(str(dat_path))
eve = dat.load_n_events(dat._ev_count)
h5_file_path = str(split/dat_path.stem.split('.dat')[0]) + '.h5'
h5 = h5py.File(h5_file_path, 'w')
h5.create_dataset('events', data=eve)
dat_path.unlink()
print('finish created of ' + h5_file_path) Do you happen to know what might be the reason for this? :) |
Yes, the problem is that you are creating this h5 dataset without chunking. What this means is that the whole data will be written as a single block. As a result, during reading, h5 will read out the whole dataset, which happens very often in a for loop here: RVT/scripts/genx/preprocess_dataset.py Line 515 in 4250ee4
To fix this, you need to write the h5 dataset in chunks. A simple way would be to write all the data at once with "create_dataset" in your code but specify that you want to use chunking. Another (typically faster approach) is to write data incrementally with chunking. If you want to do this you can take some inspiration from my code here which does it for the final dataset: RVT/scripts/genx/preprocess_dataset.py Line 72 in 4250ee4
|
Read more about this here: https://docs.h5py.org/en/stable/high/dataset.html#chunked-storage |
Aha, so it is. Thank you for your clear guidance. I will try the two methods you gave and give feedback, thanks again! ! |
Hi, megehrig I'm facing difficulties in solving this problem, and I've made some changes to the code: for split in [train_path, val_path, test_path]:
for npy_file in split.iterdir():
if npy_file.suffix != '.npy':
continue
dat_path = npy_file.parent / (
npy_file.stem.split('bbox')[0] + f"td.dat")
dat = psee_loader.PSEELoader(str(dat_path))
eve = dat.load_n_events(dat._ev_count)
h5_file_path = str(split/dat_path.stem.split('.dat')[0]) + '.h5'
h5 = h5py.File(h5_file_path, 'w')
h5.create_dataset('events', data=eve, chunks=True, **_blosc_opts(complevel=1, shuffle='byte'))
dat_path.unlink()
print('finish created of ' + h5_file_path) However, these modifications don't seem to be effective (It still costs many time in generating frames). When I revise chunks=True to chunks=(1,), it significantly increases the time required to convert dat to h5 format. I carefully reviewed your code for H5writer and noticed that the chunk shape was set to [1, 20, 360, 720]. However, I'm unsure about how to determine the appropriate parameters for this part when converting dat to an h5 file. Currently, I'm completely stuck and would greatly appreciate your guidance and assistance in resolving this problem. Could you please help me with the necessary modifications or provide a set of basic conversion codes? |
Ok, first we need to figure out if you can use the h5 data I provide to preprocess the dataset reasonably fast. If that works, it means there is room for optimization in how you create your own h5 dataset. I suggest downloading the 1 Mpx validation or test set h5 files and running the preprocessing scripts to check how fast this runs. Assuming that indeed this runs reasonably fast, you can go on and optimize your code: First, I am not sure if your code is compatible with the preprocessing script because the preprocessing script accesses individually x, y, t, p: RVT/scripts/genx/preprocess_dataset.py Line 174 in 5774d5c
To fix this, you can do the following: shape = (2**16,)
h5f.create_dataset('events/x', shape=shape, dtype='u2', chunks=shape, compression=**_blosc_opts(complevel=1, shuffle='byte'))
h5f.create_dataset('events/y', shape=shape, dtype='u2', chunks=shape, compression=**_blosc_opts(complevel=1, shuffle='byte'))
h5f.create_dataset('events/p', shape=shape, dtype='u1', chunks=shape, compression=**_blosc_opts(complevel=1, shuffle='byte'))
h5f.create_dataset('events/t', shape=shape, dtype='i8', chunks=shape, compression=**_blosc_opts(complevel=1, shuffle='byte')) Setting chunk=True as you already did should however also work reasonably well. If I understood you correctly, the time to generate the h5 files is not the issue but running the pre-processing script on your generated h5 files is slow. Can you confirm this? |
Sorry I didn't notice before that you have given the original h5 event file. If this is the case, converting dat to h5 is not so important to me, I will directly download the h5 file you provided. I hope that the h5 file you provided can be used to build frames at a normal speed. In this case, my problem will be completely solved!!! |
Sure the h5 files that are provided contain all the events and labels of the original dataset but in a more convenient format. I am a bit confused since I thought you wanted to convert the h5 files by yourself according to the thread in the other issue #2 I suppose then the easiest way forward is just to use the existing h5 files. |
I didn't expect that you would provide a whole set of raw data, which is a great help to us. Next, I will directly download the h5 data you provided, and then build a frame to test whether the speed is normal! |
Oh my gosh, this is so fast. Far faster than my previous frame building! This saved me a lot of time. Now my problem has been completely solved, thank you again for your help, I hope the noisy and tedious questions did not bother you too much. You are right, there is really no need for me to convert dat to h5 myself! Now I have deleted all the dat files and replaced with the h5 files you provided :) |
Hi Magehrig,
I would like to express my gratitude for your outstanding work and generosity in sharing your knowledge. I was wondering if you could provide us with an all-classes gen4 dataset for more comprehensive testing purposes and applications in other domain. Although I have attempted to generate the dataset myself, the process is quite time-consuming.
The text was updated successfully, but these errors were encountered: