-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Training wall time is abnormally long when sets contain many systems #2229
Comments
here is the training time reports using the de-bugging code from @iProzd data load time during training matters a lot, note that for set 1) containing ~50000 systems:
for set 2) containing ~17 systems:
|
If I/O matters, could you try the HDF5 format? |
The issue is that the time recorded by deepmd-kit is does not cover all the important part in a training step... |
This new result used modified code from my environment, which already covers all the time parts in a training step. |
It seems that when sets contain many systems, the I/O covers lots of time... |
What is the set size of the systems? Do you have significantly smaller setsize in your case (1) than (2)? |
yes, as described in summary, averaged set sizes are 1~2 for 1), and ~5000 for 2)
|
When a set is completed used, deepmd-kit will load a new set from disk, also in the case that there is only one set in the system deepmd-kit/deepmd/utils/data.py Lines 236 to 237 in 89d0d23
I guess this introduces the overhead in data loading. @iProzd This overhead should also happen in the data memory-friendly model, shouldn't it ? |
Do you run on a supercomputer? |
It's on cloud, data and working_dir were sent together, thought to be on the same GPU machine. |
Could you benchmark the performance of reading a file on your machine? python -m timeit 'f=open("coord.npy");f.read();f.close()' |
npy can't be .read()
for npy without read/close :
for raw without read/close :
for raw with read and close :
|
It seems that I/O is not too expensive... (I got 200us on a supercomputer) We may need to do profiling. |
updates, please ignore the previous result (it's on a debugging machine, instead of a production machine)
|
This does make sense. system=dpdata.LabeledSystem("A/B/data1", fmt="deepmd/npy")
system.to_deepmd_hdf5("data.hdf5#/A/B/data1") I see you are using DP-GEN. I plan to support the HDF5 format in the DP-GEN (deepmodeling/dpgen#617) but haven't done it yet. |
Thank you for your help and suggestion, happy rabbit year~ |
Could you try deepmodeling/dpgen#1119? |
It takes 5 mins to convert 80000 frames in 50000 systems into a 275M HDF5 file. effects on training time is ongoing |
HDF5 file seems to worsen the total-time problem, original npy in 50000 systesm, with data statistics before training not skipped:
HDF5, with data statistics before training not skipped:
please hold the issue @wanghan-iapcm HDF5 test with load-data time for each training batch will be updated later. |
HDF5, with data load time being printed (this version of code also prints enlonged traing time):
|
Do you use the same machine to compare them? I don't see why one has more training time with the same data. |
Can you submit a PR for this code? |
It's the same cloud-machine-type. |
it's from @iProzd |
You may run a short training with cProfile and provide the detailed profiling files in two cases: (1) on the slow machine; (2) on a faster machine. Then we can compare the difference between two. |
Fix #2229. Train models and prefetch data in parallel to decouple the time when data is produced from the time when data is consumed. --------- Signed-off-by: Jinzhe Zeng <[email protected]>
Bug summary
Summary
effectively the same sets (~80000 frames)
the same other params in input
single GPU
1)~80000 frames in ~50000 systems, task takes 52 hours,
2)~80000 frames in ~17 systems takes 18 hours, (
type_mixed
is used to collect the data)DeePMD-kit Version
DeePMD-kit v2.1.5
TensorFlow Version
2.9.0
How did you download the software?
Offline packages
Input Files, Running Commands, Error Log, etc.
discussed with and sets of 1) has been set to @iProzd previousely,
it's said I/O should not influence the training time after data statistics
~4 hours before the training actually started (data statistics, and lcurve.out starts to write)
"training time" in logs of both cases are effectively the same, note the disp_freq are 100 times larger for 1)
training time for 1)
train_origin.log
training time for 2)
train_typeSel.log
Steps to Reproduce
dp train
Further Information, Files, and Links
No response
The text was updated successfully, but these errors were encountered: