Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Training wall time is abnormally long when sets contain many systems #2229

Closed
Vibsteamer opened this issue Jan 9, 2023 · 25 comments · Fixed by #2264 or #2534
Closed

[BUG] Training wall time is abnormally long when sets contain many systems #2229

Vibsteamer opened this issue Jan 9, 2023 · 25 comments · Fixed by #2264 or #2534
Labels

Comments

@Vibsteamer
Copy link
Contributor

Bug summary

Summary
effectively the same sets (~80000 frames)
the same other params in input
single GPU

1)~80000 frames in ~50000 systems, task takes 52 hours,
2)~80000 frames in ~17 systems takes 18 hours, (type_mixed is used to collect the data)

DeePMD-kit Version

DeePMD-kit v2.1.5

TensorFlow Version

2.9.0

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

discussed with and sets of 1) has been set to @iProzd previousely,
it's said I/O should not influence the training time after data statistics

~4 hours before the training actually started (data statistics, and lcurve.out starts to write)
"training time" in logs of both cases are effectively the same, note the disp_freq are 100 times larger for 1)

training time for 1)
train_origin.log

...
DEEPMD INFO    batch 7800000 training time 1580.50 s, testing time 0.00 s
DEEPMD INFO    batch 8000000 training time 1569.11 s, testing time 0.00 s
...
DEEPMD INFO    wall time: 188106.747 s

training time for 2)
train_typeSel.log

...
DEEPMD INFO    batch 7998000 training time 15.41 s, testing time 0.00 s
DEEPMD INFO    batch 8000000 training time 15.60 s, testing time 0.00 s
...
DEEPMD INFO    wall time: 65437.235 s

Steps to Reproduce

dp train

Further Information, Files, and Links

No response

@Vibsteamer Vibsteamer added the bug label Jan 9, 2023
@Vibsteamer Vibsteamer changed the title [BUG] Training wall time is abnormally long when sets contain many frames [BUG] Training wall time is abnormally long when sets contain many systems Jan 10, 2023
@Vibsteamer
Copy link
Contributor Author

Vibsteamer commented Jan 13, 2023

here is the training time reports using the de-bugging code from @iProzd

data load time during training matters a lot,
and this de-bugging code give longer training time for 2) (~16s vs ~24 s per 2000 batch, 18h vs 27h to finish the task)

note that disp_freq for 1) is 100 times lager than that for 2)

for set 1) containing ~50000 systems:

DEEPMD INFO    initialize model from scratch
DEEPMD INFO    start training at lr 1.00e-03 (== 1.00e-03), decay_step 40000, decay_rate 0.944061, final lr will be 1.00e-08
DEEPMD INFO    start train time: 0.3218533992767334 s
DEEPMD INFO    batch  200000 training time 2173.19 s, testing time 0.01 s, data load time: 2826.80 s
...
DEEPMD INFO    batch  400000 training time 2177.85 s, testing time 0.01 s, data load time: 2817.95 s

for set 2) containing ~17 systems:

DEEPMD INFO    batch    2000 training time 24.62 s, testing time 0.01 s, data load time: 2.21 s
DEEPMD INFO    batch    4000 training time 23.85 s, testing time 0.01 s, data load time: 0.26 s
DEEPMD INFO    batch    6000 training time 23.94 s, testing time 0.01 s, data load time: 0.26 s
DEEPMD INFO    batch    8000 training time 23.90 s, testing time 0.01 s, data load time: 0.26 s
DEEPMD INFO    batch   10000 training time 23.91 s, testing time 0.01 s, data load time: 0.26 s
...
DEEPMD INFO    finished training
DEEPMD INFO    wall time: 99125.249 s

@njzjz
Copy link
Member

njzjz commented Jan 13, 2023

If I/O matters, could you try the HDF5 format?

@wanghan-iapcm
Copy link
Collaborator

If I/O matters, could you try the HDF5 format?

The issue is that the time recorded by deepmd-kit is does not cover all the important part in a training step...

@iProzd
Copy link
Collaborator

iProzd commented Jan 14, 2023

If I/O matters, could you try the HDF5 format?

The issue is that the time recorded by deepmd-kit is does not cover all the important part in a training step...

This new result used modified code from my environment, which already covers all the time parts in a training step.
"data load time" is the newly added part, filling up the I/O time each step before training.

@iProzd
Copy link
Collaborator

iProzd commented Jan 14, 2023

It seems that when sets contain many systems, the I/O covers lots of time...
Try my recently modified prefetching schedule again on these complex I/O scheme?

@wanghan-iapcm
Copy link
Collaborator

here is the training time reports using the de-bugging code from @iProzd

data load time during training matters a lot, and this de-bugging code give longer training time for 2) (~16s vs ~24 s per 2000 batch, 18h vs 27h to finish the task)

note that disp_freq for 1) is 100 times lager than that for 2)

for set 1) containing ~50000 systems:

DEEPMD INFO    initialize model from scratch
DEEPMD INFO    start training at lr 1.00e-03 (== 1.00e-03), decay_step 40000, decay_rate 0.944061, final lr will be 1.00e-08
DEEPMD INFO    start train time: 0.3218533992767334 s
DEEPMD INFO    batch  200000 training time 2173.19 s, testing time 0.01 s, data load time: 2826.80 s
...
DEEPMD INFO    batch  400000 training time 2177.85 s, testing time 0.01 s, data load time: 2817.95 s

for set 2) containing ~17 systems:

DEEPMD INFO    batch    2000 training time 24.62 s, testing time 0.01 s, data load time: 2.21 s
DEEPMD INFO    batch    4000 training time 23.85 s, testing time 0.01 s, data load time: 0.26 s
DEEPMD INFO    batch    6000 training time 23.94 s, testing time 0.01 s, data load time: 0.26 s
DEEPMD INFO    batch    8000 training time 23.90 s, testing time 0.01 s, data load time: 0.26 s
DEEPMD INFO    batch   10000 training time 23.91 s, testing time 0.01 s, data load time: 0.26 s
...
DEEPMD INFO    finished training
DEEPMD INFO    wall time: 99125.249 s

What is the set size of the systems? Do you have significantly smaller setsize in your case (1) than (2)?

@Vibsteamer
Copy link
Contributor Author

here is the training time reports using the de-bugging code from @iProzd
data load time during training matters a lot, and this de-bugging code give longer training time for 2) (~16s vs ~24 s per 2000 batch, 18h vs 27h to finish the task)
note that disp_freq for 1) is 100 times lager than that for 2)
for set 1) containing ~50000 systems:

DEEPMD INFO    initialize model from scratch
DEEPMD INFO    start training at lr 1.00e-03 (== 1.00e-03), decay_step 40000, decay_rate 0.944061, final lr will be 1.00e-08
DEEPMD INFO    start train time: 0.3218533992767334 s
DEEPMD INFO    batch  200000 training time 2173.19 s, testing time 0.01 s, data load time: 2826.80 s
...
DEEPMD INFO    batch  400000 training time 2177.85 s, testing time 0.01 s, data load time: 2817.95 s

for set 2) containing ~17 systems:

DEEPMD INFO    batch    2000 training time 24.62 s, testing time 0.01 s, data load time: 2.21 s
DEEPMD INFO    batch    4000 training time 23.85 s, testing time 0.01 s, data load time: 0.26 s
DEEPMD INFO    batch    6000 training time 23.94 s, testing time 0.01 s, data load time: 0.26 s
DEEPMD INFO    batch    8000 training time 23.90 s, testing time 0.01 s, data load time: 0.26 s
DEEPMD INFO    batch   10000 training time 23.91 s, testing time 0.01 s, data load time: 0.26 s
...
DEEPMD INFO    finished training
DEEPMD INFO    wall time: 99125.249 s

What is the set size of the systems? Do you have significantly smaller setsize in your case (1) than (2)?

yes, as described in summary, averaged set sizes are 1~2 for 1), and ~5000 for 2)

1)~80000 frames in ~50000 systems, task takes 52 hours,
2)~80000 frames in ~17 systems takes 18 hours, (type_mixed is used to collect the data)

@wanghan-iapcm
Copy link
Collaborator

When a set is completed used, deepmd-kit will load a new set from disk, also in the case that there is only one set in the system

if self.iterator + batch_size > set_size :
self._load_batch_set (self.train_dirs[self.set_count % self.get_numb_set()])

I guess this introduces the overhead in data loading.

@iProzd This overhead should also happen in the data memory-friendly model, shouldn't it ?

@njzjz
Copy link
Member

njzjz commented Jan 15, 2023

Do you run on a supercomputer?

@Vibsteamer
Copy link
Contributor Author

Do you run on a supercomputer?

It's on cloud, data and working_dir were sent together, thought to be on the same GPU machine.

@njzjz
Copy link
Member

njzjz commented Jan 15, 2023

Could you benchmark the performance of reading a file on your machine?

python -m timeit 'f=open("coord.npy");f.read();f.close()'

@Vibsteamer
Copy link
Contributor Author

Could you benchmark the performance of reading a file on your machine?

python -m timeit 'f=open("coord.npy");f.read();f.close()'

npy can't be .read()

Traceback (most recent call last):
  File "<timeit-src>", line 6, in inner
    f=open("energy.npy");f.read();f.close()
  File "/opt/deepmd-kit-2.1.5/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 0: invalid start byte

for npy without read/close :

python -m timeit 'f=open("coord.npy")'
20000 loops, best of 5: 14.5 usec per loop

for raw without read/close :

python -m timeit 'f=open("coord.raw")'
20000 loops, best of 5: 14.3 usec per loop

for raw with read and close :

python -m timeit 'f=open("coord.raw");f.read();f.close()'
20000 loops, best of 5: 17.7 usec per loop

@njzjz
Copy link
Member

njzjz commented Jan 15, 2023

It seems that I/O is not too expensive... (I got 200us on a supercomputer)

We may need to do profiling.

@Vibsteamer
Copy link
Contributor Author

Vibsteamer commented Jan 17, 2023

It seems that I/O is not too expensive... (I got 200us on a supercomputer)

We may need to do profiling.

updates,
worse and varied reading-1-frame-coord.raw-performance are given from 300 us to 800 us on different machines (several times invoking the same cloud-machine-type)

please ignore the previous result (it's on a debugging machine, instead of a production machine)

20000 loops, best of 5: 17.7 usec per loop

@njzjz
Copy link
Member

njzjz commented Jan 17, 2023

This does make sense.
You can consider converting all data into a single HDF5 file, which is opened only once.

system=dpdata.LabeledSystem("A/B/data1", fmt="deepmd/npy")
system.to_deepmd_hdf5("data.hdf5#/A/B/data1")

I see you are using DP-GEN. I plan to support the HDF5 format in the DP-GEN (deepmodeling/dpgen#617) but haven't done it yet.

@Vibsteamer
Copy link
Contributor Author

This does make sense. You can consider converting all data into a single HDF5 file, which is opened only once.

system=dpdata.LabeledSystem("A/B/data1", fmt="deepmd/npy")
system.to_deepmd_hdf5("data.hdf5#/A/B/data1")

I see you are using DP-GEN. I plan to support the HDF5 format in the DP-GEN (deepmodeling/dpgen#617) but haven't done it yet.

Thank you for your help and suggestion,
happy to konw the comming features.
@iProzd FYI

happy rabbit year~

@njzjz
Copy link
Member

njzjz commented Jan 17, 2023

Could you try deepmodeling/dpgen#1119?

@Vibsteamer
Copy link
Contributor Author

Could you try deepmodeling/dpgen#1119?

It takes 5 mins to convert 80000 frames in 50000 systems into a 275M HDF5 file.

effects on training time is ongoing

@Vibsteamer
Copy link
Contributor Author

Vibsteamer commented Jan 19, 2023

Could you try deepmodeling/dpgen#1119?

It takes 5 mins to convert 80000 frames in 50000 systems into a 275M HDF5 file.

effects on training time is ongoing

HDF5 file seems to worsen the total-time problem,

original npy in 50000 systesm, with data statistics before training not skipped:

DEEPMD INFO    batch  398000 training time 15.25 s, testing time 0.00 s
...
DEEPMD INFO    wall time: 8920.850 s

HDF5, with data statistics before training not skipped:

DEEPMD INFO    batch  398000 training time 18.62 s, testing time 0.02 s
...
DEEPMD INFO    wall time: 19959.018 s

please hold the issue @wanghan-iapcm

HDF5 test with load-data time for each training batch will be updated later.

@Vibsteamer
Copy link
Contributor Author

Vibsteamer commented Jan 19, 2023

Could you try deepmodeling/dpgen#1119?

It takes 5 mins to convert 80000 frames in 50000 systems into a 275M HDF5 file.
effects on training time is ongoing

HDF5 file seems to worsen the total-time problem,

original npy in 50000 systesm, with data statistics before training not skipped:

DEEPMD INFO    batch  398000 training time 15.25 s, testing time 0.00 s
...
DEEPMD INFO    wall time: 8920.850 s

HDF5, with data statistics before training not skipped:

DEEPMD INFO    batch  398000 training time 18.62 s, testing time 0.02 s
...
DEEPMD INFO    wall time: 19959.018 s

please hold the issue @wanghan-iapcm

HDF5 test with load-data time for each training batch will be updated later.

HDF5, with data load time being printed (this version of code also prints enlonged traing time):

DEEPMD INFO    batch  102000 training time 20.83 s, testing time 0.00 s, data load time: 77.26 s
...
DEEPMD INFO    wall time: 20418.568 s

@njzjz
Copy link
Member

njzjz commented Jan 19, 2023

original npy in 50000 systesm, with data statistics before training not skipped:

DEEPMD INFO    batch  398000 training time 15.25 s, testing time 0.00 s
...
DEEPMD INFO    wall time: 8920.850 s

HDF5, with data statistics before training not skipped:

DEEPMD INFO    batch  398000 training time 18.62 s, testing time 0.02 s
...
DEEPMD INFO    wall time: 19959.018 s

Do you use the same machine to compare them? I don't see why one has more training time with the same data.

@njzjz
Copy link
Member

njzjz commented Jan 19, 2023

this version of code also prints enlonged traing time

Can you submit a PR for this code?

@Vibsteamer
Copy link
Contributor Author

original npy in 50000 systesm, with data statistics before training not skipped:

DEEPMD INFO    batch  398000 training time 15.25 s, testing time 0.00 s
...
DEEPMD INFO    wall time: 8920.850 s

HDF5, with data statistics before training not skipped:

DEEPMD INFO    batch  398000 training time 18.62 s, testing time 0.02 s
...
DEEPMD INFO    wall time: 19959.018 s

Do you use the same machine to compare them? I don't see why one has more training time with the same data.

It's the same cloud-machine-type.
if there is difference, users would encounter the same difference in their production enviroment.

@Vibsteamer
Copy link
Contributor Author

this version of code also prints enlonged traing time

Can you submit a PR for this code?

it's from @iProzd

@njzjz
Copy link
Member

njzjz commented Jan 24, 2023

You may run a short training with cProfile and provide the detailed profiling files in two cases: (1) on the slow machine; (2) on a faster machine. Then we can compare the difference between two.

@njzjz njzjz linked a pull request May 17, 2023 that will close this issue
wanghan-iapcm pushed a commit that referenced this issue May 19, 2023
Fix #2229.

Train models and prefetch data in parallel to decouple the time when
data is produced from the time when data is consumed.

---------

Signed-off-by: Jinzhe Zeng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants