[BUG] Training wall time is abnormally long when sets contain many systems #2229

Vibsteamer · 2023-01-09T15:04:46Z

Bug summary

Summary
effectively the same sets (~80000 frames)
the same other params in input
single GPU

1)~80000 frames in ~50000 systems, task takes 52 hours,
2)~80000 frames in ~17 systems takes 18 hours, (type_mixed is used to collect the data)

DeePMD-kit Version

DeePMD-kit v2.1.5

TensorFlow Version

2.9.0

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

discussed with and sets of 1) has been set to @iProzd previousely,
it's said I/O should not influence the training time after data statistics

~4 hours before the training actually started (data statistics, and lcurve.out starts to write)
"training time" in logs of both cases are effectively the same, note the disp_freq are 100 times larger for 1)

training time for 1)
train_origin.log

...
DEEPMD INFO    batch 7800000 training time 1580.50 s, testing time 0.00 s
DEEPMD INFO    batch 8000000 training time 1569.11 s, testing time 0.00 s
...
DEEPMD INFO    wall time: 188106.747 s

training time for 2)
train_typeSel.log

...
DEEPMD INFO    batch 7998000 training time 15.41 s, testing time 0.00 s
DEEPMD INFO    batch 8000000 training time 15.60 s, testing time 0.00 s
...
DEEPMD INFO    wall time: 65437.235 s

Steps to Reproduce

dp train

Further Information, Files, and Links

No response

The text was updated successfully, but these errors were encountered:

Vibsteamer · 2023-01-13T08:06:15Z

here is the training time reports using the de-bugging code from @iProzd

data load time during training matters a lot,
and this de-bugging code give longer training time for 2) (~16s vs ~24 s per 2000 batch, 18h vs 27h to finish the task)

note that disp_freq for 1) is 100 times lager than that for 2)

for set 1) containing ~50000 systems:

DEEPMD INFO    initialize model from scratch
DEEPMD INFO    start training at lr 1.00e-03 (== 1.00e-03), decay_step 40000, decay_rate 0.944061, final lr will be 1.00e-08
DEEPMD INFO    start train time: 0.3218533992767334 s
DEEPMD INFO    batch  200000 training time 2173.19 s, testing time 0.01 s, data load time: 2826.80 s
...
DEEPMD INFO    batch  400000 training time 2177.85 s, testing time 0.01 s, data load time: 2817.95 s

for set 2) containing ~17 systems:

DEEPMD INFO    batch    2000 training time 24.62 s, testing time 0.01 s, data load time: 2.21 s
DEEPMD INFO    batch    4000 training time 23.85 s, testing time 0.01 s, data load time: 0.26 s
DEEPMD INFO    batch    6000 training time 23.94 s, testing time 0.01 s, data load time: 0.26 s
DEEPMD INFO    batch    8000 training time 23.90 s, testing time 0.01 s, data load time: 0.26 s
DEEPMD INFO    batch   10000 training time 23.91 s, testing time 0.01 s, data load time: 0.26 s
...
DEEPMD INFO    finished training
DEEPMD INFO    wall time: 99125.249 s

njzjz · 2023-01-13T19:14:37Z

If I/O matters, could you try the HDF5 format?

wanghan-iapcm · 2023-01-14T02:59:15Z

If I/O matters, could you try the HDF5 format?

The issue is that the time recorded by deepmd-kit is does not cover all the important part in a training step...

iProzd · 2023-01-14T04:38:39Z

If I/O matters, could you try the HDF5 format?

The issue is that the time recorded by deepmd-kit is does not cover all the important part in a training step...

This new result used modified code from my environment, which already covers all the time parts in a training step.
"data load time" is the newly added part, filling up the I/O time each step before training.

iProzd · 2023-01-14T04:40:55Z

It seems that when sets contain many systems, the I/O covers lots of time...
Try my recently modified prefetching schedule again on these complex I/O scheme?

wanghan-iapcm · 2023-01-15T08:26:26Z

here is the training time reports using the de-bugging code from @iProzd

data load time during training matters a lot, and this de-bugging code give longer training time for 2) (~16s vs ~24 s per 2000 batch, 18h vs 27h to finish the task)

note that disp_freq for 1) is 100 times lager than that for 2)

for set 1) containing ~50000 systems:

DEEPMD INFO    initialize model from scratch
DEEPMD INFO    start training at lr 1.00e-03 (== 1.00e-03), decay_step 40000, decay_rate 0.944061, final lr will be 1.00e-08
DEEPMD INFO    start train time: 0.3218533992767334 s
DEEPMD INFO    batch  200000 training time 2173.19 s, testing time 0.01 s, data load time: 2826.80 s
...
DEEPMD INFO    batch  400000 training time 2177.85 s, testing time 0.01 s, data load time: 2817.95 s

for set 2) containing ~17 systems:

DEEPMD INFO    batch    2000 training time 24.62 s, testing time 0.01 s, data load time: 2.21 s
DEEPMD INFO    batch    4000 training time 23.85 s, testing time 0.01 s, data load time: 0.26 s
DEEPMD INFO    batch    6000 training time 23.94 s, testing time 0.01 s, data load time: 0.26 s
DEEPMD INFO    batch    8000 training time 23.90 s, testing time 0.01 s, data load time: 0.26 s
DEEPMD INFO    batch   10000 training time 23.91 s, testing time 0.01 s, data load time: 0.26 s
...
DEEPMD INFO    finished training
DEEPMD INFO    wall time: 99125.249 s

What is the set size of the systems? Do you have significantly smaller setsize in your case (1) than (2)?

Vibsteamer · 2023-01-15T08:32:40Z

here is the training time reports using the de-bugging code from @iProzd
data load time during training matters a lot, and this de-bugging code give longer training time for 2) (~16s vs ~24 s per 2000 batch, 18h vs 27h to finish the task)
note that disp_freq for 1) is 100 times lager than that for 2)
for set 1) containing ~50000 systems:
DEEPMD INFO    initialize model from scratch
DEEPMD INFO    start training at lr 1.00e-03 (== 1.00e-03), decay_step 40000, decay_rate 0.944061, final lr will be 1.00e-08
DEEPMD INFO    start train time: 0.3218533992767334 s
DEEPMD INFO    batch  200000 training time 2173.19 s, testing time 0.01 s, data load time: 2826.80 s
...
DEEPMD INFO    batch  400000 training time 2177.85 s, testing time 0.01 s, data load time: 2817.95 s
for set 2) containing ~17 systems:
DEEPMD INFO    batch    2000 training time 24.62 s, testing time 0.01 s, data load time: 2.21 s
DEEPMD INFO    batch    4000 training time 23.85 s, testing time 0.01 s, data load time: 0.26 s
DEEPMD INFO    batch    6000 training time 23.94 s, testing time 0.01 s, data load time: 0.26 s
DEEPMD INFO    batch    8000 training time 23.90 s, testing time 0.01 s, data load time: 0.26 s
DEEPMD INFO    batch   10000 training time 23.91 s, testing time 0.01 s, data load time: 0.26 s
...
DEEPMD INFO    finished training
DEEPMD INFO    wall time: 99125.249 s
What is the set size of the systems? Do you have significantly smaller setsize in your case (1) than (2)?

yes, as described in summary, averaged set sizes are 1~2 for 1), and ~5000 for 2)

1)~80000 frames in ~50000 systems, task takes 52 hours,
2)~80000 frames in ~17 systems takes 18 hours, (type_mixed is used to collect the data)

wanghan-iapcm · 2023-01-15T09:02:49Z

When a set is completed used, deepmd-kit will load a new set from disk, also in the case that there is only one set in the system

deepmd-kit/deepmd/utils/data.py

Lines 236 to 237 in 89d0d23

    
           if self.iterator + batch_size > set_size : 
        
               self._load_batch_set (self.train_dirs[self.set_count % self.get_numb_set()])

I guess this introduces the overhead in data loading.

@iProzd This overhead should also happen in the data memory-friendly model, shouldn't it ?

njzjz · 2023-01-15T09:50:30Z

Do you run on a supercomputer?

Vibsteamer · 2023-01-15T09:55:59Z

Do you run on a supercomputer?

It's on cloud, data and working_dir were sent together, thought to be on the same GPU machine.

njzjz · 2023-01-15T10:17:14Z

Could you benchmark the performance of reading a file on your machine?

python -m timeit 'f=open("coord.npy");f.read();f.close()'

Vibsteamer · 2023-01-15T14:22:45Z

Could you benchmark the performance of reading a file on your machine?
python -m timeit 'f=open("coord.npy");f.read();f.close()'

npy can't be .read()

Traceback (most recent call last):
  File "<timeit-src>", line 6, in inner
    f=open("energy.npy");f.read();f.close()
  File "/opt/deepmd-kit-2.1.5/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 0: invalid start byte

for npy without read/close :

python -m timeit 'f=open("coord.npy")'

20000 loops, best of 5: 14.5 usec per loop

for raw without read/close :

python -m timeit 'f=open("coord.raw")'

20000 loops, best of 5: 14.3 usec per loop

for raw with read and close :

python -m timeit 'f=open("coord.raw");f.read();f.close()'

20000 loops, best of 5: 17.7 usec per loop

njzjz · 2023-01-15T20:29:24Z

It seems that I/O is not too expensive... (I got 200us on a supercomputer)

We may need to do profiling.

Vibsteamer · 2023-01-17T05:18:52Z

It seems that I/O is not too expensive... (I got 200us on a supercomputer)

We may need to do profiling.

updates,
worse and varied reading-1-frame-coord.raw-performance are given from 300 us to 800 us on different machines (several times invoking the same cloud-machine-type)

please ignore the previous result (it's on a debugging machine, instead of a production machine)

20000 loops, best of 5: 17.7 usec per loop

njzjz · 2023-01-17T06:03:46Z

This does make sense.
You can consider converting all data into a single HDF5 file, which is opened only once.

system=dpdata.LabeledSystem("A/B/data1", fmt="deepmd/npy")
system.to_deepmd_hdf5("data.hdf5#/A/B/data1")

I see you are using DP-GEN. I plan to support the HDF5 format in the DP-GEN (deepmodeling/dpgen#617) but haven't done it yet.

Vibsteamer · 2023-01-17T06:51:31Z

This does make sense. You can consider converting all data into a single HDF5 file, which is opened only once.
system=dpdata.LabeledSystem("A/B/data1", fmt="deepmd/npy")
system.to_deepmd_hdf5("data.hdf5#/A/B/data1")
I see you are using DP-GEN. I plan to support the HDF5 format in the DP-GEN (deepmodeling/dpgen#617) but haven't done it yet.

Thank you for your help and suggestion,
happy to konw the comming features.
@iProzd FYI

happy rabbit year~

njzjz · 2023-01-17T21:16:30Z

Could you try deepmodeling/dpgen#1119?

Vibsteamer · 2023-01-18T15:43:46Z

Could you try deepmodeling/dpgen#1119?

It takes 5 mins to convert 80000 frames in 50000 systems into a 275M HDF5 file.

effects on training time is ongoing

Vibsteamer · 2023-01-19T03:22:52Z

Could you try deepmodeling/dpgen#1119?

It takes 5 mins to convert 80000 frames in 50000 systems into a 275M HDF5 file.

effects on training time is ongoing

HDF5 file seems to worsen the total-time problem,

original npy in 50000 systesm, with data statistics before training not skipped:

DEEPMD INFO    batch  398000 training time 15.25 s, testing time 0.00 s
...
DEEPMD INFO    wall time: 8920.850 s

HDF5, with data statistics before training not skipped:

DEEPMD INFO    batch  398000 training time 18.62 s, testing time 0.02 s
...
DEEPMD INFO    wall time: 19959.018 s

please hold the issue @wanghan-iapcm

HDF5 test with load-data time for each training batch will be updated later.

Vibsteamer · 2023-01-19T14:45:19Z

Could you try deepmodeling/dpgen#1119?

It takes 5 mins to convert 80000 frames in 50000 systems into a 275M HDF5 file.
effects on training time is ongoing

HDF5 file seems to worsen the total-time problem,

original npy in 50000 systesm, with data statistics before training not skipped:
DEEPMD INFO    batch  398000 training time 15.25 s, testing time 0.00 s
...
DEEPMD INFO    wall time: 8920.850 s
HDF5, with data statistics before training not skipped:
DEEPMD INFO    batch  398000 training time 18.62 s, testing time 0.02 s
...
DEEPMD INFO    wall time: 19959.018 s
please hold the issue @wanghan-iapcm

HDF5 test with load-data time for each training batch will be updated later.

HDF5, with data load time being printed (this version of code also prints enlonged traing time):

DEEPMD INFO    batch  102000 training time 20.83 s, testing time 0.00 s, data load time: 77.26 s
...
DEEPMD INFO    wall time: 20418.568 s

njzjz · 2023-01-19T18:41:18Z

original npy in 50000 systesm, with data statistics before training not skipped:
DEEPMD INFO    batch  398000 training time 15.25 s, testing time 0.00 s
...
DEEPMD INFO    wall time: 8920.850 s
HDF5, with data statistics before training not skipped:
DEEPMD INFO    batch  398000 training time 18.62 s, testing time 0.02 s
...
DEEPMD INFO    wall time: 19959.018 s

Do you use the same machine to compare them? I don't see why one has more training time with the same data.

njzjz · 2023-01-19T18:41:53Z

this version of code also prints enlonged traing time

Can you submit a PR for this code?

Vibsteamer · 2023-01-20T01:32:13Z

original npy in 50000 systesm, with data statistics before training not skipped:
DEEPMD INFO    batch  398000 training time 15.25 s, testing time 0.00 s
...
DEEPMD INFO    wall time: 8920.850 s
HDF5, with data statistics before training not skipped:
DEEPMD INFO    batch  398000 training time 18.62 s, testing time 0.02 s
...
DEEPMD INFO    wall time: 19959.018 s
Do you use the same machine to compare them? I don't see why one has more training time with the same data.

It's the same cloud-machine-type.
if there is difference, users would encounter the same difference in their production enviroment.

Vibsteamer · 2023-01-20T01:32:38Z

this version of code also prints enlonged traing time

Can you submit a PR for this code?

it's from @iProzd

njzjz · 2023-01-24T21:02:03Z

You may run a short training with cProfile and provide the detailed profiling files in two cases: (1) on the slow machine; (2) on a faster machine. Then we can compare the difference between two.

Fix #2229. Train models and prefetch data in parallel to decouple the time when data is produced from the time when data is consumed. --------- Signed-off-by: Jinzhe Zeng <[email protected]>

Vibsteamer added the bug label Jan 9, 2023

Vibsteamer changed the title ~~[BUG] Training wall time is abnormally long when sets contain many frames~~ [BUG] Training wall time is abnormally long when sets contain many systems Jan 10, 2023

wanghan-iapcm linked a pull request Jan 19, 2023 that will close this issue

load training data once from disk if there is only one set #2264

Merged

njzjz mentioned this issue May 17, 2023

prefetch data during training #2534

Merged

njzjz linked a pull request May 17, 2023 that will close this issue

prefetch data during training #2534

Merged

Vibsteamer closed this as completed May 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Training wall time is abnormally long when sets contain many systems #2229

[BUG] Training wall time is abnormally long when sets contain many systems #2229

Vibsteamer commented Jan 9, 2023

Vibsteamer commented Jan 13, 2023 •

edited

Loading

njzjz commented Jan 13, 2023

wanghan-iapcm commented Jan 14, 2023

iProzd commented Jan 14, 2023

iProzd commented Jan 14, 2023 •

edited

Loading

wanghan-iapcm commented Jan 15, 2023

Vibsteamer commented Jan 15, 2023

wanghan-iapcm commented Jan 15, 2023

njzjz commented Jan 15, 2023

Vibsteamer commented Jan 15, 2023

njzjz commented Jan 15, 2023

Vibsteamer commented Jan 15, 2023

njzjz commented Jan 15, 2023

Vibsteamer commented Jan 17, 2023 •

edited

Loading

njzjz commented Jan 17, 2023

Vibsteamer commented Jan 17, 2023

njzjz commented Jan 17, 2023

Vibsteamer commented Jan 18, 2023

Vibsteamer commented Jan 19, 2023 •

edited

Loading

Vibsteamer commented Jan 19, 2023 •

edited

Loading

njzjz commented Jan 19, 2023

njzjz commented Jan 19, 2023

Vibsteamer commented Jan 20, 2023

Vibsteamer commented Jan 20, 2023

njzjz commented Jan 24, 2023

[BUG] Training wall time is abnormally long when sets contain many systems #2229

[BUG] Training wall time is abnormally long when sets contain many systems #2229

Comments

Vibsteamer commented Jan 9, 2023

Bug summary

DeePMD-kit Version

TensorFlow Version

How did you download the software?

Input Files, Running Commands, Error Log, etc.

Steps to Reproduce

Further Information, Files, and Links

Vibsteamer commented Jan 13, 2023 • edited Loading

njzjz commented Jan 13, 2023

wanghan-iapcm commented Jan 14, 2023

iProzd commented Jan 14, 2023

iProzd commented Jan 14, 2023 • edited Loading

wanghan-iapcm commented Jan 15, 2023

Vibsteamer commented Jan 15, 2023

wanghan-iapcm commented Jan 15, 2023

njzjz commented Jan 15, 2023

Vibsteamer commented Jan 15, 2023

njzjz commented Jan 15, 2023

Vibsteamer commented Jan 15, 2023

njzjz commented Jan 15, 2023

Vibsteamer commented Jan 17, 2023 • edited Loading

njzjz commented Jan 17, 2023

Vibsteamer commented Jan 17, 2023

njzjz commented Jan 17, 2023

Vibsteamer commented Jan 18, 2023

Vibsteamer commented Jan 19, 2023 • edited Loading

Vibsteamer commented Jan 19, 2023 • edited Loading

njzjz commented Jan 19, 2023

njzjz commented Jan 19, 2023

Vibsteamer commented Jan 20, 2023

Vibsteamer commented Jan 20, 2023

njzjz commented Jan 24, 2023

Vibsteamer commented Jan 13, 2023 •

edited

Loading

iProzd commented Jan 14, 2023 •

edited

Loading

Vibsteamer commented Jan 17, 2023 •

edited

Loading

Vibsteamer commented Jan 19, 2023 •

edited

Loading

Vibsteamer commented Jan 19, 2023 •

edited

Loading