RAM keep rising during training #518

tz301 · 2021-12-21T04:32:50Z

I have more than 10000h audio and store features on disk and using lazy format "cuts.jsonl.gz".

During training, I use K2SpeechRecognitionDataset and SingleCutSampler, and set --world-size=4 and --num-workers=1, memory can be seen below. OOM happened and can not train for one epoch.

If I set --world-size=1 (not using DistributedDataParallel) and --num-workers=5, main process seems to have stable memory, but all dataloader workers also seem to show rising memory.

csukuangfj · 2021-12-21T04:56:01Z

How about disabling the caching mechanism in Lhotse?
That is, use

from lhotse import set_caching_enabled
set_caching_enabled(False)

There is some kind of caching while reading features, see

lhotse/lhotse/features/io.py

Lines 243 to 244 in 5e99fb2

    
           @dynamic_lru_cache 
        
           def read(

lhotse/lhotse/caching.py

Lines 37 to 44 in 5e99fb2

    
               To disable/enable caching globally in Lhotse, call:: 
        
                   >>> from lhotse import set_caching_enabled 
        
                   >>> set_caching_enabled(True)   # enable 
        
                   >>> set_caching_enabled(False)  # disable 
        
               Currently it hard-codes the cache size at 512 items 
        
               (regardless of the array size).

If the size of your feature file is large, your RAM usage will go up as the training proceeds.

tz301 · 2021-12-21T07:00:53Z

How about disabling the caching mechanism in Lhotse? That is, use

That occurs in lhotse==0.11.0, seems no cache is used at that version.

lhotse/lhotse/features/io.py

Lines 184 to 192 in 6d60f33

    
           def read( 
        
               self, 
        
               key: str, 
        
               left_offset_frames: int = 0, 
        
               right_offset_frames: Optional[int] = None, 
        
           ) -> np.ndarray: 
        
               with open(self.storage_path / key, "rb") as f: 
        
                   arr = lilcom.decompress(f.read()) 
        
               return arr[left_offset_frames:right_offset_frames]

I also update to lhotse==0.12.0 and disable the caching mechanism at the top of train.py. Memory still rising now.

csukuangfj · 2021-12-21T07:21:00Z

Then how about deleting the following line

lhotse/lhotse/dataset/sampling/single_cut.py

Line 245 in 5e99fb2

self.diagnostics.keep(cuts)

It keeps track of every returned CutSet in memory. For a dataset with 10000 hours, it may consume lots of RAM.

tz301 · 2021-12-21T08:24:53Z

Then how about deleting the following line

lhotse/lhotse/dataset/sampling/single_cut.py

Line 245 in 5e99fb2

self.diagnostics.keep(cuts)

It keeps track of every returned CutSet in memory. For a dataset with 10000 hours, it may consume lots of RAM.

It seems not this cache, memory rising slope seems unchanged.

pzelasko · 2021-12-21T14:59:50Z

Thanks for reporting this! Another group also made me aware of this behaviour, and they suspect the culprit is that Python objects (such as cuts) are shared between two processes with copy-on-write semantics, but as soon as they pop out of pickle, the memory is copied (and not freed until the process dies) because Python increments the ref count of these objects. There are some relevant PyTorch issues also, I'll try to link them here later. I haven't looked into it yet, but it's on my list to figure it out.

pzelasko · 2021-12-21T21:56:44Z

I wrote the following snippet to try and replicate the issue, but so far I do not see the memory growing, both in the lazy = True and lazy = False scenarios. The only difference between my setup and yours that I see is you mentioned you have precomputed features; I might need to test this with HDF5 feature loader in that case, as I checked only reading audio.

Can you confirm this reflects your use-case?

import lhotse
from lhotse import load_manifest, load_manifest_lazy
from lhotse.dataset import BucketingSampler, CutPairsSampler
from lhotse.dataset.sampling import DynamicBucketingSampler
from lhotse.dataset import UnsupervisedWaveformDataset
from lhotse.dataset import CutConcatenate
from lhotse.dataset.collation import collate_audio
from tqdm import tqdm
import torch

lhotse.set_caching_enabled(False)
torch.set_num_threads(1)
torch.set_num_interop_threads(1)

lazy = True
#lazy = False


class Dset(torch.utils.data.Dataset):
    def __getitem__(self, cuts):
        cuts = CutConcatenate(gap=0.3, duration_factor=2)(cuts)  # to do sth with cuts, trigger refcounts, etc.
        data = collate_audio(cuts)  # audio tensor I/O
        return data


path = "..."
dset = Dset()
nj = 1


if lazy:
    cuts = load_manifest_lazy(path)
    sampler = DynamicBucketingSampler(cuts, max_duration=200, num_buckets=30, shuffle=True)
    dloader = torch.utils.data.DataLoader(dset, sampler=sampler, batch_size=None, num_workers=nj)
    for batch in tqdm(dloader):
        pass
else:
    cuts = load_manifest(path)
    sampler = BucketingSampler(cuts, max_duration=200, num_buckets=30, shuffle=True)
    dloader = torch.utils.data.DataLoader(dset, sampler=sampler, batch_size=None, num_workers=nj)
    for batch in tqdm(dloader):
        pass

pzelasko · 2021-12-21T23:19:46Z

... some update, when reading audio with 1 process, the mem usage stops at 230MB; when reading features from a single HDF5 file, the mem usage grows until ~700MB and then seems to stop. All Lhotse caches are disabled (even commented out), so I don't know where this memory is going yet. Seems to be HDF5 related, I'm not sure yet if this memory is used is per open HDF5 file or just per process.

tz301 · 2021-12-22T05:50:53Z

Hi, below is an exp with --world-size=1 and --num-workers=5.

The memory not alwarys grows, may stop for some hours and then grows. But never goes down.

I have generated many HDF5 files, with different subsets. For examples, aishell1/feats/0.h5, aishell1/feats/1.h5, ..., aidatatang/feats/0.h5, aidatatang/feats/1.h5, ....

Since BuckerSampler can not be used in such large dataset now, I order cuts offline by their duration buckets([1s, 2s], [2s, 3s], ...), and shuffle inside duration buckets and finally combine to one lazy cuts.jsonl.gz.

I doubt maybe during some iterations, some feature files have not been read since some subset only contain long utterances. When read that file, memory increase. But memory never decrease, maybe some cache of feature file never release?

pzelasko · 2021-12-22T20:25:16Z

Yes, it is due to HDF5 internal caches 👀

If you add the following function call lhotse.close_cached_file_handles() periodically (even after every batch, but it has to be in the Dataset to make sure its triggered in the worker process) it will force the HDF5 files to close an re-open when read from again. Apparently HDF5 has an internal cache but I couldn't disable it just by changing the settings in that docs page (maybe you will have more luck than me). If we keep h5py.File objects open, the cache seems to be growing in memory with time. When closing the files after every batch, the memory usage seems to be kept low.

Can you try to find the rest of the solution (disabling HDF5 caches properly) from here? Otherwise opening/closing files all the time could incur a perf hit (but maybe not too much? seems about the same speed in my mini-benchmark).

tacaswell · 2021-12-22T20:54:13Z

Do you see any dependence on the chunk size in your data?

pzelasko · 2021-12-22T21:06:35Z

I didn't try tweaking it -- but our most common use-case is actually storing a binary blob within the HDF5 file, that we get from using lilcom package for lossy array compression. See:

lhotse/lhotse/features/io.py

Lines 534 to 537 in 4429e88

    
           def write(self, key: str, value: np.ndarray) -> str: 
        
               serialized_feats = lilcom.compress(value, tick_power=self.tick_power) 
        
               self.hdf.create_dataset(key, data=np.void(serialized_feats)) 
        
               return key

The reads are then performed like this:

lhotse/lhotse/features/io.py

Line 490 in 4429e88

arr = lilcom.decompress(self.hdf[key][()].tobytes())

(if I am doing something very stupid with HDF5 here please call it out :)

tacaswell · 2021-12-22T21:39:49Z

This is not letting you take advantage of what hdf5 is good at (it is aware of what an array is, can do tiled/chunked access, take care of the compression for you, nice slicing)! It looks like you are effectively using hdf5 as a low-performance object/blob-store!

I would either sort out if one of the (lossy) filters that hdf5 supports can replace your compression or if you want to stick with this compression algorithm accept you have invented a custom binary file format and just drop them onto disk 🤷🏻 . If you are relying on attrs, you can probably replace that with a json / yaml file.

pzelasko · 2021-12-22T21:46:24Z

Thanks, I see your points. You're correct.

Assume I dropped the custom compression algorithm; wouldn't the cache and memory issues still kick in? The major pain point is that arrays representing speech features are of different length. I wasn't able to find a way to make HDF5 "datasets" work with uneven shape sequences like that, so we still would need to create a separate "dataset" for every utterance.

tacaswell · 2021-12-22T22:34:48Z

That would depend on exactly how the hdf5 cache is implemented (e.g. if they always keep at least 1 chunk chached even if it is over the max size and you move to smaller chunks you might escape)

It may be worth your time to look into awkward array (https://awkward-array.org/what-is-awkward.html) which among other things handles arrays with a ragged dimension extremely well. It serializes to/from parquet files ~natively (it is arrow buffers under the hood).

pzelasko · 2021-12-22T22:36:43Z

Thanks, will check it out!

wgb14 · 2021-12-31T04:34:22Z

Hi, @tz301 @pzelasko, any updates on this issue?

I also found this issue while training on gigaspeech, but didn't find a right place to call lhotse.close_cached_file_handles() to make sure it's triggered in the worker process.

Then I tried this converting k2-fsa/icefall#120 (comment). Set -j 1 because each worker needs 80G of memory, and there are 1000 workers by default.

lhotse copy-feats -t lilcom_chunky -j 1 data/fbank/cuts_XL.jsonl.gz data/fbank_cpy/cuts_XL.jsonl.gz data/fbank_cpy/XL_split_1000

But the converting process is really slow. It took 6 hours to convert one piece and I have 1000 pieces in total.
And here are file sizes of the first piece before and after converting

data/fbank/XL_split_1000/cuts_XL.0001.jsonl.gz 1.30MB
data/fbank/XL_split_1000/feats_XL_0001.h5      1.67GB
data/fbank_cpy/XL_split_1000/cuts-0.jsonl.gz   1.90MB
data/fbank_cpy/XL_split_1000/feats-0.lca       257GB

You can see the feature file after converting is extremely large.

Would it be better to convert pieces first, like

lhotse copy-feats -t lilcom_chunky -j 1 data/fbank/XL_split_1000/cuts_XL.0001.jsonl.gz data/fbank_cpy/XL_split_1000/cuts_XL.0001.jsonl.gz data/fbank_cpy/XL_split_1000

and then combine them, or redo the feature extraction?

pzelasko · 2021-12-31T05:47:55Z

The copied feature file does seem extremely large. I’ll look into it.

It might make sense to just redo the feature extraction. While I try to resolve these, are you sure that on the fly features are not a better choice for this? I was able to train gigaspeech scale models that way without any of these issues.

wgb14 · 2021-12-31T06:19:10Z

It might make sense to just redo the feature extraction. While I try to resolve these, are you sure that on the fly features are not a better choice for this? I was able to train gigaspeech scale models that way without any of these issues.

I do agree this seems to be a better choice! But I don't want to break the progress we have made, so just to confirm, does on the fly work well with features like KaldifeatFbank, shuffling, DynamicBucketingSampler, and lazy-loading training set (These are what I can recall for now)? So that I can do this just by setting --on-the-fly-feats True.

pzelasko · 2021-12-31T13:42:25Z

Yes, it’ll work with all these things.

pzelasko · 2022-01-14T01:53:19Z

@wgb14 FYI I merged #527 which should make HDF5 usable again. I also merged #522 so if you extract the features for a new experiment, they will use the new format. I don't know why you ended up having such a large file after copying. Maybe multiple (1000 as you mentioned) HDF5 files were all written there.

@tz301 you should also find that the RAM issues are gone if you use DynamicBucketingSampler (likely the memory usage will be much smaller even when using HDF5 files)

pzelasko · 2022-02-22T22:19:11Z

I'm closing the issue as I've successfully worked with large manifests and datasets after the addition of dynamic samplers and LilcomChunkyWriter / Reader, and recently also WebDataset integration. If the growing memory issues persist for anybody, LMK and I can provide some guidance.

desh2608 · 2022-02-23T00:32:15Z

@pzelasko might be useful to include "best practices for working with large datasets" when you get around to writing tutorials :)

pzelasko mentioned this issue Dec 22, 2021

Can caching in h5py.File(..., "r") be completely disabled? h5py/h5py#2023

Open

pzelasko mentioned this issue Dec 22, 2021

Custom binary feature storage format #522

Merged

pzelasko closed this as completed Feb 22, 2022

riqiang-dp mentioned this issue Nov 29, 2024

OOM with RAM with Lhotse NVIDIA/NeMo#11303

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAM keep rising during training #518

RAM keep rising during training #518

tz301 commented Dec 21, 2021

csukuangfj commented Dec 21, 2021

tz301 commented Dec 21, 2021

csukuangfj commented Dec 21, 2021

tz301 commented Dec 21, 2021

pzelasko commented Dec 21, 2021

pzelasko commented Dec 21, 2021

pzelasko commented Dec 21, 2021

tz301 commented Dec 22, 2021

pzelasko commented Dec 22, 2021

tacaswell commented Dec 22, 2021

pzelasko commented Dec 22, 2021

tacaswell commented Dec 22, 2021

pzelasko commented Dec 22, 2021

tacaswell commented Dec 22, 2021

pzelasko commented Dec 22, 2021

wgb14 commented Dec 31, 2021 •

edited

Loading

pzelasko commented Dec 31, 2021

wgb14 commented Dec 31, 2021

pzelasko commented Dec 31, 2021

pzelasko commented Jan 14, 2022

pzelasko commented Feb 22, 2022

desh2608 commented Feb 23, 2022

RAM keep rising during training #518

RAM keep rising during training #518

Comments

tz301 commented Dec 21, 2021

csukuangfj commented Dec 21, 2021

tz301 commented Dec 21, 2021

csukuangfj commented Dec 21, 2021

tz301 commented Dec 21, 2021

pzelasko commented Dec 21, 2021

pzelasko commented Dec 21, 2021

pzelasko commented Dec 21, 2021

tz301 commented Dec 22, 2021

pzelasko commented Dec 22, 2021

tacaswell commented Dec 22, 2021

pzelasko commented Dec 22, 2021

tacaswell commented Dec 22, 2021

pzelasko commented Dec 22, 2021

tacaswell commented Dec 22, 2021

pzelasko commented Dec 22, 2021

wgb14 commented Dec 31, 2021 • edited Loading

pzelasko commented Dec 31, 2021

wgb14 commented Dec 31, 2021

pzelasko commented Dec 31, 2021

pzelasko commented Jan 14, 2022

pzelasko commented Feb 22, 2022

desh2608 commented Feb 23, 2022

wgb14 commented Dec 31, 2021 •

edited

Loading