Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAM keep rising during training #518

Closed
tz301 opened this issue Dec 21, 2021 · 22 comments
Closed

RAM keep rising during training #518

tz301 opened this issue Dec 21, 2021 · 22 comments

Comments

@tz301
Copy link

tz301 commented Dec 21, 2021

I have more than 10000h audio and store features on disk and using lazy format "cuts.jsonl.gz".

During training, I use K2SpeechRecognitionDataset and SingleCutSampler, and set --world-size=4 and --num-workers=1, memory can be seen below. OOM happened and can not train for one epoch.

1

If I set --world-size=1 (not using DistributedDataParallel) and --num-workers=5, main process seems to have stable memory, but all dataloader workers also seem to show rising memory.

2

3

@csukuangfj
Copy link
Contributor

How about disabling the caching mechanism in Lhotse?
That is, use

from lhotse import set_caching_enabled
set_caching_enabled(False)

There is some kind of caching while reading features, see

@dynamic_lru_cache
def read(

lhotse/lhotse/caching.py

Lines 37 to 44 in 5e99fb2

To disable/enable caching globally in Lhotse, call::
>>> from lhotse import set_caching_enabled
>>> set_caching_enabled(True) # enable
>>> set_caching_enabled(False) # disable
Currently it hard-codes the cache size at 512 items
(regardless of the array size).

If the size of your feature file is large, your RAM usage will go up as the training proceeds.

@tz301
Copy link
Author

tz301 commented Dec 21, 2021

How about disabling the caching mechanism in Lhotse? That is, use

That occurs in lhotse==0.11.0, seems no cache is used at that version.

def read(
self,
key: str,
left_offset_frames: int = 0,
right_offset_frames: Optional[int] = None,
) -> np.ndarray:
with open(self.storage_path / key, "rb") as f:
arr = lilcom.decompress(f.read())
return arr[left_offset_frames:right_offset_frames]

I also update to lhotse==0.12.0 and disable the caching mechanism at the top of train.py. Memory still rising now.

12

@csukuangfj
Copy link
Contributor

Then how about deleting the following line

self.diagnostics.keep(cuts)

It keeps track of every returned CutSet in memory. For a dataset with 10000 hours, it may consume lots of RAM.

@tz301
Copy link
Author

tz301 commented Dec 21, 2021

Then how about deleting the following line

self.diagnostics.keep(cuts)

It keeps track of every returned CutSet in memory. For a dataset with 10000 hours, it may consume lots of RAM.

It seems not this cache, memory rising slope seems unchanged.

13

@pzelasko
Copy link
Collaborator

Thanks for reporting this! Another group also made me aware of this behaviour, and they suspect the culprit is that Python objects (such as cuts) are shared between two processes with copy-on-write semantics, but as soon as they pop out of pickle, the memory is copied (and not freed until the process dies) because Python increments the ref count of these objects. There are some relevant PyTorch issues also, I'll try to link them here later. I haven't looked into it yet, but it's on my list to figure it out.

@pzelasko
Copy link
Collaborator

I wrote the following snippet to try and replicate the issue, but so far I do not see the memory growing, both in the lazy = True and lazy = False scenarios. The only difference between my setup and yours that I see is you mentioned you have precomputed features; I might need to test this with HDF5 feature loader in that case, as I checked only reading audio.

Can you confirm this reflects your use-case?

import lhotse
from lhotse import load_manifest, load_manifest_lazy
from lhotse.dataset import BucketingSampler, CutPairsSampler
from lhotse.dataset.sampling import DynamicBucketingSampler
from lhotse.dataset import UnsupervisedWaveformDataset
from lhotse.dataset import CutConcatenate
from lhotse.dataset.collation import collate_audio
from tqdm import tqdm
import torch

lhotse.set_caching_enabled(False)
torch.set_num_threads(1)
torch.set_num_interop_threads(1)

lazy = True
#lazy = False


class Dset(torch.utils.data.Dataset):
    def __getitem__(self, cuts):
        cuts = CutConcatenate(gap=0.3, duration_factor=2)(cuts)  # to do sth with cuts, trigger refcounts, etc.
        data = collate_audio(cuts)  # audio tensor I/O
        return data


path = "..."
dset = Dset()
nj = 1


if lazy:
    cuts = load_manifest_lazy(path)
    sampler = DynamicBucketingSampler(cuts, max_duration=200, num_buckets=30, shuffle=True)
    dloader = torch.utils.data.DataLoader(dset, sampler=sampler, batch_size=None, num_workers=nj)
    for batch in tqdm(dloader):
        pass
else:
    cuts = load_manifest(path)
    sampler = BucketingSampler(cuts, max_duration=200, num_buckets=30, shuffle=True)
    dloader = torch.utils.data.DataLoader(dset, sampler=sampler, batch_size=None, num_workers=nj)
    for batch in tqdm(dloader):
        pass

@pzelasko
Copy link
Collaborator

... some update, when reading audio with 1 process, the mem usage stops at 230MB; when reading features from a single HDF5 file, the mem usage grows until ~700MB and then seems to stop. All Lhotse caches are disabled (even commented out), so I don't know where this memory is going yet. Seems to be HDF5 related, I'm not sure yet if this memory is used is per open HDF5 file or just per process.

@tz301
Copy link
Author

tz301 commented Dec 22, 2021

Hi, below is an exp with --world-size=1 and --num-workers=5.

The memory not alwarys grows, may stop for some hours and then grows. But never goes down.

I have generated many HDF5 files, with different subsets. For examples, aishell1/feats/0.h5, aishell1/feats/1.h5, ..., aidatatang/feats/0.h5, aidatatang/feats/1.h5, ....

Since BuckerSampler can not be used in such large dataset now, I order cuts offline by their duration buckets([1s, 2s], [2s, 3s], ...), and shuffle inside duration buckets and finally combine to one lazy cuts.jsonl.gz.

I doubt maybe during some iterations, some feature files have not been read since some subset only contain long utterances. When read that file, memory increase. But memory never decrease, maybe some cache of feature file never release?

1222

@pzelasko
Copy link
Collaborator

Yes, it is due to HDF5 internal caches 👀

If you add the following function call lhotse.close_cached_file_handles() periodically (even after every batch, but it has to be in the Dataset to make sure its triggered in the worker process) it will force the HDF5 files to close an re-open when read from again. Apparently HDF5 has an internal cache but I couldn't disable it just by changing the settings in that docs page (maybe you will have more luck than me). If we keep h5py.File objects open, the cache seems to be growing in memory with time. When closing the files after every batch, the memory usage seems to be kept low.

Can you try to find the rest of the solution (disabling HDF5 caches properly) from here? Otherwise opening/closing files all the time could incur a perf hit (but maybe not too much? seems about the same speed in my mini-benchmark).

@tacaswell
Copy link

Do you see any dependence on the chunk size in your data?

@pzelasko
Copy link
Collaborator

I didn't try tweaking it -- but our most common use-case is actually storing a binary blob within the HDF5 file, that we get from using lilcom package for lossy array compression. See:

def write(self, key: str, value: np.ndarray) -> str:
serialized_feats = lilcom.compress(value, tick_power=self.tick_power)
self.hdf.create_dataset(key, data=np.void(serialized_feats))
return key

The reads are then performed like this:

arr = lilcom.decompress(self.hdf[key][()].tobytes())

(if I am doing something very stupid with HDF5 here please call it out :)

@tacaswell
Copy link

This is not letting you take advantage of what hdf5 is good at (it is aware of what an array is, can do tiled/chunked access, take care of the compression for you, nice slicing)! It looks like you are effectively using hdf5 as a low-performance object/blob-store!

I would either sort out if one of the (lossy) filters that hdf5 supports can replace your compression or if you want to stick with this compression algorithm accept you have invented a custom binary file format and just drop them onto disk 🤷🏻 . If you are relying on attrs, you can probably replace that with a json / yaml file.

@pzelasko
Copy link
Collaborator

Thanks, I see your points. You're correct.

Assume I dropped the custom compression algorithm; wouldn't the cache and memory issues still kick in? The major pain point is that arrays representing speech features are of different length. I wasn't able to find a way to make HDF5 "datasets" work with uneven shape sequences like that, so we still would need to create a separate "dataset" for every utterance.

@tacaswell
Copy link

That would depend on exactly how the hdf5 cache is implemented (e.g. if they always keep at least 1 chunk chached even if it is over the max size and you move to smaller chunks you might escape)

It may be worth your time to look into awkward array (https://awkward-array.org/what-is-awkward.html) which among other things handles arrays with a ragged dimension extremely well. It serializes to/from parquet files ~natively (it is arrow buffers under the hood).

@pzelasko
Copy link
Collaborator

Thanks, will check it out!

@wgb14
Copy link
Contributor

wgb14 commented Dec 31, 2021

Hi, @tz301 @pzelasko, any updates on this issue?

I also found this issue while training on gigaspeech, but didn't find a right place to call lhotse.close_cached_file_handles() to make sure it's triggered in the worker process.

Then I tried this converting k2-fsa/icefall#120 (comment). Set -j 1 because each worker needs 80G of memory, and there are 1000 workers by default.

lhotse copy-feats -t lilcom_chunky -j 1 data/fbank/cuts_XL.jsonl.gz data/fbank_cpy/cuts_XL.jsonl.gz data/fbank_cpy/XL_split_1000

But the converting process is really slow. It took 6 hours to convert one piece and I have 1000 pieces in total.
And here are file sizes of the first piece before and after converting

data/fbank/XL_split_1000/cuts_XL.0001.jsonl.gz 1.30MB
data/fbank/XL_split_1000/feats_XL_0001.h5      1.67GB
data/fbank_cpy/XL_split_1000/cuts-0.jsonl.gz   1.90MB
data/fbank_cpy/XL_split_1000/feats-0.lca       257GB

You can see the feature file after converting is extremely large.

Would it be better to convert pieces first, like

lhotse copy-feats -t lilcom_chunky -j 1 data/fbank/XL_split_1000/cuts_XL.0001.jsonl.gz data/fbank_cpy/XL_split_1000/cuts_XL.0001.jsonl.gz data/fbank_cpy/XL_split_1000

and then combine them, or redo the feature extraction?

@pzelasko
Copy link
Collaborator

The copied feature file does seem extremely large. I’ll look into it.

It might make sense to just redo the feature extraction. While I try to resolve these, are you sure that on the fly features are not a better choice for this? I was able to train gigaspeech scale models that way without any of these issues.

@wgb14
Copy link
Contributor

wgb14 commented Dec 31, 2021

It might make sense to just redo the feature extraction. While I try to resolve these, are you sure that on the fly features are not a better choice for this? I was able to train gigaspeech scale models that way without any of these issues.

I do agree this seems to be a better choice! But I don't want to break the progress we have made, so just to confirm, does on the fly work well with features like KaldifeatFbank, shuffling, DynamicBucketingSampler, and lazy-loading training set (These are what I can recall for now)? So that I can do this just by setting --on-the-fly-feats True.

@pzelasko
Copy link
Collaborator

Yes, it’ll work with all these things.

@pzelasko
Copy link
Collaborator

@wgb14 FYI I merged #527 which should make HDF5 usable again. I also merged #522 so if you extract the features for a new experiment, they will use the new format. I don't know why you ended up having such a large file after copying. Maybe multiple (1000 as you mentioned) HDF5 files were all written there.

@tz301 you should also find that the RAM issues are gone if you use DynamicBucketingSampler (likely the memory usage will be much smaller even when using HDF5 files)

@pzelasko
Copy link
Collaborator

I'm closing the issue as I've successfully worked with large manifests and datasets after the addition of dynamic samplers and LilcomChunkyWriter / Reader, and recently also WebDataset integration. If the growing memory issues persist for anybody, LMK and I can provide some guidance.

@desh2608
Copy link
Collaborator

@pzelasko might be useful to include "best practices for working with large datasets" when you get around to writing tutorials :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants