-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RAM keep rising during training #518
Comments
How about disabling the caching mechanism in Lhotse? from lhotse import set_caching_enabled
set_caching_enabled(False) There is some kind of caching while reading features, see Lines 243 to 244 in 5e99fb2
Lines 37 to 44 in 5e99fb2
If the size of your feature file is large, your RAM usage will go up as the training proceeds. |
That occurs in lhotse==0.11.0, seems no cache is used at that version. Lines 184 to 192 in 6d60f33
I also update to lhotse==0.12.0 and disable the caching mechanism at the top of train.py. Memory still rising now. |
Then how about deleting the following line lhotse/lhotse/dataset/sampling/single_cut.py Line 245 in 5e99fb2
It keeps track of every returned |
It seems not this cache, memory rising slope seems unchanged. |
Thanks for reporting this! Another group also made me aware of this behaviour, and they suspect the culprit is that Python objects (such as cuts) are shared between two processes with copy-on-write semantics, but as soon as they pop out of pickle, the memory is copied (and not freed until the process dies) because Python increments the ref count of these objects. There are some relevant PyTorch issues also, I'll try to link them here later. I haven't looked into it yet, but it's on my list to figure it out. |
I wrote the following snippet to try and replicate the issue, but so far I do not see the memory growing, both in the Can you confirm this reflects your use-case? import lhotse
from lhotse import load_manifest, load_manifest_lazy
from lhotse.dataset import BucketingSampler, CutPairsSampler
from lhotse.dataset.sampling import DynamicBucketingSampler
from lhotse.dataset import UnsupervisedWaveformDataset
from lhotse.dataset import CutConcatenate
from lhotse.dataset.collation import collate_audio
from tqdm import tqdm
import torch
lhotse.set_caching_enabled(False)
torch.set_num_threads(1)
torch.set_num_interop_threads(1)
lazy = True
#lazy = False
class Dset(torch.utils.data.Dataset):
def __getitem__(self, cuts):
cuts = CutConcatenate(gap=0.3, duration_factor=2)(cuts) # to do sth with cuts, trigger refcounts, etc.
data = collate_audio(cuts) # audio tensor I/O
return data
path = "..."
dset = Dset()
nj = 1
if lazy:
cuts = load_manifest_lazy(path)
sampler = DynamicBucketingSampler(cuts, max_duration=200, num_buckets=30, shuffle=True)
dloader = torch.utils.data.DataLoader(dset, sampler=sampler, batch_size=None, num_workers=nj)
for batch in tqdm(dloader):
pass
else:
cuts = load_manifest(path)
sampler = BucketingSampler(cuts, max_duration=200, num_buckets=30, shuffle=True)
dloader = torch.utils.data.DataLoader(dset, sampler=sampler, batch_size=None, num_workers=nj)
for batch in tqdm(dloader):
pass |
... some update, when reading audio with 1 process, the mem usage stops at 230MB; when reading features from a single HDF5 file, the mem usage grows until ~700MB and then seems to stop. All Lhotse caches are disabled (even commented out), so I don't know where this memory is going yet. Seems to be HDF5 related, I'm not sure yet if this memory is used is per open HDF5 file or just per process. |
Hi, below is an exp with --world-size=1 and --num-workers=5. The memory not alwarys grows, may stop for some hours and then grows. But never goes down. I have generated many HDF5 files, with different subsets. For examples, aishell1/feats/0.h5, aishell1/feats/1.h5, ..., aidatatang/feats/0.h5, aidatatang/feats/1.h5, .... Since BuckerSampler can not be used in such large dataset now, I order cuts offline by their duration buckets([1s, 2s], [2s, 3s], ...), and shuffle inside duration buckets and finally combine to one lazy cuts.jsonl.gz. I doubt maybe during some iterations, some feature files have not been read since some subset only contain long utterances. When read that file, memory increase. But memory never decrease, maybe some cache of feature file never release? |
Yes, it is due to HDF5 internal caches 👀 If you add the following function call Can you try to find the rest of the solution (disabling HDF5 caches properly) from here? Otherwise opening/closing files all the time could incur a perf hit (but maybe not too much? seems about the same speed in my mini-benchmark). |
Do you see any dependence on the chunk size in your data? |
I didn't try tweaking it -- but our most common use-case is actually storing a binary blob within the HDF5 file, that we get from using Lines 534 to 537 in 4429e88
The reads are then performed like this: Line 490 in 4429e88
(if I am doing something very stupid with HDF5 here please call it out :) |
This is not letting you take advantage of what hdf5 is good at (it is aware of what an array is, can do tiled/chunked access, take care of the compression for you, nice slicing)! It looks like you are effectively using hdf5 as a low-performance object/blob-store! I would either sort out if one of the (lossy) filters that hdf5 supports can replace your compression or if you want to stick with this compression algorithm accept you have invented a custom binary file format and just drop them onto disk 🤷🏻 . If you are relying on |
Thanks, I see your points. You're correct. Assume I dropped the custom compression algorithm; wouldn't the cache and memory issues still kick in? The major pain point is that arrays representing speech features are of different length. I wasn't able to find a way to make HDF5 "datasets" work with uneven shape sequences like that, so we still would need to create a separate "dataset" for every utterance. |
That would depend on exactly how the hdf5 cache is implemented (e.g. if they always keep at least 1 chunk chached even if it is over the max size and you move to smaller chunks you might escape) It may be worth your time to look into awkward array (https://awkward-array.org/what-is-awkward.html) which among other things handles arrays with a ragged dimension extremely well. It serializes to/from parquet files ~natively (it is arrow buffers under the hood). |
Thanks, will check it out! |
Hi, @tz301 @pzelasko, any updates on this issue? I also found this issue while training on gigaspeech, but didn't find a right place to call Then I tried this converting k2-fsa/icefall#120 (comment). Set
But the converting process is really slow. It took 6 hours to convert one piece and I have 1000 pieces in total.
You can see the feature file after converting is extremely large. Would it be better to convert pieces first, like
and then combine them, or redo the feature extraction? |
The copied feature file does seem extremely large. I’ll look into it. It might make sense to just redo the feature extraction. While I try to resolve these, are you sure that on the fly features are not a better choice for this? I was able to train gigaspeech scale models that way without any of these issues. |
I do agree this seems to be a better choice! But I don't want to break the progress we have made, so just to confirm, does on the fly work well with features like KaldifeatFbank, shuffling, DynamicBucketingSampler, and lazy-loading training set (These are what I can recall for now)? So that I can do this just by setting |
Yes, it’ll work with all these things. |
@wgb14 FYI I merged #527 which should make HDF5 usable again. I also merged #522 so if you extract the features for a new experiment, they will use the new format. I don't know why you ended up having such a large file after copying. Maybe multiple (1000 as you mentioned) HDF5 files were all written there. @tz301 you should also find that the RAM issues are gone if you use DynamicBucketingSampler (likely the memory usage will be much smaller even when using HDF5 files) |
I'm closing the issue as I've successfully worked with large manifests and datasets after the addition of dynamic samplers and LilcomChunkyWriter / Reader, and recently also WebDataset integration. If the growing memory issues persist for anybody, LMK and I can provide some guidance. |
@pzelasko might be useful to include "best practices for working with large datasets" when you get around to writing tutorials :) |
I have more than 10000h audio and store features on disk and using lazy format "cuts.jsonl.gz".
During training, I use K2SpeechRecognitionDataset and SingleCutSampler, and set --world-size=4 and --num-workers=1, memory can be seen below. OOM happened and can not train for one epoch.
If I set --world-size=1 (not using DistributedDataParallel) and --num-workers=5, main process seems to have stable memory, but all dataloader workers also seem to show rising memory.
The text was updated successfully, but these errors were encountered: