LF-MMI GPU OOM #196

wwxm0523 · 2022-01-29T12:11:44Z

There is a GPU OOM problem when I use lf-mmi for training, my token size about 1300 , I want to know how to avoid this problem.

csukuangfj · 2022-01-29T13:29:31Z

What's your training command? What's the value of --max-duration?

danpovey · 2022-01-30T02:34:10Z

It would be helpful to see the traceback from when it dies.

wwxm0523 · 2022-01-30T05:51:48Z

This is the error log.(When the number of phones is 220, it can run normally)
`2022-01-30 05:34:59,582 INFO Loading L.fst
INFO from MMI module:
device: cuda
use pruned_intersect: True
use segment info: True
self.lo Sequential(
(0): Dropout(p=0.1, inplace=False)
(1): Linear(in_features=256, out_features=1253, bias=True)
)
number of phones 1252
2022-01-30 05:35:05,540 INFO Epoch 0 TRAIN info lr 4e-08
2022-01-30 05:35:05,542 INFO using accumulate grad, new batch size is 4 timeslarger than before
2022-01-30 05:35:06,842 DEBUG TRAIN Batch 0/15013 loss 247.649350 loss_att 77.322586 loss_mmi 110.531494 lr 0.00000004 rank 0
2022-01-30 05:36:13,933 DEBUG TRAIN Batch 100/15013 loss 338.543274 loss_att 106.091759 loss_mmi 123.042969 lr 0.00000104 rank 0
terminate called after throwing an instance of 'c10::CUDAOutOfMemoryError'
what(): CUDA out of memory. Tried to allocate 1.73 GiB (GPU 0; 23.70 GiB total capacity; 19.65 GiB already allocated; 1.06 GiB free; 21.29 GiB reserved in total by PyTorch)
Exception raised from malloc at /opt/conda/conda-bld/pytorch_1616554793803/work/c10/cuda/CUDACachingAllocator.cpp:288 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f71382b72f2 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: + 0x1bc21 (0x7f7138516c21 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: + 0x1c944 (0x7f7138517944 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1cf63 (0x7f7138517f63 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: c10::Allocator::raw_allocate(unsigned long) + 0x2f (0x7f709044afaf in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so)
frame #5: k2::PytorchCudaContext::Allocate(unsigned long, void**) + 0x5f (0x7f709044b65f in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so)
frame #6: k2::NewRegion(std::shared_ptrk2::Context, unsigned long) + 0x175 (0x7f709016b015 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so)
frame #7: k2::Renumbering::ComputeOld2New() + 0x96 (0x7f70901288f6 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so)
frame #8: k2::Renumbering::Old2New(bool) + 0xc8 (0x7f70902b5b78 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so)
frame #9: k2::MultiGraphDenseIntersectPruned::PruneTimeRange(int, int) + 0x907 (0x7f70902c7547 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so)
frame #10: std::_Function_handler<void (), k2::MultiGraphDenseIntersectPruned::Intersect()::{lambda()#1}>::_M_invoke(std::_Any_data const&) + 0x26e (0x7f70902ca58e in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so)
frame #11: k2::ThreadPool::ProcessTasks() + 0x16d (0x7f709041027d in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so)
frame #12: + 0xc9039 (0x7f719dc49039 in /opt/conda/lib/python3.8/site-packages/torch/lib/../../../../libstdc++.so.6)
frame #13: + 0x76db (0x7f71c00216db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #14: clone + 0x3f (0x7f71bfd4a71f in /lib/x86_64-linux-gnu/libc.so.6)

Killing subprocess 3803024
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in
main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
`

danpovey · 2022-01-30T08:16:43Z

Hm, there should be a max_arcs option to MultiGraphDenseIntersectPruned() [I forget the python-level wrapper, probably intersect_dense_pruned()]. Setting that to, e.g. 1000, may resolve the issue. Early in training you can get too many arcs active, and if you are using the "normal" topology (not modified topology), the LF-MMI denominator graph size is quadratic in the number of symbols.

…

On Sun, Jan 30, 2022 at 1:51 PM abner ***@***.***> wrote: This is the error log.(When the number of phones is 220, it can run normally) `2022-01-30 05:34:59,582 INFO Loading L.fst INFO from MMI module: device: cuda use pruned_intersect: True use segment info: True self.lo Sequential( (0): Dropout(p=0.1, inplace=False) (1): Linear(in_features=256, out_features=1253, bias=True) ) number of phones 1252 2022-01-30 05:35:05,540 INFO Epoch 0 TRAIN info lr 4e-08 2022-01-30 05:35:05,542 INFO using accumulate grad, new batch size is 4 timeslarger than before 2022-01-30 05:35:06,842 DEBUG TRAIN Batch 0/15013 loss 247.649350 loss_att 77.322586 loss_mmi 110.531494 lr 0.00000004 rank 0 2022-01-30 05:36:13,933 DEBUG TRAIN Batch 100/15013 loss 338.543274 loss_att 106.091759 loss_mmi 123.042969 lr 0.00000104 rank 0 terminate called after throwing an instance of 'c10::CUDAOutOfMemoryError' what(): CUDA out of memory. Tried to allocate 1.73 GiB (GPU 0; 23.70 GiB total capacity; 19.65 GiB already allocated; 1.06 GiB free; 21.29 GiB reserved in total by PyTorch) Exception raised from malloc at /opt/conda/conda-bld/pytorch_1616554793803/work/c10/cuda/CUDACachingAllocator.cpp:288 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f71382b72f2 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1 <#1>: + 0x1bc21 (0x7f7138516c21 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2 <#2>: + 0x1c944 (0x7f7138517944 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3 <#3>: + 0x1cf63 (0x7f7138517f63 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #4 <#4>: c10::Allocator::raw_allocate(unsigned long) + 0x2f (0x7f709044afaf in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #5 <#5>: k2::PytorchCudaContext::Allocate(unsigned long, void**) + 0x5f (0x7f709044b65f in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #6 <#6>: k2::NewRegion(std::shared_ptrk2::Context, unsigned long) + 0x175 (0x7f709016b015 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #7 <#7>: k2::Renumbering::ComputeOld2New() + 0x96 (0x7f70901288f6 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #8 <#8>: k2::Renumbering::Old2New(bool) + 0xc8 (0x7f70902b5b78 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #9 <#9>: k2::MultiGraphDenseIntersectPruned::PruneTimeRange(int, int) + 0x907 (0x7f70902c7547 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #10 <#10>: std::_Function_handler<void (), k2::MultiGraphDenseIntersectPruned::Intersect()::{lambda()#1 <#1>}>::_M_invoke(std::_Any_data const&) + 0x26e (0x7f70902ca58e in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #11 <#11>: k2::ThreadPool::ProcessTasks() + 0x16d (0x7f709041027d in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #12 <#12>: + 0xc9039 (0x7f719dc49039 in /opt/conda/lib/python3.8/site-packages/torch/lib/../../../../libstdc++.so.6) frame #13 <#13>: + 0x76db (0x7f71c00216db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #14 <#14>: clone + 0x3f (0x7f71bfd4a71f in /lib/x86_64-linux-gnu/libc.so.6) Killing subprocess 3803024 Traceback (most recent call last): File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in main() File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) ` — Reply to this email directly, view it on GitHub <#196 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO3WRWXKP7NN3AZJLMLUYTGX5ANCNFSM5NC2HWKQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you commented.Message ID: ***@***.***>

wwxm0523 · 2022-01-31T06:36:21Z

Hm, there should be a max_arcs option to MultiGraphDenseIntersectPruned() [I forget the python-level wrapper, probably intersect_dense_pruned()]. Setting that to, e.g. 1000, may resolve the issue. Early in training you can get too many arcs active, and if you are using the "normal" topology (not modified topology), the LF-MMI denominator graph size is quadratic in the number of symbols.
…
On Sun, Jan 30, 2022 at 1:51 PM abner @.> wrote: This is the error log.(When the number of phones is 220, it can run normally) 2022-01-30 05:34:59,582 INFO Loading L.fst INFO from MMI module: device: cuda use pruned_intersect: True use segment info: True self.lo Sequential( (0): Dropout(p=0.1, inplace=False) (1): Linear(in_features=256, out_features=1253, bias=True) ) number of phones 1252 2022-01-30 05:35:05,540 INFO Epoch 0 TRAIN info lr 4e-08 2022-01-30 05:35:05,542 INFO using accumulate grad, new batch size is 4 timeslarger than before 2022-01-30 05:35:06,842 DEBUG TRAIN Batch 0/15013 loss 247.649350 loss_att 77.322586 loss_mmi 110.531494 lr 0.00000004 rank 0 2022-01-30 05:36:13,933 DEBUG TRAIN Batch 100/15013 loss 338.543274 loss_att 106.091759 loss_mmi 123.042969 lr 0.00000104 rank 0 terminate called after throwing an instance of 'c10::CUDAOutOfMemoryError' what(): CUDA out of memory. Tried to allocate 1.73 GiB (GPU 0; 23.70 GiB total capacity; 19.65 GiB already allocated; 1.06 GiB free; 21.29 GiB reserved in total by PyTorch) Exception raised from malloc at /opt/conda/conda-bld/pytorch_1616554793803/work/c10/cuda/CUDACachingAllocator.cpp:288 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f71382b72f2 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1 <#1>: + 0x1bc21 (0x7f7138516c21 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2 <#2>: + 0x1c944 (0x7f7138517944 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3 <#3>: + 0x1cf63 (0x7f7138517f63 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #4 <#4>: c10::Allocator::raw_allocate(unsigned long) + 0x2f (0x7f709044afaf in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #5 <#5>: k2::PytorchCudaContext::Allocate(unsigned long, void**) + 0x5f (0x7f709044b65f in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #6 <#6>: k2::NewRegion(std::shared_ptrk2::Context, unsigned long) + 0x175 (0x7f709016b015 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #7 <#7>: k2::Renumbering::ComputeOld2New() + 0x96 (0x7f70901288f6 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #8 <#8>: k2::Renumbering::Old2New(bool) + 0xc8 (0x7f70902b5b78 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #9 <#9>: k2::MultiGraphDenseIntersectPruned::PruneTimeRange(int, int) + 0x907 (0x7f70902c7547 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #10 <#10>: std::_Function_handler<void (), k2::MultiGraphDenseIntersectPruned::Intersect()::{lambda()#1 <#1>}>::_M_invoke(std::_Any_data const&) + 0x26e (0x7f70902ca58e in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #11 <#11>: k2::ThreadPool::ProcessTasks() + 0x16d (0x7f709041027d in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #12 <#12>: + 0xc9039 (0x7f719dc49039 in /opt/conda/lib/python3.8/site-packages/torch/lib/../../../../libstdc++.so.6) frame #13 <#13>: + 0x76db (0x7f71c00216db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #14 <#14>: clone + 0x3f (0x7f71bfd4a71f in /lib/x86_64-linux-gnu/libc.so.6) Killing subprocess 3803024 Traceback (most recent call last): File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in main() File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) — Reply to this email directly, view it on GitHub <#196 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3WRWXKP7NN3AZJLMLUYTGX5ANCNFSM5NC2HWKQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you commented.Message ID: @.>

Thanks, Is it max_active_states? Will lowering this parameter lead to poor training accuracy?

danpovey · 2022-01-31T07:42:53Z

It's better to set max_active_arcs. It may only be present in newer versions of k2. max_active_states is a bit less precise because some states can have many arcs leaving them.

…

On Mon, Jan 31, 2022 at 2:36 PM abner ***@***.***> wrote: Hm, there should be a max_arcs option to MultiGraphDenseIntersectPruned() [I forget the python-level wrapper, probably intersect_dense_pruned()]. Setting that to, e.g. 1000, may resolve the issue. Early in training you can get too many arcs active, and if you are using the "normal" topology (not modified topology), the LF-MMI denominator graph size is quadratic in the number of symbols. … <#m_-5600041957703070315_> On Sun, Jan 30, 2022 at 1:51 PM abner *@*.*> wrote: This is the error log.(When the number of phones is 220, it can run normally) 2022-01-30 05:34:59,582 INFO Loading L.fst INFO from MMI module: device: cuda use pruned_intersect: True use segment info: True self.lo Sequential( (0): Dropout(p=0.1, inplace=False) (1): Linear(in_features=256, out_features=1253, bias=True) ) number of phones 1252 2022-01-30 05:35:05,540 INFO Epoch 0 TRAIN info lr 4e-08 2022-01-30 05:35:05,542 INFO using accumulate grad, new batch size is 4 timeslarger than before 2022-01-30 05:35:06,842 DEBUG TRAIN Batch 0/15013 loss 247.649350 loss_att 77.322586 loss_mmi 110.531494 lr 0.00000004 rank 0 2022-01-30 05:36:13,933 DEBUG TRAIN Batch 100/15013 loss 338.543274 loss_att 106.091759 loss_mmi 123.042969 lr 0.00000104 rank 0 terminate called after throwing an instance of 'c10::CUDAOutOfMemoryError' what(): CUDA out of memory. Tried to allocate 1.73 GiB (GPU 0; 23.70 GiB total capacity; 19.65 GiB already allocated; 1.06 GiB free; 21.29 GiB reserved in total by PyTorch) Exception raised from malloc at /opt/conda/conda-bld/pytorch_1616554793803/work/c10/cuda/CUDACachingAllocator.cpp:288 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f71382b72f2 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1 <#1>: + 0x1bc21 (0x7f7138516c21 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2 <#2>: + 0x1c944 (0x7f7138517944 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3 <#3>: + 0x1cf63 (0x7f7138517f63 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #4 <#4>: c10::Allocator::raw_allocate(unsigned long) + 0x2f (0x7f709044afaf in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #5 <#5>: k2::PytorchCudaContext::Allocate(unsigned long, void**) + 0x5f (0x7f709044b65f in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #6 <#6>: k2::NewRegion(std::shared_ptrk2::Context, unsigned long) + 0x175 (0x7f709016b015 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #7 <#7>: k2::Renumbering::ComputeOld2New() + 0x96 (0x7f70901288f6 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #8 <#8>: k2::Renumbering::Old2New(bool) + 0xc8 (0x7f70902b5b78 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #9 <#9>: k2::MultiGraphDenseIntersectPruned::PruneTimeRange(int, int) + 0x907 (0x7f70902c7547 in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #10 <#10>: std::_Function_handler<void (), k2::MultiGraphDenseIntersectPruned::Intersect()::{lambda()#1 <#1>}>::_M_invoke(std::_Any_data const&) + 0x26e (0x7f70902ca58e in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #11 <#11>: k2::ThreadPool::ProcessTasks() + 0x16d (0x7f709041027d in /opt/conda/lib/python3.8/site-packages/k2-1.9.dev20220119+cuda11.1.torch1.8.1-py3.8-linux-x86_64.egg/libk2context.so) frame #12 <#12>: + 0xc9039 (0x7f719dc49039 in /opt/conda/lib/python3.8/site-packages/torch/lib/../../../../libstdc++.so.6) frame #13 <#13>: + 0x76db (0x7f71c00216db in /lib/x86_64-linux-gnu/libpthread.so.0) frame #14 <#14>: clone + 0x3f (0x7f71bfd4a71f in /lib/x86_64-linux-gnu/libc.so.6) Killing subprocess 3803024 Traceback (most recent call last): File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in main() File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) — Reply to this email directly, view it on GitHub <#196 (comment) <#196 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3WRWXKP7NN3AZJLMLUYTGX5ANCNFSM5NC2HWKQ <https://github.com/notifications/unsubscribe-auth/AAZFLO3WRWXKP7NN3AZJLMLUYTGX5ANCNFSM5NC2HWKQ> . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you commented.Message ID: @.*> Thanks, Is it max_active_states? Will lowering this parameter lead to poor training accuracy? — Reply to this email directly, view it on GitHub <#196 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO3PDKI7GDEU5KCTCRLUYYUW7ANCNFSM5NC2HWKQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you commented.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LF-MMI GPU OOM #196

LF-MMI GPU OOM #196

wwxm0523 commented Jan 29, 2022

csukuangfj commented Jan 29, 2022

danpovey commented Jan 30, 2022

wwxm0523 commented Jan 30, 2022

danpovey commented Jan 30, 2022 via email

wwxm0523 commented Jan 31, 2022

danpovey commented Jan 31, 2022 via email

LF-MMI GPU OOM #196

LF-MMI GPU OOM #196

Comments

wwxm0523 commented Jan 29, 2022

csukuangfj commented Jan 29, 2022

danpovey commented Jan 30, 2022

wwxm0523 commented Jan 30, 2022

danpovey commented Jan 30, 2022 via email

wwxm0523 commented Jan 31, 2022

danpovey commented Jan 31, 2022 via email