Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

发票数据集训练报错:Out of memory error on GPU 0. Cannot allocate 14.406982MB memory on GPU 0, 10.746094GB memory has been allocated and available memory is only 15.562500MB. #10247

Closed
dizhenx opened this issue Jun 27, 2023 · 9 comments
Assignees
Labels
expneeded need extra experiment to fix issue good first issue Good for newcomers status/close

Comments

@dizhenx
Copy link

dizhenx commented Jun 27, 2023

下载的官方的发票数据集做训练。运行python tools/train.py -c ./fapiao/train_data/ser_vi_layoutxlm.yml -o Global.save_model_dir=./output/kie/ 报以下错误。batchsize和num_works都调为1了还是报错。单卡显存有11G,无其他程序占用。
Error Message Summary:

ResourceExhaustedError:

Out of memory error on GPU 0. Cannot allocate 14.406982MB memory on GPU 0, 10.746094GB memory has been allocated and available memory is only 15.562500MB.

Please check whether there is any other process using GPU 0.

  1. If yes, please stop them, or start PaddlePaddle on another GPU.
  2. If no, please decrease the batch size of your model.
    If the above ways do not solve the out of memory problem, you can try to use CUDA managed memory. The command is export FLAGS_use_cuda_managed_memory=false.
    (at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:95)
    为啥只能使用1个gpu,不能设置多卡训练吗?
@dizhenx dizhenx changed the title Out of memory error on GPU 0. Cannot allocate 14.406982MB memory on GPU 0, 10.746094GB memory has been allocated and available memory is only 15.562500MB. 发票数据集训练报错:Out of memory error on GPU 0. Cannot allocate 14.406982MB memory on GPU 0, 10.746094GB memory has been allocated and available memory is only 15.562500MB. Jun 27, 2023
@shiyutang shiyutang added the good first issue Good for newcomers label Jun 29, 2023
@livingbody
Copy link
Contributor

  • 可以多卡训练,用paddle.distributed.launch --gpus '0,1,2,3' 。。。
  • 参考
# 单机多卡训练,通过 --gpus 参数设置使用的GPU ID
python3 -m paddle.distributed.launch --gpus '0,1,2,3' tools/train.py -c configs/det/det_mv3_db.yml \
     -o Global.pretrained_model=./pretrain_models/MobileNetV3_large_x0_5_pretrained

@shiyutang shiyutang added expneeded need extra experiment to fix issue and removed expneeded need extra experiment to fix issue labels Jun 30, 2023
@shiyutang
Copy link
Collaborator

@livingbody 需要进一步实验看是否是就bs=1依旧超过11g显存

@shiyutang shiyutang added the expneeded need extra experiment to fix issue label Jun 30, 2023
@dizhenx
Copy link
Author

dizhenx commented Jul 3, 2023

是的

@livingbody
Copy link
Contributor

下载的官方的发票数据集做训练。运行python tools/train.py -c ./fapiao/train_data/ser_vi_layoutxlm.yml -o Global.save_model_dir=./output/kie/ 报以下错误。batchsize和num_works都调为1了还是报错。单卡显存有11G,无其他程序占用。

Error Message Summary:
ResourceExhaustedError:

Out of memory error on GPU 0. Cannot allocate 14.406982MB memory on GPU 0, 10.746094GB memory has been allocated and available memory is only 15.562500MB.

Please check whether there is any other process using GPU 0.

  1. If yes, please stop them, or start PaddlePaddle on another GPU.
  2. If no, please decrease the batch size of your model.
    If the above ways do not solve the out of memory problem, you can try to use CUDA managed memory. The command is export FLAGS_use_cuda_managed_memory=false.
    (at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:95)
    为啥只能使用1个gpu,不能设置多卡训练吗?

能否在aistudio上分享项目给我,我查看下具体情况。

@livingbody
Copy link
Contributor

是的

联系我微信:livingbody

@adamzhg
Copy link

adamzhg commented Oct 16, 2023

@livingbody @dizhenx 请问问题解决了吗?
我这GPU是单卡,显存只有8G,也报类似的问题,是参考https://aistudio.baidu.com/projectdetail/4823162做的发票数据训练。

@ericyeyeye
Copy link

ericyeyeye commented Dec 21, 2023

@livingbody @dizhenx 请问问题解决了吗? 我这GPU是单卡,显存只有8G,也报类似的问题,是参考https://aistudio.baidu.com/projectdetail/4823162做的发票数据训练。

可以提供訓練的yml檔,才能瞭解你的設定,det model 中Eval batch_size_per_card只能設為1。

@papersuper
Copy link

请问最后解决了吗

@UserWangZz
Copy link
Collaborator

该issue长时间未更新,暂将此issue关闭,如有需要可重新开启。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
expneeded need extra experiment to fix issue good first issue Good for newcomers status/close
Projects
None yet
Development

No branches or pull requests

8 participants