-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AttributeError: module 'portalocker' has no attribute 'Lock' #4
Comments
It seems that there may be wrong with 'Loading checkpoint from detectron2://ImageNetPretrained/MSRA/R-50.pkl |
Could you check the version of >>> import portalocker
>>> portalocker.__version__
>>> portalocker.Lock |
Sorry, there must be something wrong with my
|
Try to re-install |
first |
How to close DDP? |
I think this is the last bug before I can train YOLOF... |
Comment out the lines with BTW, when you train with only one GPU, you should adjust the learning rate and batch size. Refer to this response. |
Support training with one GPU in this commit. |
Thanks ! |
真的很抱歉,作者,由于我的水平太低,这又出现了新的bug。请问这个应该怎么改? |
Could you give more details about what command you use? |
command : yaml: MODEL: |
Can you try this setting?
|
I left lab and I will try tomorrow. thanks a lot from bottom of my heart !
|
Good Morning! When I tried your setting, it stiill remains the same bug as: [03/26 17:38:53 d2.engine.train_loop]: Starting training from iteration 0 /opt/conda/conda-bld/pytorch_1607370169888/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [11,0,0] Assertion ERROR [03/26 17:39:00 d2.engine.train_loop]: Exception during training: |
Sorry, the Try to warm up more iterations, e.g.,
|
Thanks. It stil does not work... |
Could you upload your training log file? |
[03/26 21:44:54] detectron2 INFO: Rank of current process: 0. World size: 2 sys.platform linux PyTorch built with:
[03/26 21:44:55] detectron2 INFO: Command line arguments: Namespace(config_file='./configs/yolof_R_50_C5_1x.yaml', dist_url='tcp://127.0.0.1:50159', eval_only=False, machine_rank=0, num_gpus=2, num_machines=1, opts=['OUTPUT_DIR', '/hdd2/wh/cw/train/yolof/R_50_C5_1x/'], resume=False) [03/26 21:44:55] detectron2 INFO: Running with full config: |
[03/26 21:45:54 d2.engine.train_loop]: Starting training from iteration 0 Traceback (most recent call last): -- Process 1 terminated with the following error: |
command : yaml: OUTPUT_DIR: '/hdd2/wh/cw/train/yolof/R_50_C5_1x' |
作者您好,我粗浅的认为是跟数据有关的错误,可能是发生了数组越界等? 但我的数据集是coco2017,不应该有这个错误才对,其余的我也没有更改了,我原本的detectron2也重新升级了一下。 |
看错误是在取index的时候越界了,但是很奇怪,我之前跑过这么多次都没有遇到过这个问题。log文件一眼看过去也并没有找到明显不对的地方,感觉没啥道理。。。你这个是每次一跑,必然会出现这个错误嘛? |
|
|
1.gt_class.size() torch.Size([38000]) 2.gt_class.size() torch.Size([42000]) 3.gt_class.size() torch.Size([38000]) /opt/conda/conda-bld/pytorch_1595629408163/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [9,0,0] Assertion |
感觉这里不应该有错,建议用
看看到底哪里出错了。 |
还是不行,可能只能用其他机器试一下, 我不确定我的cuda 9.0是否会对这个有影响 。 |
作者您好 我换了一台cuda 10.1的然后重复了我之前的操作,(基本就是直接安装了,数据集也是直接从原来的服务器上传输的)。 |
另外有一个小地方,我觉得您可以考虑修改下 。 |
Thanks for sharing your great work. I am sorry that I have a bug when I use
python ./tools/train_net.py --num-gpus 1 --config-file ./configs/yolof_R_50_C5_1x.yaml
Bug log below as :
[03/26 07:38:03 d2.data.build]: Using training sampler TrainingSampler
[03/26 07:38:03 d2.data.common]: Serializing 117266 elements to byte tensors and concatenating them all ...
[03/26 07:38:10 d2.data.common]: Serialized dataset takes 451.21 MiB
[03/26 07:38:15 fvcore.common.checkpoint]: Loading checkpoint from detectron2://ImageNetPretrained/MSRA/R-50.pkl
Traceback (most recent call last):
File "./tools/train_net.py", line 234, in
args=(args,),
File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/launch.py", line 62, in launch
main_func(*args)
File "./tools/train_net.py", line 215, in main
trainer.resume_or_load(resume=args.resume)
File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/defaults.py", line 353, in resume_or_load
checkpoint = self.checkpointer.resume_or_load(self.cfg.MODEL.WEIGHTS, resume=resume)
File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/fvcore/common/checkpoint.py", line 215, in resume_or_load
return self.load(path, checkpointables=[])
File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/fvcore/common/checkpoint.py", line 140, in load
path = self.path_manager.get_local_path(path)
File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/iopath/common/file_io.py", line 1100, in get_local_path
path, force=force, **kwargs
File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/utils/file_io.py", line 29, in _get_local_path
return PathManager.get_local_path(self.S3_DETECTRON2_PREFIX + name, **kwargs)
File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/iopath/common/file_io.py", line 1100, in get_local_path
path, force=force, **kwargs
File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/iopath/common/file_io.py", line 755, in _get_local_path
with file_lock(cached):
File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/iopath/common/file_io.py", line 82, in file_lock
return portalocker.Lock(path + ".lock", timeout=3600) # type: ignore
AttributeError: module 'portalocker' has no attribute 'Lock'
I woule be grateful if you could give me some advice. Thanks.
The text was updated successfully, but these errors were encountered: