Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Error #27

Open
philp123 opened this issue Oct 9, 2024 · 3 comments
Open

Training Error #27

philp123 opened this issue Oct 9, 2024 · 3 comments

Comments

@philp123
Copy link

philp123 commented Oct 9, 2024

(clip4str) root@Lab-PC:/workspace/Project/OCR/CLIP4STR# bash scripts/vl4str_base.sh
abs_root: /home/shuai
model:
convert: all
img_size:

  • 224
  • 224
    max_label_length: 25
    charset_train: 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[]^_`{|}~
    charset_test: 0123456789abcdefghijklmnopqrstuvwxyz
    batch_size: 256
    weight_decay: 0.2
    warmup_pct: 0.075
    code_path: ${abs_root}/code/CLIP4STR
    name: vl4str
    target: strhub.models.vl_str.system.VL4STR
    patch_size:
  • 16
  • 16
    embed_dim: 512
    enc_num_heads: 12
    enc_mlp_ratio: 4
    enc_depth: 12
    enc_width: 768
    dec_num_heads: 8
    dec_mlp_ratio: 4
    dec_depth: 1
    enc_del_cls: false
    dec_ndim_no_decay: true
    context_length: 16
    use_language_model: true
    image_detach: true
    type_embedding: false
    cross_gt_context: true
    cross_cloze_mask: false
    cross_extra_attn: false
    cross_correct_once: false
    cross_loss_w: 1.0
    itm_loss: false
    itm_loss_weight: 0.1
    cross_token_embeding: false
    fusion_model: false
    image_freeze_nlayer: -1
    text_freeze_nlayer: 6
    image_freeze_layer_divisor: 0
    image_only_fc: false
    use_share_dim: true
    clip_cls_eot_feature: false
    lr: 8.4e-05
    coef_lr: 19.0
    coef_wd: 1.0
    perm_num: 6
    perm_forward: true
    perm_mirrored: true
    dropout: 0.1
    decode_ar: true
    refine_iters: 1
    freeze_backbone: false
    freeze_language_backbone: false
    clip_pretrained: /workspace/Project/OCR/CLIP4STR/pretrained/models--laion--CLIP-ViT-B-16-DataComp.XL-s13B-b90K/open_clip_pytorch_model.bin
    find_unused_parameters: true
    data:
    target: strhub.data.module.SceneTextDataModule
    root_dir: /workspace/Database/OCR/CLIP4STR/str_dataset_ub
    output_url: null
    train_dir: real
    batch_size: ${model.batch_size}
    img_size: ${model.img_size}
    charset_train: ${model.charset_train}
    charset_test: ${model.charset_test}
    max_label_length: ${model.max_label_length}
    remove_whitespace: true
    normalize_unicode: true
    augment: true
    num_workers: 8
    openai_meanstd: true
    trainer:
    target: pytorch_lightning.Trainer
    convert: all
    val_check_interval: 2000
    max_epochs: 11
    gradient_clip_val: 20
    gpus: 1
    accumulate_grad_batches: 4
    precision: 16
    ckpt_path: null
    pretrained: null
    swa: false

config of VL4STR:
image_freeze_nlayer: -1, text_freeze_nlayer: 6, freeze_language_backbone: False, freeze_image_backbone: False
use_language_model: True, context_length: 16, cross_token_embeding: False, cross_loss_weight: 1.0
use_share_dim: True, image_detach: True, clip_cls_eot_feature: False
cross_gt_context: True, cross_cloze_mask: False, cross_fast_decode: False

loading checkpoint from /workspace/Project/OCR/CLIP4STR/pretrained/models--laion--CLIP-ViT-B-16-DataComp.XL-s13B-b90K/open_clip_pytorch_model.bin
/workspace/Project/OCR/CLIP4STR/strhub/clip/clip.py:139: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
state_dict = torch.load(model_path, map_location="cpu")
The dimension of the visual decoder is 512.
| Name | Type | Params

0 | clip_model | CLIP | 149 M
1 | clip_model.visual | VisionTransformer | 86.2 M
2 | clip_model.transformer | Transformer | 37.8 M
3 | clip_model.token_embedding | Embedding | 25.3 M
4 | clip_model.ln_final | LayerNorm | 1.0 K
5 | visual_decoder | Decoder | 4.3 M
6 | visual_decoder.layers | ModuleList | 4.2 M
7 | visual_decoder.text_embed | TokenEmbedding | 49.7 K
8 | visual_decoder.norm | LayerNorm | 1.0 K
9 | visual_decoder.dropout | Dropout | 0
10 | visual_decoder.head | Linear | 48.7 K
11 | cross_decoder | Decoder | 4.3 M
12 | cross_decoder.layers | ModuleList | 4.2 M
13 | cross_decoder.text_embed | TokenEmbedding | 49.7 K
14 | cross_decoder.norm | LayerNorm | 1.0 K
15 | cross_decoder.dropout | Dropout | 0
16 | cross_decoder.head | Linear | 48.7 K

114 M Trainable params
44.3 M Non-trainable params
158 M Total params
633.025 Total estimated model params size (MB)
[dataset] mean (0.48145466, 0.4578275, 0.40821073), std (0.26862954, 0.26130258, 0.27577711)
/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:478: LightningDeprecationWarning: Setting Trainer(gpus=1) is deprecated in v1.7 and will be removed in v2.0. Please use Trainer(accelerator='gpu', devices=1) instead.
rank_zero_deprecation(
Using 16bit None Automatic Mixed Precision (AMP)
/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/native_amp.py:47: FutureWarning: torch.cuda.amp.GradScaler(args...) is deprecated. Please use torch.amp.GradScaler('cuda', args...) instead.
scaler = torch.cuda.amp.GradScaler()
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/pytorch_lightning/trainer/configuration_validator.py:92: UserWarning: When using Trainer(accumulate_grad_batches != 1) and overriding LightningModule.optimizer_{step,zero_grad}, the hooks will not be called on every batch (rather, they are called on every optimization step).
rank_zero_warn(
You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set torch.set_float32_matmul_precision('medium' | 'high') which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
[VL4STR] The length of encoder params with and without weight decay is 76 and 151, respectively.
[VL4STR] The length of decoder params with and without weight decay is 14 and 38, respectively.
Loading train_dataloader to estimate number of stepping batches.
dataset root: /workspace/Database/OCR/CLIP4STR/str_dataset_ub/train/real
lmdb: ArT/train num samples: 28828
lmdb: ArT/val num samples: 3200
lmdb: LSVT/test num samples: 4093
lmdb: LSVT/train num samples: 33199
lmdb: LSVT/val num samples: 4147
lmdb: benchmark/IIIT5k num samples: 2000
lmdb: benchmark/IC15 num samples: 4468
lmdb: benchmark/IC13 num samples: 848
lmdb: benchmark/SVT num samples: 257
lmdb: ReCTS/test num samples: 2467
lmdb: ReCTS/train num samples: 21589
lmdb: ReCTS/val num samples: 2376
lmdb: TextOCR/train num samples: 710994
lmdb: TextOCR/val num samples: 107093
lmdb: OpenVINO/train_5 num samples: 495833
lmdb: OpenVINO/train_2 num samples: 502769
lmdb: OpenVINO/train_f num samples: 470562
lmdb: OpenVINO/train_1 num samples: 443620
lmdb: OpenVINO/validation num samples: 158757
lmdb: RCTW17/test num samples: 1030
lmdb: RCTW17/train num samples: 8225
lmdb: RCTW17/val num samples: 1029
lmdb: MLT19/test num samples: 5669
lmdb: MLT19/train num samples: 45384
lmdb: MLT19/val num samples: 5674
lmdb: COCOv2.0/train num samples: 59733
lmdb: COCOv2.0/val num samples: 13394
lmdb: Union14M-L-LMDB/medium num samples: 218154
lmdb: Union14M-L-LMDB/hard num samples: 145523
lmdb: Union14M-L-LMDB/hell num samples: 479156
lmdb: Union14M-L-LMDB/difficult num samples: 297164
lmdb: Union14M-L-LMDB/simple num samples: 2076687
lmdb: Uber/train num samples: 91732
lmdb: Uber/val num samples: 36188
lmdb: The number of training samples is 6481842
Sanity Checking: 0it [00:00, ?it/s]dataset root: /workspace/Database/OCR/CLIP4STR/str_dataset_ub/val
lmdb: IIIT5k num samples: 2000
lmdb: IC15 num samples: 4467
lmdb: IC13 num samples: 843
lmdb: SVT num samples: 257
lmdb: The number of validation samples is 7567
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/torch/nn/functional.py:5193: UserWarning: Support for mismatched key_padding_mask and attn_mask is deprecated. Use same type for both instead.
warnings.warn(
Epoch 0: 0%| | 0/25680 [00:00<?, ?it/s]Error executing job with overrides: ['+experiment=vl4str', 'model=vl4str', 'dataset=real', 'data.root_dir=/workspace/Database/OCR/CLIP4STR/str_dataset_ub', 'trainer.max_epochs=11', 'trainer.gpus=1', 'model.lr=8.4e-5', 'model.batch_size=256', 'model.clip_pretrained=/workspace/Project/OCR/CLIP4STR/pretrained/models--laion--CLIP-ViT-B-16-DataComp.XL-s13B-b90K/open_clip_pytorch_model.bin', 'trainer.accumulate_grad_batches=4']
Traceback (most recent call last):
File "/workspace/Project/OCR/CLIP4STR/train.py", line 141, in
main()
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in
lambda: hydra.run(
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "/workspace/Project/OCR/CLIP4STR/train.py", line 100, in main
trainer.fit(model, datamodule=datamodule, ckpt_path=config.ckpt_path)
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
call._call_and_handle_interrupt(
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
results = self._run_stage()
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage
self._run_train()
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1214, in _run_train
self.fit_loop.run()
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 187, in advance
batch = next(data_fetcher)
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py", line 184, in next
return self.fetching_function()
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py", line 265, in fetching_function
self._fetch_next_batch(self.dataloader_iter)
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py", line 280, in _fetch_next_batch
batch = next(iterator)
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/pytorch_lightning/trainer/supporters.py", line 571, in next
return self.request_next_batch(self.loader_iters)
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/pytorch_lightning/trainer/supporters.py", line 583, in request_next_batch
return apply_to_collection(loader_iters, Iterator, next)
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/lightning_utilities/core/apply_func.py", line 64, in apply_to_collection
return function(data, *args, **kwargs)
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in next
data = self._next_data()
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1344, in _next_data
return self._process_data(data)
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data
data.reraise()
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/torch/_utils.py", line 706, in reraise
raise exception
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
data = fetcher.fetch(index) # type: ignore[possibly-undefined]
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/torch/utils/data/dataset.py", line 350, in getitem
return self.datasets[dataset_idx][sample_idx]
File "/workspace/Project/OCR/CLIP4STR/strhub/data/dataset.py", line 134, in getitem
img = self.transform(img)
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/torchvision/transforms/transforms.py", line 95, in call
img = t(img)
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/timm/data/auto_augment.py", line 751, in call
img = op(img)
File "/root/anaconda3/envs/clip4str/lib/python3.10/site-packages/timm/data/auto_augment.py", line 381, in call
if self.prob < 1.0 and random.random() > self.prob:
TypeError: '<' not supported between instances of 'dict' and 'float'

Error seems to be related to dataset, when doing data augmentation.

@philp123
Copy link
Author

philp123 commented Oct 9, 2024

When I set augment: false(in configs/main.yaml), the code could run with "bash scripts/vl4str_base.sh", it seems that sth incorrect with data augmentation.

@mzhaoshuai
Copy link
Contributor

mzhaoshuai commented Oct 9, 2024

That's weird. I never met the problem.

Will the Python version matter like that in #15?

What is the timm version?

@anikde
Copy link

anikde commented Oct 21, 2024

pip install timm==0.5.4
solved the above error for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants