Use use_reentrant in torch.utils.checkpoint #4

javak87 · 2024-10-07T13:24:48Z

I'm want to do a benchmark on OPENFOLD branch.
When I run this code:

        python -u -m torchrun \
            --nproc_per_node=gpu \
            --nnodes=\"$SLURM_JOB_NUM_NODES\" \
            --rdzv_id=\"$SLURM_JOB_ID\" \
            --rdzv_endpoint=\"$MASTER_ADDR\":\"$MASTER_PORT\" \
            --rdzv_backend=c10d \
            /workspace/openfold/train.py \
            --training_dirpath /training_rundir/openfold \
            --pdb_mmcif_chains_filepath /data/pdb_data/pdb_mmcif/processed/chains.csv \
            --pdb_mmcif_dicts_dirpath /data/pdb_data/pdb_mmcif/processed/dicts \
            --pdb_obsolete_filepath /data/pdb_data/pdb_mmcif/processed/obsolete.dat \
            --pdb_alignments_dirpath /data/pdb_data/open_protein_set/processed/pdb_alignments \
            --initialize_parameters_from /data/mlperf_hpc_openfold_resumable_checkpoint_b518be46.pt \
            --seed $SEED \
            --num_train_iters 2000 \
            --val_every_iters 40 \
            --local_batch_size 4 \
            --base_lr 1e-3 \
            --warmup_lr_init 1e-5 \
            --warmup_lr_iters 0 \
            --num_train_dataloader_workers 16 \
            --num_val_dataloader_workers 2 \
            --distributed"

It gave me this error:

:::MLLOG {"namespace": "", "time_ms": 1728306460732, "event_type": "POINT_IN_TIME", "key": "swa_decay_rate", "value": 0.9, "metadata": {"file": "/workspace/openfold/train.py", "lineno": 594}}
:::MLLOG {"namespace": "", "time_ms": 1728306461790, "event_type": "POINT_IN_TIME", "key": "model_parameters_count", "value": 93229082, "metadata": {"file": "/workspace/openfold/train.py", "lineno": 601}}
:::MLLOG {"namespace": "", "time_ms": 1728306461791, "event_type": "POINT_IN_TIME", "key": "staging_start", "value": null, "metadata": {"file": "/workspace/openfold/train.py", "lineno": 617}}
:::MLLOG {"namespace": "", "time_ms": 1728306461791, "event_type": "POINT_IN_TIME", "key": "staging_stop", "value": null, "metadata": {"file": "/workspace/openfold/train.py", "lineno": 624, "staging_duration": 4.1606836020946503e-07, "instance": 0}}
:::MLLOG {"namespace": "", "time_ms": 1728306461791, "event_type": "POINT_IN_TIME", "key": "tracked_stats", "value": {"staging_duration": 4.1606836020946503e-07}, "metadata": {"file": "/workspace/openfold/train.py", "lineno": 629, "step": 0, "instance": 0}}
:::MLLOG {"namespace": "", "time_ms": 1728306461857, "event_type": "INTERVAL_END", "key": "init_stop", "value": null, "metadata": {"file": "/workspace/openfold/train.py", "lineno": 637}}
:::MLLOG {"namespace": "", "time_ms": 1728306461920, "event_type": "INTERVAL_START", "key": "run_start", "value": null, "metadata": {"file": "/workspace/openfold/train.py", "lineno": 637}}
:::MLLOG {"namespace": "", "time_ms": 1728306516300, "event_type": "POINT_IN_TIME", "key": "train_samples", "value": 594595, "metadata": {"file": "/workspace/openfold/train.py", "lineno": 651}}
:::MLLOG {"namespace": "", "time_ms": 1728306547952, "event_type": "POINT_IN_TIME", "key": "eval_samples", "value": 180, "metadata": {"file": "/workspace/openfold/train.py", "lineno": 669}}
:::MLLOG {"namespace": "", "time_ms": 1728306548901, "event_type": "POINT_IN_TIME", "key": "initial_training_dataloader_type", "value": "InitialTrainingDataloaderPT", "metadata": {"file": "/workspace/openfold/train.py", "lineno": 715}}
:::MLLOG {"namespace": "", "time_ms": 1728306548902, "event_type": "INTERVAL_START", "key": "epoch_start", "value": null, "metadata": {"file": "/workspace/openfold/train.py", "lineno": 738, "epoch_num": 640, "instance": 0}}
/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py:631: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
  return fn(*args, **kwargs)
/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py:631: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
  return fn(*args, **kwargs)
/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py:631: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
  return fn(*args, **kwargs)
/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py:631: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
  return fn(*args, **kwargs)

Since I want to use close division, how to get ride of this warning.
Shall it change the code or I should use another version of pytorch?
Thanks

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use use_reentrant in torch.utils.checkpoint #4

Use use_reentrant in torch.utils.checkpoint #4

javak87 commented Oct 7, 2024

Use use_reentrant in torch.utils.checkpoint #4

Use use_reentrant in torch.utils.checkpoint #4

Comments

javak87 commented Oct 7, 2024