Other Problems and Some Solutions

About Training Configs

MODEL_DIR:
- The output log, evaluation results and checkpoints during training are saved in path of MODEL_DIR in "trainval_distributed.py" for each datasets.
- It can be customized to anywhere. Experiments are divided by folders named by the start date-time.
GPU and Batch-Size:
- In each "dist_train.sh" file, CUDA_VISIBLE_DEVICES defines the GPU-IDs visible by the process
- nproc_per_node means GPUs actually used by the PyTorch DDP processes.
- NOTE: "trainval_distributed.py" uses the length of CUDA_VISIBLE_DEVICES as GPU number rather than nproc_per_node for DDP.
- If nproc_per_node=1 and the length of CUDA_VISIBLE_DEVICES is 1, single-GPU training will be used.
- "config.onegpu" means batch-size on each GPU rather than the total batch-size.

Training Phrase:
- Reproducibility can be only guaranteed on the same environment, with hardwares, softwares and all random seeds fixed.
- Differeent versions of hardwares like GPUs and CPUs, or softwares like CUDA causes that. Please refer to this issue of PyTorch.
- For example, model always have the better result A on Machine 1, but always have the worse result B on Machine 2.
- After various trials, 2 x GPU on CityPersons is more sensitive to enviroment changes than 1 x GPU on Caltech.
Evaluation Phrase
- Adjusting the "val_begin" epoch will lead to different results, please refer to this issue of Pytorch and this article (in Chinese).

Training Phrase:
- The same random seed does not always works across all the machines, try change another one.
- set "self.gen_seed=True" in config and new seed will be printed on the top lines of training log files.
- If one seed works, fix it in config, e.g. 1763 is printed, so "self.seed=1763" and "self.gen_seed=False".
- For example, to avoid the performance gains by non-method changes, VLPD was trained with the same machine and fixed seed 1337. (training logs can be downloaded from BaiduYun or GoogleDrive)
Evalutation Phrase:
- To avoid adjust "val_begin", save checkpoints you want by adjust "save_begin" and "save_end".
- Then evaluate them offline like Evaluation.md, instead of during training.

← Go back to README.md