loss render is NaN，The following problems occurred. I did not modify the training parameters. Is the training parameters wrong? #5

manjidada · 2023-07-11T03:22:58Z

Traceback (most recent call last):
File "train.py", line 35, in
main()
File "train.py", line 32, in main
m.train(opt)
File "G:\work_document\python_work\L2G-NeRF-main\model\nerf.py", line 61, in train
if self.it%opt.freq.val==0: self.validate(opt,self.it)
File "G:\work_document\Tensorflow\Miniconda3\envs\L2G-NeRF\lib\site-
packages\torch\autograd\grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "G:\work_document\python_work\L2G-NeRF-main\model\l2g_nerf.py", line 89, in validate
super().validate(opt,ep=ep)
File "G:\work_document\Tensorflow\Miniconda3\envs\L2G-NeRF\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "G:\work_document\python_work\L2G-NeRF-main\model\base.py", line 154, in validate
loss = self.summarize_loss(opt,var,loss)
File "G:\work_document\python_work\L2G-NeRF-main\model\base.py", line 139, in summarize_loss
assert not torch.isnan(loss[key]),"loss {} is NaN".format(key)
AssertionError: loss render is NaN

manjidada · 2023-07-11T03:23:04Z

python train.py --model=l2g_nerf --yaml=l2g_nerf_blender --group=exp_synthetic --name=l2g_lego --data.scene=lego --data.root=./data/blender/nerf_synthetic --camera.noise_r=0.07 --camera.n
oise_t=0.5
Process ID: 21456
[train.py] (PyTorch code for training NeRF/BARF/L2G_NeRF)
setting configurations...
loading options/base.yaml...
loading options/nerf_blender.yaml...
loading options/barf_blender.yaml...
loading options/l2g_nerf_blender.yaml...

H: 400
W: 400
arch:
- density_activ: softplus
- embedding_dim: 128
- layers_feat: [None, 256, 256, 256, 256, 256, 256, 256, 256]
- layers_rgb: [None, 128, 3]
- layers_warp: [None, 256, 256, 256, 256, 256, 256, 6]
- posenc:
  - L_3D: 10
  - L_view: 4
- skip: [4]
- skip_warp: [4]
- tf_init: True
barf_c2f: [0.1, 0.5]
batch_size: None
camera:
- model: perspective
- ndc: False
- noise: True
- noise_r: 0.07
- noise_t: 0.5
cpu: False
data:
- augment:
- bgcolor: 1
- center_crop: None
- dataset: blender
- image_size: [400, 400]
- num_workers: 4
- preload: True
- root: ./data/blender/nerf_synthetic
- scene: lego
- train_sub: None
- val_on_test: False
- val_sub: 4
device: cuda:0
error_map_size: None
freq:
- ckpt: 5000
- scalar: 200
- val: 2000
- vis: 1000
gpu: 0
group: exp_synthetic
load: None
loss_weight:
- global_alignment: 2
- render: 0
- render_fine: None
max_epoch: None
max_iter: 200000
model: l2g_nerf
name: l2g_lego
nerf:
- density_noise_reg: None
- depth:
  - param: metric
  - range: [2, 6]
- fine_sampling: False
- rand_rays: 1024
- sample_intvs: 128
- sample_intvs_fine: None
- sample_stratified: True
- setbg_opaque: False
- view_dep: True
optim:
- algo: Adam
- lr: 0.0005
- lr_end: 0.0001
- lr_pose: 0.001
- lr_pose_end: 1e-08
- sched:
  - gamma: None
  - type: ExponentialLR
- sched_pose:
  - gamma: None
  - type: ExponentialLR
- test_iter: 100
- test_photo: True
- warmup_pose: None
output_path: output/exp_synthetic/l2g_lego
output_root: output
resume: False
seed: 0
tb:
- num_images: [4, 8]
trimesh:
- chunk_size: 16384
- range: [-1.2, 1.2]
- res: 128
- thres: 25.0
visdom:
- cam_depth: 0.5
- port: 8600
- server: localhost
yaml: l2g_nerf_blender
existing options file found (identical)
Setting up [LPIPS] perceptual loss: trunk [alex], v[0.1], spatial [off]
Loading model from: G:\work_document\Tensorflow\Miniconda3\envs\L2G-NeRF\lib\site-packages\lpips\weights\v0.1\alex.pth
loading training data...
number of samples: 100
loading test data...
number of samples: 4
building networks...
setting up optimizers...
initializing weights from scratch...
setting up visualizers...
visdom port (8600) not open, retry? (y/n) n
Setting up a new session...

rover-xingyu · 2023-07-22T06:53:14Z

Sorry, we can not reproduce the issue. You could change the random seed to see if it happens again.

Chaphlagical · 2023-10-09T07:36:30Z

Sorry, we can not reproduce the issue. You could change the random seed to see if it happens again.

@manjidada I got the same issue on blender dataset. It seems depth range will become 0 when entering the iteration loop like above, which is weird.

Therefore, I force the depth range to be scalar instead of torch tensor during validation and it work for me. Like:

class Graph(nerf.Graph):
    ...
    def forward(self, opt, var, mode=None):
        # rescale the size of the scene condition on the pose
        if opt.data.dataset == "blender":
            depth_min, depth_max = opt.nerf.depth.range
            position = camera.Pose().invert(
                self.optimised_training_poses.weight.data.detach().clone().view(-1, 3, 4))[..., -1]
            diameter = ((position[self.idx_grid[..., 0]] -
                        position[self.idx_grid[..., 1]]).norm(dim=-1)).max()
            depth_min_new = (depth_min/(depth_max+depth_min))*diameter
            depth_max_new = (depth_max/(depth_max+depth_min))*diameter
            if mode in ["train"]:
                opt.nerf.depth.range = [
                    depth_min_new, depth_max_new]
            else:
                # force scalar
                opt.nerf.depth.range = [
                    depth_min_new.item(), depth_max_new.item()]
        ...
    ....

    @torch.no_grad()
    def validate(self, opt, ep=None):
        pose, pose_GT = self.get_all_training_poses(opt)
        _, self.graph.sim3 = self.prealign_cameras(opt, pose, pose_GT)
        # force scalar
        if torch.is_tensor(opt.nerf.depth.range[0]):
            opt.nerf.depth.range[0] = opt.nerf.depth.range[0].item()
        if torch.is_tensor(opt.nerf.depth.range[1]):
            opt.nerf.depth.range[1] = opt.nerf.depth.range[1].item()
        super().validate(opt, ep=ep)

Hope for the official solution.

rover-xingyu · 2023-10-16T10:18:24Z

Thanks for pointing this out. I rescale the size of the blender objects(near/far) condition on the optimized poses as shown here. I guess the depth range becoming 0 is caused by the diameter becoming 0, but it is weird as the diameter is determined by the maximum distance between the two cameras. Does anyone have any idea?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loss render is NaN，The following problems occurred. I did not modify the training parameters. Is the training parameters wrong? #5

loss render is NaN，The following problems occurred. I did not modify the training parameters. Is the training parameters wrong? #5

manjidada commented Jul 11, 2023

manjidada commented Jul 11, 2023

rover-xingyu commented Jul 22, 2023

Chaphlagical commented Oct 9, 2023

rover-xingyu commented Oct 16, 2023

loss render is NaN，The following problems occurred. I did not modify the training parameters. Is the training parameters wrong? #5

loss render is NaN，The following problems occurred. I did not modify the training parameters. Is the training parameters wrong? #5

Comments

manjidada commented Jul 11, 2023

manjidada commented Jul 11, 2023

rover-xingyu commented Jul 22, 2023

Chaphlagical commented Oct 9, 2023

rover-xingyu commented Oct 16, 2023