-
Notifications
You must be signed in to change notification settings - Fork 249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
during-training valid loss is wrong #444
Comments
Pretty sure that this is indeed a bug, but also validation batches in the |
What would you say the best behaviour should be? The user can supply a list of weights for each head? Defaulting to 0: mp and 1:Default ? Given that the main impact this has is which checkpoint (and model) to save:
|
I think the validation losses and rmses should be printed separately for each head |
They already are |
This has nothing to do with weights. That's a separate issue. The code claims to print a loss for each head, but actually prints the cumulative loss that it's computing as it calculates the total loss by looping over heads. That's all. I'll do a PR for this issue, now that it seems pretty clear (from the slack) that the validation loss not being deterministic is a separate bug. |
The point is that checkpoints only get saved if the loss decreases. Now that we have multiple heads and therefore multiple validation losses how do we decide when to save a checkpoint? My suggestion was having a |
A fine suggestion, but independent of this issue. I agree that the way the "total" loss, which is used to save checkpoints, is calculated could use further thought. And I don't even mind making that part of the PR I created for this issue. But the issue was really only about how [edited] @LarsSchaaf I think you should perhaps open a new issue, an enhancement request, to make the logic for saving checkpoints based on loss less naive |
The user can already provide weights for different heads just to be clear. |
Are those used when printing the validation loss? Or only when computing the gradient of the training loss w.r.t. shared parameters? |
They would be used for printing also currently. |
This issue was supposed to be closed by #449, but seems to still be open. Do we want to continue this discussion here, or close it and open a new one having to do with weights? |
Looks to me like the validation loss in the log during fitting is actually the sum over all heads so far:
mace/mace/tools/train.py
Lines 217 to 233 in e4ac498
Is the solution just that the quantity that should be passed to is
valid_loss_head
?The text was updated successfully, but these errors were encountered: