Validation loss #1864

rockerBOO · 2025-01-03T06:09:55Z

Original implementation by @rockerBOO
Timestep validation implementation by @gesen2egee
Updated implementation for sd3/flux by @hinablue

I went through and tried to merge the different PR's together. I probably messed up some things in the process.

One thing I wanted to note is that process_batch was made to limit duplication of the code for validation and training to keep them consistent. I implemented the timestep processing so it could work for both. Noted that it was using only debiased_estimation in other PR's but i didn't know why it was like that.

train_db.py I did not update appropriately to my goal of a unified process_batch, as I do not have a good way to test them. I will try to get them in an acceptable state and we can refine it.

I'm posting this a little early so others can view and give me feedback. I am still working on some issues with the code so let me know before you dive in to fix anything. Open to commits to this PR, can post them to this branch on my fork.

Testing

Test training code is actually training
Test validation epoch (Test validation every epoch)
Test validate per n steps (After n steps it will run a validation run)
Test validate per n epochs (After n epochs will run validation epochs)
Test max validation steps
Test validation split (The validation split should be split accordingly, 0.2 should produce 20% dataset of the primary dataset)
Test validation split from train_network.py arguments (--validation_split) as well as dataset_config.toml (validation_split=0.1)
Test validation seed (Seed is used for dataset shuffling only right now)
Test image latent caching (validation and training datasets)
Test tokenizing strategy (SD, SDXL, SD3, Flux)
Test text encoding strategy (SD, SDXL, SD3, Flux)
Test --network_train_text_encoder_only
Test --network_train_unet_only
Test training some text encoders (I think this is a feature?)
Test on SD1.5, SDXL, SD3, Flux LoRAs

Parameters

Validation dataset is for dreambooth datasets (text/image pairs) and will split the dataset into 2 parts, train_dataset and validation_dataset depending on the split.

--validation_seed Validation seed for shuffling validation dataset, training --seed used otherwise / 検証データセットをシャッフルするための検証シード、それ以外の場合はトレーニング --seed を使用する
--validation_split Split for validation images out of the training dataset / 学習画像から検証画像に分割する割合
--validate_every_n_steps Run validation on validation dataset every N steps. By default, validation will only occur every epoch if a validation dataset is available / 検証データセットの検証をNステップごとに実行します。デフォルトでは、検証データセットが利用可能な場合にのみ、検証はエポックごとに実行されます
--validate_every_n_epochs Run validation dataset every N epochs. By default, validation will run every epoch if a validation dataset is available / 検証データセットをNエポックごとに実行します。デフォルトでは、検証データセットが利用可能な場合、検証はエポックごとに実行されます
--max_validation_steps Max number of validation dataset items processed. By default, validation will run the entire validation dataset / 処理される検証データセット項目の最大数。デフォルトでは、検証は検証データセット全体を実行します

validation_seed and validation_split can be set inside the dataset_config.toml

I'm open to feedback about this approach and if anything needs to be fixed in the code to be accurate.

We only want to be enabling grad if we are training.

… to calculate validation loss. Balances the influence of different time steps on training performance (without affecting actual training results)

kohya-ss · 2025-01-08T23:22:12Z

Not at all - your code is excellent and very well structured. Thank you for taking the time to submit this PR.

In config_util.py, the following will return two DatasetGroups (or one DatasetGroup and None):

    return (
        DatasetGroup(datasets),
        DatasetGroup(val_datasets) if val_datasets else None
    )

However flux_train.py etc. seem to expect one DatasetGroup.

        train_dataset_group = config_util.generate_dataset_group_by_blueprint(blueprint.dataset_group)

I haven't run the code yet, so I apologize if it actually works.

rockerBOO · 2025-01-08T23:37:43Z

I have updated all calls to config_util.generate_dataset_group_by_blueprint to handle extracting from the Tuple and added the return type to that function to help with typechecking. Thanks for pointing it out.

kohya-ss

Thank you for the update. I'm sorry the code is complicated.

I'm currently checking the training script to see if it still works as before. There was one problem (in two places), so I would appreciate it if you could check it.

library/train_util.py

…plit check and warning

Divergence is the difference between training and validation to allow a clear value to indicate the difference between the two in the logs.

Adding metadata recording for validation arguments Add comments about the validation split for clarity of intention

rockerBOO · 2025-01-13T01:12:26Z

Added divergence value for step and epoch, indicating the difference between training and validation. Will make it easier to see the difference and not have to rely on overlapping. Maybe a different term would be better as divergence might indicate how much it's moving apart and away from convergence. Also might be better to invert the current to go to the other way to match the loss values.

Fixed a bunch of things with regularization images datasets and repeats. Fix some issues with validate every n steps (which is important when using repeats and regularization images)

rockerBOO · 2025-01-14T04:02:02Z

Bug: If text encoders are cached and validation is enabled, and validation is running the process_batch

input_ids = [ids.to(accelerator.device) for ids in batch["input_ids_list"]]

Errors that the batch['input_ids_list'] is None

kohya-ss

Sorry for the delay. I've started testing. I've added a review of what I've noticed so far. Please check it out when you have time.

train_network.py

…idation dataset

rockerBOO · 2025-01-23T15:05:26Z

Added 0 check for LossRecorder
Added train_text_encoder/train_unet values for validation batches.
Added val_dataset_group to assert_extra_args
Added numpy<=2.0 to requirements for compatibility (some libraries were not constraining this and caused things to break)

stepfunction83 · 2025-01-26T01:35:27Z

Shit, I should've looked here before going nuts on my own over the past two days: #1898

Only looks like I'm a little over a year late to the party.

I've been working off of this since it was posted: https://github.com/spacepxl/demystifying-sd-finetuning

In my implementation of this, I refactor the training loop by extracting the loss function outside of it. Then within the loop, I run the calculations for test/validation loss with accumulation turned off, and finally run the loss calculation for the actual training sample and continuing. Everything happens within a single global step. It's pretty straightforward and doesn't really take advantage of some of the nice PyTorch features, but it's also simple and gives a nice clean result.

It wasn't really clear from the code, but are you running the same noise/timesteps through the loss calculation each time for the validation set?

Ideally, it is my understanding that the same samples are used for test and validation sets each time and each sample gets the same noise and timesteps for each loss calculation, that way the loss is consistently calculated each time, with the model being the only variable that changes. To accomplish this, on the first iteration, I create and record a state variable from the loss calculation for each test/validation sample which captures the noise/timesteps/sigmas and then replays these values for future loss calculations.

rockerBOO · 2025-01-26T02:49:50Z

There was some code in previous versions to allow a distribution of timesteps to be set (instead of random) which I think is aligned to what you're suggesting. It could be a single timestep or we can cache a random timestep and use the same one. This was removed to allow us to merge this PR and we can approach it with a new PR to add that as it would be involved with some different systems.

I think having more static options or stable options might be a good idea for limited datasets because the variability of a limit dataset might skew the results. A validation dataset may have a few times like 2-10 so having it pick a couple poor timesteps could cause issues the loss calculation to not highlight the right things. The current idea is to take a distribution of timesteps and average them. Like 50, 250, 500, 600, 900 timesteps. This could also be applied to regular training to smooth out the variance from random timesteps, but with a significantly increased training time.

Stabilizing the noise of the initial latents may be worth experimenting with but storing those latents could cause the training take up more memory that scales up with validation size.

Also need to consider how limited the datasets were working with, training time, and goals of what validation is doing for us in training. As the datasets get larger these variance become less particular (more examples) and smoothing out the loss for the charts might be enough to highlight it.

For loss we also do an average of the other steps in an epoch which smooths out the "current" loss and the same for validation but because of the limited datasets I think having more samples from timesteps can help smooth it out with less data.

stepfunction83 · 2025-01-26T04:15:32Z

Why not just reset the seed for each validation run so that the same noise and timestamps can be used each time in the same way? That has all the benefits of consistent validation loss reporting with no memory overhead. To be honest, it doesn't really matter too much what timesteps you pick as long as they and the noise stay the same for each sample to isolate the impact of the model updates. If you look at the images in my PR, you can see how smooth the loss looks when you do this with just randomly selecting 2-4 timesteps and noise samples per batch. I would say that while my PR is not as polished, it does touch a lot less of the codebase, just kind of latching on to the training loop, so it may be more compatible, while still achieving a very stable test and validation loss curve.

…

On Sat, Jan 25, 2025, 9:50 PM Dave Lage ***@***.***> wrote: There was some code in previous versions to allow a distribution of timesteps to be set (instead of random) which I think is aligned to what you're suggesting. It could be a single timestep or we can cache a random timestep and use the same one. This was removed to allow us to merge this PR and we can approach it with a new PR to add that as it would be involved with some different systems. I think having more static options or stable options might be a good idea for limited datasets because the variability of a limit dataset might skew the results. A validation dataset may have a few times like 2-10 so having it pick a couple poor timesteps could cause issues the loss calculation to not highlight the right things. The current idea is to take a distribution of timesteps and average them. Like 50, 250, 500, 600, 900 timesteps. This could also be applied to regular training to smooth out the variance from random timesteps, but with a significantly increased training time. Stabilizing the noise of the initial latents may be worth experimenting with but storing those latents could cause the training take up more memory that scales up with validation size. Also need to consider how limited the datasets were working with, training time, and goals of what validation is doing for us in training. As the datasets get larger these variance become less particular (more examples) and smoothing out the loss for the charts might be enough to highlight it. For loss we also do an average of the other steps in an epoch which smooths out the "current" loss and the same for validation but because of the limited datasets I think having more samples from timesteps can help smooth it out with less data. — Reply to this email directly, view it on GitHub <#1864 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AH2WKO4JS5YFZGFXDRZCBJL2MREOJAVCNFSM6AAAAABUREODZWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMJUGE4DKMRQG4> . You are receiving this because you commented.Message ID: ***@***.***>

kohya-ss · 2025-01-26T05:02:29Z

This PR was about to be merged. I wish I had worked on it a little sooner...

It shouldn't be too difficult to use the same timesteps for each validation. We may consider addressing this in a separate PR in the future.

kohya-ss · 2025-01-26T12:09:52Z

sdxl_train_textual_inversion.py and sdxl_train_control_net_lllite.py raises an error, so I merged this into a new branch. I will merge it into sd3 after I fix it. I will finish it today.

Thank you again for this great work!

stepfunction83 · 2025-01-26T13:53:37Z

Is this only implemented for LoRA training?

kohya-ss · 2025-01-27T13:13:44Z

Is this only implemented for LoRA training?

Currently only available for LoRA training.

rockerBOO and others added 30 commits November 5, 2023 12:35

Add validation loss

5b19bda

new ratio code

33c311e

Add validation split of datasets

3de9e6c

Update args to validation_seed and validation_split

a93c524

Add process_batch for train_network

c892521

Removed/cleanup a line

e545fdf

Remove unnecessary subset line from collate

9c591bd

Set grad enabled if is_train and train_text_encoder

569ca72

We only want to be enabling grad if we are training.

val

b558a5b

improve

78cfb01

Update train_network.py

923b761

Update train_network.py

47359b8

fix timesteps

a51723c

only use train subset to val

7d84ac2

Update train_network.py

befbec5

Update train_network.py

63e58f7

Update train_network.py

a6c41c6

fix

bd7e229

Merge remote-tracking branch 'kohya-ss/dev' into val

5d7ed0d

Update train_network.py

d05965d

fix control net

b5e8045

Merge branch 'main' into val

086f600

Update config_util.py

36d4023

Update train_util.py

229c5a3

Update config_util.py

3b251b7

Update config_util.py

459b125

Update train_util.py

89ad69b

Update config_util.py

fde8026

Remove unnecessary is_train changes and use apply_debiased_estimation…

31507b9

… to calculate validation loss. Balances the influence of different time steps on training performance (without affecting actual training results)

Update train_db.py

1db4951

rockerBOO added 3 commits January 8, 2025 18:38

Handle tuple return from generate_dataset_group_by_blueprint

9fde0d7

Revert bucket_reso_steps to correct 64

1e61392

Fix incorrect destructoring for load_abritrary_dataset

d6f158d

kohya-ss reviewed Jan 9, 2025

View reviewed changes

library/train_util.py Show resolved Hide resolved

rockerBOO added 7 commits January 9, 2025 12:43

Apply is_training_dataset only to DreamBoothDataset. Add validation_s…

264167f

…plit check and warning

Add divergence to logs

4c61adc

Divergence is the difference between training and validation to allow a clear value to indicate the difference between the two in the logs.

Fix regularization images with validation

2bbb40c

Adding metadata recording for validation arguments Add comments about the validation split for clarity of intention

Fix validate_every_n_steps always running first step

0456858

Fix validate_every_n_steps for gradient accumulation

ee9265c

Remove Validating... print to fix output layout

25929dd

Disable repeats for validation datasets

b489082

kohya-ss reviewed Jan 23, 2025

View reviewed changes

train_network.py Show resolved Hide resolved

train_network.py Show resolved Hide resolved

train_network.py Show resolved Hide resolved

kohya-ss reviewed Jan 23, 2025

View reviewed changes

train_network.py Show resolved Hide resolved

train_network.py Show resolved Hide resolved

Fix loss recorder on 0. Fix validation for cached runs. Assert on val…

c04e5df

…idation dataset

kohya-ss mentioned this pull request Jan 26, 2025

Implement Test and Validation Loss for Flux Finetuning and LoRA Training #1898

Closed

kohya-ss changed the base branch from sd3 to val-loss January 26, 2025 12:06

kohya-ss merged commit b833d47 into kohya-ss:val-loss Jan 26, 2025
2 checks passed

kohya-ss mentioned this pull request Jan 26, 2025

Val loss #1899

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validation loss #1864

Validation loss #1864

rockerBOO commented Jan 3, 2025 •

edited

Loading

kohya-ss commented Jan 8, 2025

rockerBOO commented Jan 8, 2025

kohya-ss left a comment

rockerBOO commented Jan 13, 2025

rockerBOO commented Jan 14, 2025

kohya-ss left a comment

rockerBOO commented Jan 23, 2025

stepfunction83 commented Jan 26, 2025 •

edited

Loading

rockerBOO commented Jan 26, 2025

stepfunction83 commented Jan 26, 2025 via email •

edited

Loading

kohya-ss commented Jan 26, 2025

kohya-ss commented Jan 26, 2025

stepfunction83 commented Jan 26, 2025

kohya-ss commented Jan 27, 2025

Validation loss #1864

Validation loss #1864

Conversation

rockerBOO commented Jan 3, 2025 • edited Loading

Testing

Parameters

kohya-ss commented Jan 8, 2025

rockerBOO commented Jan 8, 2025

kohya-ss left a comment

Choose a reason for hiding this comment

rockerBOO commented Jan 13, 2025

rockerBOO commented Jan 14, 2025

kohya-ss left a comment

Choose a reason for hiding this comment

rockerBOO commented Jan 23, 2025

stepfunction83 commented Jan 26, 2025 • edited Loading

rockerBOO commented Jan 26, 2025

stepfunction83 commented Jan 26, 2025 via email • edited Loading

kohya-ss commented Jan 26, 2025

kohya-ss commented Jan 26, 2025

stepfunction83 commented Jan 26, 2025

kohya-ss commented Jan 27, 2025

rockerBOO commented Jan 3, 2025 •

edited

Loading

stepfunction83 commented Jan 26, 2025 •

edited

Loading

stepfunction83 commented Jan 26, 2025 via email •

edited

Loading