Here you will find detailed information on how to train different kinds of models:
In some cases, like SRGAN and ESRGAN, the recommendation is to use a PSNR-oriented pretrained SR model to initialize the parameters for better quality. According to the SRGAN author's paper and some testing, this will also stabilize the GAN training and allows for faster convergence, but it may not be necessary in all cases. As an example with ESRGAN, these could be the steps to follow:
- Prepare datasets, usually the DIV2K dataset. More details are in
codes/data
. - Optional: If the intention is to replicate the original paper here you would prepare the PSNR-oriented pretrained model. You can also use the original
RRDB_PSNR_x4.pth
as the pretrained model for that purpose, otherwise any existing model will work as pretrained. - Modify one of the configuration template file, for example
options/sr/train_sr.yml
oroptions/sr/train_sr.json
. Note that thecrop_size
variable in the case of Super-Resolution refers to the crop size of the target images. For example, in a 4x SR case, acrop_size
of 128 pixels means an LR crop size of128/4 = 32
pixels. - Run command:
python train.py -opt options/sr/train_sr.yml
orpython train.py -opt options/sr/train_sr.json
Note that while you can train PPON using the regular train.py file and the same steps as other SR models, these additional options have to be set in the training options file (using example values):
Select ppon model type:
model: ppon
Set the ppon Generator network:
which_model_G: ppon
mode: CNA
nf: 64
nb: 24
in_nc: 3
out_nc: 3
group: 1
You need to configure the losses (type, weights, etc) as you would normally first:
pixel_criterion: l1
pixel_weight: 1
feature_criterion: l1
feature_weigh": 1
ssim_type: ms-ssim
ssim_weight: 1e-2
ms_criterion: multiscale-l1
ms_weight: 1e-2
gan_type: vanilla
gan_weight: 8e-3
And then pick which of the configured losses will be used for each stage (the names used are matched out of the names as they are logged during training, so pixel_criterion
corresponds to pix
, feature_criterion
to fea
and cx_type: contextual
to contextual
, for example):
p1_losses: [pix] # from the paper: l1 pixel_weigh: 1
p2_losses: [pix-multiscale, ms-ssim] # from the paper: multiscale_weight: 1, ms-ssim_weight: 1
p3_losses: [fea] # from the paper: VGG feature_weight: 1 gan_weight: 0.005
ppon_stages: [1000, 2000] # The first value here is where phase 2 (structure) will start and the second is where phase 3 (features) starts
The same losses can be used in multiple stages (it can be repeated) and take into consideration that the first stage is the one with the most capacity of the network and the other two stages depend on it.
The Discriminator is enabled only on the last phase at the moment, following the paper. You can configure any of the losses in any of the phases, I recommend testing "contextual" (cx) if possible, specially on phases 2 and 3.
You may also want to adjust your scheduler and coordinate the ppon_stages to match the steps, the original paper used "StepLR_Restart".
Lastly, you can control what phase you want to train with the ppon_stages option. For example, if you set it to [0, 0] it will start on phase 3 from the beginning of the training session, while [0, 1000000] will start in phase 2, and phase 3 will begin after 1000000 iterations. Similarly, with [1000000, 1000000], only phase 1 will be trained for 1000000 iterations.
SRFlow allows for the use of any differentiable architecture for the LR encoding network, since ir itself does not need to be invertible. SRFlow uses by default an RRDB network (ESRGAN) network for this purpose. In the original work, a pretrained ESRGAN model is loaded and according to the paper, the remaining flow network is trained for half the training time and the RRDB module is only unfrozen after that period. The option "train_RRDB_delay: 0.5" does that automatically, but you can lower it to start earlier if required. Besides these main differences, the training process is similar to other SR networks.
- Prepare datasets, usually the DIV2K dataset. More details are in
codes/data
. - Optional: If the intention is to replicate the original paper here you would use an ESRGAN pretrained model. The original paper used the ESRGAN modified architecture model for this purpose. You can also use the original
RRDB_PSNR_x4.pth
as the pretrained model for that purpose, otherwise any existing model will work as pretrained. Inoptions/srflow/train_srflow.yml
set path.pretrain_model_G:RRDB_ESRGAN_x4_mod_arch.pth
(or any ESRGAN model) and path.load_submodule:true
for this purpose. If using an SRFlow model as pretrained, only setting pretrain_model_G is required. - Modify the configuration file,
options/srflow/train_srflow.yml
as needed. - Run command:
python train.py -opt options/srflow/train_srflow.yml
Notes:
- While SRFlow only needs the nll to train, it is possible to add any of the losses (except GAN) from the regular training template for training and they will work. They will operate on the deterministic version of the super resolved image with temperature τ= 0.
- SRFlow is more memory intensive than ESRGAN, specially if using the regular losses that need to calculate reconstructed SR from the latent space
z
(withreverse=True
) - To remain stable, SRFlow needs a large batch size. batch=1 produces NaN results. If real batch sizes>1 are not possible on the hardware, using virtual batch can solve this stability issue.
- During validation and inference it's known that reconstructed images will output NaN values, which are reduced with more training. More details are discussed here
- During validation, as many images as set in the
heats: [ 0.0, 0.5, 0.75, 1.0 ]
timesn_sample: 3
will be generated. This example means 3 random samples from each of the heat values configured there, 12 images in total for each validation image.
Note: these are the instructions from the original repository and the whole process is in need on behing updated, but it should work if you want to experiment.
Pretraining is also important. Use a PSNR-oriented pretrained SR model (trained on DIV2K) to initialize the SFTGAN model.
- First prepare the segmentation probability maps for training data: run
test_seg.py
. We provide a pretrained segmentation model for 7 outdoor categories in Pretrained models. Use Xiaoxiao Li's codes to train the segmentation model and transfer it to a PyTorch model. - Put the images and segmentation probability maps in a folder as described in
codes/data
. - Transfer the pretrained model parameters to the SFTGAN model.
- First train with
debug
mode and obtain a saved model. - Run
transfer_params_sft.py
to initialize the model. - We provide an initialized model named
sft_net_ini.pth
in Pretrained models
- First train with
- Modify the configuration file in
options/sr/train_sftgan.json
- Run command:
python train.py -opt options/sr/train_sftgan.json
Restoration models (Deblurring, denoising, etc) are fundamentally the same as Super Resolution models, with the difference that they usually operate without scaling the images (1x scale). The steps to train are the same as in super-resolution models, only make sure that the network supports operating in 1x scale.
Super-Resolution and restoration are tasks that can be done simultaneously, in which case the low quality input data will be both a factor of the size of the high quality target, but also contains one or more degradations.
Images can be resized and cropped in different ways using preprocess
option.
- The default option
resize_and_crop
resizes the image to be of size (load_size
,load_size
) and does a random crop of size (crop_size
,crop_size
). crop
skips the resizing step and only performs random cropping, in the same way as SR cases.center_crop
will always do the same center crop of size (center_crop_size
,center_crop_size
) to all images.scale_width
resizes the image to have widthcrop_size
while keeping the aspect ratio.scale_width_and_crop
first resizes the image to have widthload_size
and then does random cropping of size (crop_size
,crop_size
).none
tries to skip all these preprocessing steps. However, if the image size is not a multiple of some number depending on the number of downsamplings of the generator, you will get an error because the size of the output image may be different from the size of the input image. Therefore,none
option still tries to adjust the image size to be a multiple of 4. You might need a bigger adjustment if you change the generator architecture. Please seedataops/augmentations.py
to see how all these were implemented.- Note: Options can be concatenated usin
_and_
likecenter_crop_and_resize
orresize_and_crop
.
Since the generator architecture in CycleGAN involves a series of downsampling / upsampling operations, the size of the input and output image may not match if the input image size is not a multiple of 4. As a result, you may get a runtime error because the L1 identity loss cannot be enforced with images of different size. Therefore, we slightly resize the image to become multiples of 4 even with preprocess: none
option, as explained above. For the same reason, crop_size
needs to be a multiple of 4.
CycleGAN is quite memory-intensive as four networks (two generators and two discriminators) need to be loaded on one GPU, so a large image cannot be entirely loaded. In this case, we recommend training with cropped images. For example, to generate 1024px results, you can train with preprocess: scale_width_and_crop
, load_size: 1024
crop_size: 360
, and test with preprocess: scale_width
, load_size: 1024
. This way makes sure the training and test will be at the same scale. At test time, you can afford higher resolution because you don’t need to load all the networks.
Both pix2pix and CycleGAN can work for rectangular images. To make them work, you need to use different preprocessing flags. Let's say that you are working with 360x256 images. During training, you can specify preprocess: crop
and crop_size: 256
. This will allow your model to be trained on randomly cropped 256x256
images during training time. During test time, you can apply the model on 360x256
images with the flag preprocess: none
.
There are practical restrictions regarding image sizes for each generator architecture. For unet256
, it only supports images whose width and height are divisible by 256
. For unet128
, the width and height need to be divisible by 128
. For resnet_6blocks
and resnet_9blocks
, the width and height need to be divisible by 4
.
WBC also uses a UNet
architecture (wbcunet
) and while it's different from CycleGAN
's, it also expects images of size 256x256
.
For all experiments in the original pix2pix and CycleGAN papers, the batch size was set to be 1. If there is room for memory, you can use higher batch size with batch norm or instance norm. (Note that the default batchnorm
does not work well with multi-GPU training. You may consider using synchronized batchnorm instead). But please be aware that it can impact the training. In particular, even with Instance Normalization, different batch sizes can lead to different results. Moreover, increasing crop_size
may be a good alternative to increasing the batch_size
.
For WBC the batch size was set to 16 and some adjustments to the losses could be necessary if modified.
The identity loss can regularize the generator to be close to an identity mapping when fed with real samples from the target domain. If something already looks like from the target domain, you should preserve the image without making additional changes. The generator trained with this loss will often be more conservative for unknown content, meaning that if you want to allow more liberty to the model to be able to change the images, the identity loss should be disabled.
This Distill blog discussed one of the potential causes of the checkerboard artifacts. You can fix that issue by switching from deconv
("deconvolution") to an upconv
(regular upsampling followed by regular convolution). Currently the network parameters for both pix2pix and CycleGAN allow to make this change (using the upsample_mode
option), but here's an alternative reference implementation using ReflectionPad2d
:
nn.Upsample(scale_factor = 2, mode='bilinear'),
nn.ReflectionPad2d(1),
nn.Conv2d(ngf * mult, int(ngf * mult / 2), kernel_size=3, stride=1, padding=0),
Sometimes the checkboard artifacts will go away if you train long enough. Maybe you can try training your model a bit longer.
The training process for White-box Cartoonization expects one input dataset (real photos) and one target dataset (cartoons/anime). In both cases, one directory containing landscape
(or scenery) images and another with faces
are used and, by default, the code will automatically sample landscape images with a probability of 4/5
and faces with a probability of 1/5
when using the concat_unaligned
dataset mode. The distribution weights can be changed with the sampler_weights
option and if only one diretory is to be used (for example, only landscape), then the regular unaligned
dataset mode can be used instead.
Besides the regular loss weights from other models, due to the multiple representations used in WBC
, it is also possible to select which of the configured losses will be used for each representation and an independent scale that each representation will have in total during training. This allows for very fine-grained configuration flexibility for training custom models.
Each representation uses different images pairs to calculate the respective losses, namely:
- Surface: the
edge-preserving filtered model output
vs. theedge-preserving filtered target cartoon image
. - Texture: the
random grayscale model output
vs. therandom grayscale target cartoon image
. - Structure: the
model output
vs. thesuperpixel segmented model output
- Content: the
input photo image
vs. themodel output
. - Regularization: if using only the TV regularization, then only the
model output
, but otherwise it will calculate losses between themodel output
and thetarget cartoon images
(for use with losses that can operate with unaligned images, likeContextual Loss
).
The representation losses are configured in the options file with:
- surf_losses
- text_losses
- struct_losses
- cont_losses
- reg_losses
- idt_losses
And similarly, the representation scales with:
- surface_scale
- texture_scale
- struct_scale
- content_scale
- reg_scale
Depending on each case you may want to tweak the balance of the losses, weights and scales to achieve the desired results. Note that when starting training, even if a pretrained model is used, the cartoonization effect may not be apparent while the discriminators are being trained, so a safer strategy is to start training with the default configuration for about 5000 iterations before tweaking the balance, so the effects can be properly evaluated.
Something important to keep in mind is that the Structure
representation calculates the superpixel targets from the model outputs, so it can potentially create a feedback loop if the scale of this representation is too high in comparison to the Content
representation (that maintains the semantic invariability), so they have to be balanced.
A valid alternative to tweaking every representation is to train multiple models from the same pretrained model, each with a focus on each representation and then interpolate the resulting models to fine-tune the results. The guided filter during inference also provides additional control over the details that are preserved in the final results.
TBD
TBD
When resuming training, just set the resume_state
option in the configuration file under path
, like: resume_state: "../experiments/debug_001_RRDB_PSNR_x4_DIV2K/training_state/200.state"
.
To fine-tune a pre-trained model, just set the path in pretrain_model_G
to your pretrained generator model (pretrain_model_G_A
and pretrain_model_G_B
in the case of CycleGAN) and it will train with your current configuration. The program will initialize the training from iteration 0.
You can also use pretrained Discriminator networks with the corresponding pretrain_model_D
, which is particularly useful in cases where you would like to evaluate transfer learning, either by using the Discriminator as is or by freezing layers using the FreezeD
option. Also, can be useful when combined with feature matching
to use the discriminator feature maps to calculate feature loss.