In this project, we explore the advancements in Generative Adversarial Networks (GANs) for high-fidelity image editing by re-implementing the paper titled \textit{Image2StyleGAN++: How to Edit the Embedded Images?}. The authors, Rameen Abdal, Yipeng Qin, and Peter Wonka, expand upon their previous work, Image2StyleGAN, by introducing several significant enhancements. It was presented at CVPR 2020.
The core method discussed in the paper involves embedding images into the latent space of a pre-trained StyleGAN. This is achieved through an embedding algorithm that not only optimizes the latent space but also manipulates the noise and activation tensors to produce localized and high-fidelity edits on images.
The main contributions of the paper are threefold: Firstly, the introduction of noise optimization to enhance the reconstruction of high-frequency features in images, significantly improving the quality of the output. Secondly, the extension of the global W+ latent space embedding allows for local modifications and more controlled edits. Thirdly, the combination of latent space embedding with direct manipulation of activation tensors enables both global and local edits, providing a powerful framework for various image editing applications such as image inpainting, style transfer, and feature modification.
We aimed to reproduce two specific results the first is the merging of two halves of an image as seen below. This result is significant because it does a better job of manually cropping and pushing two images together, by make it a smooth blend. The second which is making the PSNR ratio up from around 19 to 22 to 39 to 45 on recreated images. The significance is that we can recreate the image with a higher quality than before as it has a higher precision-to-noise ratio.
During this re-implementation, we started from the Image2StyleGAN paper and the pre-trained model, which we loaded via a pickle file. The main change in the model architecture was via the loss function and adding the noise optimization. When changing the loss function we changed it by adding multiple masks on the four different loss compared to the MSE loss and the perceptual loss in the original paper. In this loss function, we added two new parts and added masks primarily so that there could be style transfer from two images. We added the loss called the paper as the style loss which was taken from the third layer of the VGG16 pre-trained ImageNet model and the image not used in the style loss. The next thing we had to implement was masks which was a convolutional filter that specified which part of the image we took from image one and which part of the image one we took from image two. There had to be three separate filters because they were different parts of the loss and had different losses. However, overall they are supposed to roughly represent the same portion of our image. While doing this for the perceptual loss it was tough just because there were parts from different sizes so we couldn't use the same matrix. As a result we had a few matrices representing the same part of the image. These matrices would be hyperparameters to the loss functions as well as the images. In addition to blend two image, such as the image with a scribble result we would put in some type of blur matrix. Some of the shapes in the image were complicated to make as filters so we took the one that combined two images vertically.
For running our code we did it on a Jupyter notebook so one could simply run the Jupyter notebook. We used pytorch, numpy and pickle to load the pretrained styleGan network.
We were able to reproduce results by running a V100 GPU for 20-30 minutes.
Clearly the results of the combined image are very similar to the results in the paper. Our result was a seamless blend between the two faces, similar to the result in the paper. However, we couldn't use the exact Joker and Ryan Gosling images in the paper because the dataset of images was very large and finding the right image was like finding a needle in a haystack. As to avoid combing through the actual dataset with millions of images we chose two random images and used the same filters. As we were able to re-implement this this shows that this is able to be reimplemented and that the paper contributed very well. I am not as good in Photoshop so I recreated the pictures combined manually as best I could there was a little bit of whitespace but for the purposes of comparing each half with the half of the recreated image I believe it shows that it does a decent job. In addition, below we show the results of our noise optimization effects after taking in an image, embedding it, and then inputing the new embedding and noise vector into the pre-trained model. We were able to achieve a high PSNR similar to what was achieveded in the paper. We also were able to re-create the picture of the car with a PSNR around 40. This is shown below. By achieving image blending and significantly enhancing PSNR values, our project validates the authors' advancements in noise optimization and latent space manipulation. These results not only replicate the high-fidelity editing capabilities presented in the paper but also suggest broader applications for GANs in digital media, art restoration, and beyond. Future investigations could further explore the impact of different hyperparameters and mask designs, potentially making these powerful editing tools more accessible and tailored to specific user needs.Our reimplementation has highlighted several important lessons and potentials in image editing using Generative Adversarial Networks (GANs), specifically through the Image2StyleGAN++ framework. A key takeaway is the efficacy of leveraging pre-trained models for image editing tasks. This approach allows for significant enhancements in image quality without the need to develop new models from scratch. By fine-tuning the system on specific images, we achieved remarkable results that reinforce the capabilities of advanced GAN architectures in producing high-quality, detailed images. However, this project also had some challenges. The lack of detailed documentation on hyperparameters and mask configurations in the original paper made it difficult to replicate specific results. This underscores the need for detailed reporting in research to facilitate true reproducibility and to allow other researchers to build on existing work without unnecessary guesswork. Based on these experiences, several future directions emerge. First, further research could explore the impact of different masking techniques on the quality and style of edited images. Experimenting with various shapes, sizes, and configurations of masks could lead to a deeper understanding of how localized changes affect the overall aesthetics and realism of synthetic images. Additionally, developing a more intuitive interface for adjusting masks and hyperparameters could democratize access to advanced image editing, making these powerful tools accessible to non-experts. Extending this framework to other types of data, such as videos or three-dimensional models, could significantly broaden its applicability. Such expansions could revolutionize fields such as film production, video games, and virtual reality, where realistic and customizable visual content is crucial. Overall, our reimplementation of Image2StyleGAN++ not only validated its effectiveness but also opened avenues for further exploration and innovation in image editing technology. These findings encourage ongoing experimentation and adaptation in the evolving landscape of artificial intelligence and computer vision.
https://arxiv.org/abs/1904.03189 https://arxiv.org/abs/1911.11544 https://oscar-guarnizo.medium.com/review-image2stylegan-embedding-an-image-into-stylegan-c7989e345271