You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, excellent work! A question here:
I noticed that in svd_interpolate_single_img_traj.py: 1082, the mask is directly reshaped from (576, 1024) to (72, 8, 128, 8) and then to (72,128,64) :
As far as I understand, this reshaped mask will be used in the latent space for Eq. (14) and so on. However, why can the mask match the latents encoded by the VAE? In other words, the parts of the warped images that are known in pixel space should correspond to the parts of the original mask that are 1. However, how can we ensure that the parts of the warped images that are known in pixel space still correspond to the parts of the mask that are 1 after encoding into the latent space?
The text was updated successfully, but these errors were encountered:
Thanks for your interest in our project. The VAE uses CNN to map the image from pixel space to latent space. CNNs are known for their ability to capture spatial structure through convolutional layers. The latent representation still retains a lower-resolution version of the spatial structure of the original image. Therefore, the encoder preserves the spatial relationships between different parts of the image, even though the resolution may be reduced.
Hi, excellent work! A question here:
I noticed that in
svd_interpolate_single_img_traj.py: 1082
, the mask is directly reshaped from (576, 1024) to (72, 8, 128, 8) and then to (72,128,64) :As far as I understand, this reshaped mask will be used in the latent space for Eq. (14) and so on. However, why can the mask match the latents encoded by the VAE? In other words, the parts of the warped images that are known in pixel space should correspond to the parts of the original mask that are 1. However, how can we ensure that the parts of the warped images that are known in pixel space still correspond to the parts of the mask that are 1 after encoding into the latent space?
The text was updated successfully, but these errors were encountered: