Skip to content
/ C3Net Public

Official PyTorch implementation of C3Net: Compound Conditioned ControlNet for Multimodal Content Generation

License

Notifications You must be signed in to change notification settings

JordanZh/C3Net

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 

Repository files navigation

C3Net: Compound Conditioned ControlNet for Multimodal Content Generation

Juntao Zhang1, Yuehuai Liu1, Yu-Wing Tai2, Chi-Keung Tang1,
1HKUST, 2Dartmouth College

arXiv

Abstruct

We present Compound Conditioned ControlNet, C3Net, a novel generative neural architecture taking conditions from multiple modalities and synthesizing multimodal contents simultaneously (e.g., image, text, audio). C3Net adapts the ControlNet architecture to jointly train and make inferences on a production-ready diffusion model and its trainable copies. Specifically, C3Net first aligns the conditions from multi-modalities to the same semantic latent space using modality-specific encoders based on contrastive training. Then, it generates multimodal outputs based on the aligned latent space, whose semantic information is combined using a ControlNet-like architecture called Control C3-UNet. Correspondingly, with this system design, our model offers an improved solution for joint-modality generation through learning and explaining multimodal conditions instead of simply taking linear interpolations on the latent space. Meanwhile, as we align conditions to a unified latent space, C3Net only requires one trainable Control C3-UNet to work on multimodal semantic information. Furthermore, our model employs unimodal pretraining on the condition alignment stage, outperforming the non-pretrained alignment even on relatively scarce training data and thus demonstrating high-quality compound condition generation. We contribute the first high-quality tri-modal validation set to validate quantitatively that C3Net outperforms or is on par with first and contemporary state-of-the-art multimodal generation. Our codes and tri-modal dataset will be released.

Pipline

Details

Compound Multimodal Conditioned Synthesis

Download weights

All the weights should be placed under the ./checkpoint directory. You can download weights at weights, which include the weights of Control C3-Unet.

Acknowledgements

We would like to thank the contributors to the CoDi and SSAST repositories, for their open research and exploration.

Citation

@INPROCEEDINGS{C3Net2024,
      title={C3Net: Compound Conditioned ControlNet for Multimodal Content Generation}, 
      author={Juntao Zhang and Yuehuai Liu and Yu-Wing Tai and Chi-Keung Tang},
      booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
      year={2024}
}



About

Official PyTorch implementation of C3Net: Compound Conditioned ControlNet for Multimodal Content Generation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages