GitHub - JordanZh/C3Net: Official PyTorch implementation of C3Net: Compound Conditioned ControlNet for Multimodal Content Generation

C3Net: Compound Conditioned ControlNet for Multimodal Content Generation

Juntao Zhang¹, Yuehuai Liu¹, Yu-Wing Tai², Chi-Keung Tang¹,

¹HKUST, ²Dartmouth College

Abstruct

We present Compound Conditioned ControlNet, C3Net, a novel generative neural architecture taking conditions from multiple modalities and synthesizing multimodal contents simultaneously (e.g., image, text, audio). C3Net adapts the ControlNet architecture to jointly train and make inferences on a production-ready diffusion model and its trainable copies. Specifically, C3Net first aligns the conditions from multi-modalities to the same semantic latent space using modality-specific encoders based on contrastive training. Then, it generates multimodal outputs based on the aligned latent space, whose semantic information is combined using a ControlNet-like architecture called Control C3-UNet. Correspondingly, with this system design, our model offers an improved solution for joint-modality generation through learning and explaining multimodal conditions instead of simply taking linear interpolations on the latent space. Meanwhile, as we align conditions to a unified latent space, C3Net only requires one trainable Control C3-UNet to work on multimodal semantic information. Furthermore, our model employs unimodal pretraining on the condition alignment stage, outperforming the non-pretrained alignment even on relatively scarce training data and thus demonstrating high-quality compound condition generation. We contribute the first high-quality tri-modal validation set to validate quantitatively that C3Net outperforms or is on par with first and contemporary state-of-the-art multimodal generation. Our codes and tri-modal dataset will be released.

Pipline

Details

Compound Multimodal Conditioned Synthesis

Download weights

All the weights should be placed under the ./checkpoint directory. You can download weights at weights, which include the weights of Control C3-Unet.

Acknowledgements

We would like to thank the contributors to the CoDi and SSAST repositories, for their open research and exploration.

Citation

@INPROCEEDINGS{C3Net2024,
      title={C3Net: Compound Conditioned ControlNet for Multimodal Content Generation}, 
      author={Juntao Zhang and Yuehuai Liu and Yu-Wing Tai and Chi-Keung Tang},
      booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
      year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
assets		assets
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

C3Net: Compound Conditioned ControlNet for Multimodal Content Generation

Abstruct

Pipline

Details

Compound Multimodal Conditioned Synthesis

Download weights

Acknowledgements

Citation

About

Releases

Packages

Languages

License

JordanZh/C3Net

Folders and files

Latest commit

History

Repository files navigation

C3Net: Compound Conditioned ControlNet for Multimodal Content Generation

Abstruct

Pipline

Details

Compound Multimodal Conditioned Synthesis

Download weights

Acknowledgements

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages