UniFormerV2, a generic paradigm to build a powerful family of video networks, by arming the pre-trained ViTs with efficient UniFormer designs. It gets the state-of-the-art recognition performance on 8 popular video benchmarks, including scene-related Kinetics-400/600/700 and Moments in Time, temporal-related Something-Something V1/V2, untrimmed ActivityNet and HACS. In particular, it is the first model to achieve 90% top-1 accuracy on Kinetics-400.
This is unofficial keras
implementation of UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer.. The official PyTorch code is here.
- [24-10-2023]: Kinetics-400 test data set can be found on kaggle, link.
- [20-10-2023]: GPU(s), TPU-VM for fine-tune training are supported, colab.
- [19-10-2023]: UFV2 checkpoints for HACS becomes available, link.
- [19-10-2023]: UFV2 checkpoints for ActivityNet becomes available, link.
- [18-10-2023]: UFV2 checkpoints for Moments in Time becomes available, link.
- [18-10-2023]: UFV2 checkpoints for K710 becomes available, link.
- [17-10-2023]: UFV2 checkpoints for SSV2 becomes available, link.
- [17-10-2023]: UFV2 checkpoints for Kinetics-600/700 becomes available, link.
- [16-10-2023]: UFV2 checkpoints for Kinetics-400 becomes available, link.
- [15-10-2023]: Code of UniFormerV2 (UFV2) in Keras becomes available.
git clone https://github.com/innat/UniFormerV2.git
cd UniFormerV2
pip install -e .
The UniFormerV2
checkpoints are available in both SavedModel
and H5
formats on total 8 datasets, i.e. Kinetics-400/600/700/710, Something Something V2, Moments in Time V1, ActivityNet and HACS. The variants of this models are base
and large
. Each variants may have further variation for different number of input size and input frame. That gives around 35 checkpoints for UniFormerV2. Check this release and model zoo page to know details of it. Also check model_configs.py
to get overall looks of avaiable model config. Following are some hightlights.
Inference
from uniformerv2 import UniFormerV2
>>> model = UniFormerV2(name='K400_B16_8x224')
>>> model.load_weights('TFUniFormerV2_K400_B16_8x224.h5')
>>> container = read_video('sample.mp4')
>>> frames = frame_sampling(container, num_frames=8)
>>> y = model(frames)
>>> y.shape
TensorShape([1, 400])
>>> probabilities = tf.nn.softmax(y_pred_tf)
>>> probabilities = probabilities.numpy().squeeze(0)
>>> confidences = {
label_map_inv[i]: float(probabilities[i]) \
for i in np.argsort(probabilities)[::-1]
}
>>> confidences
A classification results on a sample from Kinetics-400.
Video | Top-5 |
---|---|
{ |
Fine Tune
Each uniformerv2 checkpoints returns logits
. We can just add a custom classifier on top of it. A sample view is shown below. See the above notebook for more details.
from uniformerv2 import UniFormerV2
# import pretrained model, i.e.
model_name = 'ANET_L14_16x224'
uniformer_v2 = UniFormerV2(name=model_name)
uniformer_v2.load_weights(f'TFUniFormerV2_{model_name}.h5')
uniformer_v2.trainable = False
# downstream model
model = keras.Sequential([
uniformer_v2,
layers.Dense(
len(class_folders), dtype='float32', activation=None
)
])
model.compile(...)
model.fit(...)
model.predict(...)
The uniformer-v2 checkpoints are listed in MODEL_ZOO.md.
- Custom fine-tuning code.
- Publish on TF-Hub.
- Support
Keras V3
to support multi-framework backend.
If you use this uniformerv2 implementation in your research, please cite it using the metadata from our CITATION.cff
file.
@misc{li2022uniformerv2,
title={UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer},
author={Kunchang Li and Yali Wang and Yinan He and Yizhuo Li and Yi Wang and Limin Wang and Yu Qiao},
year={2022},
eprint={2211.09552},
archivePrefix={arXiv},
primaryClass={cs.CV}
}