diff --git a/MODEL_ZOO.md b/MODEL_ZOO.md index cbc1d96..2e0debc 100644 --- a/MODEL_ZOO.md +++ b/MODEL_ZOO.md @@ -1,37 +1,42 @@ # Video Swin Transformer Model Zoo -Video Swin in `keras` can be used with multiple backends, i.e. `tensorflow`, `torch`, and `jax`. +Video Swin in `keras` can be used with multiple backends, i.e. `tensorflow`, `torch`, and `jax`. The input shape are expected to be `channel_last`, i.e. `(depth, height, width, channel)`. ## Note +While evaluating the video model for classification task, multiple clips from a video are sampled. This process also involves multiple crops on the sample. + - `#Frame = #input_frame x #clip x #crop`. The frame interval is `2` to evaluate on benchmark dataset. -- `#input_frame` means how many frames are input for model during the test phase. +- `#input_frame` means how many frames are input for model during the test phase. For video swin, it is `32`. - `#crop` means spatial crops (e.g., 3 for left/right/center crop). - `#clip` means temporal clips (e.g., 5 means repeted temporal sampling five clips with different start indices). -### Kinetics 400 -In the training phase, the video swin mdoels are initialized with the pretrained weights of image swin models. In that case, `IN` referes to **ImageNet**. +# Checkpoints + +In the training phase, the video swin mdoels are initialized with the pretrained weights of image swin models. In that case, `IN` referes to **ImageNet**. In the following, the `keras` checkpoints are the complete model, so `keras.saving.load_model` API can be used. In contrast, the `h5` checkpoints are the only weight file. + +### Kinetics 400 -| Backbone | Pretrain | #Frame | Top-1 | Top-5 | Checkpoints | -| :---: | :---: | :---: | :---: | :---: | :---: | -| Swin-T | IN-1K | 32x4x3 | 78.8 | 93.6 | [SavedModel](https://github.com/innat/VideoSwin/releases/download/v1.1/TFVideoSwinT_K400_IN1K_P244_W877_32x224.zip)/[h5](https://github.com/innat/VideoSwin/releases/download/v1.0/TFVideoSwinT_K400_IN1K_P244_W877_32x224.h5) | -| Swin-S | IN-1K | 32x4x3 | 80.6 | 94.5 | [SavedModel](https://github.com/innat/VideoSwin/releases/download/v1.1/TFVideoSwinS_K400_IN1K_P244_W877_32x224.zip)/[h5](https://github.com/innat/VideoSwin/releases/download/v1.0/TFVideoSwinS_K400_IN1K_P244_W877_32x224.h5) | -| Swin-B | IN-1K | 32x4x3 | 80.6 | 94.6 | [SavedModel](https://github.com/innat/VideoSwin/releases/download/v1.1/TFVideoSwinB_K400_IN1K_P244_W877_32x224.zip)/[h5](https://github.com/innat/VideoSwin/releases/download/v1.0/TFVideoSwinB_K400_IN1K_P244_W877_32x224.h5) | -| Swin-B | IN-22K | 32x4x3 | 82.7 | 95.5 | [SavedModel](https://github.com/innat/VideoSwin/releases/download/v1.1/TFVideoSwinB_K400_IN22K_P244_W877_32x224.zip)/[h5](https://github.com/innat/VideoSwin/releases/download/v1.0/TFVideoSwinB_K400_IN22K_P244_W877_32x224.h5) | +| Model | Pretrain | #Frame | Top-1 | Top-5 | Checkpoints | config | +| :---: | :---: | :---: | :---: | :---: | :---: | :---: | +| Swin-T | IN-1K | 32x4x3 | 78.8 | 93.6 | [keras]()/[h5]() | [swin-t](https://github.com/SwinTransformer/Video-Swin-Transformer/blob/master/configs/recognition/swin/swin_tiny_patch244_window877_kinetics400_1k.py) | +| Swin-S | IN-1K | 32x4x3 | 80.6 | 94.5 | [keras]()/[h5]() | [swin-s](https://github.com/SwinTransformer/Video-Swin-Transformer/blob/master/configs/recognition/swin/swin_small_patch244_window877_kinetics400_1k.py) | +| Swin-B | IN-1K | 32x4x3 | 80.6 | 94.6 | [keras]()/[h5]() | [swin-b](https://github.com/SwinTransformer/Video-Swin-Transformer/blob/master/configs/recognition/swin/swin_base_patch244_window877_kinetics400_1k.py) | +| Swin-B | IN-22K | 32x4x3 | 82.7 | 95.5 | [keras]()/[h5]() | [swin-b](https://github.com/SwinTransformer/Video-Swin-Transformer/blob/master/configs/recognition/swin/swin_base_patch244_window877_kinetics400_22k.py) | ### Kinetics 600 -| Backbone | Pretrain | #Frame | Top-1 | Top-5 | Checkpoints | -| :---: | :---: | :---: | :---: | :---: | :---: | -| Swin-B | IN-22K | 32x4x3 | 84.0 | 96.5 | [SavedModel](https://github.com/innat/VideoSwin/releases/download/v1.1/TFVideoSwinB_K600_IN22K_P244_W877_32x224.zip)/[h5](https://github.com/innat/VideoSwin/releases/download/v1.0/TFVideoSwinB_K600_IN22K_P244_W877_32x224.h5) | +| Model | Pretrain | #Frame | Top-1 | Top-5 | Checkpoints | config | +| :---: | :---: | :---: | :---: | :---: | :---: | :---: | +| Swin-B | IN-22K | 32x4x3 | 84.0 | 96.5 | [keras]()/[h5]() | [swin-b](https://github.com/SwinTransformer/Video-Swin-Transformer/blob/master/configs/recognition/swin/swin_base_patch244_window877_kinetics600_22k.py) | ### Something-Something V2 -| Backbone | Pretrain | #Frame | Top-1 | Top-5 | Checkpoints | -| :---: | :---: | :---: | :---: | :---: | :---: | -| Swin-B | Kinetics 400 | 32x1x3 | 69.6 | 92.7 | [SavedModel](https://github.com/innat/VideoSwin/releases/download/v1.1/TFVideoSwinB_SSV2_K400_P244_W1677_32x224.zip)/[h5](https://github.com/innat/VideoSwin/releases/download/v1.0/TFVideoSwinB_SSV2_K400_P244_W1677_32x224.h5) | +| Model | Pretrain | #Frame | Top-1 | Top-5 | Checkpoints | config | +| :---: | :---: | :---: | :---: | :---: | :---: | :---: | +| Swin-B | Kinetics 400 | 32x1x3 | 69.6 | 92.7 | [keras]()/[h5]() | [swin-b](https://github.com/SwinTransformer/Video-Swin-Transformer/blob/master/configs/recognition/swin/swin_base_patch244_window1677_sthv2.py) | ## Weight Comparison diff --git a/README.md b/README.md index f91f4f4..c7792cd 100644 --- a/README.md +++ b/README.md @@ -4,22 +4,14 @@ [![Palestine](https://img.shields.io/badge/Free-Palestine-white?labelColor=green)](https://twitter.com/search?q=%23FreePalestine&src=typed_query) -[![arXiv](https://img.shields.io/badge/arXiv-2106.13230-darkred)](https://arxiv.org/abs/2106.13230) [![keras-2.12.](https://img.shields.io/badge/keras-2.12-darkred)]([?](https://img.shields.io/badge/keras-2.12-darkred)) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Q7A700MEI10UomikqjQJANWyFZktJCT-?usp=sharing) [![HugginFace badge](https://img.shields.io/badge/🤗%20Hugging%20Face-Spaces-yellow.svg)](https://huggingface.co/spaces/innat/VideoSwin) [![HugginFace badge](https://img.shields.io/badge/🤗%20Hugging%20Face-Hub-yellow.svg)](https://huggingface.co/innat/videoswin) +[![arXiv](https://img.shields.io/badge/arXiv-2106.13230-darkred)](https://arxiv.org/abs/2106.13230) [![keras-3](https://img.shields.io/badge/keras-3-darkred +)]([?](https://img.shields.io/badge/keras-2.12-darkred)) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Q7A700MEI10UomikqjQJANWyFZktJCT-?usp=sharing) [![HugginFace badge](https://img.shields.io/badge/🤗%20Hugging%20Face-Spaces-yellow.svg)](https://huggingface.co/spaces/innat/VideoSwin) [![HugginFace badge](https://img.shields.io/badge/🤗%20Hugging%20Face-Hub-yellow.svg)](https://huggingface.co/innat/videoswin) VideoSwin is a pure transformer based video modeling algorithm, attained top accuracy on the major video recognition benchmarks. In this model, the author advocates an inductive bias of locality in video transformers, which leads to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization. The locality of the proposed video architecture is realized by adapting the [**Swin Transformer**](https://arxiv.org/abs/2103.14030) designed for the image domain, while continuing to leverage the power of pre-trained image models. -This is a unofficial `Keras` implementation of [Video Swin transformers](https://arxiv.org/abs/2106.13230). The official `PyTorch` implementation is [here](https://github.com/SwinTransformer/Video-Swin-Transformer) based on [mmaction2](https://github.com/open-mmlab/mmaction2). +This is a unofficial `Keras 3` implementation of [Video Swin transformers](https://arxiv.org/abs/2106.13230). The official `PyTorch` implementation is [here](https://github.com/SwinTransformer/Video-Swin-Transformer) based on [mmaction2](https://github.com/open-mmlab/mmaction2). The official PyTorch weight has been converted to `Keras 3` compatible. This implementaiton supports to run the model on multiple backend, i.e. TensorFlow, PyTorch, and Jax. -## News - -- **[24-10-2023]**: [Kinetics-400](https://www.deepmind.com/open-source/kinetics) test data set can be found on kaggle, [link](https://www.kaggle.com/datasets/ipythonx/k4testset/data?select=videos_val). -- **[14-10-2023]**: VideoSwin integrated into [Huggingface Space](https://huggingface.co/spaces/innat/VideoSwin). -- **[12-10-2023]**: GPU(s), TPU-VM for fine-tune training are supported, [colab](https://github.com/innat/VideoSwin/blob/main/notebooks/videoswin_video_classification.ipynb). -- **[09-10-2023]**: TensorFlow [SavedModel](https://www.tensorflow.org/guide/saved_model) (formet) checkpoints, [link](https://github.com/innat/VideoSwin/releases/tag/v1.1). -- **[08-10-2023]**: VideoSwin checkpoints [SSV2](https://developer.qualcomm.com/software/ai-datasets/something-something) and [Kinetics-600](https://www.deepmind.com/open-source/kinetics) becomes available, [link](https://github.com/innat/VideoSwin/releases/tag/v1.0). -- **[07-10-2023]**: VideoSwin checkpoints on [Kinetics-400](https://www.deepmind.com/open-source/kinetics) becomes available, [link](https://github.com/innat/VideoSwin/releases/tag/v1.0). -- **[06-10-2023]**: Code of VideoSwin in Keras becomes available. # Install @@ -29,9 +21,10 @@ cd VideoSwin pip install -e . ``` -# Usage +# Checkpoints + +The **VideoSwin** checkpoints are available in both `.weights.h5`, and `.keras` formats. The variants of this models are `tiny`, `small`, and `base`. Check [model zoo](https://github.com/innat/VideoSwin/blob/main/MODEL_ZOO.md) page to know details of it. -The **VideoSwin** checkpoints are available in both `SavedModel`, `H5`, and `.weights.H5` formats. The variants of this models are `tiny`, `small`, and `base`. Check this [release](https://github.com/innat/VideoSwin/releases/tag/v1.0) and [model zoo](https://github.com/innat/VideoSwin/blob/main/MODEL_ZOO.md) page to know details of it. Following are some hightlights. **Inference** @@ -43,9 +36,13 @@ from videoswin import VideoSwinT >>> os.environ["KERAS_BACKEND"] = "torch" >>> from videoswin import VideoSwinT ->>> model = VideoSwinT(num_classes=400) +>>> model = VideoSwinT( + num_classes=400, + include_rescaling=False, + activation=None +) >>> _ = model(torch.ones((1, 32, 224, 224, 3))) ->>> model.load_weights('VideoSwinT_K400_IN1K_P244_W877_32x224.weights.h5') +>>> model.load_weights('model.weights.h5') >>> container = read_video('sample.mp4') >>> frames = frame_sampling(container, num_frames=32) @@ -91,28 +88,6 @@ model.fit(...) model.predict(...) ``` -**Attention Maps** - -By passing `return_attns=True` in the forward pass, we can get the attention scores from each basic block of the model as well. For example, - -```python -from videomae import VideoSwinT - ->>> model = VideoSwinT(num_classes=400) ->>> model.load_weights('TFVideoSwinT_K400_IN1K_P244_W877_32x224.h5') ->>> container = read_video('sample.mp4') ->>> frames = frame_sampling(container, num_frames=32) ->>> y, attns_scores = model(frames, return_attns=True) - -for k, v in attns_scores.items(): - print(k, v.shape) # num_heads, depth, seq_len, seq_len -TFBasicLayer1_att (128, 3, 392, 392) -TFBasicLayer2_att (32, 6, 392, 392) -TFBasicLayer3_att (8, 12, 392, 392) -TFBasicLayer4_att (2, 24, 392, 392) -``` - - ## Model Zoo The 3D swin-video checkpoints are listed in [`MODEL_ZOO.md`](MODEL_ZOO.md). Following are some hightlights.