Official PyTorch implementation of "Scalable Neural Video Representations with Learnable Positional Features" (NeurIPS 2022) by Subin Kim*1, Sihyun Yu*1, Jaeho Lee2, and Jinwoo Shin1.
1KAIST, 2POSTECH
TL;DR: We propose a novel neural representation for videos that is the best of both worlds; achieved high-quality encoding and the compute-/parameter- efficiency simultaneously.
Required packages are listed in environment.yaml
.
Also, you should install the following packages:
conda install pytorch torchvision cudatoolkit=11.3 -c pytorch
pip install git+https://github.com/subin-kim-cv/tiny-cuda-nn/#subdirectory=bindings/torch
- This repository of tiny-cuda-nn is slightly different from original implementation of tiny-cuda-nn.
Download the UVG-HD dataset from the following link:
Then, extract RGB sequences from the original YUV videos of UVG-HD using ffmpeg. Here, INPUT
is the input file name, and OUTPUT
is a directory to save decompressed RGB frames.
ffmpeg -f rawvideo -vcodec rawvideo -s 1920x1080 -r 120 -pix_fmt yuv420p -i INPUT.yuv OUTPUT/f%05d.png
Run the following script with a single GPU.
CUDA_VISIBLE_DEVICES=0 python experiment_scripts/train_video.py --logging_root ./logs_nvp --experiment_name <EXPERIMENT_NAME> --dataset <DATASET> --num_frames <NUM_FRAMES> --config ./config/config_nvp_s.json
- Option
--logging_root
denotes the path to save the experiment log. - Option
--experiment_name
denotes the subdirectory to save the log files (results, checkpoints, configuration, etc.) existed under--logging_root
. - Option
--dataset
denotes the path of RGB sequences (e.g.,~/data/Jockey
). - Option
--num_frames
denotes the number of frames to reconstruct (300 for the ShakeNDry video and 600 for other videos in UVG-HD). - To reconstruct videos with 300 frames, please change the values of
t_resolution
in configuration file to 300.
Evaluation without compression of parameters (i.e., qunatization only).
CUDA_VISIBLE_DEVICES=0 python experiment_scripts/eval.py --logging_root ./logs_nvp --experiment_name <EXPERIMENT_NAME> --dataset <DATASET> --num_frames <NUM_FRAMES> --config ./logs_nvp/<EXPERIMENT_NAME>/config_nvp_s.json
- Option
--save
denotes whether to save the reconstructed frames. - One can specify an option
--s_interp
for a video superresolution results. It denotes the superresolution scale (e.g., 8). - One can specify an option
--t_interp
for a video frame interpolation results. It denotes the temporal interpolation scale (e.g., 8).
Evaluation with compression of parameters using well-known image and video codecs.
-
Save the quantized parameters.
CUDA_VISIBLE_DEVICES=0 python experiment_scripts/compression.py --logging_root ./logs_nvp --experiment_name <EXPERIMENT_NAME> --config ./logs_nvp/<EXPERIMENT_NAME>/config_nvp_s.json
-
Compress the saved sparse positional image-/video-like features using codecs.
- Execute
compression.ipynb
. - Please change the logging_root and experiment_name in
compression.ipynb
appropriately. - One can change
qscale
,crf
,framerate
which changes the compression ratio of sparse positinal features.qscale
ranges from 1 to 31, where larger values mean the worse quality (2~5 recommended).crf
ranges from 0 to 51 where larger values mean the worse quality (20~25 recommended).framerate
(25 or 40 recommended).
- Execute
-
Evaluation with the compressed parameters.
CUDA_VISIBLE_DEVICES=0 python experiment_scripts/eval_compression.py --logging_root ./logs_nvp --experiment_name <EXPERIMENT_NAME> --dataset <DATASET> --num_frames <NUM_FRAMES> --config ./logs_nvp/<EXPERIMENT_NAME>/config_nvp_s.json --qscale 2 3 3 --crf 21 --framerate 25
- Option
--save
denotes whether to save the reconstructed frames. - Please specify the option
--qscale
,--crf
,--framerate
as same with the values in thecompression.ipynb
.
- Option
Reconstructed video results of NVP on UVG-HD, and other 4K/long/temporally dynamic videos are available at the following project page.
Our model achieves the following performance on UVG-HD with a single NVIDIA V100 32GB GPU:
Encoding Time | BPP | PSNR (↑) | FLIP (↓) | LPIPS (↓) |
---|---|---|---|---|
~5 minutes | 0.901 | 34.57 |
0.075 |
0.190 |
~10 minutes | 0.901 | 35.79 |
0.065 |
0.160 |
~1 hour | 0.901 | 37.61 |
0.052 |
0.145 |
~8 hours | 0.210 | 36.46 |
0.067 |
0.135 |
- The reported values are averaged over the Beauty, Bosphorus, Honeybee, Jockey, ReadySetGo, ShakeNDry, and Yachtride videos in UVG-HD and measured using LPIPS, FLIP repositories.
One can download the pretrained checkpoints from the following link
@inproceedings{
kim2022scalable,
title={Scalable Neural Video Representations with Learnable Positional Features},
author={Kim, Subin and Yu, Sihyun and Lee, Jaeho and Shin, Jinwoo},
booktitle={Advances in Neural Information Processing Systems},
year={2022},
}
We used the code from following repositories: SIREN, Modulation, tiny-cuda-nn.