Skip to content

betarixm/video-super-resolution

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Video Super Resolution

CSED451@POSTECH CSED451@POSTECH

Introduction

This is a reimplementation of "Deep Video Super-Resolution Network Using Dynamic Upsampling Filters Without Explicit Motion Compensation." It was rewritten based on TensorFlow 2.5, and a simple web application was added. Launch the web application container using the command below.

docker-compose up

Background

์ตœ๊ทผ 4K์— ์ด์–ด 8K ๊ธฐ์ˆ ๊นŒ์ง€ ๊ณ ํ™”์งˆ, ๊ณ ํ•ด์ƒ๋„์˜ display๊ธฐ์ˆ ์ด ์ง‘์•ฝ์ ์œผ๋กœ ๋ฐœ์ „ํ•˜๋ฉด์„œ ๊ทธ์— ๋งž์ถ”์–ด ๊ณ ํ™”์งˆ video์˜ ์ˆ˜์š”๊ฐ€ ํญ๋ฐœ์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜๊ณ  ์žˆ๋‹ค. ์ƒˆ๋กœ์šด ๊ณ ํ•ด์ƒ๋„ video์„ ์ƒˆ๋กญ๊ฒŒ ์ œ์ž‘ํ•˜๋Š” ๊ฒƒ๋„ ์ข‹์ง€๋งŒ, ๊ณผ๊ฑฐ์— ์ œ์ž‘๋œ ์ €ํ•ด์ƒ๋„์˜ ์˜ํ™”, ๋น„๋””์˜ค ๋“ฑ์˜ video ์ปจํ…์ธ ๋ฅผ ๊ณ ํ•ด์ƒ๋„ device์—์„œ ์„ธ๋ฐ€ํ•˜๊ณ  ์ด˜์ด˜ํ•œ pixel๋กœ ํ‘œํ˜„๋œ video๋กœ ์‹œ์ฒญํ•˜๊ณ ์ž ํ•˜๋Š” ์ˆ˜์š”๋„ ์ฆ๊ฐ€ํ•˜๊ณ  ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•˜์—ฌ video์˜ ํ•ด์ƒ๋„๋ฅผ ์˜ฌ๋ฆฌ๋Š” video upscaling๊ธฐ์ˆ ๋„ ๋ฐœ๋งž์ถ”์–ด ๋ฐœ์ „ํ•˜๊ณ  ์žˆ๋‹ค. ๊ธฐ์กด์—๋Š” interpolation algorithm์„ ์‚ฌ์šฉํ•˜์—ฌ ์ฃผ๋ณ€์˜ ํ”ฝ์…€ ๊ฐ’์œผ๋กœ ๋ชจ๋ฅด๋Š” ๋ฐ์ดํ„ฐ ๊ฐ’์„ ์ถ”์ •ํ•˜๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ–ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Š” ์ด๋ฏธ์ง€์˜ ํฌ๊ธฐ๋งŒ ์กฐ์ •ํ•ด ์ฃผ๋ฉฐ ํ’ˆ์งˆ ์ž์ฒด๋Š” ๊ฐœ์„ ์‹œํ‚ค์ง€ ๋ชปํ•˜๊ธฐ ๋•Œ๋ฌธ์— Deep-learning์„ ์‚ฌ์šฉํ•œ upscaling ๋ฐฉ๋ฒ•์ด ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๋‹ค. ๊ทธ๋ž˜์„œ ์šฐ๋ฆฌ๋Š” ์ด๋Ÿฌํ•œ ์‹œ๋Œ€์  ์ˆ˜์š”์— ๋งž์ถ”์–ด, video upscaling์„ ํ”„๋กœ์ ํŠธ ์ฃผ์ œ๋กœ ์„ ์ •ํ•˜์˜€๋‹ค.

Goal

์ด ํ”„๋กœ์ ํŠธ์˜ ๋ชฉํ‘œ๋Š” ์ €ํ™”์งˆ video๋ฅผ ์ •์ˆ˜๋ฐฐ ๊ฐœ์„ ๋œ ํ•ด์ƒ๋„๋ฅผ ๊ฐ€์ง€๋Š” ๊ณ ํ™”์งˆ video๋กœ upscalingํ•˜๋Š” ํ”„๋กœ๊ทธ๋žจ์„ ์ œ์ž‘ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๋‹จ์ˆœํ•œ interpolation ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ๋ถ€ํ„ฐ deep learning๊นŒ์ง€ upscaling์„ ํ•˜๊ธฐ ์œ„ํ•œ ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•์ด ์ œ์‹œ๋˜์–ด ์žˆ๋Š”๋ฐ, ์ด ํ”„๋กœ์ ํŠธ์—์„œ๋Š” deep learning์„ ํ™œ์šฉํ•œ upscaling์„ ๊ตฌํ˜„ํ–ˆ๋‹ค. CVPR์— ๋ฐœํ‘œ๋œ ๋…ผ๋ฌธ ์ค‘ ํ•˜๋‚˜์ธ โ€œDeep Video Super-Resolution Network Using Dynamic Upsampling Filters Without Explicitโ€์˜ ๋ฐฉ๋ฒ•์„ ์ฐจ์šฉํ•˜์˜€์œผ๋ฉฐ, ์ผ๋ถ€ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ˆ˜์ •ํ•˜์—ฌ ํ•™์Šต ์†๋„์— ๋„์›€์ด ๋˜๋„๋ก ํ•˜์˜€๋‹ค. ๋˜ํ•œ, ๋‹จ์ˆœํžˆ ๋„คํŠธ์›Œํฌ๋ฅผ ๊ตฌํ˜„ํ•˜๋Š” ๊ฒƒ์—์„œ ๋” ๋‚˜์•„๊ฐ€์„œ, ํ•ด๋‹น ๋„คํŠธ์›Œํฌ๋ฅผ ์ด์šฉํ•œ ์„œ๋น„์Šค๋ฅผ ์‚ฌ๋žŒ๋“ค์ด ์ง์ ‘ ์ฒดํ—˜ํ•ด๋ณผ ์ˆ˜ ์žˆ๋„๋ก ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์ œ์ž‘ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ์‚ผ์•˜๋‹ค. ์ตœ๊ทผ ๊ฐ€์žฅ ํ”ํ•œ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์ธ ์›น ํŽ˜์ด์ง€์˜ ํ˜•ํƒœ๋กœ ์‚ฌ์šฉ์ž์—๊ฒŒ ์ œ๊ณตํ•˜๊ณ ์ž ํ•˜์˜€๊ธฐ ๋•Œ๋ฌธ์—, ๋„คํŠธ์›Œํฌ์˜ ๊ตฌํ˜„ ๋ฐ ํ•™์Šต์ด ์™„๋ฃŒ๋œ ์ดํ›„์— Python์˜ ์›น ํ”„๋ ˆ์ž„์›Œํฌ๋“ค ์ค‘ ํ•˜๋‚˜์ธ Flask๋ฅผ ์ด์šฉํ•ด ๊ฐ„๋‹จํ•œ web application์„ ์ œ์ž‘ํ•˜์—ฌ ์œ ์ €๋“ค์ด ์›น์‚ฌ์ดํŠธ์— upscaling์„ ์›ํ•˜๋Š” ์ž„์˜์˜ video๋ฅผ ์—…๋กœ๋“œ ํ•˜๋ฉด upscaling ํ›„ download ๋ฐ›์„ ์ˆ˜ ์žˆ๋„๋ก ๊ตฌํ˜„ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ์‚ผ์•˜๋‹ค.

Design

Problem Analysis

Super Resolution์€ ๊ทธ๋ž˜ํ”ฝ์Šค ๋ถ„์•ผ์˜ ํ•œ ๊ฐˆ๋ž˜๋กœ, ์ €ํ•ด์ƒ๋„์˜ ๋ฏธ๋””์–ด๋ฅผ ๊ณ ํ•ด์ƒ๋„์˜ ๋ฏธ๋””์–ด๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ฒƒ์„ ๋งํ•œ๋‹ค. ์ด ๋ชฉ์  ์ž์ฒด๊ฐ€ ๋„์ „์ ์ด๊ธฐ ๋•Œ๋ฌธ์— ๋งŽ์€ ์‚ฌ๋žŒ๋“ค์ด ๋„์ „ํ•˜๊ณ  ์žˆ๋Š” ๋™์‹œ์—, ๊ณผ๊ฑฐ ํ•ด์ƒ๋„๊ฐ€ ๋‚ฎ๊ฒŒ ๊ธฐ๋ก๋œ ๋ฏธ๋””์–ด๋‚˜ ํ™”์งˆ ๋ฉด์—์„œ ์†์ƒ๋œ ๋ฏธ๋””์–ด๋ฅผ ๋ณต๊ตฌํ•ด๋‚ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์—์„œ ํฅ๋ฏธ๋กœ์šด ์ฃผ์ œ๋กœ ์—ฌ๊ฒจ์ง„๋‹ค. ๊ธฐ์กด์—๋Š” Bicubic Interpolation๊ณผ ๊ฐ™์€ ์ˆ˜ํ•™์ ์ธ ๋ฐฉ๋ฒ•์„ ์ด์šฉํ•˜์—ฌ ํ”ฝ์…€ ๋ณด๊ฐ„์„ ์ฑ„์šฐ๋Š” ๋ฐฉ์‹์œผ๋กœ Super Resolution์˜ ๋ชฉ์ ์„ ๋‹ฌ์„ฑํ•˜์˜€์œผ๋‚˜, ์ตœ๊ทผ์—๋Š” Deep Learning์ด ์—ฌ๋Ÿฌ ๋ถ„์•ผ์— ๋„์ž…๋จ์— ๋”ฐ๋ผ Super Resolution์—์„œ๋„ deep learning์„ ์ด์šฉํ•˜๋Š” ์ถ”์„ธ์ด๋‹ค. Super Resolution์€ ์ฃผ๋กœ ์ด๋ฏธ์ง€๋‚˜ ๋น„๋””์˜ค์—์„œ ์‚ฌ์šฉ๋œ๋‹ค. ๋น„๋””์˜ค๋Š” ์—ฐ์†๋œ ์ด๋ฏธ์ง€์˜ ๋”๋ฏธ๋กœ ๊ตฌ์„ฑ๋˜๋ฉฐ, 1์ดˆ์— ์ผ์ •ํ•œ ์–‘์˜ ์ด๋ฏธ์ง€๋ฅผ ๋„˜๊ธฐ๋Š” ๋ฐฉ์‹์œผ๋กœ ๊ตฌํ˜„๋œ๋‹ค. ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— ๋น„๋””์˜ค์— ๋Œ€ํ•œ Super Resoltion์„ ์ง„ํ–‰ํ•˜๋Š” ๊ฒƒ์€ ์ด๋ฏธ์ง€์— ๋น„ํ•ด ๋” ์–ด๋ ต์ง€๋งŒ, ์ตœ๊ทผ video ๊ธฐ๋ฐ˜ ๋ฏธ๋””์–ด ์‚ฐ์—…์ด ์‹œ์‹œ๊ฐ๊ฐ์œผ๋กœ ๋ฐœ์ „ํ•จ์— ๋”ฐ๋ผ ์‚ฌ์šฉ์ฒ˜๊ฐ€ ๋งค์šฐ ๋Š˜์–ด๋‚ฌ๊ธฐ ๋•Œ๋ฌธ์— ์ด๋ฏธ์ง€ ํ•ด์ƒ๋„๋ฅผ ๋†’์ด๋Š” ๊ฒƒ ๋ณด๋‹ค ๋” ํ™œ์šฉ์„ฑ์ด ๋›ฐ์–ด๋‚  ๊ฒƒ์œผ๋กœ ๊ธฐ๋Œ€๋œ๋‹ค.

Research Directions

Problem Definition

Video super resolution์˜ ๋ชฉํ‘œ๋Š” ์ฃผ์–ด์ง„ low resolution (LR) frame ${X_t}$๋กœ๋ถ€ํ„ฐ high resolution (HR) frame ${\hat{Y_t}}$์„ ํš๋“ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. VU network $G$์™€ network parameter $\theta$๋กœ๋ถ€ํ„ฐ ๋‹ค์Œ๊ณผ ๊ฐ™์ด VU problem์„ ์ •์˜ํ•œ๋‹ค. $$ \hat{Y_t} = G_{\theta}(X_{t-N:t+N}) $$ ์ด ๋•Œ, $N$์€ temporal radius๋ฅผ ์˜๋ฏธํ•œ๋‹ค. $G$์˜ input tensor shape๋Š” $T * H * W * C$์ด๋ฉฐ, $T= 2N + 1$, $H$์™€ $W$๋Š” ๊ฐ๊ฐ LR frame์˜ height์™€ weight์„ ์˜๋ฏธํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  $C$๋Š” color channel์˜ ๊ฐœ์ˆ˜์ด๋‹ค. output tensor shape๋Š” $1 * rH * rW * C$์ด๋‹ค. $r$์€ upscaling factor์ด๋‹ค.

Network Architecture

$\hat{Y_t}$๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด network๋Š” ${X_{t-N:t+N}}$๋กœ๋ถ€ํ„ฐ dynamic upscaling filter $F_t$์™€ residual $R_t$๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. ๊ทธ ํ›„, input center frame $X_t$๋Š” ๋จผ์ € $F_t$๋กœ upscaling ๋œ ํ›„, $R_t$์™€ ๋”ํ•ด์ ธ ์ตœ์ข…์ ์œผ๋กœ $\hat{Y_t}$๊ฐ€ ์ƒ์„ฑ๋œ๋‹ค.

Dynamic Upsampling Filters

ํ›ˆ๋ จ๋œ network๋Š” ${X_{t-N:t+N}}$์„ input์œผ๋กœ ๋ฐ›์•„ ํŠน์ • ์‚ฌ์ด์ฆˆ์˜ filter๋“ค์˜ set $F_t$๋ฅผ outputํ•œ๋‹ค. ($F_t$๋Š” $r^2HW$๊ฐœ์˜ filter๋กœ ๊ตฌ์„ฑ) ๊ทธ๋ฆฌ๊ณ  ์ด๊ฒƒ์€ filtered HR frame $\tilde{Y_t}$๋ฅผ ์ƒ์„ฑํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋œ๋‹ค. ๊ฐ๊ฐ์˜ HR pixel value๋Š” input frame $X_t$์˜ LR pixel์— local filtering์„ filter $F_t^{y,x,v,u}$๋กœ ์ ์šฉํ•จ์œผ๋กœ์จ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.

$\tilde{Y_t}(yr+v, xr + u) =\sum_{j=-2}^2\sum_{i=-2}^2F_t^{y,x,v,u}(j+2, i+2)X_t(y+j, x+i)$

์—ฌ๊ธฐ์„œ, $x$์™€ $y$๋Š” LR grid์˜ ์ขŒํ‘œ์ด๋ฉฐ, $v$์™€ $u$๋Š” $r*r$ output block์˜ corrdinate์ด๋‹ค.

Residual Learning

dynamic upsampling filter๋งŒ์„ ์ ์šฉ์‹œํ‚จ ๊ฒฐ๊ณผ๋Š” sharpness์—์„œ lack์„ ๊ฐ€์ง„๋‹ค. ๊ทธ๊ฑด ๊ทธ์ € input pixel์˜ weighted sum์ด๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ด๊ฒƒ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด residual image๋ฅผ estimateํ•˜์—ฌ high frequency detail์„ ์ฆ๊ฐ€์‹œํ‚จ๋‹ค. Dynamic Upsampling process์™€ residual process๋ฅผ ๊ฒฐํ•ฉํ•จ์œผ๋กœ์จ HR frame์—์„œ spatial sharpness์™€ temporal consistency๋ฅผ ์„ฑ์ทจํ•  ์ˆ˜ ์žˆ๋‹ค.

Temporal Augmentation

dynamic upsampling filter๋งŒ์„ ์ ์šฉ์‹œํ‚จ ๊ฒฐ๊ณผ๋Š” sharpness์—์„œ lack์„ ๊ฐ€์ง„๋‹ค. ๊ทธ๊ฑด ๊ทธ์ € input pixel์˜ weighted sum์ด๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ด๊ฒƒ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด residual image๋ฅผ estimateํ•˜์—ฌ high frequency detail์„ ์ฆ๊ฐ€์‹œํ‚จ๋‹ค. Dynamic Upsampling process์™€ residual process๋ฅผ ๊ฒฐํ•ฉํ•จ์œผ๋กœ์จ HR frame์—์„œ spatial sharpness์™€ temporal consistency๋ฅผ ์„ฑ์ทจํ•  ์ˆ˜ ์žˆ๋‹ค.

Implementation

Dataset

Crawling

์ €์ž‘๊ถŒ ๋ฌธ์ œ๊ฐ€ ์—†๋Š” video ์ˆ˜์ง‘์„ ์œ„ํ•ด โ€œPixabayโ€์—์„œ ๊ตฌํ•  ์ˆ˜ ์žˆ๋Š” ์ƒ์—…์  ์šฉ๋„๋กœ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋™์‹œ์— ์ถœ์ฒ˜๋ฅผ ๋ฐํžˆ์ง€ ์•Š์•„๋„ ๋˜๋Š” ๋น„๋””์˜ค๋“ค ์ค‘ ์›€์ง์ž„์ด dynamicํ•œ video๋“ค๋กœ ์ˆ˜์ง‘ํ•˜์˜€๋‹ค. Python์„ ์ด์šฉํ•˜์—ฌ ํ•ด๋‹น ์›นํŽ˜์ด์ง€์˜ video๋“ค์„ crawling ํ–ˆ์œผ๋ฉฐ, "Walking", "Animal", "Exercise", "Transport" ๋“ฑ์˜ keyword๋กœ dynamicํ•œ motion์ด ํฌํ•จ๋˜์–ด ์žˆ๊ณ , ๊ธธ์ด๊ฐ€ 8~15์ดˆ ์ด๋‚ด์ธ video ๋“ค๋กœ๋งŒ crawling ํ•˜์˜€๋‹ค.

Convert to Frame

OpenCV๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ด 670๊ฐœ์˜ video์—์„œ ๋Œ€๋žต 320,000 ๊ฐœ์˜ frame image๋ฅผ ์ถ”์ถœํ•˜์˜€๋‹ค. Ground truth๋Š” frame image ์ค‘ ๋ชจ์…˜์ด ๊ฐ€์žฅ ์ž˜ ๋‚˜ํƒ€๋‚œ๋‹ค๊ณ  ์ถ”์ •ํ•  ์ˆ˜ ์žˆ๋Š” ์ค‘์•™ ๋ถ€๋ถ„์˜ 128x128 ํฌ๋กญ๋œ image๋ฅผ ์‚ฌ์šฉํ•˜์˜€๊ณ , LR input์€ 32x32 frame image๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค.

Network Architecture

Network

Tensorflow 2.0์˜ high level framework์ธ keras๋ฅผ ์ด์šฉํ•ด Network๋ฅผ ๊ตฌ์„ฑํ–ˆ๋‹ค. Network๋ฅผ ์ด๋ฃจ๋Š” ์ฃผ์š” ์š”์†Œ์ธ Convolutional์™€ Batch Normalization layer๋Š” keras ์ œ๊ณต layer๋ฅผ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ local filtering layer๋Š” ์‚ฌ์šฉ์ž ์ •์˜ ์ธต์œผ๋กœ ๊ตฌํ˜„ํ•ด ์‚ฌ์šฉํ•˜์˜€๋‹ค.

Training

์ œ์ž‘ํ•œ dataset๊ณผ network๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ training ์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค. ์ปดํ“จํ„ฐ๊ณตํ•™๊ณผ ๊ณต์šฉ ํด๋Ÿฌ์Šคํ„ฐ ์„œ๋ฒ„๋ฅผ ์ด์šฉํ•˜์—ฌ, 100 Epoch, Batch size 16, GPU 2๊ฐœ๋กœ ํ•™์Šตํ•˜์˜€๋‹ค. 100 epoch ์ด์ƒ์—์„œ ์œ ์˜๋ฏธํ•œ ํ•™์Šต์ด ์ง„ํ–‰๋˜์ง€ ์•Š์•˜๊ธฐ ๋•Œ๋ฌธ์—, ํ•™์Šต์€ 100 epoch๋กœ ์ œํ•œํ•œ ์ƒํƒœ์—์„œ huber loss์˜ delta๋‚˜ cosine decay์˜ warmup ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์กฐ๊ธˆ์”ฉ ๋ณ€๊ฒฝํ•˜๋ฉฐ validation loss๋ฅผ ์ค„์ด๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค.

Web Application

์ œ์ž‘๋œ network๋ฅผ ์‹ค์ œ ์œ ์ €๋“ค์ด ํ”„๋กœ๋•์…˜ ๋ ˆ๋ฒจ์—์„œ ์‚ฌ์šฉํ•ด๋ณผ ์ˆ˜ ์žˆ๋„๋ก ์›น ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์ œ์ž‘ํ•˜์˜€๋‹ค. ๊ธฐ์กด ์ฝ”๋“œ์™€ ๋งค๋„๋Ÿฝ๊ฒŒ ํ†ตํ•ฉ๋˜๋Š” ๋™์‹œ์— ๊ฐ„๋‹จํ•œ ํŽ˜์ด์ง€๋ฅผ ์ œ์ž‘ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ์‚ผ์•˜๊ธฐ ๋•Œ๋ฌธ์—, Python์˜ ๊ฐ€๋ฒผ์šด ์›น ํ”„๋ ˆ์ž„์›Œํฌ์ธ Flask๋ฅผ ์ด์šฉํ•˜์—ฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์ œ์ž‘ํ•˜์˜€๊ณ , docker๋ฅผ ์ด์šฉํ•œ ์ปจํ…Œ์ด๋„ˆํ™”๋ฅผ ํ†ตํ•ด ๋ˆ„๊ตฌ๋‚˜ ์‰ฝ๊ฒŒ ๋ฐฐํฌํ•  ์ˆ˜ ์žˆ๊ฒŒ ์„ค์ •ํ•˜์˜€๋‹ค.

Upload

์œ ์ €๊ฐ€ ์›น ํŽ˜์ด์ง€์— ์ ‘์†ํ•˜๋ฉด, ์šฐ์„  ํŒŒ์ผ์„ ์—…๋กœ๋“œํ•  ์ˆ˜ ์žˆ๋Š” ์ฐฝ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์œ ์ €๊ฐ€ ํŒŒ์ผ์„ ์—…๋กœ๋“œํ•˜๋ฉด, ์„œ๋ฒ„๋Š” ํ•ด๋‹น ํŒŒ์ผ์„ ๊ฐ ์œ ์ €์˜ ์„ธ์…˜ ๊ณต๊ฐ„์— ์ €์žฅํ•œ๋‹ค. ์ด๊ฒƒ์€ ์„œ๋กœ ๋‹ค๋ฅธ ์œ ์ €๊ฐ€ ๋‹ค๋ฅธ ์œ ์ €์˜ ํŒŒ์ผ์„ ์—ด๋žŒํ•  ์ˆ˜ ์—†๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•จ์ด๋‹ค. ํ•ด๋‹น ์ €์žฅ์ด ์™„๋ฃŒ๋œ ๊ฒฝ์šฐ, ์œ ์ €๋Š” ์›นํŽ˜์ด์ง€์—์„œ ํŒŒ์ผ ๋ชฉ๋ก์— ์ž์‹ ์ด ์—…๋กœ๋“œํ•œ ํŒŒ์ผ์ด ์žˆ๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ํ•ด๋‹น ํŒŒ์ผ ์˜†์˜ ์•„์ด์ฝ˜์„ ๋ˆŒ๋Ÿฌ์„œ ๋ณ€ํ™˜์„ ์š”์ฒญํ•  ์ˆ˜ ์žˆ๋‹ค.

Convert

์œ ์ €๊ฐ€ ๋ณ€ํ™˜์„ ์š”์ฒญํ•˜๋ฉด, ํ•ด๋‹น ์š”์ฒญ์„ asynchronous ํ•˜๊ฒŒ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ์šฐ์„  uWSGI thread๋ฅผ ์ด์šฉํ•˜์—ฌ video ์ „์ฒ˜๋ฆฌ์™€ upscaling์„ background์—์„œ ์ง„ํ–‰ํ•˜๋Š” ๋™์‹œ์—, ์œ ์ €์—๊ฒŒ๋Š” ์šฐ์„  ์ฒ˜๋ฆฌ ์ค‘ ์ด๋ผ๋Š” ์‚ฌ์‹ค์„ ์•Œ๋ ค์ค€๋‹ค. ํ•œํŽธ, background์—์„œ๋Š” ์œ ์ €๊ฐ€ ์—…๋กœ๋“œํ•œ video๋ฅผ ์šฐ์„  ์ „์ฒ˜๋ฆฌํ•œ๋‹ค. ์ด ํ”„๋กœ์ ํŠธ์—์„œ ๊ตฌํ˜„ํ•œ network๋Š” 128x128์˜ ํฌ๊ธฐ๋ฅผ ๊ฐ€์ง€๋Š” ์ด๋ฏธ์ง€๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›๊ธฐ ๋•Œ๋ฌธ์—, ์œ ์ €๊ฐ€ ์—…๋กœ๋“œํ•œ video๋ฅผ ํ”„๋ ˆ์ž„ ๋ณ„๋กœ ๋‚˜๋ˆˆ ์ดํ›„, ๊ฐ ํ”„๋ ˆ์ž„์„ ๋‹ค์‹œ 128x128 ํฌ๊ธฐ๋กœ ๋‚˜๋ˆˆ๋‹ค. ์ดํ›„ ๊ฐ ๋ถ€๋ถ„์„ network์— ๋„ฃ์–ด์„œ ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ์–ป์€ ์ดํ›„์— ๊ฐ ๊ฒฐ๊ณผ๋ฅผ mergeํ•˜๊ณ , ๋˜ ๊ทธ๊ฒƒ์„ ๋‹ค์‹œ video๋กœ mergeํ•˜์—ฌ ๊ฒฐ๊ณผ๋ฌผ์„ ๋งŒ๋“ ๋‹ค. ๊ฒฐ๊ณผ๋ฌผ์ด ๋งŒ๋“ค์–ด์ง€๋ฉด, ์œ ์ €๋Š” ๊ฒฐ๊ณผ๋ฌผ์ด ๋งŒ๋“ค์–ด์กŒ๋‹ค๋Š” ์‚ฌ์‹ค์„ ์›น ํŽ˜์ด์ง€์˜ ์•„์ด์ฝ˜์„ ํ†ตํ•ด ์•Œ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ํ•ด๋‹น ์•„์ด์ฝ˜์„ ๋ˆŒ๋Ÿฌ์„œ ๋‹ค์šด๋กœ๋“œ ๋ฐ›์„ ์ˆ˜ ์žˆ๋‹ค.

Result

Input (256 x 256) Output(1024 x 1024)
Input Output

์ฒซ๋ฒˆ์งธ ํ‘œ๋ฅผ ํ†ตํ•ด ํ•ด์ƒ๋„๊ฐ€ 256x256์ธ video์˜ frame์ด 1024x1024 x16 upscaling๋˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. Input video์™€ output video์˜ frame ์ผ๋ถ€๋ฅผ cropํ•˜์—ฌ ๋น„๊ตํ•ด ๋ณด๋ฉด output์—์„œ ํ™”์งˆ์ด ๊ฐœ์„ ๋œ ๊ฒƒ์„ ๊ฐ€์‹œ์ ์œผ๋กœ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ํ•™์Šต์ด ์™„๋ฃŒ๋˜์—ˆ์„ ๋•Œ ์ตœ์ข… validation set์— ๋Œ€ํ•œ loss๋Š” 0.00435 ์ด์—ˆ์œผ๋ฉฐ, hyperparameter๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  • Cosine Decay
    • Initial Learning Rate: 0.005
    • First Decay Steps: 70
  • Huber loss
    • $\delta$: 200.0

Discussion

Motion Area identification

๋…ผ๋ฌธ์—์„œ๋Š” dataset์„ ์ œ์ž‘ํ•  ๋•Œ, video frame์„ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  "motion์ด ์ถฉ๋ถ„ํžˆ ๋ณด์ด๋Š” area" ๋ฅผ cropํ•˜์—ฌ ์ผ๋ถ€๋ถ„๋งŒ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ๋น„์Šทํ•˜๊ฒŒ motion์ด ์ถฉ๋ถ„ํ•œ area๋ฅผ cropํ•˜์—ฌ dataset์„ ์ œ์ž‘ํ•˜๊ณ ์ž ํ•˜์˜€์œผ๋‚˜ 300๊ฐœ๊ฐ€ ๋„˜๋Š” video๋ฅผ ์ผ์ผ์ด ํ™•์ธํ•˜์—ฌ crop ์œ„์น˜๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ๊ฒƒ์€ ๋ฌด๋ฆฌ๊ฐ€ ์žˆ์—ˆ๋‹ค. ๊ทธ๋ž˜์„œ ๋Œ€์ฒด์ ์œผ๋กœ frame์˜ ๊ฐ€์šด๋ฐ ๋ถ€๋ถ„์—์„œ object์˜ motion์ด ํ™œ๋ฐœํ•˜๊ฒŒ ๋ณด์ธ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๊ณ  ์ค‘์‹ฌ๋ถ€๋ถ„์„ cropํ•˜์—ฌ ์ œ์ž‘ํ•˜์˜€๋‹ค. ๊ทธ๋ ‡๊ฒŒ ํ•˜๋‹ค ๋ณด๋‹ˆ video์— ๋”ฐ๋ผ์„œ๋Š” motion์ด ์ถฉ๋ถ„ํ•˜์ง€ ์•Š์•„ training์— ๋„์›€์ด ๋˜์ง€ ๋ชปํ•˜๋Š” data๋“ค๋„ ์žˆ์—ˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” 300๊ฐœ์˜ video๋กœ ์ถฉ๋ถ„ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœํ•˜์˜€์ง€๋งŒ, ๋ณธ project์—์„œ๋Š” dataset์ด ๋” ํ•„์š”ํ•˜๊ฒŒ ๋˜์—ˆ๊ณ  ์ถ”๊ฐ€์ ์œผ๋กœ 300๊ฐœ์˜ video๋ฅผ ๋” crawlingํ•ด์„œ ์ตœ์ข… dataset์„ ์ œ์ž‘ํ•˜์˜€๋‹ค. ์ด๋Ÿฌํ•œ motion area๋ฅผ identifyํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์ข€ ๋” ๊ณ ๋ฏผํ•ด ๋ณธ๋‹ค๋ฉด ์ ์€ dataset์œผ๋กœ๋„ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ผ ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค.

Speed

10์ดˆ์˜ 240p video๋ฅผ 4๋ฐฐ๋กœ upscailingํ•˜๋Š”๋ฐ ์•ฝ 6๋ถ„์˜ ์‹œ๊ฐ„์ด ์†Œ์š”๋œ๋‹ค. Video๋ฅผ ์‹œ์ฒญํ•˜๋Š” ๋Œ€์ค‘์†Œ๋น„์ž๋“ค์—๊ฒŒ ์žˆ์–ด ์ง์ ‘ ๋ณธ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋„๋ก ๊ฒƒ์€ ๋งค๋ ฅ์ ์ด์ง€ ์•Š๋‹ค. ๋”ฐ๋ผ์„œ ํ˜„์žฌ์˜ ๊ธฐ์ˆ ๋กœ๋Š” ์ปจํ…์ธ  ๊ณต๊ธ‰์ž๊ฐ€ ๋ณธ์ธ์˜ video๋ฅผ upsaciling ์ž‘์—…ํ•œ ํ›„ ์ด๋ฅผ ๋Œ€์ค‘์—๊ฒŒ ์ œ๊ณตํ•˜๋„๋ก ํ•˜๋Š” ๊ฒƒ์ด ํ•ฉ๋ฆฌ์ ์ธ buisiness plan์œผ๋กœ ๋ณด์ธ๋‹ค. video super resolution์˜ ์†๋„๋ฅผ ๋†’์ด๊ธฐ ์œ„ํ•ด ๋‹ค์Œ์˜ ๋ฐฉ๋ฒ•์„ ๋…ผ์˜ํ•˜์˜€๋‹ค.

๋ณธ ๋…ผ๋ฌธ์—์„œ video super resolution์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด "Data augmentation"์ด ํ™œ์šฉ๋œ๋‹ค. 7๊ฐœ์˜ frame์„ inputํ•˜์—ฌ temporal consistency๋ฅผ ์ด์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์ธ๋ฐ, ์ด ๊ฒฝ์šฐ 100 frame์˜ ๊ธธ์ด๋ฅผ ๊ฐ€์ง„ video๋ฅผ upscailingํ•˜๊ธฐ ์œ„ํ•ด ์•ฝ 700๊ฐœ์˜ frame์— ๋Œ€ํ•œ computation์ด ํ•„์š”ํ•˜๋‹ค. ๊ณ„์‚ฐ๋˜์–ด์•ผํ•  ์ด ํ”„๋ ˆ์ž„ ์ˆ˜๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด, 100๊ฐœ์˜ frame ์ค‘ keyframe์„ ์ •ํ•˜์—ฌ ํ•ด๋‹น ํ”„๋ ˆ์ž„์€ 7๊ฐœ์˜ frame์„ data augmentation์„ ํ™œ์šฉํ•œ๋‹ค. ๋‚˜๋จธ์ง€ frame๋“ค์€ ๋” ์ ์€ ์ˆ˜์˜ frame์„ data augmentation์„ ํ™œ์šฉํ•˜๋˜, keyframe๋“ค์˜ interpolation์œผ๋กœ๋ถ€ํ„ฐ residual์„ ๊ณ„์‚ฐํ•ด ๋”ํ•ด์คŒ์œผ๋กœ์จ ๋ถ€์กฑํ•  ์ˆ˜ ์žˆ๋Š” data augmentation์„ ๋ณด์ถฉํ•ด์ค€๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๊ณ„์‚ฐ๋Ÿ‰์„ ์ค„์ด๊ณ  ์ด๋กœ ์ธํ•ด ๋ฐœ์ƒํ•  ํ’ˆ์งˆ ์ €ํ•˜๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ๊ธฐ๋ฅผ ๊ธฐ๋Œ€ํ•œ๋‹ค.

Category Specific Training

์• ๋‹ˆ๋ฉ”์ด์…˜, ์ž์—ฐ ํ’๊ฒฝ, ๊ฑด์ถ•๋ฌผ ๋“ฑ ์—ฌ๋Ÿฌ ์ข…๋ฅ˜์˜ video๊ฐ€ ์กด์žฌํ•˜๊ณ  ๊ฐ ์ข…๋ฅ˜๋ณ„ ๋น„๋””์˜ค๊ฐ€ ๋‚˜ํƒ€๋‚ด๋Š” ํ…์Šค์ณ์™€ ์›€์ง์ž„์€ ์ƒ์ดํ•˜๋‹ค. Category๋ณ„๋กœ ๋น„๋””์˜ค๋ฅผ ๋ถ„๋ฅ˜ํ•˜์—ฌ dataset์„ ์ค€๋น„ํ•˜์—ฌ category๋ณ„๋กœ training์„ ํ•˜์˜€์„ ๋•Œ, ์„ฑ๋Šฅ์ด ํ–ฅ์ƒํ•˜๋Š”์ง€๋ฅผ ํ™•์ธํ•˜๊ณ  ์‹ถ๋‹ค. ๋งŒ์•ฝ ์œ ์˜๋ฏธํ•œ ์„ฑ๋Šฅํ–ฅ์ƒ์ด ์ด๋ฃจ์–ด์ง„๋‹ค๋ฉด ์‚ฌ์šฉ์ž๊ฐ€ ํ–ฅ์ƒ์‹œํ‚ค๊ณ ์ž ํ•˜๋Š” ๋น„๋””์˜ค์˜ ์ข…๋ฅ˜์— ๋งž์ถ”์–ด ๋” ์ข‹์€ ํ’ˆ์งˆ์˜ video super resolution์„ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐํ•œ๋‹ค.

์ฐธ๊ณ ๋ฌธํ—Œ

  • Jo, Y., Oh, S. W., Kang, J., & Kim, S. J. (2018). Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3224-3232).
  • Jo, Y., Oh, S. W., Kang, J., & Kim, S. J. (2019). VSR-DUF. https://github.com/yhjo09/VSR-DUF
    • Some code from the source is included in this project.