Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds video reading / saving functionalities #1039

Merged
merged 15 commits into from
Jul 2, 2019
Merged

Conversation

fmassa
Copy link
Member

@fmassa fmassa commented Jun 21, 2019

This PR introduces functions for reading from and writing to video files, with support for audio streams.

It uses PyAV internally, and can decode pretty-much all videos that can be decoded by FFmpeg.

The functions available here currently have a very simple API.
We might extend it at later moments.

cc @bjuncek @stephenyan1231 for review

Copy link
Contributor

@bjuncek bjuncek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me;
let's chat offline about wether we want to add the mentioned functionality to initial release.

test/test_io.py Outdated Show resolved Hide resolved
return aframes[:, s_idx:e_idx]


def read_video(filename, start_pts=0, end_pts=math.inf):
Copy link
Contributor

@bjuncek bjuncek Jun 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not a priority, but would it be possible to add functionality of reading with a stride?

So if your data is at 30fps, but the models are trained on videos at 15fps (see https://github.com/facebookresearch/VMZ), having this allows the end-user not to re-encode the data.

Having said that, I suppose, you could do the same in the dataloader by striding the tensor so maybe not crucial.

(note, I have this in the experimental repo)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question.

It is currently not visible to the user what the fps for a given video is.

I see two options:

  • provide the fps somewhere, either by returning it in this function or by a separate helper function
  • always enforce that the videos that are returned follow a particular fps. This might imply doing some expensive postprocessing / frame interpolation maybe?

At least the temporal video data would be consistent.

Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the two options are not (necessarily) mutually exclusive?

  1. "provide the fps somewhere, either by returning it in this function or by a separate helper function"
    This would be incredibly useful as that has been the main reproducibility issue of the formerly mentioned repo -- relatively non-standard encoding. Maybe we could (following torchaudio example) have read_video() return (video, audio), stats where stats would be a list/dict/tensor with video fps and audio sampling rate? It should be a part of the stream codec as VideoCodecContext.framerate.

  2. "always enforce that the videos that are returned follow a particular fps. This might imply doing some expensive postprocessing / frame interpolation maybe?"
    This might be very slow could cause issues/artefacts within the video. We could leave that one out for future releases? Video frame sampling rate could be a relatively straightforward way to get a similar functionality that would allow some control without the need for expensive operations.

container.close()


def _read_from_stream(container, start_offset, end_offset, stream, stream_name):
Copy link
Contributor

@bjuncek bjuncek Jun 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add an option to resample audio to a specific SR online?

(note, I have this in the experimental repo)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question again.

I'm inclined to always return the audio at a fixed frequency (say 44kHz), so that the results are always consistent.

Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"I'm inclined to always return the audio at a fixed frequency"

definitely, BUT I think the exact value should be left for the user to decide?
I feel like a simple if sr != user_defined_sr call resampler should be enough?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that for a reading function, we should try to make it as simple as possible, and add additional transforms for resampling the audio / video if needed.

But you have great points about the stats that should be returned.

I'll modify the implementation to return a third argument, a dict with the fps etc.

@fmassa fmassa mentioned this pull request Jun 24, 2019
@codecov-io
Copy link

codecov-io commented Jun 26, 2019

Codecov Report

Merging #1039 into master will increase coverage by 0.13%.
The diff coverage is 71.27%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1039      +/-   ##
==========================================
+ Coverage    63.9%   64.04%   +0.13%     
==========================================
  Files          66       68       +2     
  Lines        5275     5373      +98     
  Branches      793      814      +21     
==========================================
+ Hits         3371     3441      +70     
- Misses       1673     1693      +20     
- Partials      231      239       +8
Impacted Files Coverage Δ
torchvision/__init__.py 70.58% <100%> (+1.83%) ⬆️
torchvision/io/__init__.py 100% <100%> (ø)
torchvision/io/video.py 70.32% <70.32%> (ø)
torchvision/transforms/transforms.py 82.12% <0%> (-0.61%) ⬇️
torchvision/models/resnet.py 88.27% <0%> (+0.45%) ⬆️
torchvision/datasets/cityscapes.py 20.58% <0%> (+0.58%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5e42e80...38596b0. Read the comment docs.

@fmassa fmassa merged commit d293c4c into pytorch:master Jul 2, 2019
@fmassa
Copy link
Member Author

fmassa commented Jul 2, 2019

Approved by @stephenyan1231 and @bjuncek

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants