Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cleanup download utils #8273

Merged
merged 2 commits into from
Feb 15, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 5 additions & 17 deletions torchvision/datasets/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
import urllib.error
import urllib.request
import zipfile
from typing import Any, Callable, Dict, IO, Iterable, Iterator, List, Optional, Tuple, TypeVar, Union
from typing import Any, Callable, Dict, IO, Iterable, List, Optional, Tuple, TypeVar, Union
from urllib.parse import urlparse

import numpy as np
Expand All @@ -24,24 +24,12 @@
USER_AGENT = "pytorch/vision"


def _save_response_content(
content: Iterator[bytes],
destination: Union[str, pathlib.Path],
length: Optional[int] = None,
) -> None:
with open(destination, "wb") as fh, tqdm(total=length) as pbar:
for chunk in content:
# filter out keep-alive new chunks
if not chunk:
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new changes are only equivalent to the old ones if this continue was meant to be a break. Was this meant to be a break? In other words, can there be an empty chunk and then a non-empty chunk?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR they are equivalent since content now can only be one type or iterator for which the branch will never trigger.

These keep-alive chunks are b"". Content iterators coming from requests could actually contain them. However, the content iterator coming from our _urlretrieve function cannot.

The way iter works when you have two inputs is quite unintuitive. Basically waht iter(lambda: response.read(chunk_size), b"") does is it calls the lambda forever until that returns the sentinel b"". If that happens it raises a StopIteration. Meaning

for chunk in iter(lambda: response.read(chunk_size), b""):
    if not chunk:
        # this will never be reached

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To answer your question directly:

In other words, can there be an empty chunk and then a non-empty chunk?

No.

items = [b"foo", b"", b"bar"]
idx = 0


def fn():
    global idx

    item = items[idx]
    idx += 1
    return item


for chunk in iter(fn, b""):
    print(chunk)
b'foo'


fh.write(chunk)
pbar.update(len(chunk))


def _urlretrieve(url: str, filename: Union[str, pathlib.Path], chunk_size: int = 1024 * 32) -> None:
with urllib.request.urlopen(urllib.request.Request(url, headers={"User-Agent": USER_AGENT})) as response:
_save_response_content(iter(lambda: response.read(chunk_size), b""), filename, length=response.length)
with open(filename, "wb") as fh, tqdm(total=response.length) as pbar:
while chunk := response.read(chunk_size):
fh.write(chunk)
pbar.update(len(chunk))


def calculate_md5(fpath: Union[str, pathlib.Path], chunk_size: int = 1024 * 1024) -> str:
Expand Down
Loading