Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When iterating over Zipfiles, always use the Unix file separator to fix a Windows issue #638

Merged
17 changes: 11 additions & 6 deletions pex/third_party/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,9 +100,13 @@ def containing(cls, root):
prefix = ''
path = root
while path:
# We use '/' here because the zip file format spec specifies that paths must use
# forward slashes. See section 4.4.17 of
# https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT.
if zipfile.is_zipfile(path):
return cls(zipfile_path=path, prefix=prefix + os.sep if prefix else '')
prefix = os.path.join(prefix, os.path.basename(path))
return cls(zipfile_path=path, prefix='{}/'.format(prefix) if prefix else '')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any guarantee prefix will never already have os.sep in it? That's an earnest question - I couldn't figure it out from a very quick search for _ZipIterator.

Either way, for consistency and safety, might be better to change this line to this idiom:

return cls(zipfile_path=path, prefix='' if not prefix else '{}{}'.format(prefix, os.sep).replace(os.sep, '/')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any guarantee prefix will never already have os.sep in it?

I think this can be guaranteed because prefix starts off as '' outside the while loop, and only contains parts constructed with os.path.basename. os.path.basename takes the last path element, so it is equivalent to arg.split(os.sep)[-1], which will mean the result won't contain the os.sep character.

That said, there is something that doesn't make sense to me about this while loop. This threw me off at first, but I believe the while loop is repeatedly walking up root directory by directory until it hits a zip file. E.g ignoring prefix for a minute, the loop boils down to:

while path:
  if zipfile.is_zipfile(path):
    return
  # if path = foo/bar/baz, then os.path.dirname(path) = foo/bar
  path = os.path.dirname(path)

Given that, I think the prefix is actually backwards. If we go into the loop with foo.zip/bar/baz, assuming foo.zip is the zip file, prefix will become baz/bar, because after the first iteration, prefix will be basename of the path (which is baz), and the second iteration joins that with basename of the path again (the original was prefix = os.path.join(prefix, os.path.basename(path)), so os.path.join('baz', os.path.basename('foo.zip/bar')) = 'baz/bar'). Am I missing something with my logic here?

This must be working because the tests pass, but I wonder if this is actually an uncaught bug? I haven't had a chance to walk through all the test cases to know if there is any test case for this, or if the caller of _ZipIterator somehow guarantees this loop only goes one iteration deep.

Also this loop is an infinite loop if root is ever an absolute path that does not contain a zip file, because os.path.dirname('/') == '/', so it will never be falsy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right about the bug - and it skates by because items are exactly one level deep. The prefix = os.path.join(prefix, os.path.basename(path)) line should be prefix = os.path.join(os.path.basename(path), prefix).

As far as the infinite loop goes, the one call of containing ensures root is not a dir (but not that it's not a file). It probably makes sense to either guard the infinite loop case or else assert early that root is either a zipfile or else not a file and not a directory (foo.zip/bar/baz is not either of these).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great investigation @yorinasub17 and thanks @jsirois for confirming.

I think it'd be best to save this bug fix for a separate PR, as this one is focused on Windows support. If you're up for it, that'd be great for you to open this PR. No worries if you're busy - we can do it otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the confirmation. I made the fix for the prefix here, but have not addressed the infinite loop.

I won't be able to address it today though. I'll check back again when I have time and if it still hasn't been fixed yet, I'll open a PR.

path_base = os.path.basename(path)
prefix = '{}/{}'.format(path_base, prefix) if prefix else path_base
path = os.path.dirname(path)
raise ValueError('Could not find the zip file housing {}'.format(root))

Expand All @@ -116,10 +120,11 @@ def iter_root_packages(self, relpath):
yield package

def _filter_names(self, relpath, pattern, group):
pat = re.compile(r'^{prefix}{pattern}$'
.format(prefix=self.prefix + ((relpath + os.sep) if relpath else ''),
pattern=pattern))

# We use '/' here because the zip file format spec specifies that paths must use
# forward slashes. See section 4.4.17 of
# https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT.
relpath_pat = '' if not relpath else '{}/'.format(relpath.replace(os.sep, '/'))
pat = re.compile(r'^{}{}{}$'.format(self.prefix, relpath_pat, pattern))
with contextlib.closing(zipfile.ZipFile(self.zipfile_path)) as zf:
for name in zf.namelist():
match = pat.match(name)
Expand Down