Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zipfile.Path.open is slow when opening for writing in a ZIP with many entries #126565

Closed
janhicken opened this issue Nov 8, 2024 · 4 comments
Closed
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes performance Performance or resource usage stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@janhicken
Copy link
Contributor

janhicken commented Nov 8, 2024

Bug report

Bug description:

I'm using the zipfile module to create a ZIP file with thousands of members. Each file is created by creating a corresponding zipfile.Path object first and then calling .open() on it.

The implementation of open() contains a check whether the file already exists when opening a file in read mode:

if not self.exists() and zip_mode == 'r':
    raise FileNotFoundError(self)

However, self.exists() is called even in write mode, because it is the and operator's first argument.
The call to self.exists() is quite slow however, because this requires the ZIP file's .namelist() to be computed, which in turn requires to compute all the implied directories. After all, this is the reason why the FastLookup optimization exists for ZIP files in read mode.

I found this issue after profiling my application and was surprised that more than half of its execution time was spent in computing implied ZIP directories.

I would propose to swap the and arguments so the check looks like this:

if zip_mode == "r" and not self.exists():
    raise FileNotFoundError(self)

This will cause the self.exists() check to only be run in read mode, where the check is fast because of FastLookup anyway.

I'm happy to provide a pull request with this change.

CPython versions tested on:

3.11, 3.12

Operating systems tested on:

Linux, macOS

Linked PRs

@janhicken janhicken added the type-bug An unexpected behavior, bug, or error label Nov 8, 2024
@picnixz picnixz added the stdlib Python modules in the Lib dir label Nov 8, 2024
@picnixz
Copy link
Member

picnixz commented Nov 8, 2024

I'd say such PR would be welcomed and you could also check if other tests need that kind of optimization. Let me confirm with a core dev: cc @jaraco

janhicken added a commit to janhicken/cpython that referenced this issue Nov 8, 2024
When `zipfile.Path.open` is called, the implementation will check
whether the path already exists in the ZIP file. However, this check is
only required when the ZIP file is in read mode. By swapping arguments
of the `and` operator, the short-circuiting will prevent the check from
being run in write mode.

This change will improve the performance of `open()`, because checking
whether a file exists is slow in write mode, especially when the archive
has many members.
janhicken added a commit to janhicken/cpython that referenced this issue Nov 8, 2024
When `zipfile.Path.open` is called, the implementation will check
whether the path already exists in the ZIP file. However, this check is
only required when the ZIP file is in read mode. By swapping arguments
of the `and` operator, the short-circuiting will prevent the check from
being run in write mode.

This change will improve the performance of `open()`, because checking
whether a file exists is slow in write mode, especially when the archive
has many members.
@mdboom mdboom added the performance Performance or resource usage label Nov 8, 2024
jaraco pushed a commit that referenced this issue Nov 10, 2024
When `zipfile.Path.open` is called, the implementation will check
whether the path already exists in the ZIP file. However, this check is
only required when the ZIP file is in read mode. By swapping arguments
of the `and` operator, the short-circuiting will prevent the check from
being run in write mode.

This change will improve the performance of `open()`, because checking
whether a file exists is slow in write mode, especially when the archive
has many members.
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Nov 10, 2024
…onGH-126576)

When `zipfile.Path.open` is called, the implementation will check
whether the path already exists in the ZIP file. However, this check is
only required when the ZIP file is in read mode. By swapping arguments
of the `and` operator, the short-circuiting will prevent the check from
being run in write mode.

This change will improve the performance of `open()`, because checking
whether a file exists is slow in write mode, especially when the archive
has many members.
(cherry picked from commit 160758a)

Co-authored-by: Jan Hicken <[email protected]>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Nov 10, 2024
…onGH-126576)

When `zipfile.Path.open` is called, the implementation will check
whether the path already exists in the ZIP file. However, this check is
only required when the ZIP file is in read mode. By swapping arguments
of the `and` operator, the short-circuiting will prevent the check from
being run in write mode.

This change will improve the performance of `open()`, because checking
whether a file exists is slow in write mode, especially when the archive
has many members.
(cherry picked from commit 160758a)

Co-authored-by: Jan Hicken <[email protected]>
@picnixz picnixz added 3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes labels Nov 10, 2024
@jaraco
Copy link
Member

jaraco commented Nov 10, 2024

This fix is also available on older Pythons with zipp 3.21.0.

jaraco pushed a commit that referenced this issue Nov 10, 2024
…126576) (#126643)

gh-126565: Skip `zipfile.Path.exists` check in write mode (GH-126576)

When `zipfile.Path.open` is called, the implementation will check
whether the path already exists in the ZIP file. However, this check is
only required when the ZIP file is in read mode. By swapping arguments
of the `and` operator, the short-circuiting will prevent the check from
being run in write mode.

This change will improve the performance of `open()`, because checking
whether a file exists is slow in write mode, especially when the archive
has many members.
(cherry picked from commit 160758a)

Co-authored-by: Jan Hicken <[email protected]>
jaraco pushed a commit that referenced this issue Nov 10, 2024
…126576) (#126642)

gh-126565: Skip `zipfile.Path.exists` check in write mode (GH-126576)

When `zipfile.Path.open` is called, the implementation will check
whether the path already exists in the ZIP file. However, this check is
only required when the ZIP file is in read mode. By swapping arguments
of the `and` operator, the short-circuiting will prevent the check from
being run in write mode.

This change will improve the performance of `open()`, because checking
whether a file exists is slow in write mode, especially when the archive
has many members.
(cherry picked from commit 160758a)

Co-authored-by: Jan Hicken <[email protected]>
@picnixz
Copy link
Member

picnixz commented Nov 10, 2024

Closing since merged and backported. Thank you all.

@gpshead
Copy link
Member

gpshead commented Nov 10, 2024

thanks for finding and fixing this!

picnixz pushed a commit to picnixz/cpython that referenced this issue Dec 8, 2024
…on#126576)

When `zipfile.Path.open` is called, the implementation will check
whether the path already exists in the ZIP file. However, this check is
only required when the ZIP file is in read mode. By swapping arguments
of the `and` operator, the short-circuiting will prevent the check from
being run in write mode.

This change will improve the performance of `open()`, because checking
whether a file exists is slow in write mode, especially when the archive
has many members.
ebonnal pushed a commit to ebonnal/cpython that referenced this issue Jan 12, 2025
…on#126576)

When `zipfile.Path.open` is called, the implementation will check
whether the path already exists in the ZIP file. However, this check is
only required when the ZIP file is in read mode. By swapping arguments
of the `and` operator, the short-circuiting will prevent the check from
being run in write mode.

This change will improve the performance of `open()`, because checking
whether a file exists is slow in write mode, especially when the archive
has many members.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes performance Performance or resource usage stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
Status: Done
Development

No branches or pull requests

5 participants