zipfile.Path.open is slow when opening for writing in a ZIP with many entries #126565

janhicken · 2024-11-08T08:44:42Z

Bug report

Bug description:

I'm using the zipfile module to create a ZIP file with thousands of members. Each file is created by creating a corresponding zipfile.Path object first and then calling .open() on it.

The implementation of open() contains a check whether the file already exists when opening a file in read mode:

if not self.exists() and zip_mode == 'r':
    raise FileNotFoundError(self)

However, self.exists() is called even in write mode, because it is the and operator's first argument.
The call to self.exists() is quite slow however, because this requires the ZIP file's .namelist() to be computed, which in turn requires to compute all the implied directories. After all, this is the reason why the FastLookup optimization exists for ZIP files in read mode.

I found this issue after profiling my application and was surprised that more than half of its execution time was spent in computing implied ZIP directories.

I would propose to swap the and arguments so the check looks like this:

if zip_mode == "r" and not self.exists():
    raise FileNotFoundError(self)

This will cause the self.exists() check to only be run in read mode, where the check is fast because of FastLookup anyway.

I'm happy to provide a pull request with this change.

CPython versions tested on:

3.11, 3.12

Operating systems tested on:

Linux, macOS

Linked PRs

The text was updated successfully, but these errors were encountered:

picnixz · 2024-11-08T09:25:25Z

I'd say such PR would be welcomed and you could also check if other tests need that kind of optimization. Let me confirm with a core dev: cc @jaraco

When `zipfile.Path.open` is called, the implementation will check whether the path already exists in the ZIP file. However, this check is only required when the ZIP file is in read mode. By swapping arguments of the `and` operator, the short-circuiting will prevent the check from being run in write mode. This change will improve the performance of `open()`, because checking whether a file exists is slow in write mode, especially when the archive has many members.

…onGH-126576) When `zipfile.Path.open` is called, the implementation will check whether the path already exists in the ZIP file. However, this check is only required when the ZIP file is in read mode. By swapping arguments of the `and` operator, the short-circuiting will prevent the check from being run in write mode. This change will improve the performance of `open()`, because checking whether a file exists is slow in write mode, especially when the archive has many members. (cherry picked from commit 160758a) Co-authored-by: Jan Hicken <[email protected]>

jaraco · 2024-11-10T15:04:26Z

This fix is also available on older Pythons with zipp 3.21.0.

…126576) (#126643) gh-126565: Skip `zipfile.Path.exists` check in write mode (GH-126576) When `zipfile.Path.open` is called, the implementation will check whether the path already exists in the ZIP file. However, this check is only required when the ZIP file is in read mode. By swapping arguments of the `and` operator, the short-circuiting will prevent the check from being run in write mode. This change will improve the performance of `open()`, because checking whether a file exists is slow in write mode, especially when the archive has many members. (cherry picked from commit 160758a) Co-authored-by: Jan Hicken <[email protected]>

…126576) (#126642) gh-126565: Skip `zipfile.Path.exists` check in write mode (GH-126576) When `zipfile.Path.open` is called, the implementation will check whether the path already exists in the ZIP file. However, this check is only required when the ZIP file is in read mode. By swapping arguments of the `and` operator, the short-circuiting will prevent the check from being run in write mode. This change will improve the performance of `open()`, because checking whether a file exists is slow in write mode, especially when the archive has many members. (cherry picked from commit 160758a) Co-authored-by: Jan Hicken <[email protected]>

picnixz · 2024-11-10T15:25:06Z

Closing since merged and backported. Thank you all.

gpshead · 2024-11-10T19:01:27Z

thanks for finding and fixing this!

…on#126576) When `zipfile.Path.open` is called, the implementation will check whether the path already exists in the ZIP file. However, this check is only required when the ZIP file is in read mode. By swapping arguments of the `and` operator, the short-circuiting will prevent the check from being run in write mode. This change will improve the performance of `open()`, because checking whether a file exists is slow in write mode, especially when the archive has many members.

janhicken added the type-bug An unexpected behavior, bug, or error label Nov 8, 2024

picnixz added the stdlib Python modules in the Lib dir label Nov 8, 2024

picnixz added this to Zipfile issues Nov 8, 2024

bedevere-app bot mentioned this issue Nov 8, 2024

gh-126565: Skip zipfile.Path.exists check in write mode #126576

Merged

mdboom added the performance Performance or resource usage label Nov 8, 2024

This was referenced Nov 10, 2024

[3.13] gh-126565: Skip zipfile.Path.exists check in write mode (GH-126576) #126642

Merged

[3.12] gh-126565: Skip zipfile.Path.exists check in write mode (GH-126576) #126643

Merged

picnixz added 3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes labels Nov 10, 2024

picnixz closed this as completed Nov 10, 2024

github-project-automation bot moved this to Done in Zipfile issues Nov 10, 2024

github-actions bot mentioned this issue Dec 1, 2024

Monthly issue metrics report hugovk/test#88

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zipfile.Path.open is slow when opening for writing in a ZIP with many entries #126565

zipfile.Path.open is slow when opening for writing in a ZIP with many entries #126565

janhicken commented Nov 8, 2024 •

edited by bedevere-app bot

Loading

picnixz commented Nov 8, 2024

jaraco commented Nov 10, 2024

picnixz commented Nov 10, 2024 •

edited

Loading

gpshead commented Nov 10, 2024

zipfile.Path.open is slow when opening for writing in a ZIP with many entries #126565

zipfile.Path.open is slow when opening for writing in a ZIP with many entries #126565

Comments

janhicken commented Nov 8, 2024 • edited by bedevere-app bot Loading

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

Linked PRs

picnixz commented Nov 8, 2024

jaraco commented Nov 10, 2024

picnixz commented Nov 10, 2024 • edited Loading

gpshead commented Nov 10, 2024

janhicken commented Nov 8, 2024 •

edited by bedevere-app bot

Loading

picnixz commented Nov 10, 2024 •

edited

Loading