-
-
Notifications
You must be signed in to change notification settings - Fork 291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parallel append broken with ProcessSynchronizer #2077
Comments
The problem seems to be that the cached metadata is not updated after the shape is resized in another thread/process, leading to dropped rows. I found two workarounds:
def fixed_append(arr, data, axis=0):
def fixed_append_nosync(data, axis=0):
arr._load_metadata_nosync()
return arr._append_nosync(data, axis=axis)
return arr._write_op(fixed_append_nosync, data, axis=axis)
Perhaps the default value for I believe this resolves these StackOverflow questions: |
Oddly enough, both workarounds fail when working with in-memory zarr arrays (initialized with |
@jasonkena - thanks for the report. Your diagnosis seems correct but I'm not sure what we want to do about it. Its quite expensive to be always reloading metadata to protect against metadata modifications by another writer. Finally, I should note that we haven't settled on whether or not to keep the synchronizer API around for the 3.0 release (it is not currently included). |
is there any data you can share to support the bolded? |
@zoj613 - not any specific data but if every chunk IO op requires first checking if the metadata has changed, you can imagine how this would be expensive. In my view, the bigger issue is actually around consistency. One of the design tradeoffs in Zarr is that by splitting the dataset into many objects/files, you can act concurrently on individual components. However, the cost of this is that the writer is required to coordinate updates among multiple writers. (you might be interested in reading the Consistency Problems with Zarr in the Arraylake documentation) |
Zarr version
v2.18.2
Numcodecs version
v0.13.0
Python Version
3.10.11
Operating System
Linux
Installation
pip
Description
Appending to zarr arrays is not safe, even with ProcessSynchronizer.
Steps to reproduce
Code:
Output:
Additional output
No response
The text was updated successfully, but these errors were encountered: