-
-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure parents are created when creating a node #2262
Conversation
This updates our Array and Group creation methods to ensure that parents implicitly defined through a nested path are also created. To accomplish this semi-safely and efficiently, we require a new setdefulat method on the Store class.
@@ -98,6 +98,9 @@ async def set(self, value: Buffer, byte_range: ByteRangeRequest | None = None) - | |||
async def delete(self) -> None: | |||
del self.shard_dict[self.chunk_coords] | |||
|
|||
async def setdefault(self, default: Buffer) -> None: | |||
self.shard_dict.setdefault(self.chunk_coords, default) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is actually tested anywhere. I'm not 100% sure, but I think all the Group / Array Metadata creation method will be using StorePath
as their ByteSetter
.
@@ -603,9 +609,24 @@ async def getitem( | |||
) | |||
return await self._get_selection(indexer, prototype=prototype) | |||
|
|||
async def _save_metadata(self, metadata: ArrayMetadata) -> None: | |||
async def _save_metadata(self, metadata: ArrayMetadata, ensure_parents: bool = False) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new keyword is to ensure that updates to an existing nodes don't require all the setdefault
operations to ensure that parents exist. Anytime we create a brand new node we should call _save_metdata
with ensure_parents=True
.
@@ -68,7 +69,13 @@ def _put( | |||
f.write(value.as_numpy_array().tobytes()) | |||
return None | |||
else: | |||
return path.write_bytes(value.as_numpy_array().tobytes()) | |||
view = memoryview(value.as_numpy_array().tobytes()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pathlib.Path.write_bytes
doesn't provide control over the mode
. So this is just that method inlined, with mode="xb"
if we want exclusive create.
Thanks @TomAugspurger -- I'll plan to review this ASAP.
Noting that S3 recently added support for conditional writes (and other object stores already have this) so conceivable that s3fs/fsspec could add support for atomic writes in the near future. cc @martindurant |
Opened fsspec/filesystem_spec#1693 for that. We can pick that API up if / when it's available, but live with the potential for concurrent writers to mess each other up for now. |
src/zarr/store/logging.py
Outdated
@@ -138,6 +139,10 @@ async def set(self, key: str, value: Buffer) -> None: | |||
with self.log(): | |||
return await self._store.set(key=key, value=value) | |||
|
|||
async def setdefault(self, key: str, default: Buffer) -> None: | |||
with self.log(): | |||
return await self._store.set(key=key, value=default) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return await self._store.set(key=key, value=default) | |
return await self._store.setdefault(key=key, value=default) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with this as is (modulo a few very minor suggestions). I'll defer to @d-v-b for a final review though.
Should this be in
|
Mmm that does sound a bit cleaner. The reason for putting it in I could imagine a refactor that builds up a list of operations and hands that off to a writer. We do something like of like that in
Since this has the essentially the same behavior of setdefault (aside from the return value), that did feel natural to me. As a bonus, you now know about setdefault :) Some arguments either way:
I don't have a strong preference. |
Some alternative names to
|
RE name, how about One question about the ABC is whether we want to put a default implementation in? Something like: async def set_if_not_exists(self, key, value) -> None:
if not (await self.contains(key)):
await self.set(key, value) This is obviously not atomic but is as good as we can do for some stores. |
A default implementation, even if it isn't atomic, is probably OK. This isn't going to be our only source of concurrency issues between Nodes & Stores. I'll leave a comment that implementations with that ability might want to override this. There's one other consideration: whether we should require a return type to indicate whether the value was set (a new key created equal to |
This should be good to go, maybe aside from @d-v-b's point about I do agree that it's a bit strange to shove this in there, but IMO the benefit of creating all the parents concurrently makes it worth it. In the future I think we could refactor this a bit to have a kind of bulk |
I broadly agree with this analysis. For now it's great that we have this merged as-is, but longer-term I might explore moving the "create all the parents as needed" logic one level up, to the creation routines. |
This updates our Array and Group creation methods to ensure that parents implicitly defined through a nested path are also created. The main goals are to
To accomplish this semi-safely and efficiently, I've required a new
setdefault
method on the Store class. The idea is to use an atomic "set if this doesn't exist" method. But given that we already have concurrency issues around these Metadata classes, maybe we don't worry about this and do the "if exists" and "set" operations in two operations (we have to do this anyway for RemoteStore)Closes #2228
TODO: