-
-
Notifications
You must be signed in to change notification settings - Fork 288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LRU cache for decoded chunks #306
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, tests especially. Only comment is regarding possible simplification of _chunk_getitem().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few documentation suggestions. Also need to add the new chunk cache class to the API docs, and ensure it is imported into the root zarr namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible workaround for one of the test failures.
Regarding the msgpack failure, maybe that test needs to be skipped for the array tests with chunk cache, the behaviour with msgpack is broken and in fact correct when using a chunk cache. Regarding the test failure about mismatching dtypes, I don't know what's causing that. Maybe worth revising the implementation of _chunk_getitem() as suggested above and seeing if it persists. Seems worth pursuing the support for write cache implementation, should be a way to deal with these test issues. |
Thanks for the review. Very helpful. For fixing TestArrayWithChunkCache.test_dtypes, I have made a special case in the The reason for the TestArrayWithChunkCache.test_object_arrays_vlen_array failure is slightly subtle. In that test we intend to apply this filter: VLenArray('<u4'). Now as you can see in the below snippet(taken from here), the dtype of the variable length arrays will not change until encode-decode happens: >>> import numcodecs
>>> import numpy as np
>>> x = np.array([[1, 3, 5], [4], [7, 9]], dtype='object')
>>> codec = numcodecs.VLenArray('<i4')
>>> codec.decode(codec.encode(x))
array([array([1, 3, 5], dtype=int32), array([4], dtype=int32),
array([7, 9], dtype=int32)], dtype=object) When caching during _chunk_setitem, the VLenArray('<u4') filter is not applied and encode-decode does not happen, therefore the test failure. Only workaround to this I can think of is to apply encode then decode to the chunk when write caching and when the filter is VLenArray. See my comment above in the _chunk_setitem_nosync function. Will do the relevant changes in docs and class naming shortly. |
Thanks @shikharsg. The test_object_arrays_vlen_array situation is tricky. It exposes a potential difficulty in general with object arrays and caching on write. I can see a few possible options at the moment... Option 1a. Don't support caching on write at all. Option 1b. Don't support caching on write for object arrays. Option 2. Relax the test. Option 3. Round-trip the chunk through encode/decode before caching. FWIW I would be fine with option 1a or 1b if you want to get the read caching in now and leave write caching for later. The primary use cases Zarr targets are write-once-read-many so read caching is the priority. I think I'd also be OK with option 2. However, there is another issue that I just realised would need to be addressed, and that is that during write, saving the chunk to the cache occurs too early, and there is a possibility that the chunk could get cached but a failure could subsequently occur either during encoding or storing. This could arise, for example, if an object array is passed but that cannot be encoded (e.g., as a vlen array). This could be addressed by moving the call to set the chunk in the cache to be the very last statement inside the method. I.e., the end of Array._chunk_setitem_nosync() becomes: # encode chunk
cdata = self._encode_chunk(chunk)
# store
self.chunk_store[ckey] = cdata
# cache the chunk
if self._chunk_cache is not None:
self._chunk_cache[ckey] = np.copy(chunk) Option 3 also seems OK for object arrays. There is a way this could be done that would handle a wider range of scenarios for object arrays, if the end of Array._chunk_setitem_nosync() becomes: # encode chunk
cdata = self._encode_chunk(chunk)
# store
self.chunk_store[ckey] = cdata
# cache the chunk
if self._chunk_cache is not None:
if self._dtype == object:
# ensure cached chunk has been round-tripped through encode/decode
chunk = self._decode_chunk(cdata)
self._chunk_cache[ckey] = np.copy(chunk) Note that this still would mean that using a chunk cache would slow down writing of object arrays, because it would incur an extra decode as chunks are written. |
I have moved write cache to the very last statement inside the method. I have also implemented caching for 0-d arrays About the test_object_arrays_vlen_array problem, option 3(but just for dtype=object arrays) seems best to me and which I have also implemented as you have suggested. We might want to document the slowdown for write-cache object arrays somewhere. Which part of the docs should I put this in? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shikharsg.
For documenting the potential slowdown using chunk_cache with object arrays, that could go in a Notes section in the Array class docstring.
Only thing left I think it to look at simplifying the implementation of _chunk_getitem(), otherwise looks pretty mature to me.
One other question is whether we should document these new features as experimental, which would give users some warning that this is something new and would also give us some latitude to change the API via a minor release if needed.
After getting a few bugs in LRUChunkCache and seeing that it didn't have enough tests I figured that some of the tests from the base class StoreTests could be used for LRUChunkCache, but not all tests. So I separated the methods of StoreTests into two classes MutableMappingStoreTests and StoreTests, the latter inheriting the former. StoreTests, in addition to the methods of MutableMappingStoreTests , has test_hierarchy and the test_init* methods. Also LRUChunkCache inherits MutableMappingStoreTests. Doing this also let me easily fix a few bugs in LRUChunkCache. I hope this is not too big of a change and if you would like to shift any methods from StoreTests to MutableMappingStoreTests or vice versa, or even entirely revert this change do let me know. I would like to work the the _chunk_getitem() simplification for a little more time, if that's okay. Will get back on this soon. I would certainly go with documenting this feature as experimental which allow us to make API changes if needed. Do let me know where I could document this. |
Thanks for pushing forward on this. I'm out of radio contact for two weeks
now, but I think this is moving in a good direction, please feel free to
continue as you see fit. Will be great if we can try to keep code as simple
and easy to understand as possible, with good separation of concerns and
minimal code duplication.
…On Sat, 20 Oct 2018, 11:04 shikharsg, ***@***.***> wrote:
After getting a few bugs in LRUChunkCache and seeing that it didn't have
enough tests I figured that some of the tests from the base class
StoreTests could be used for LRUChunkCache, but not all tests. So I
separated the methods of StoreTests into two classes
MutableMappingStoreTests and StoreTests, the latter inheriting the former.
StoreTests, in addition to the methods of MutableMappingStoreTests , has
test_hierarchy and the test_init* methods. Also LRUChunkCache inherits
MutableMappingStoreTests. Doing this also let me easily fix a few bugs in
LRUChunkCache. I hope this is not too big of a change and if you would like
to shift any methods from StoreTests to MutableMappingStoreTests or vice
versa, or even entirely revert this change do let me know.
I would like to work the the _chunk_getitem() simplification for a little
more time, if that's okay. Will get back on this soon.
I would certainly go with documenting this feature as experimental which
allow us to make API changes if needed. Do let me know where I could
document this.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#306 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QhY-2OHvXYId5_gkGBqMj_DmToTaks5umvUegaJpZM4XXYK6>
.
|
Maybe we should be looking to pull out common code from the two LRU classes into a common base class e.g. LRUMappingCache. Might also help with some refactoring of tests. |
Thanks @joshmoore. Turned out be an easy fix actually. |
What's the next step here? |
In terms of functionality, it's all working. But there were some concerns regarding refactoring the code. So this PR needs a detailed review. Otherwise it's good to go in. |
Quick question: does this layer respect get/setitems for concurrent access with fsspec? I ask because fsspec, of course, has its own idea of caching. That is file-oriented rather than chunk-wise, but since ReferenceFileSystem, chunks of other remote URLs can be regarded as standalone files too. It would be nice to have fsspec's caching layer also respect LRU (i.e., number of entries, or size of cache). That is orthogonal and shouldn't delay this PR any further. |
Any chance this can be reviewed? After quite some profiling I realized that LRUStoreCache does not offer the performance benefits I expected, because of this decoding issue. If I can do anything to move this PR forward, let me know. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few minor wording suggestions on a quick read. There are also now a few conflicts with the mainline.
At a very high-level, my biggest question would be whether or not this could have been a concern of the LRUStoreCache IF it had been possible to more reliably detect which key represents a chunk.
can provide, as there may be opportunities to optimise further either within | ||
Zarr or within the mapping interface to the storage. | ||
The above :class:`zarr.storage.LRUStoreCache` wraps any Zarr storage class, and stores | ||
encoded chunks. So every time cache is accessed, the chunk has to be decoded. For cases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encoded chunks. So every time cache is accessed, the chunk has to be decoded. For cases | |
encoded chunks. Every time the cache is accessed, the chunk must be decoded. For cases |
Mapping to store decoded chunks for caching. Can be used in repeated | ||
chunk access scenarios when decoding of data is computationally | ||
expensive. | ||
NOTE: When using the write cache feature with object arrays(i.e. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: When using the write cache feature with object arrays(i.e. | |
NOTE: When using the write cache feature with object arrays (i.e. |
Mapping to store decoded chunks for caching. Can be used in repeated | ||
chunk access scenarios when decoding of data is computationally | ||
expensive. | ||
NOTE: When using the write cache feature with object arrays(i.e. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: When using the write cache feature with object arrays(i.e. | |
NOTE: When using the write cache feature with object arrays (i.e. |
Mapping to store decoded chunks for caching. Can be used in repeated | ||
chunk access scenarios when decoding of data is computationally | ||
expensive. | ||
NOTE: When using the write cache feature with object arrays(i.e. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: When using the write cache feature with object arrays(i.e. | |
NOTE: When using the write cache feature with object arrays (i.e. |
I think the issue is mainly that the My idea basically is to move all functionality of the compressor out of the In the Edit: This would then probably require to distinguish chunk keys from other data. |
I had another look at this PR. There is one corner-case not covered. If |
Merged v2.9.3 into this branch. |
@ericpre: from hyperspy/hyperspy#2798 (comment), do I understand correctly that you've tested this and it's working for you? |
Yes, I have tested this branch and I can confirm the speed improvement. If you would like to have more details, I would need to check again but I would not have time to do before a week or so. |
I would be very interested in this functionality. Anything I can do that would help finish this up? |
Probably fixing merge conflicts is the first thing. Agree this is something we'd like to see included. |
I’m also quite interested in this feature but it seems the PR has gone stale. Are there any remaining open issues that are to be addressed? Having gone through the thread, it looks ready for merging. |
Hi @claesenm. The primary issue will be the merge conflicts:
If you or anyone else could try to create a merge commit, that would certainly help. (Thanks!) |
For those following this issue, I think the most important thing right now is to look at #1583 and help us think about how this functionality should be supported in zarr-python's 3.0 api. There's a lot of surface area in this PR I want to make sure we cover the functionality going forward if its important. |
Closing this out as stale. We'll keep the feature request open. |
Issue: #278
The
ChunkCache
implemented here is an almost identical copy of theLRUStoreCache
class, but with the keys cache functionality removed(same for its tests).I was hesitant to implement it as a super class of
LRUStoreCache
as the two are supposed to be used in different places. WhileLRUStoreCache
is like any other storage class of zarr and can be wrapped around any storage class of zarr,ChunkCache
is specifically to be used for storage of decoded chunks, and to be passed to anArray
.I have implemented both read-write cache, i.e. in addition to caching when one reads from the array, if one writes to a certain chunk, instead of invalidating that chunk in the
ChunkCache
(which would happen if one wanted only read cache), that chunk being written to is cached in the ChunkCache object. There is a problem with this approach as the below 3 tests are failing. If we first write data to zarr array, and then we read it, and when not usingChunkCache
(as implemented above) it always goes through the encode phase, and then the decode phase when we write and subsequently read from the array. IfChunkCache
is used, with write cache as implemented here, it's possible that it does not go through the encode-decode phase when we write and subsequently read the data. Now for the following three tests to pass, it is imperative that it go through the encode-decode phase when we write and then subsequently read from the array:Tracebacks can be seen in Travis CI
I wonder if there is a way to get around this while keeping write cache implemented, which makes me think if I should implement write cache at all? The above 3 tests will pass if we implement only read cache.
Finally, the logic of chunk cache as implemented in the
_chunk_getitem
function is making that function much more hard to read than it already was. I could refactor for better readability if it is so desired.TODO:
tox -e py36
orpytest -v --doctest-modules zarr
)tox -e py27
orpytest -v zarr
)tox -e py36
orflake8 --max-line-length=100 zarr
)tox -e py36
orpython -m doctest -o NORMALIZE_WHITESPACE -o ELLIPSIS docs/tutorial.rst
)tox -e docs
)