LRU cache for decoded chunks #306

shikharsg · 2018-10-11T11:53:05Z

Issue: #278

The ChunkCache implemented here is an almost identical copy of the LRUStoreCache class, but with the keys cache functionality removed(same for its tests).

I was hesitant to implement it as a super class of LRUStoreCache as the two are supposed to be used in different places. While LRUStoreCache is like any other storage class of zarr and can be wrapped around any storage class of zarr, ChunkCache is specifically to be used for storage of decoded chunks, and to be passed to an Array.

I have implemented both read-write cache, i.e. in addition to caching when one reads from the array, if one writes to a certain chunk, instead of invalidating that chunk in the ChunkCache(which would happen if one wanted only read cache), that chunk being written to is cached in the ChunkCache object. There is a problem with this approach as the below 3 tests are failing. If we first write data to zarr array, and then we read it, and when not using ChunkCache(as implemented above) it always goes through the encode phase, and then the decode phase when we write and subsequently read from the array. If ChunkCache is used, with write cache as implemented here, it's possible that it does not go through the encode-decode phase when we write and subsequently read the data. Now for the following three tests to pass, it is imperative that it go through the encode-decode phase when we write and then subsequently read from the array:

Tracebacks can be seen in Travis CI

I wonder if there is a way to get around this while keeping write cache implemented, which makes me think if I should implement write cache at all? The above 3 tests will pass if we implement only read cache.

Finally, the logic of chunk cache as implemented in the _chunk_getitem function is making that function much more hard to read than it already was. I could refactor for better readability if it is so desired.

TODO:

alimanfoo

Looks good to me, tests especially. Only comment is regarding possible simplification of _chunk_getitem().

zarr/core.py

alimanfoo

A few documentation suggestions. Also need to add the new chunk cache class to the API docs, and ensure it is imported into the root zarr namespace.

zarr/core.py

zarr/storage.py

alimanfoo

Possible workaround for one of the test failures.

zarr/storage.py

alimanfoo · 2018-10-11T16:58:20Z

Regarding the msgpack failure, maybe that test needs to be skipped for the array tests with chunk cache, the behaviour with msgpack is broken and in fact correct when using a chunk cache.

Regarding the test failure about mismatching dtypes, I don't know what's causing that. Maybe worth revising the implementation of _chunk_getitem() as suggested above and seeing if it persists.

Seems worth pursuing the support for write cache implementation, should be a way to deal with these test issues.

shikharsg · 2018-10-13T10:22:05Z

Thanks for the review. Very helpful.

For fixing TestArrayWithChunkCache.test_dtypes, I have made a special case in the buffer_size function to detect 'Mm' dtypes, as you suggested.

The reason for the TestArrayWithChunkCache.test_object_arrays_vlen_array failure is slightly subtle. In that test we intend to apply this filter: VLenArray('<u4'). Now as you can see in the below snippet(taken from here), the dtype of the variable length arrays will not change until encode-decode happens:

>>> import numcodecs
>>> import numpy as np
>>> x = np.array([[1, 3, 5], [4], [7, 9]], dtype='object')
>>> codec = numcodecs.VLenArray('<i4')
>>> codec.decode(codec.encode(x))
array([array([1, 3, 5], dtype=int32), array([4], dtype=int32),
       array([7, 9], dtype=int32)], dtype=object)

When caching during _chunk_setitem, the VLenArray('<u4') filter is not applied and encode-decode does not happen, therefore the test failure. Only workaround to this I can think of is to apply encode then decode to the chunk when write caching and when the filter is VLenArray. See my comment above in the _chunk_setitem_nosync function.

Will do the relevant changes in docs and class naming shortly.

zarr/core.py

alimanfoo · 2018-10-14T18:51:46Z

Thanks @shikharsg. The test_object_arrays_vlen_array situation is tricky. It exposes a potential difficulty in general with object arrays and caching on write. I can see a few possible options at the moment...

Option 1a. Don't support caching on write at all.

Option 1b. Don't support caching on write for object arrays.

Option 2. Relax the test.

Option 3. Round-trip the chunk through encode/decode before caching.

FWIW I would be fine with option 1a or 1b if you want to get the read caching in now and leave write caching for later. The primary use cases Zarr targets are write-once-read-many so read caching is the priority.

I think I'd also be OK with option 2. However, there is another issue that I just realised would need to be addressed, and that is that during write, saving the chunk to the cache occurs too early, and there is a possibility that the chunk could get cached but a failure could subsequently occur either during encoding or storing. This could arise, for example, if an object array is passed but that cannot be encoded (e.g., as a vlen array). This could be addressed by moving the call to set the chunk in the cache to be the very last statement inside the method. I.e., the end of Array._chunk_setitem_nosync() becomes:

        # encode chunk
        cdata = self._encode_chunk(chunk)

        # store
        self.chunk_store[ckey] = cdata

        # cache the chunk
        if self._chunk_cache is not None:
            self._chunk_cache[ckey] = np.copy(chunk)

Option 3 also seems OK for object arrays. There is a way this could be done that would handle a wider range of scenarios for object arrays, if the end of Array._chunk_setitem_nosync() becomes:

        # encode chunk
        cdata = self._encode_chunk(chunk)

        # store
        self.chunk_store[ckey] = cdata

        # cache the chunk
        if self._chunk_cache is not None:
            if self._dtype == object:
                # ensure cached chunk has been round-tripped through encode/decode
                chunk = self._decode_chunk(cdata)
            self._chunk_cache[ckey] = np.copy(chunk)

Note that this still would mean that using a chunk cache would slow down writing of object arrays, because it would incur an extra decode as chunks are written.

…type=object. See this: zarr-developers#306 (comment)

shikharsg · 2018-10-15T08:54:02Z

I have moved write cache to the very last statement inside the method. I have also implemented caching for 0-d arrays

About the test_object_arrays_vlen_array problem, option 3(but just for dtype=object arrays) seems best to me and which I have also implemented as you have suggested. We might want to document the slowdown for write-cache object arrays somewhere. Which part of the docs should I put this in?

docs/tutorial.rst

alimanfoo

Thanks @shikharsg.

For documenting the potential slowdown using chunk_cache with object arrays, that could go in a Notes section in the Array class docstring.

Only thing left I think it to look at simplifying the implementation of _chunk_getitem(), otherwise looks pretty mature to me.

One other question is whether we should document these new features as experimental, which would give users some warning that this is something new and would also give us some latitude to change the API via a minor release if needed.

zarr/core.py

shikharsg · 2018-10-20T10:04:13Z

After getting a few bugs in LRUChunkCache and seeing that it didn't have enough tests I figured that some of the tests from the base class StoreTests could be used for LRUChunkCache, but not all tests. So I separated the methods of StoreTests into two classes MutableMappingStoreTests and StoreTests, the latter inheriting the former. StoreTests, in addition to the methods of MutableMappingStoreTests , has test_hierarchy and the test_init* methods. Also LRUChunkCache inherits MutableMappingStoreTests. Doing this also let me easily fix a few bugs in LRUChunkCache. I hope this is not too big of a change and if you would like to shift any methods from StoreTests to MutableMappingStoreTests or vice versa, or even entirely revert this change do let me know.

I would like to work the the _chunk_getitem() simplification for a little more time, if that's okay. Will get back on this soon.

I would certainly go with documenting this feature as experimental which allow us to make API changes if needed. Do let me know where I could document this.

alimanfoo · 2018-10-20T10:15:11Z

Thanks for pushing forward on this. I'm out of radio contact for two weeks now, but I think this is moving in a good direction, please feel free to continue as you see fit. Will be great if we can try to keep code as simple and easy to understand as possible, with good separation of concerns and minimal code duplication.

…

On Sat, 20 Oct 2018, 11:04 shikharsg, ***@***.***> wrote: After getting a few bugs in LRUChunkCache and seeing that it didn't have enough tests I figured that some of the tests from the base class StoreTests could be used for LRUChunkCache, but not all tests. So I separated the methods of StoreTests into two classes MutableMappingStoreTests and StoreTests, the latter inheriting the former. StoreTests, in addition to the methods of MutableMappingStoreTests , has test_hierarchy and the test_init* methods. Also LRUChunkCache inherits MutableMappingStoreTests. Doing this also let me easily fix a few bugs in LRUChunkCache. I hope this is not too big of a change and if you would like to shift any methods from StoreTests to MutableMappingStoreTests or vice versa, or even entirely revert this change do let me know. I would like to work the the _chunk_getitem() simplification for a little more time, if that's okay. Will get back on this soon. I would certainly go with documenting this feature as experimental which allow us to make API changes if needed. Do let me know where I could document this. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#306 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QhY-2OHvXYId5_gkGBqMj_DmToTaks5umvUegaJpZM4XXYK6> .

alimanfoo · 2018-10-21T14:39:13Z

Maybe we should be looking to pull out common code from the two LRU classes into a common base class e.g. LRUMappingCache. Might also help with some refactoring of tests.

shikharsg · 2021-02-20T21:12:45Z

Thanks @joshmoore. Turned out be an easy fix actually.

zarr/core.py

joshmoore · 2021-03-25T14:14:20Z

What's the next step here?

shikharsg · 2021-04-05T16:48:06Z

In terms of functionality, it's all working. But there were some concerns regarding refactoring the code. So this PR needs a detailed review. Otherwise it's good to go in.

martindurant · 2021-04-05T20:10:48Z

Quick question: does this layer respect get/setitems for concurrent access with fsspec?

I ask because fsspec, of course, has its own idea of caching. That is file-oriented rather than chunk-wise, but since ReferenceFileSystem, chunks of other remote URLs can be regarded as standalone files too. It would be nice to have fsspec's caching layer also respect LRU (i.e., number of entries, or size of cache). That is orthogonal and shouldn't delay this PR any further.

niowniow · 2021-06-28T14:39:27Z

Any chance this can be reviewed? After quite some profiling I realized that LRUStoreCache does not offer the performance benefits I expected, because of this decoding issue.

If I can do anything to move this PR forward, let me know.

joshmoore

Few minor wording suggestions on a quick read. There are also now a few conflicts with the mainline.

At a very high-level, my biggest question would be whether or not this could have been a concern of the LRUStoreCache IF it had been possible to more reliably detect which key represents a chunk.

joshmoore · 2021-06-29T14:24:59Z

docs/tutorial.rst

-can provide, as there may be opportunities to optimise further either within
-Zarr or within the mapping interface to the storage.
+The above :class:`zarr.storage.LRUStoreCache` wraps any Zarr storage class, and stores
+encoded chunks. So every time cache is accessed, the chunk has to be decoded. For cases


Suggested change

encoded chunks. So every time cache is accessed, the chunk has to be decoded. For cases

encoded chunks. Every time the cache is accessed, the chunk must be decoded. For cases

joshmoore · 2021-06-29T14:26:03Z

zarr/core.py

+        Mapping to store decoded chunks for caching. Can be used in repeated
+        chunk access scenarios when decoding of data is computationally
+        expensive.
+        NOTE: When using the write cache feature with object arrays(i.e.


Suggested change

NOTE: When using the write cache feature with object arrays(i.e.

NOTE: When using the write cache feature with object arrays (i.e.

joshmoore · 2021-06-29T14:27:11Z

zarr/creation.py

+        Mapping to store decoded chunks for caching. Can be used in repeated
+        chunk access scenarios when decoding of data is computationally
+        expensive.
+        NOTE: When using the write cache feature with object arrays(i.e.


Suggested change

NOTE: When using the write cache feature with object arrays(i.e.

NOTE: When using the write cache feature with object arrays (i.e.

joshmoore · 2021-06-29T14:27:22Z

zarr/creation.py

+        Mapping to store decoded chunks for caching. Can be used in repeated
+        chunk access scenarios when decoding of data is computationally
+        expensive.
+        NOTE: When using the write cache feature with object arrays(i.e.


Suggested change

NOTE: When using the write cache feature with object arrays(i.e.

NOTE: When using the write cache feature with object arrays (i.e.

niowniow · 2021-07-05T15:49:50Z

I think the issue is mainly that the Array class is responsible for decompressing. Is there a reason the de/compressor is implemented this way? I haven't checked through the code in detail yet. Please bear with me if the following is nonsense:

My idea basically is to move all functionality of the compressor out of the Array class.
As alternative one could implement a compressor that can be wrapped around a store, similar to how the LRUStoreCache is wrapped around it. then we could e.g. store =LRUStoreCache(CompressorWrapper(DirectoryStore))) and the LRUChunkCache wouldn't be necessary. Or is there another benefit from using the LRUChunkCache?

In the Array class we could remove __compressor and instead wrap the store with CompressorWrapper if there is no CompressorWrapper in the store chain. I guess the default would be to wrap the store unless the user explicitly disables it because they have wrapped it themselves. All parameters given to Array can just be propagated through to its CompressorWrapper instance

Edit: This would then probably require to distinguish chunk keys from other data.

niowniow · 2021-07-16T12:42:00Z

I had another look at this PR. There is one corner-case not covered. If write_direct (here) is true, i.e. we request one whole chunk. It will not be cached. I don't know if that was intended.

joshmoore · 2021-08-27T12:49:58Z

Merged v2.9.3 into this branch.

joshmoore · 2021-09-23T14:52:10Z

@ericpre: from hyperspy/hyperspy#2798 (comment), do I understand correctly that you've tested this and it's working for you?

ericpre · 2021-09-23T16:12:32Z

@ericpre: from hyperspy/hyperspy#2798 (comment), do I understand correctly that you've tested this and it's working for you?

Yes, I have tested this branch and I can confirm the speed improvement. If you would like to have more details, I would need to check again but I would not have time to do before a week or so.

croth1 · 2022-10-20T13:41:47Z

I would be very interested in this functionality. Anything I can do that would help finish this up?

jakirkham · 2022-10-20T18:31:38Z

Probably fixing merge conflicts is the first thing. Agree this is something we'd like to see included.

claesenm · 2023-11-22T21:55:27Z

I’m also quite interested in this feature but it seems the PR has gone stale. Are there any remaining open issues that are to be addressed? Having gone through the thread, it looks ready for merging.

joshmoore · 2023-11-23T09:34:45Z

Hi @claesenm. The primary issue will be the merge conflicts:

zarr/__init__.py
zarr/core.py
zarr/creation.py
zarr/hierarchy.py
zarr/storage.py
zarr/tests/test_storage.py

If you or anyone else could try to create a merge commit, that would certainly help. (Thanks!)

jhamman · 2023-12-07T22:16:08Z

For those following this issue, I think the most important thing right now is to look at #1583 and help us think about how this functionality should be supported in zarr-python's 3.0 api. There's a lot of surface area in this PR I want to make sure we cover the functionality going forward if its important.

jhamman · 2024-02-13T22:44:02Z

Closing this out as stale. We'll keep the feature request open.

shikharsg added 3 commits August 29, 2018 13:17

first attempt at chunk_cache layer

d62febb

ChunkCache test with MockChunkCacheArray

f796ea7

np.copy not needed when accessing a subset of a chunk

32141a9

shikharsg changed the title ~~Chunk cache~~ LRU cache for decoded chunks Oct 11, 2018

alimanfoo reviewed Oct 11, 2018

View reviewed changes

zarr/core.py Show resolved Hide resolved

alimanfoo reviewed Oct 11, 2018

View reviewed changes

zarr/core.py Outdated Show resolved Hide resolved

zarr/storage.py Outdated Show resolved Hide resolved

zarr/storage.py Outdated Show resolved Hide resolved

zarr/storage.py Outdated Show resolved Hide resolved

alimanfoo reviewed Oct 11, 2018

View reviewed changes

zarr/storage.py Outdated Show resolved Hide resolved

zarr/storage.py Outdated Show resolved Hide resolved

zarr/storage.py Outdated Show resolved Hide resolved

fixed 'Mm' dtype error for buffersize function

b35139b

shikharsg commented Oct 13, 2018

View reviewed changes

zarr/core.py Outdated Show resolved Hide resolved

shikhar added 3 commits October 13, 2018 11:45

renamed ChunkCache to LRUChunkCache

3c45176

LRUChunkCache in zarr root namespace

46dcf94

LRUChunkCache example

c69c751

shikhar added 4 commits October 15, 2018 08:18

write caching of chunk should be done after encoding

2cb143e

ensure cached chunk has been round tripped through encode-decode if d…

2fb169e

…type=object. See this: zarr-developers#306 (comment)

flake8 fixes

31e4dfb

read write cache for 0-d arrays

5559c4f

added tutorial and api docs

2a0124a

alimanfoo reviewed Oct 16, 2018

View reviewed changes

docs/tutorial.rst Show resolved Hide resolved

alimanfoo reviewed Oct 16, 2018

View reviewed changes

zarr/core.py Outdated Show resolved Hide resolved

shikharsg added 2 commits October 20, 2018 14:33

separated store tests from mutable mapping tests in test_storage.py

6fac2ad

fixed pickle, __delitem__ and ordered dict iteration bugs

4e79d0b

documenting slowdown when using write cache with object arrays

5fd6fc8

factoring out mapping code from LRUStoreCache and LRUChunkCache

422f9eb

Merge branch 'master' into chunk_cache

992b48a

jakirkham reviewed Feb 21, 2021

View reviewed changes

zarr/core.py Outdated Show resolved Hide resolved

refactor

38ee622

Merge branch 'master' into chunk_cache

8b6a699

joshmoore reviewed Jun 29, 2021

View reviewed changes

joshmoore mentioned this pull request Aug 26, 2021

LRU cache for decoded chunks #278

Open

Merge 'origin/master' into pr-306

ba5c0ed

Remove use of unittest

7cdce5f

ericpre mentioned this pull request Sep 9, 2021

Zarr Reading and Writing (.zspy format) hyperspy/hyperspy#2798

Closed

8 tasks

Merge branch 'master' into chunk_cache

06c899b

croth1 mentioned this pull request Oct 23, 2022

[WIP] Attempt to continue LRU cache for decoded chunks #1214

Closed

6 tasks

benjeffery mentioned this pull request Feb 21, 2023

Ancestor generation requires entire genotype array to be in-memory. tskit-dev/tsinfer#806

Closed

jhamman added the stale label Dec 7, 2023

jhamman closed this Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LRU cache for decoded chunks #306

LRU cache for decoded chunks #306

shikharsg commented Oct 11, 2018 •

edited

Loading

alimanfoo left a comment

alimanfoo left a comment

alimanfoo left a comment

alimanfoo commented Oct 11, 2018

shikharsg commented Oct 13, 2018

alimanfoo commented Oct 14, 2018

shikharsg commented Oct 15, 2018

alimanfoo left a comment

shikharsg commented Oct 20, 2018

alimanfoo commented Oct 20, 2018 via email

alimanfoo commented Oct 21, 2018

shikharsg commented Feb 20, 2021

joshmoore commented Mar 25, 2021

shikharsg commented Apr 5, 2021

martindurant commented Apr 5, 2021

niowniow commented Jun 28, 2021 •

edited

Loading

joshmoore left a comment •

edited

Loading

joshmoore Jun 29, 2021

joshmoore Jun 29, 2021

joshmoore Jun 29, 2021

joshmoore Jun 29, 2021

niowniow commented Jul 5, 2021 •

edited

Loading

niowniow commented Jul 16, 2021

joshmoore commented Aug 27, 2021

joshmoore commented Sep 23, 2021

ericpre commented Sep 23, 2021

croth1 commented Oct 20, 2022

jakirkham commented Oct 20, 2022

claesenm commented Nov 22, 2023

joshmoore commented Nov 23, 2023

jhamman commented Dec 7, 2023

jhamman commented Feb 13, 2024

	encoded chunks. So every time cache is accessed, the chunk has to be decoded. For cases
	encoded chunks. Every time the cache is accessed, the chunk must be decoded. For cases

	NOTE: When using the write cache feature with object arrays(i.e.
	NOTE: When using the write cache feature with object arrays (i.e.

LRU cache for decoded chunks #306

LRU cache for decoded chunks #306

Conversation

shikharsg commented Oct 11, 2018 • edited Loading

alimanfoo left a comment

Choose a reason for hiding this comment

alimanfoo left a comment

Choose a reason for hiding this comment

alimanfoo left a comment

Choose a reason for hiding this comment

alimanfoo commented Oct 11, 2018

shikharsg commented Oct 13, 2018

alimanfoo commented Oct 14, 2018

shikharsg commented Oct 15, 2018

alimanfoo left a comment

Choose a reason for hiding this comment

shikharsg commented Oct 20, 2018

alimanfoo commented Oct 20, 2018 via email

alimanfoo commented Oct 21, 2018

shikharsg commented Feb 20, 2021

joshmoore commented Mar 25, 2021

shikharsg commented Apr 5, 2021

martindurant commented Apr 5, 2021

niowniow commented Jun 28, 2021 • edited Loading

joshmoore left a comment • edited Loading

Choose a reason for hiding this comment

joshmoore Jun 29, 2021

Choose a reason for hiding this comment

joshmoore Jun 29, 2021

Choose a reason for hiding this comment

joshmoore Jun 29, 2021

Choose a reason for hiding this comment

joshmoore Jun 29, 2021

Choose a reason for hiding this comment

niowniow commented Jul 5, 2021 • edited Loading

niowniow commented Jul 16, 2021

joshmoore commented Aug 27, 2021

joshmoore commented Sep 23, 2021

ericpre commented Sep 23, 2021

croth1 commented Oct 20, 2022

jakirkham commented Oct 20, 2022

claesenm commented Nov 22, 2023

joshmoore commented Nov 23, 2023

jhamman commented Dec 7, 2023

jhamman commented Feb 13, 2024

shikharsg commented Oct 11, 2018 •

edited

Loading

niowniow commented Jun 28, 2021 •

edited

Loading

joshmoore left a comment •

edited

Loading

niowniow commented Jul 5, 2021 •

edited

Loading