Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LRU cache for decoded chunks #306

Closed
wants to merge 56 commits into from

Conversation

shikharsg
Copy link
Member

@shikharsg shikharsg commented Oct 11, 2018

Issue: #278

The ChunkCache implemented here is an almost identical copy of the LRUStoreCache class, but with the keys cache functionality removed(same for its tests).

I was hesitant to implement it as a super class of LRUStoreCache as the two are supposed to be used in different places. While LRUStoreCache is like any other storage class of zarr and can be wrapped around any storage class of zarr, ChunkCache is specifically to be used for storage of decoded chunks, and to be passed to an Array.

I have implemented both read-write cache, i.e. in addition to caching when one reads from the array, if one writes to a certain chunk, instead of invalidating that chunk in the ChunkCache(which would happen if one wanted only read cache), that chunk being written to is cached in the ChunkCache object. There is a problem with this approach as the below 3 tests are failing. If we first write data to zarr array, and then we read it, and when not using ChunkCache(as implemented above) it always goes through the encode phase, and then the decode phase when we write and subsequently read from the array. If ChunkCache is used, with write cache as implemented here, it's possible that it does not go through the encode-decode phase when we write and subsequently read the data. Now for the following three tests to pass, it is imperative that it go through the encode-decode phase when we write and then subsequently read from the array:

Tracebacks can be seen in Travis CI

I wonder if there is a way to get around this while keeping write cache implemented, which makes me think if I should implement write cache at all? The above 3 tests will pass if we implement only read cache.

Finally, the logic of chunk cache as implemented in the _chunk_getitem function is making that function much more hard to read than it already was. I could refactor for better readability if it is so desired.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Unit tests and doctests pass locally under Python 3.6 (e.g., run tox -e py36 or
    pytest -v --doctest-modules zarr)
  • Unit tests pass locally under Python 2.7 (e.g., run tox -e py27 or
    pytest -v zarr)
  • PEP8 checks pass (e.g., run tox -e py36 or flake8 --max-line-length=100 zarr)
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Doctests in tutorial pass (e.g., run tox -e py36 or python -m doctest -o NORMALIZE_WHITESPACE -o ELLIPSIS docs/tutorial.rst)
  • Changes documented in docs/release.rst
  • Docs build locally (e.g., run tox -e docs)
  • AppVeyor and Travis CI passes
  • Test coverage to 100% (Coveralls passes)

@shikharsg shikharsg changed the title Chunk cache LRU cache for decoded chunks Oct 11, 2018
Copy link
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, tests especially. Only comment is regarding possible simplification of _chunk_getitem().

zarr/core.py Show resolved Hide resolved
Copy link
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few documentation suggestions. Also need to add the new chunk cache class to the API docs, and ensure it is imported into the root zarr namespace.

zarr/core.py Outdated Show resolved Hide resolved
zarr/storage.py Outdated Show resolved Hide resolved
zarr/storage.py Outdated Show resolved Hide resolved
zarr/storage.py Outdated Show resolved Hide resolved
Copy link
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible workaround for one of the test failures.

zarr/storage.py Outdated Show resolved Hide resolved
zarr/storage.py Outdated Show resolved Hide resolved
zarr/storage.py Outdated Show resolved Hide resolved
@alimanfoo
Copy link
Member

Regarding the msgpack failure, maybe that test needs to be skipped for the array tests with chunk cache, the behaviour with msgpack is broken and in fact correct when using a chunk cache.

Regarding the test failure about mismatching dtypes, I don't know what's causing that. Maybe worth revising the implementation of _chunk_getitem() as suggested above and seeing if it persists.

Seems worth pursuing the support for write cache implementation, should be a way to deal with these test issues.

@shikharsg
Copy link
Member Author

Thanks for the review. Very helpful.

For fixing TestArrayWithChunkCache.test_dtypes, I have made a special case in the buffer_size function to detect 'Mm' dtypes, as you suggested.

The reason for the TestArrayWithChunkCache.test_object_arrays_vlen_array failure is slightly subtle. In that test we intend to apply this filter: VLenArray('<u4'). Now as you can see in the below snippet(taken from here), the dtype of the variable length arrays will not change until encode-decode happens:

>>> import numcodecs
>>> import numpy as np
>>> x = np.array([[1, 3, 5], [4], [7, 9]], dtype='object')
>>> codec = numcodecs.VLenArray('<i4')
>>> codec.decode(codec.encode(x))
array([array([1, 3, 5], dtype=int32), array([4], dtype=int32),
       array([7, 9], dtype=int32)], dtype=object)

When caching during _chunk_setitem, the VLenArray('<u4') filter is not applied and encode-decode does not happen, therefore the test failure. Only workaround to this I can think of is to apply encode then decode to the chunk when write caching and when the filter is VLenArray. See my comment above in the _chunk_setitem_nosync function.

Will do the relevant changes in docs and class naming shortly.

zarr/core.py Outdated Show resolved Hide resolved
@alimanfoo
Copy link
Member

Thanks @shikharsg. The test_object_arrays_vlen_array situation is tricky. It exposes a potential difficulty in general with object arrays and caching on write. I can see a few possible options at the moment...

Option 1a. Don't support caching on write at all.

Option 1b. Don't support caching on write for object arrays.

Option 2. Relax the test.

Option 3. Round-trip the chunk through encode/decode before caching.

FWIW I would be fine with option 1a or 1b if you want to get the read caching in now and leave write caching for later. The primary use cases Zarr targets are write-once-read-many so read caching is the priority.

I think I'd also be OK with option 2. However, there is another issue that I just realised would need to be addressed, and that is that during write, saving the chunk to the cache occurs too early, and there is a possibility that the chunk could get cached but a failure could subsequently occur either during encoding or storing. This could arise, for example, if an object array is passed but that cannot be encoded (e.g., as a vlen array). This could be addressed by moving the call to set the chunk in the cache to be the very last statement inside the method. I.e., the end of Array._chunk_setitem_nosync() becomes:

        # encode chunk
        cdata = self._encode_chunk(chunk)

        # store
        self.chunk_store[ckey] = cdata

        # cache the chunk
        if self._chunk_cache is not None:
            self._chunk_cache[ckey] = np.copy(chunk)

Option 3 also seems OK for object arrays. There is a way this could be done that would handle a wider range of scenarios for object arrays, if the end of Array._chunk_setitem_nosync() becomes:

        # encode chunk
        cdata = self._encode_chunk(chunk)

        # store
        self.chunk_store[ckey] = cdata

        # cache the chunk
        if self._chunk_cache is not None:
            if self._dtype == object:
                # ensure cached chunk has been round-tripped through encode/decode
                chunk = self._decode_chunk(cdata)
            self._chunk_cache[ckey] = np.copy(chunk)

Note that this still would mean that using a chunk cache would slow down writing of object arrays, because it would incur an extra decode as chunks are written.

@shikharsg
Copy link
Member Author

I have moved write cache to the very last statement inside the method. I have also implemented caching for 0-d arrays

About the test_object_arrays_vlen_array problem, option 3(but just for dtype=object arrays) seems best to me and which I have also implemented as you have suggested. We might want to document the slowdown for write-cache object arrays somewhere. Which part of the docs should I put this in?

Copy link
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shikharsg.

For documenting the potential slowdown using chunk_cache with object arrays, that could go in a Notes section in the Array class docstring.

Only thing left I think it to look at simplifying the implementation of _chunk_getitem(), otherwise looks pretty mature to me.

One other question is whether we should document these new features as experimental, which would give users some warning that this is something new and would also give us some latitude to change the API via a minor release if needed.

zarr/core.py Outdated Show resolved Hide resolved
@shikharsg
Copy link
Member Author

After getting a few bugs in LRUChunkCache and seeing that it didn't have enough tests I figured that some of the tests from the base class StoreTests could be used for LRUChunkCache, but not all tests. So I separated the methods of StoreTests into two classes MutableMappingStoreTests and StoreTests, the latter inheriting the former. StoreTests, in addition to the methods of MutableMappingStoreTests , has test_hierarchy and the test_init* methods. Also LRUChunkCache inherits MutableMappingStoreTests. Doing this also let me easily fix a few bugs in LRUChunkCache. I hope this is not too big of a change and if you would like to shift any methods from StoreTests to MutableMappingStoreTests or vice versa, or even entirely revert this change do let me know.

I would like to work the the _chunk_getitem() simplification for a little more time, if that's okay. Will get back on this soon.

I would certainly go with documenting this feature as experimental which allow us to make API changes if needed. Do let me know where I could document this.

@alimanfoo
Copy link
Member

alimanfoo commented Oct 20, 2018 via email

@alimanfoo
Copy link
Member

Maybe we should be looking to pull out common code from the two LRU classes into a common base class e.g. LRUMappingCache. Might also help with some refactoring of tests.

@shikharsg
Copy link
Member Author

Thanks @joshmoore. Turned out be an easy fix actually.

zarr/core.py Outdated Show resolved Hide resolved
@joshmoore
Copy link
Member

What's the next step here?

@shikharsg
Copy link
Member Author

In terms of functionality, it's all working. But there were some concerns regarding refactoring the code. So this PR needs a detailed review. Otherwise it's good to go in.

@martindurant
Copy link
Member

Quick question: does this layer respect get/setitems for concurrent access with fsspec?

I ask because fsspec, of course, has its own idea of caching. That is file-oriented rather than chunk-wise, but since ReferenceFileSystem, chunks of other remote URLs can be regarded as standalone files too. It would be nice to have fsspec's caching layer also respect LRU (i.e., number of entries, or size of cache). That is orthogonal and shouldn't delay this PR any further.

@niowniow
Copy link

niowniow commented Jun 28, 2021

Any chance this can be reviewed? After quite some profiling I realized that LRUStoreCache does not offer the performance benefits I expected, because of this decoding issue.

If I can do anything to move this PR forward, let me know.

Copy link
Member

@joshmoore joshmoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few minor wording suggestions on a quick read. There are also now a few conflicts with the mainline.

At a very high-level, my biggest question would be whether or not this could have been a concern of the LRUStoreCache IF it had been possible to more reliably detect which key represents a chunk.

can provide, as there may be opportunities to optimise further either within
Zarr or within the mapping interface to the storage.
The above :class:`zarr.storage.LRUStoreCache` wraps any Zarr storage class, and stores
encoded chunks. So every time cache is accessed, the chunk has to be decoded. For cases
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
encoded chunks. So every time cache is accessed, the chunk has to be decoded. For cases
encoded chunks. Every time the cache is accessed, the chunk must be decoded. For cases

Mapping to store decoded chunks for caching. Can be used in repeated
chunk access scenarios when decoding of data is computationally
expensive.
NOTE: When using the write cache feature with object arrays(i.e.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
NOTE: When using the write cache feature with object arrays(i.e.
NOTE: When using the write cache feature with object arrays (i.e.

Mapping to store decoded chunks for caching. Can be used in repeated
chunk access scenarios when decoding of data is computationally
expensive.
NOTE: When using the write cache feature with object arrays(i.e.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
NOTE: When using the write cache feature with object arrays(i.e.
NOTE: When using the write cache feature with object arrays (i.e.

Mapping to store decoded chunks for caching. Can be used in repeated
chunk access scenarios when decoding of data is computationally
expensive.
NOTE: When using the write cache feature with object arrays(i.e.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
NOTE: When using the write cache feature with object arrays(i.e.
NOTE: When using the write cache feature with object arrays (i.e.

@niowniow
Copy link

niowniow commented Jul 5, 2021

I think the issue is mainly that the Array class is responsible for decompressing. Is there a reason the de/compressor is implemented this way? I haven't checked through the code in detail yet. Please bear with me if the following is nonsense:

My idea basically is to move all functionality of the compressor out of the Array class.
As alternative one could implement a compressor that can be wrapped around a store, similar to how the LRUStoreCache is wrapped around it. then we could e.g. store =LRUStoreCache(CompressorWrapper(DirectoryStore))) and the LRUChunkCache wouldn't be necessary. Or is there another benefit from using the LRUChunkCache?

In the Array class we could remove __compressor and instead wrap the store with CompressorWrapper if there is no CompressorWrapper in the store chain. I guess the default would be to wrap the store unless the user explicitly disables it because they have wrapped it themselves. All parameters given to Array can just be propagated through to its CompressorWrapper instance

Edit: This would then probably require to distinguish chunk keys from other data.

@niowniow
Copy link

I had another look at this PR. There is one corner-case not covered. If write_direct (here) is true, i.e. we request one whole chunk. It will not be cached. I don't know if that was intended.

@joshmoore
Copy link
Member

Merged v2.9.3 into this branch.

@joshmoore
Copy link
Member

@ericpre: from hyperspy/hyperspy#2798 (comment), do I understand correctly that you've tested this and it's working for you?

@ericpre
Copy link
Contributor

ericpre commented Sep 23, 2021

@ericpre: from hyperspy/hyperspy#2798 (comment), do I understand correctly that you've tested this and it's working for you?

Yes, I have tested this branch and I can confirm the speed improvement. If you would like to have more details, I would need to check again but I would not have time to do before a week or so.

@croth1
Copy link

croth1 commented Oct 20, 2022

I would be very interested in this functionality. Anything I can do that would help finish this up?

@jakirkham
Copy link
Member

Probably fixing merge conflicts is the first thing. Agree this is something we'd like to see included.

@claesenm
Copy link

I’m also quite interested in this feature but it seems the PR has gone stale. Are there any remaining open issues that are to be addressed? Having gone through the thread, it looks ready for merging.

@joshmoore
Copy link
Member

Hi @claesenm. The primary issue will be the merge conflicts:

zarr/__init__.py
zarr/core.py
zarr/creation.py
zarr/hierarchy.py
zarr/storage.py
zarr/tests/test_storage.py

If you or anyone else could try to create a merge commit, that would certainly help. (Thanks!)

@jhamman
Copy link
Member

jhamman commented Dec 7, 2023

For those following this issue, I think the most important thing right now is to look at #1583 and help us think about how this functionality should be supported in zarr-python's 3.0 api. There's a lot of surface area in this PR I want to make sure we cover the functionality going forward if its important.

@jhamman jhamman added the stale label Dec 7, 2023
@jhamman
Copy link
Member

jhamman commented Feb 13, 2024

Closing this out as stale. We'll keep the feature request open.

@jhamman jhamman closed this Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.