-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In-memory representation of chunks: array instead of a dict? #33
Comments
There are basically three pieces of data you need for a chunk reference. This is roughly what our data structure look like in Arraylake class ReferenceData:
uri: str
offset: int
length: int Storing these in different separate Zarr arrays would offer major advantages in terms of compression. For the int data, we could use the delta codec, plus a lossless compressor like Zstd, to massively crush down the data. For the uris, the VlenUTF8 (plus lossless compression) would work great. It's likely that we could store millions of references this way using < 1MB of storage. |
Your
This seems like a cool idea but possibly orthogonal to the design issue I'm talking about above? Sounds like you're suggesting a particular on-disk storage format for the chunk references. In this issue I'm talking only about the in-memory representation of the chunked n-dimensional array + manifest. We can write that to disk in a number of ways (as kerchunk json, as kerchunk parquet, or as your triple-zarr-array suggestion here etc.). EDIT: That's a great point about compression though - the URL and length fields are likely to exhibit very little variation over the whole array, and so compress down to almost nothing. |
Yeah I see what you mean. In my defense, the title of this issues does begin with the word "Store"! 😆 For the issue you are talking about (in memory representation of references), I think having an array indexed by chunk positions makes a lot of sense. The main downside would be in the case of very sparsely populated manifests (relative to the full array), in which case this would involve a lot of unused memory compared to the dict. I suppose you could opt to use an in-memory sparse array for that case. |
Getting super meta here... What if we used an Xarray Dataset for the in-memory references, with three different variables ( There is something very satisfying about the idea of using Xarray itself to manage the manifests. |
Interesting... I'm not seeing how that would really work though. We need to use different xarray Variables at the top-level for different netCDF variables / zarr arrays, so we would need 3 variables (path, offset, length) per variable (lat, lon, temp, etc.). Or are you suggesting using xarray twice, at two levels of abstraction? Once to hold the chunk grid in each ManifestArray, and once to hold all the ManifestArrays. That would be kind of wild. Another idea that's along the same lines would be to store the manifest entries in a structured numpy array, i.e. using a structured dtype. But that would require xarray to be able to wrap structured numpy arrays, which I bet it can't right now. EDIT: Actually xarray wouldn't have to directly wrap the structured array, it would wrap a ManifestArray that wraps a structured array... That could actually work... Also the concat_manifests isn't really that complicated in my opinion. The concatenation and broadcasting of the chunk manifests is pretty easy to describe just as manipulation of the chunk keys, that's one of the things I like about this design. It works well as an abstraction, it's only a problem if we think it's really inefficient / can't represent some case we need. |
This structured array idea could work nicely... You have a The structured array has 3 fields, for All concatenation / broadcasting of the Or you could go even more meta and have a |
One downside of all these in-memory array ideas compared the the dictionary chunk manifest we currently have is I don't know how the array would support missing chunks. The structured array wouldn't have anywhere you could put a NaN either. EDIT: I guess a path containing only an empty string could be understood to represent a missing chunk? |
The current dictionary encoding is basically a poor man's sparse array. 😆 |
Or we could have a separate bitmask array. The nice thing about manifest as dataset is that you can attach all kinds of chunk-level statistics. For example, chunk min, max, sum, count, etc. |
Haha yeah it kinda is 😅
Okay so there are potentially multiple solutions to the missing chunk issue. @jhamman pointed out today that the main use case of this library is arrays where you do have every chunk (because netCDF files don't just omit chunks), so I don't think we are really talking about very sparse arrays anyway. The main reason to be able to represent NaNs is for padding with them (#22).
You could do that in a structured array too, just by having extra fields. Not sure what the use case of those chunk-level statistics is in the context of this library though. I'm tempted to make a PR to try out the structured array idea, as I feel like that's the most memory-optimized data structure we can use to represent the manifest (without writing something ourselves in rust #23). |
See #104 (comment) for a simple experiment showing that using 3 (dense) numpy arrays we should be able to represent a manifest that points to 1 million chunks using only ~24MB in-memory. Also note that apparently it isn't possible to put numpy 2.0's variable-length string dtype into a numpy structured array (see Jeremy's comment zarr-developers/zarr-specs#287 (comment)), which means I need to change how I had started implementing #39 to use 3 separate arrays (for path, offset, and length) instead. |
FWIW I think the 3 array approach will be more performant: https://numpy.org/doc/stable/user/basics.rec.html#introduction
|
Currently chunks are stored as a mapping of chunk keys to chunk entries, i.e. a manifest dict. The
ManifestArray
is just a thin wrapper over this, it's not really an array at all internally. The main purpose of it is to do lazy concatenation, which is implemented via manipulating manifest dicts.However, as was pointed out by @dcherian in fsspec/kerchunk#377 (comment), a lazily concatenated array is essentially just a chunked array. We could imagine an alternative design where
ManifestArray
works more like a dask array, which holds a grid of chunks, each of which is itself an array (i.e. numpy arrays).It might make sense to re-implement
ManifestArray
to organise chunks in the same way dask does, perhaps storing some sort ofChunkReferenceArray
object in place of numpy arrays. This would then handle concatenation at the chunk level, allow for indexing (as long as the indexers aligned with chunk boundaries), and possibly be implemented by vendoring code from inside dask.In this design the
ChunkManifest
would be something that could be built from theManifestArray
when needed, not the fundamental object. This change could be done without changing the existing API ofManifestArray
.The text was updated successfully, but these errors were encountered: