-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache chunks in RAM #9
Comments
Simplify caching! And restrict it to a single layer!Caching should exist only in a thin layer See if we can satisfy requests (even partially) from cache. If so, there are several options:
Cache miss: If the user provided a buffer, then write into that buffer, and make a note to copy that data into LSIO's cache after the read. If the user didn't provide a buffer, then allocate one now, and make a note that LSIO owns this buffer, and so the buffer can be used as LSIO's cache and a slice can be returned to the user. Need to think about what happens when the planning stage reads more than the user requested. How to cache the unrequested data? |
Consider using a
|
Thing about using a rope data structure for storing the cache. @clbarnes first introduced me to "rope data structures" in his comment here: apache/arrow-rs#4612 (comment) Each File would have its own "rope". As we load chunks of that file into RAM, those chunks would be represented by leaf notes in the rope. A "cache hit" would be when we can fully satisfy a user's request from the rope. Ropes are nice because insertion is cheap (see the wikipedia page for a comparison of the performance characteristics of a rope vs an array). And ropes are nice because they naturally represent the ordering of the chunks. And make it clear how to implement rebalancing / merging operations. But I need to think more about the performance of a rope vs a hash map. I perhaps should start with a hash map (because implementations exist in the standard library), and move to a rope if a hash map doesn't work out? |
I guess the important thing is that the outer cache struct keeps the same interface, then the rest is just an implementation detail. Would this Map exist for each File, and be keyed on the byte range requested? In that case you'd need to iterate through the hashmap looking for ranges every time you wanted to read. A BTreeMap might be more efficient in that case, although the start index obviously isn't the only information you want to keep track of. |
All good questions! (as always!) I must admit I haven't put a huge amount of thought into the implementation of the cache yet. I'm deliberately postponing designing the cache until I've gotten something working (without a cache) because my brain keeps exploding when trying to design something with so many unknowns 🙂! |
Why cache data in LSIO?
I'd like LSIO to do its own caching (and to always read using O_DIRECT), for several reasons:
Unstructured notes
Probably needs to be the first step of the planning stage. We only want to load exactly what we need from cache.
The planning stage can't do all the work, because the planning stage doesn't know about any transform functions. Instead, the planning stage needs to express "load these chunks from cache. Load these other chunks from disk". (We don't want both the planning stage and the loading stage to have to perform cache lookups, because that's expensive).
Keep track of the time that each Vector was accessed, and the slice(s) accessed.
Maybe make a separate Rust crate for caching.
Just cache the raw data from IO. (Let's avoid caching the processed data: the processed data takes up more RAM and requires the caching mechanism to know which transforms have been applied to each chunk.)
Let users choose to cache all the data read from disk (e.g. when merging nearby reads. But mark the unwanted chunks as having been read "zero times"); or cache just the chunks that the user requested. Or cache nothing!
Let the user configure read-ahead or read-behind caching. (e.g. if the user requests 4/5th of a file, then, allocate enough RAM buffer for the entire file. But, first, only read what the user requested (returning a slice into the RAM buffer). And then, after giving the user what they requested, read the rest of the file in the background).
The user can specify the max amount of RAM to use for caching. Elements will be thrown away depending on the time they were last accessed.
Allow the user to observe:
Consider this situation: In the first call, the user loads all of a 1 GB file. From then on, the user only reads the last 1 kB of that file. We shouldn't keep the entire file in RAM. Instead, we should move the last 1 kB to a new Vector, and throw away the 1 GB buffer. That's why we need to record to slice(s) requested for each Vector.
Data chunks returned to the user will always immutable. This allows LSIO's caching mechanism to be faster at returning cache hits than the operating system's page cache because LSIO doesn't have to memcopy anything. In contrast, the OS has to memcopy from page cache into the process' address space.
For now, let's just make sure the design can handle caching. But don't implement caching for the MVP. Record ideas in this github issue, and link to the issue from the design doc.
The text was updated successfully, but these errors were encountered: