Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache chunks in RAM #9

Open
JackKelly opened this issue Dec 5, 2023 · 5 comments
Open

Cache chunks in RAM #9

JackKelly opened this issue Dec 5, 2023 · 5 comments
Labels
performance Improvements to runtime performance

Comments

@JackKelly
Copy link
Owner

JackKelly commented Dec 5, 2023

Why cache data in LSIO?

I'd like LSIO to do its own caching (and to always read using O_DIRECT), for several reasons:

  • LSIO's cache can be "aware" of the graph of computation. So it can "smartly" cache just the data that's going to be used for a subsequent computation.
  • LSIO's cache can avoid memory copies: If we rely on the operating system's page cache then even a cache hit will result in a memory copy from the OS page cache into the process' buffer. If, instead, the cache is in user-land (in LSIO) then we stand a chance of being able to use the cached data without any mem copies!

Unstructured notes

Probably needs to be the first step of the planning stage. We only want to load exactly what we need from cache.

The planning stage can't do all the work, because the planning stage doesn't know about any transform functions. Instead, the planning stage needs to express "load these chunks from cache. Load these other chunks from disk". (We don't want both the planning stage and the loading stage to have to perform cache lookups, because that's expensive).

Keep track of the time that each Vector was accessed, and the slice(s) accessed.

Maybe make a separate Rust crate for caching.

Just cache the raw data from IO. (Let's avoid caching the processed data: the processed data takes up more RAM and requires the caching mechanism to know which transforms have been applied to each chunk.)

Let users choose to cache all the data read from disk (e.g. when merging nearby reads. But mark the unwanted chunks as having been read "zero times"); or cache just the chunks that the user requested. Or cache nothing!

Let the user configure read-ahead or read-behind caching. (e.g. if the user requests 4/5th of a file, then, allocate enough RAM buffer for the entire file. But, first, only read what the user requested (returning a slice into the RAM buffer). And then, after giving the user what they requested, read the rest of the file in the background).

The user can specify the max amount of RAM to use for caching. Elements will be thrown away depending on the time they were last accessed.

Allow the user to observe:

  • the cache hit ratio
  • the total amount of RAM used for cache
  • the total amount of RAM used for data that has never been read (e.g. read-ahead that has never been read).

Consider this situation: In the first call, the user loads all of a 1 GB file. From then on, the user only reads the last 1 kB of that file. We shouldn't keep the entire file in RAM. Instead, we should move the last 1 kB to a new Vector, and throw away the 1 GB buffer. That's why we need to record to slice(s) requested for each Vector.

Data chunks returned to the user will always immutable. This allows LSIO's caching mechanism to be faster at returning cache hits than the operating system's page cache because LSIO doesn't have to memcopy anything. In contrast, the OS has to memcopy from page cache into the process' address space.

For now, let's just make sure the design can handle caching. But don't implement caching for the MVP. Record ideas in this github issue, and link to the issue from the design doc.

@JackKelly
Copy link
Owner Author

JackKelly commented Dec 9, 2023

Simplify caching! And restrict it to a single layer!

Caching should exist only in a thin layer after the planning, and before reading UPDATE: actually, shouldn't caching exist between the user and the planning? I think we can ignore the few situations where thinking about the cache before the planning is useful (but I should check!)

See if we can satisfy requests (even partially) from cache. If so, there are several options:

  1. If the user provided a buffer, then memcpy from cache into the buffer (and record the read) (possibly do the memcpy concurrently using Rayon's join)
  2. If the user didn't provide a buffer, and the cache fully satisfies the read, then give a slice reference to the cache as the buffer
  3. If the user didn't provide a buffer, and the cache only partially satisfies the read (or multiple cache elements are required to satisfy the read, or some of the requested read), then allocate an entire new buffer, and memcpy (concurrently) the part(s) of the cache which is/are useful (concurrently). Then submit a read (from IO) into the un-initialised part of the vector. And make a note to copy the newly read data into LSIO's cache. (There could also be a "defrag" method on the cache.)

Cache miss:

If the user provided a buffer, then write into that buffer, and make a note to copy that data into LSIO's cache after the read.

If the user didn't provide a buffer, then allocate one now, and make a note that LSIO owns this buffer, and so the buffer can be used as LSIO's cache and a slice can be returned to the user.

Need to think about what happens when the planning stage reads more than the user requested. How to cache the unrequested data?

@JackKelly
Copy link
Owner Author

Consider using a Map<PathBuf, File> where each File struct would store cached filesize, and cached data

Then, maybe assign one File to each FileChunks, so that we can safely plan FileChunks in parallel, because each File will only be owned by one FileChunks object.

@JackKelly
Copy link
Owner Author

JackKelly commented Dec 15, 2023

Thing about using a rope data structure for storing the cache. @clbarnes first introduced me to "rope data structures" in his comment here: apache/arrow-rs#4612 (comment)

Each File would have its own "rope". As we load chunks of that file into RAM, those chunks would be represented by leaf notes in the rope. A "cache hit" would be when we can fully satisfy a user's request from the rope.

Ropes are nice because insertion is cheap (see the wikipedia page for a comparison of the performance characteristics of a rope vs an array). And ropes are nice because they naturally represent the ordering of the chunks. And make it clear how to implement rebalancing / merging operations.

But I need to think more about the performance of a rope vs a hash map. I perhaps should start with a hash map (because implementations exist in the standard library), and move to a rope if a hash map doesn't work out?

Also see: https://stackoverflow.com/questions/1863440/is-there-any-scenario-where-the-rope-data-structure-is-more-efficient-than-a-str

@clbarnes
Copy link

I guess the important thing is that the outer cache struct keeps the same interface, then the rest is just an implementation detail. Would this Map exist for each File, and be keyed on the byte range requested? In that case you'd need to iterate through the hashmap looking for ranges every time you wanted to read. A BTreeMap might be more efficient in that case, although the start index obviously isn't the only information you want to keep track of.

@JackKelly
Copy link
Owner Author

JackKelly commented Dec 15, 2023

All good questions! (as always!)

I must admit I haven't put a huge amount of thought into the implementation of the cache yet. I'm deliberately postponing designing the cache until I've gotten something working (without a cache) because my brain keeps exploding when trying to design something with so many unknowns 🙂!

@JackKelly JackKelly added the performance Improvements to runtime performance label Feb 10, 2024
@JackKelly JackKelly moved this to Todo in light-speed-io Feb 15, 2024
@JackKelly JackKelly moved this from Todo to Backlog in light-speed-io Feb 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Improvements to runtime performance
Projects
Status: Backlog
Development

No branches or pull requests

2 participants