sort: Reduce memory overhead #2134

miDeb · 2021-04-26T15:07:55Z

When I wrote #1996, I implemented this by pre-computing some values beforehand to speed up comparisons later. There are other cases where we rely on pre-computed values. While this can result in a speed improvement, it has the big disadvantage of increasing our overall memory usage. The way it is currently implemented also means that we pay a constant tax even for lines/files that would not need precomputed values. That said, I think we should not rely on precomputed values for our comparisons. We can later experiment with adding them back for cases where they bring significant improvements without impacting other use-cases.

Places where pre-computed values are used:

numeric string comparisons (-h and -n): The current algorithm is expecting that pre-computed values are cached. To achieve similar performance without caching, it needs to be rewritten. ~~Expected performance impact: Small performance regression, improvements for other use-cases and already sorted files.~~
~~general numeric comparisons: Parsed f64 values are cached. Reverting this should be relatively straight-forward. Should only be a minor performance regression.~~ See sort: Reduce memory overhead #2134 (comment), this may not be as useful as I initially thought.
transforms: (-f, -d, -i, -b) could be implemented with a custom string comparison algorithm. Allocating new strings for every comparison could have a noticeable performance impact: sort: add some custom string comparisons #2144
fields: (-k) This could potentially benefit from caching, but maybe in a limited form? I have to investigate what the performance penalty for tokenizing each time would be.

Generally the goal is to

reduce memory usage: Currently every line gets its own Line struct that is 96 bytes big. This is due to the precomputed values for numeric string comparison and indices for field support we have to reserve space for. The goal is to reduce this to just a &str, i.e. 16 bytes.
speed cases up that do not need pre-computed values anyways.
observe the performance impact and at a later point add mitigations for them.

cc @electricboogie does roughly sound ok to you? If so, I'll start implementing this the next days.

The text was updated successfully, but these errors were encountered:

kimono-koans · 2021-04-26T17:23:58Z

I'm not sure we have to abandon all precomputed values just yet. I was just testing this and the SmallVec in the Line struct is a big hog. We can reduce the Line struct to 48 bytes just by changing to a regular Vec. Why don't we both test that first and see where we are after?

miDeb · 2021-04-26T17:30:00Z

Yeah, I also tried this briefly. I will look at it again, but I think this regressed speed. It will not reduce overall memory usage because it just moves the memory to a different place.

kimono-koans · 2021-04-26T17:40:09Z

So the memory overhead is similar? Huh. Yeah, then I guess I'm with you on precomputed values seeming like the thing we can lose if we have to.

tertsdiepraam · 2021-04-26T17:47:28Z

Maybe you could store a Vec in FieldSelector instead of a (Small)Vec in Line? Then if you have no FieldSelectors, you wouldn't allocate any memory for it, right? I'm not really familiar with the code, so this might make no sense at all.

miDeb · 2021-04-26T17:55:19Z

This could be an option, but would probably introduce more complexity. FieldSelectors are a global thing, but we'd need indices for each line. I'm also unsure how nice this would work with extsort (sorting by using auxiliary disk memory to sort large files), but we can surely explore multiple approaches.

This is more or less what I meant with "later we can try to put some of this back in a way that doesn't affect other use-cases". Maybe we can build indices for the first n lines for huge files or something like that to reduce the peak memory usage.

kimono-koans · 2021-04-26T18:03:04Z

@miDeb since you've tinkered with the Line struct, do we have to take an owned value for each line string? I'm looking at this naively, but are there instances where we could take a str, and we could make this a COW, and save a bunch of string allocs (or maybe we can avoid string allocs altogether)?

miDeb · 2021-04-26T18:11:09Z

No, I think we need to allocate those strings. We get them from the BufReader and have to take ownership from it because it will not keep its buffer content around. We could avoid allocations in some cases in the transform function as far as I can tell. For -b that's surely the case: We could just trim before comparing strings.

kimono-koans · 2021-04-26T18:32:25Z

I'm sorry I meant alloc a string buffer. So, alloc a String::with_capacity(4096)[the size of the BufReader buffer or the --buffer-size limit or whatever] into a Buffer_Chunks struct then have the Line struct take references?

I think this is what GNU sort does looking at its memory behavior in Valgrind.

miDeb · 2021-04-26T18:35:30Z

Thanks for clarifying, that makes sense! That way we can avoid sorting Strings (24 bytes afaik) and can use &str (16 bytes).

kimono-koans · 2021-04-26T23:13:18Z

I reread your initial comment and realized what I was suggesting something like what you were already talking about! Sorry for missing the point.

Re: chunked sorting, just so someone remembers -- we should benchmark stable and unstable again -- because I remember reading in the library docs that the default unstable sort perhaps wouldn't be suited to doing a final pass over the chunks.

miDeb · 2021-04-29T14:27:57Z

Alright, I did some experiments with removing precomputation for numeric comparisons. I don't think my initial hypothesis holds up, this results in a severe performance regression, and I don't know how valuable the improvement in memory usage is in practice.

I'm seeing a 30% speed regression for -n and a 70% regression for -g (haven't implemented -h yet, but it would be somewhere in between). Other workloads aren't really affected in their performance, athough memory usage goes down by about 15% for all workloads. The size of the line struct is reduced from 96 to 72 bytes (a reduction of 25%).

Maybe I'll find a change that has a clearer benefit.

kimono-koans · 2021-04-29T16:31:59Z

Could you store the cache in a separate struct/hashmap for each function call, instead of per line?

kimono-koans · 2021-04-29T17:06:43Z

Re: Field/Key, what about storing the field/key info in a separate struct/s which merely correspond to an instance of a line struct? So maybe Key=Line,Value=OtherStuff? Or perhaps struct GlobalLine, which is only used if needed and we rework the functions to take type T?

kimono-koans · 2021-04-29T17:11:13Z

See: https://medium.com/swlh/on-memoization-291fd1dd924

kimono-koans · 2021-04-29T17:57:59Z

FYI, on BSD/MacOS, --debug has some interesting output which will show you the items/values which are actually being compared. Right now, as I understand the Line struct, we are storing every value. When I look at BSD --debug, I think wouldn't a smaller cache make more sense? Because it seems like a 10-50 item cache would might make performance dip by 5-10% but save bunches of memory.

miDeb · 2021-05-02T19:54:09Z

With the changes I'm implementing in #2144 and some more changes I in #2165 the size of the Line struct is reduced from 12 words (96 bytes) to 7 words (56 bytes). That's as far as I could get, not sure how much room for improvement there's left.

Here's how Line and Selection would look like:

pub struct Line {
    line: Box<str>,
    first_selection: Selection,
    other_selections: Box<[Selection]>,
}

struct Selection {
    range: SelectionRange, // type alias for Range<usize>
    num_cache: Option<Box<NumCache>>,
}

The memory layout of Line and Selection would look like this:

type: `Line`: 56 bytes, alignment: 8 bytes
    field `.line`: 16 bytes
    field `.first_selection`: 24 bytes
    field `.other_selections`: 16 bytes

type: `Selection`: 24 bytes, alignment: 8 bytes
    field `.range`: 16 bytes
    field `.num_cache`: 8 bytes

I'll also have to evaluate how much of an improvement this is in practice, #2165 alone looks like a 10% memory usage reduction.

kimono-koans · 2021-05-03T02:49:10Z

I tried dividing up the Line struct myself into a Line: String (24 bytes) and Selections: Vec (24 bytes) structs, and my benchmarking puts our memory usage about even with GNU, and we're still about 1.5x as fast on the slower Linux machine, and 5-6x as fast as GNU on the faster MacOS laptop (if still slowish).

Appreciate your work on this. We both don't like a performance regression, but I'm still wondering if it isn't better to keep the memory usage low at the sake of out and out performance, because larger inputs is where this may matter (?).

I'll push to my repo so you can take a look, if you care to. I still think there is an upside on the performance front to just cache the last 10-20 parse ops.

May I ask -- why are you still holding onto the first and second selection within the Line struct proper? Is there a performance benefit?

miDeb · 2021-05-03T05:13:33Z

I'm holding onto the first selection, because I'm assuming that we'll only need to compare the first selection to find a difference between lines (but we always need to compare at least one!). Having the first selection directly on the struct is I think much faster than it being behind a pointer, but I'll check the exact numbers here.

The other thing is a Box<[Selection]>, which is a "boxed slice", similar to a &[Selection], with the only difference that we own it. The trick is not to use a Vec<Selection>, which uses 24 bytes (it consists of ptr, len, capacity), but a slice (16 bytes: ptr and len). This just means that we can't grow a Box<[Selection]> anymore (can't add Selections after constructing the line), but we don't need to in the first place. I'm converting the initial Vec<Selection> in Line::new using into_boxed_slice.

Similarly, Box<str> is like String, but we can't grow it and it uses 16 bytes instead of 24.

I really appreciate your suggestions, I'll have a look at your repo (though maybe not until the next weekend)

chadbrewbaker · 2021-05-07T00:20:26Z

Like to see a /usr/bin time -v (BSD/OSX has different arguments) for several tools too keep track of memory regressions. Many Rust buffered readers are unbounded memory.

miDeb · 2021-05-14T20:32:53Z

My latest, highly experimental attempt at improving sort performance can be found here: https://github.com/miDeb/coreutils/tree/ouroboros. Performance should be comparable to GNU sort for everything I tested due to a few new (experimental) approaches. I hope I'll be able to split my changes into a number of incremental and merge-able PRs, but for now it's just an exploration into what could be possibly achieved.

The biggest experimental change is likely the use of self-rererential structs with a dependency to https://crates.io/crates/ouroboros, I guess.

miDeb · 2021-06-16T10:53:34Z

I think I'm pretty satisfied with the current memory usage, although there could be some improvements for numeric sorting or sort keys (maybe using a more data-oriented approach).

kimono-koans mentioned this issue Apr 26, 2021

sort: implement --buffer-size and --temporary-directory (external sort) #2130

Merged

miDeb mentioned this issue May 2, 2021

sort: optimize the line struct #2165

Merged

miDeb changed the title ~~sort: Remove precomputation~~ sort: Reduce memory overhead May 2, 2021

miDeb closed this as completed Jun 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sort: Reduce memory overhead #2134

sort: Reduce memory overhead #2134

miDeb commented Apr 26, 2021 •

edited

Loading

kimono-koans commented Apr 26, 2021

miDeb commented Apr 26, 2021

kimono-koans commented Apr 26, 2021

tertsdiepraam commented Apr 26, 2021

miDeb commented Apr 26, 2021

kimono-koans commented Apr 26, 2021

miDeb commented Apr 26, 2021

kimono-koans commented Apr 26, 2021

miDeb commented Apr 26, 2021

kimono-koans commented Apr 26, 2021

miDeb commented Apr 29, 2021

kimono-koans commented Apr 29, 2021

kimono-koans commented Apr 29, 2021 •

edited

Loading

kimono-koans commented Apr 29, 2021

kimono-koans commented Apr 29, 2021

miDeb commented May 2, 2021

kimono-koans commented May 3, 2021

miDeb commented May 3, 2021

chadbrewbaker commented May 7, 2021

miDeb commented May 14, 2021

miDeb commented Jun 16, 2021

sort: Reduce memory overhead #2134

sort: Reduce memory overhead #2134

Comments

miDeb commented Apr 26, 2021 • edited Loading

kimono-koans commented Apr 26, 2021

miDeb commented Apr 26, 2021

kimono-koans commented Apr 26, 2021

tertsdiepraam commented Apr 26, 2021

miDeb commented Apr 26, 2021

kimono-koans commented Apr 26, 2021

miDeb commented Apr 26, 2021

kimono-koans commented Apr 26, 2021

miDeb commented Apr 26, 2021

kimono-koans commented Apr 26, 2021

miDeb commented Apr 29, 2021

kimono-koans commented Apr 29, 2021

kimono-koans commented Apr 29, 2021 • edited Loading

kimono-koans commented Apr 29, 2021

kimono-koans commented Apr 29, 2021

miDeb commented May 2, 2021

kimono-koans commented May 3, 2021

miDeb commented May 3, 2021

chadbrewbaker commented May 7, 2021

miDeb commented May 14, 2021

miDeb commented Jun 16, 2021

miDeb commented Apr 26, 2021 •

edited

Loading

kimono-koans commented Apr 29, 2021 •

edited

Loading