-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sort: Reduce memory overhead #2134
Comments
I'm not sure we have to abandon all precomputed values just yet. I was just testing this and the SmallVec in the Line struct is a big hog. We can reduce the Line struct to 48 bytes just by changing to a regular Vec. Why don't we both test that first and see where we are after? |
Yeah, I also tried this briefly. I will look at it again, but I think this regressed speed. It will not reduce overall memory usage because it just moves the memory to a different place. |
So the memory overhead is similar? Huh. Yeah, then I guess I'm with you on precomputed values seeming like the thing we can lose if we have to. |
Maybe you could store a |
This could be an option, but would probably introduce more complexity. FieldSelectors are a global thing, but we'd need indices for each line. I'm also unsure how nice this would work with extsort (sorting by using auxiliary disk memory to sort large files), but we can surely explore multiple approaches. This is more or less what I meant with "later we can try to put some of this back in a way that doesn't affect other use-cases". Maybe we can build indices for the first n lines for huge files or something like that to reduce the peak memory usage. |
@miDeb since you've tinkered with the Line struct, do we have to take an owned value for each line string? I'm looking at this naively, but are there instances where we could take a str, and we could make this a COW, and save a bunch of string allocs (or maybe we can avoid string allocs altogether)? |
No, I think we need to allocate those strings. We get them from the BufReader and have to take ownership from it because it will not keep its buffer content around. We could avoid allocations in some cases in the transform function as far as I can tell. For -b that's surely the case: We could just trim before comparing strings. |
I'm sorry I meant alloc a string buffer. So, alloc a String::with_capacity(4096)[the size of the BufReader buffer or the --buffer-size limit or whatever] into a Buffer_Chunks struct then have the Line struct take references? I think this is what GNU sort does looking at its memory behavior in Valgrind. |
Thanks for clarifying, that makes sense! That way we can avoid sorting Strings (24 bytes afaik) and can use &str (16 bytes). |
I reread your initial comment and realized what I was suggesting something like what you were already talking about! Sorry for missing the point. Re: chunked sorting, just so someone remembers -- we should benchmark stable and unstable again -- because I remember reading in the library docs that the default unstable sort perhaps wouldn't be suited to doing a final pass over the chunks. |
Alright, I did some experiments with removing precomputation for numeric comparisons. I don't think my initial hypothesis holds up, this results in a severe performance regression, and I don't know how valuable the improvement in memory usage is in practice. I'm seeing a 30% speed regression for -n and a 70% regression for -g (haven't implemented -h yet, but it would be somewhere in between). Other workloads aren't really affected in their performance, athough memory usage goes down by about 15% for all workloads. The size of the line struct is reduced from 96 to 72 bytes (a reduction of 25%). Maybe I'll find a change that has a clearer benefit. |
Could you store the cache in a separate struct/hashmap for each function call, instead of per line? |
Re: Field/Key, what about storing the field/key info in a separate struct/s which merely correspond to an instance of a line struct? So maybe Key=Line,Value=OtherStuff? Or perhaps struct GlobalLine, which is only used if needed and we rework the functions to take type T? |
FYI, on BSD/MacOS, --debug has some interesting output which will show you the items/values which are actually being compared. Right now, as I understand the Line struct, we are storing every value. When I look at BSD --debug, I think wouldn't a smaller cache make more sense? Because it seems like a 10-50 item cache would might make performance dip by 5-10% but save bunches of memory. |
With the changes I'm implementing in #2144 and some more changes I in #2165 the size of the Line struct is reduced from 12 words (96 bytes) to 7 words (56 bytes). That's as far as I could get, not sure how much room for improvement there's left. Here's how pub struct Line {
line: Box<str>,
first_selection: Selection,
other_selections: Box<[Selection]>,
}
struct Selection {
range: SelectionRange, // type alias for Range<usize>
num_cache: Option<Box<NumCache>>,
} The memory layout of
I'll also have to evaluate how much of an improvement this is in practice, #2165 alone looks like a 10% memory usage reduction. |
I tried dividing up the Line struct myself into a Line: String (24 bytes) and Selections: Vec (24 bytes) structs, and my benchmarking puts our memory usage about even with GNU, and we're still about 1.5x as fast on the slower Linux machine, and 5-6x as fast as GNU on the faster MacOS laptop (if still slowish). Appreciate your work on this. We both don't like a performance regression, but I'm still wondering if it isn't better to keep the memory usage low at the sake of out and out performance, because larger inputs is where this may matter (?). I'll push to my repo so you can take a look, if you care to. I still think there is an upside on the performance front to just cache the last 10-20 parse ops. May I ask -- why are you still holding onto the first and second selection within the Line struct proper? Is there a performance benefit? |
I'm holding onto the first selection, because I'm assuming that we'll only need to compare the first selection to find a difference between lines (but we always need to compare at least one!). Having the first selection directly on the struct is I think much faster than it being behind a pointer, but I'll check the exact numbers here. The other thing is a Similarly, I really appreciate your suggestions, I'll have a look at your repo (though maybe not until the next weekend) |
Like to see a /usr/bin time -v (BSD/OSX has different arguments) for several tools too keep track of memory regressions. Many Rust buffered readers are unbounded memory. |
My latest, highly experimental attempt at improving sort performance can be found here: https://github.com/miDeb/coreutils/tree/ouroboros. Performance should be comparable to GNU sort for everything I tested due to a few new (experimental) approaches. I hope I'll be able to split my changes into a number of incremental and merge-able PRs, but for now it's just an exploration into what could be possibly achieved. The biggest experimental change is likely the use of self-rererential structs with a dependency to https://crates.io/crates/ouroboros, I guess. |
I think I'm pretty satisfied with the current memory usage, although there could be some improvements for numeric sorting or sort keys (maybe using a more data-oriented approach). |
When I wrote #1996, I implemented this by pre-computing some values beforehand to speed up comparisons later. There are other cases where we rely on pre-computed values. While this can result in a speed improvement, it has the big disadvantage of increasing our overall memory usage. The way it is currently implemented also means that we pay a constant tax even for lines/files that would not need precomputed values. That said, I think we should not rely on precomputed values for our comparisons. We can later experiment with adding them back for cases where they bring significant improvements without impacting other use-cases.
Places where pre-computed values are used:
-h
and-n
): The current algorithm is expecting that pre-computed values are cached. To achieve similar performance without caching, it needs to be rewritten.Expected performance impact: Small performance regression, improvements for other use-cases and already sorted files.general numeric comparisons: Parsed f64 values are cached. Reverting this should be relatively straight-forward. Should only be a minor performance regression.See sort: Reduce memory overhead #2134 (comment), this may not be as useful as I initially thought.Generally the goal is to
Line
struct that is 96 bytes big. This is due to the precomputed values for numeric string comparison and indices for field support we have to reserve space for. The goal is to reduce this to just a&str
, i.e. 16 bytes.cc @electricboogie does roughly sound ok to you? If so, I'll start implementing this the next days.
The text was updated successfully, but these errors were encountered: