-
-
Notifications
You must be signed in to change notification settings - Fork 421
Conversation
Thanks for your pull request, @Darredevil! We are looking forward to reviewing it, and you should be hearing from a maintainer soon. Some tips to help speed things up:
Bear in mind that large or tricky changes may require multiple rounds of review and revision. Please see CONTRIBUTING.md for more information. Bugzilla references
|
bbe08b3
to
a276e47
Compare
@@ -1974,6 +1975,11 @@ void clear(T : Value[Key], Value, Key)(T* aa) | |||
_aaClear(*cast(void **) aa); | |||
} | |||
|
|||
void reserve(T : Value[Key], Value, Key)(ref T aa, size_t ndim) | |||
{ | |||
_aaReserve(cast(void **)&aa, typeid(Value[Key]), ndim); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use typeid(T)
here since T == Value[Key]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, however I used Value[Key] to be consistent with the rest of the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few kinks we could do better in this area, si @Darredevil let's go with the cleaner way here and we'll update the rest in good time.
@@ -1974,6 +1975,11 @@ void clear(T : Value[Key], Value, Key)(T* aa) | |||
_aaClear(*cast(void **) aa); | |||
} | |||
|
|||
void reserve(T : Value[Key], Value, Key)(ref T aa, size_t ndim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add an overload for T* as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ouch do we really need this T*
for new overloads? After all aren't we trying to move away from the ugly C-like pointer syntax?
Also while you are at it, note that most AA methods still aren't @safe
:/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wilzbach yes, it needs to be there. Otherwise it operates inconsistently with other normal objects that can call "members" based on a pointer.
This is a special case because AA's are builtins. If they were fully library types, this wouldn't be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sux but ok
a276e47
to
7cceafc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs unit tests and documentation. Will do complete review when those are there. Looks good so far!
Btw please don't forget to open a PR at dlang.org for the spec update -> https://dlang.org/spec/hash-map.html |
A design question on this -- What is the expectation we want for You could potentially do both the buckets and the data items if you preallocate the data elements into some sort of array or block, but that can have an effect on many things, including when a data item is collected by the GC. There are advantages to both ways, so it truly is a question of what we want, and not of what is obvious. Again, another reason to favor a configurable library type vs. a language type. Make sure the documentation is crystal clear what is being reserved. |
src/rt/aaA.d
Outdated
/// Reserve AA | ||
extern (C) void _aaReserve(AA* aa, const TypeInfo_AssociativeArray ti, size_t ndim) | ||
{ | ||
// lazily alloc implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dimension has to be a power of 2, so anything else should be rejected or rounded up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, this actually saved me some headache I had with a few tests. Rounding up by default now. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the resize call is supposed to avoid rehashing for the given number of entries, you need to include the "grow ratio", too. The AA rehashes if the number of entries exceeds 4/5 of the dimension of the bucket array: https://github.com/Darredevil/druntime/blob/eb9865e63b1205e36f28c52aeffbb08d02f96b68/src/rt/aaA.d#L386.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rainers that seems to explain the efficiency issue. Also, @Darredevil what happens if reserve is called with a smaller number than the current number of elements? Seems to me the correct code should be:
if (ndim <= aa.dim) return;
if (aa.used * GROW_DEN > ndim * GROW_NUM) ndim = (aa.used * GROW_DEN + GROW_NUM - 1) / GROW_NUM;
assert(aa.used * GROW_DEN <= ndim * GROW_NUM);
ndim = nextpow2(ndim);
if (aa.impl is null)
aa.impl = new Impl(ti, ndim);
else
aa.resize(ndim);
Please verify my math.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, 20 million rounds up to 33 million as a power of 2. 20/33 < 4/5. So no rehashing should be happening.
Still a good idea to take into account the load factor.
Edit: this is in relation to the current tests, so what I mean in this context is it doesn't "explain the efficiency issue"
7cceafc
to
eb9865e
Compare
@schveiguy regarding the design question, I noticed after several tests that the current implementation seems to run faster without a reserve() call.
This code runs several hundred milisec slower with the reserve(). Would it be faster if we preallocate the data elements as well? |
src/rt/aaA.d
Outdated
assert(aa[31_133] == 31_133); | ||
|
||
foreach(i;0..20_000_000) aa[i] = i; | ||
assert(aa[19_999_133] == 19_999_133); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These aren't actually testing the functionality of reserve, but just that reserve doesn't break the current functionality.
You should check that aa.impl.buckets
doesn't change at some point while filling. Something like:
auto b = aa.impl.buckets;
assert(b.length >= 20_000_000);
// fill, rehashing won't happen
...
assert(aa.impl.buckets is b); // check both pointer and length still match
I can't see why it would be slower. Without reserve, it has to rehash several times during the insertion of elements. With reserve, it shouldn't rehash at all, as 20 million elements will round up to a dim size of 2^25, or 33 million. It shoudn't ever rehash during adding the elements. Perhaps the issue is that the GC scans take longer when you are allocating because you have this humungous array of pointers? |
Note that preallocating the elements would stop the GC scans during insertion, so if that is the issue, preallocation of elements would solve it. |
Could be true, using GC.disable() might be used to verify it. Maybe caching also has effects, i.e. inserting into the AA might be faster with smaller bucket array, more than compensating for the eventual rehashing.
It would move allocation and GC scan time into the reserve() call, but I suspect overall performance would not change. |
Unless you preallocated as an array of elements. Then it's only one allocation, and one potential scan (2 if you preallocate the buckets). |
Yes, but it would be a large memory block staying in memory unless all entries are removed and all references to it are gone. That's not very GC friendly. Please note that there can be "external" references to the values via |
@@ -1974,6 +1975,11 @@ void clear(T : Value[Key], Value, Key)(T* aa) | |||
_aaClear(*cast(void **) aa); | |||
} | |||
|
|||
void reserve(T : Value[Key], Value, Key)(ref T aa, size_t ndim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sux but ok
src/object.d
Outdated
|
||
void reserve(T : Value[Key], Value, Key)(T* aa, size_t ndim) | ||
{ | ||
_aaReserve(cast(void **)aa, typeid(Value[Key]), ndim); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rather use simple forwarding instead of the ugly cast twice. That way we only need to maintain the tricky code once:
reserve(*aa, ndim);
@@ -1974,6 +1975,11 @@ void clear(T : Value[Key], Value, Key)(T* aa) | |||
_aaClear(*cast(void **) aa); | |||
} | |||
|
|||
void reserve(T : Value[Key], Value, Key)(ref T aa, size_t ndim) | |||
{ | |||
_aaReserve(cast(void **)&aa, typeid(Value[Key]), ndim); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few kinks we could do better in this area, si @Darredevil let's go with the cleaner way here and we'll update the rest in good time.
src/rt/aaA.d
Outdated
/// Reserve AA | ||
extern (C) void _aaReserve(AA* aa, const TypeInfo_AssociativeArray ti, size_t ndim) | ||
{ | ||
// lazily alloc implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rainers that seems to explain the efficiency issue. Also, @Darredevil what happens if reserve is called with a smaller number than the current number of elements? Seems to me the correct code should be:
if (ndim <= aa.dim) return;
if (aa.used * GROW_DEN > ndim * GROW_NUM) ndim = (aa.used * GROW_DEN + GROW_NUM - 1) / GROW_NUM;
assert(aa.used * GROW_DEN <= ndim * GROW_NUM);
ndim = nextpow2(ndim);
if (aa.impl is null)
aa.impl = new Impl(ti, ndim);
else
aa.resize(ndim);
Please verify my math.
src/rt/aaA.d
Outdated
|
||
foreach(i;0..20_000_000) aa[i] = i; | ||
assert(aa[19_999_133] == 19_999_133); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a test that reserves less than the current length (which will probably fail with the current implementation).
Yes, I know. That is why I said earlier "can have an effect on many things, including when a data item is collected by the GC." But the bottom line is -- if reserve makes things slower, there's no point. There's only a point to reserve if it improves performance. The design issue with the current proposed PR is that it only preallocates the bucket allocations, which is something that isn't very costly already. The problem with preallocating the elements as individual allocations is that each of them must be put somewhere to not be collected, but without knowing where they will go (as that is based on the hash), you then would have to store them in a temporary array. What we need is a way to "preallocate" from the GC, and then use that cache for inserting elements. Each element would be able to be collected once it's used, but before actually being inserted it would be considered one block. You could probably do it by manipulating the bits for the page as each one is consumed. That's significantly more work than just this PR. |
|
If this partial solution is not helping, we should figure out whether there's enough need for a more elaborate solution, or scrap reserve() altogether. The proper solution would be for the hashtable to maintain a private freelist. Then reserve(n) makes sure there are at least n elements in the freelist. Again, we need to make sure there are solid use cases for this complication. All, please advise. |
4b6127f
to
d1e0860
Compare
After taking into account the grow threshold it seems there is a 20% speedup after running several tests on my machine. I would appreciate it if others could test this as well and confirm there is an increase in performance, example test code:
Thank you everyone for the assistance, please let me know if you agree with this implementation so we can merge it. Once merged I'll create a PR for the docs as well. |
I've got a slower machine, but I got an average of 500ms out of 10 tests speedup :) |
d1e0860
to
3f9f357
Compare
3f9f357
to
af6d479
Compare
I'll try it this evening (California time) on my standard benchmark tests. In case anyone wishes to try it, there's a documented test here: eBay TSV utils: Join Benchmark. Instructions for creating the test files are included, and there is a link the source data file under the "Details" section. To try it, call It will also be interesting to check the max GC pause time. Add the command line arg |
@jondegenhardt I tried to run your benchmark but I ran into some issues when trying to use my local build for dmd/druntime/phobos with tsv utils. Could you please provide the results from your machine ? |
tsv-join benchmark: My benchmarks show no material change. Very surprising. I need to take a more in-depth look to make sure I haven't made a mistake somewhere. Here are timing from three builds: this PR with no reserve; this PR with a reserve of 8 million (the AA uses 7 million in this case), and a build of 2.076 (no reserve):
|
There is a large flaw with your benchmark, and that is that GC tests largely depend on the state of the GC. In each test, the GC is in a very different state after having allocated some number of AA elements for the previous tests. I recommend instead compiling Edit: I realized the first "test" is just to check how long reserve takes. |
tsv-join benchmark: Here are the GC profile stats. These vary more run-to-run, those below are smallest "Grand total GC time" of two runs. Here also, no material difference. Like the raw timing numbers, this is a surprise. Makes me question the validity. I'll take a look again this evening and see if I can identify anything. The other thing I might try is to create an LDC build. LDC builds run about twice as fast on this benchmark.
|
/// Reserve AA | ||
extern (C) void _aaReserve(AA* aa, const TypeInfo_AssociativeArray ti, size_t ndim) | ||
{ | ||
ndim = nextpow2(ndim); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The buckets size used should be large enough that if ndim elements are added, the bucket array will not be grown further (ndim function arg value). However, the choices here don't achieve that. Specifically, the bucket array will be grown if it becomes 80% full (GROW_NUM/GROW_DEN). nextpow2(ndim) will not always need this requirement. It will sometimes be necessary to move one power of two further to achieve this. e.g. If the arg is 500, the bucket size selected should be 1024, not 512. I suggest testing for this after the ndim = nextpow2(ndim)
line, if it doesn't meet it, find the next power of two. That should handle the first two cases (aa.impl is null
, ndim <= aa.dim
). For the third case tested (aa.used * GROW_DEN > ndim * GROW_NUM)
, it is less clear to me when that would get triggered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think applying ndim = (ndim * GROW_DEN + GROW_NUM - 1) / GROW_NUM;
before finding the next power of 2 should be good enough.
I'm not 100% sure, but you might also have to take the deleted entries into account when comparing with aa.dim or aa.used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to be cautious of is that reserve
should never remove buckets. Due to the logic below, it won't, but if the logic is moved around, it could easily happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the power-of-two allocation sizes: Best I can tell all the existing bucket allocation code paths use nextpow2
to set the bucket size when allocating. Not clear if that is important. If it is, the
ndim = (aa.used * GROW_DEN + GROW_NUM - 1) / GROW_NUM;
line is not following it. However, it the power-of-two component is not important, there may be an opportunity with reserve
. Assuming that a call to reserve
is a hint about expected element count, it might be better to choose the size based on fill ratio rather than power-of-two. For example, reserving 7 million entries will pick a power-of-two of 16777216, a 42% fill ratio. A 70% fill ratio of 10 million would save space and improve locality of reference.
Update: findSlotLookup
and findSlotInsert
identify the slot based on a mask created by buckets.length - 1
, so it looks like power-of-two is expected. The code line in reserve
should probably be changed to conform.
tsv-join benchmark: I was able to look into this a little, but I still need to examine further to understand why there is no material benefit. It was still re-allocating the bucket array one time more than necessary, but using a large enough initial size to avoid it doesn't get a material change. I think I'll need to write a different benchmark a bit to get a more accurate read. |
tsv-join benchmark: Did some further investigation. I'm seeing something unexpected: Using Normally I would spend more time validating and investigating prior to reporting a result like this, but my time is quite fragmented right now, I wanted to provide others an opportunity to try reproducing this before I got back to it. At this point I've run this enough times to be sure there is a real behavior change in my tests. However, it would be preferable to create a more specific test to ensure I haven't made a mistake, that there are no unintended side effects involved, etc . I will do this as soon as I can. My benchmark uses a Update: In a simplified version of this benchmark I'm seeing a 9-10% improvement in AA population time with no degradation in lookup time. This is with the DMD compiler. |
tsv-join benchmark: I tried a number of standalone tests using For some reason I don't yet understand the full tsv-join benchmark is materially slower when it calls reserve. Same build, the call to reserve conditioned on a run-time argument. However, another of my tools, tsv-uniq, is materially faster when it calls reserve first. There is some evidence that using reserve reduces GC impact, but the numbers reported have high variance, it would take a more study to validate and quantify deltas. In any case, the max pause times remain relatively long. Another benefit of using reserve - It chooses a more appropriate power-of-two size for the bucket array. In the current scheme, the bucket array quadruples in size when it grows. This leaves a 50-50 chance of using the power-of-two that best fits the data size, otherwise it's 2 times larger than needed. With default allocation, 7 million keys results in bucket array of 2**25 (33.5 million). Using My suggestion: Make a decision about proceeding without worrying about the degradation in the full tsv-join benchmark. I'll keep trying to identify a simplified case, but whatever is going on appears to be somewhat more complicated. |
Thanks for this work.
My understanding (correct me if wrong) is that what we implement now is reserving the bucket array but not the nodes. That is not fulfilling the contract above and improves performance only marginally. Even the description is difficult ("reserves some of the memory needed later but not all" etc). We risk to disappoint folks more than help. This should do what some code in std.experimental.allocator does, i.e. allocate relatively large blocks then thread a freelist through them. A freelist would in fact help the hashtable implementation a whole lot. If that's deemed too difficult, let's close this and work on something more impactful. |
I'm inclined to agree. A 9% gain is material, but frankly, I was hoping for a good bit more. Relative to the other tools I wrote, those using AAs are quite a bit slower than the others for the amount of work they are doing. That is, my assessment is that there is fair bit of opportunity to improve AAs, The memory savings by selecting a more appropriate power-of-two size are probably more worthwhile than the speed gains, but there are other ways to achieve this. I was also hoping that the multi-second GC pauses would be further reduced. My assumption is that the large array of pointers in these AAs is a major contributor. Clearly an alternate approach is need to address this. |
In the interests of moving this along... To me the performance gain measured doesn't justify adding a method to the API. I recommend closing this. An alternate PR could certainly be put together in the future, incorporating this work it appropriate. I have a couple follow-up questions:
|
After discussing with @andralex in more depth about the issue we decided to postpone this enhancement for now and revisit it later. |
It's always worth exploring, as we can guess what the reason is, but testing can reveal something we didn't think of (my theory would be that there are a huge amount of small blocks involved, with a large array that points at them, meaning you are following pointers all over the place, looking up each block as you process the large array, killing the cache).
You pre-allocate the blocks that would be allocated by the AA. You would not preallocate the values themselves. Any string types would be allocated elsewhere, the AA does not duplicate them. It just stores the string reference itself. |
https://issues.dlang.org/show_bug.cgi?id=2504
Following the suggestion from https://forum.dlang.org/post/[email protected]