-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] left_semi_join operation is abnormal and serious time-consuming #569
Comments
This is definitely a bug. No GPU operation should take thousands of seconds as shown in the nsys trace. @chenrui17 does this happen every time you run this query or only sometimes? Does the scale factor of the tpc-ds data matter or will this occur even with smaller scale factors? |
it happens every time and my scale factor of the tpc-ds is 1000 |
I also tried two scalfactors. When the scalefactor is 100, tpc-ds query 10 seems to be able to execute normally. It takes a little time to reduce. The total query time is 14s, and the reduce phase takes 8s. However, most tasks take 1s, and some tasks still take 7S, as shown in the figure below; When scalefactor is 10000, tpc-ds query 10 fails to execute successfully, and it is stuck in the reduce phase mentioned above. Therefore, even with a small amount of data, this problem still exists |
I think I have been able to reproduce this issue. I ran query 10 with a SF 200 data set encoded in parquet.
8 CPU cores per GPU. With the plugin it can take over 50 seconds to run the semi-join. If I disable just LeftSemi as a join option the query takes 6 seconds. If I run on the CPU for everything it completes in about 15 seconds. It is producing the correct result in all cases, so it looks to be a performance related issue of some kind. It actually looks like it is scaling related. When I increase the number of workers I see a very large decrease in the total runtime. It is almost like the join being done is brute force some how. I'll dig into the source for it and see if I can come up with something. |
So I read through the code and got a profile too. It looks like building the hash table, just for the right table keys is taking about 87% of the time, and then probing the table is taking the other 13% of the time. I really don't know why. I'll try to get a repro case and file something in cudf about this. @jlowe this looks to be bad enough that I think we want to disable both semi and anti joins (They share the same implementation), but I would like to hear your opinion on this. |
I think we need to get a bit more details on how it's slow before disabling outright. IIUC semi/anti join should always be as fast or faster than a normal join in cudf. The build phase constructs a hash table just like a regular join, but instead of a multimap it's just a map. The lookup phase should also be as fast or faster since it only cares about a key hit in the table and doesn't care about any value associated with it. It would be interesting to know if a regular join on the same input tables and same join keys is significantly faster. |
So,how can i disable the semi (and anti)join?@revans2 |
@JustPlay I had to go in and change the code to disable just those joins. We don't have a config for that yet. @jlowe I was able to make some progress on this. I dumped the inputs to the join and was able to reproduce the issue in c++. It has to do with null values and null equality.
On the 200GB data set when you get to the left semi join there are 513,298 null key values in the right table. If I say that nulls are equal When I do the join manually in Spark I am not able to reproduce it because Spark automatically filters all nulls from both the left and right table before doing the join. My guess as to why it is taking so long is that cudf sees nulls as not equal so each time it sees one it inserts it into the table as a new entry, but because they hash to the same thing it has to walk the table looking for a free location. This then messes up the hash table enough that probing it takes a lot longer too. The fastest fix for this would be to filter out any nulls before we do the join. We should also file a follow on issue with cudf to see if there is a better fix on their side, but I am off today so if this can wait until Thursday then I can work on it. If not feel free to put in a fix yourself and and we can work with cudf on this later. |
@revans2 Thursday should be fine. I'm chasing some other things today, and you're already setup to reproduce the issue easily and therefore can verify the fix.
Agreed. I assume we'd only need to pre-filter the nulls in the build table for the hash table, thinking it would be faster for the streamed table rows to fail a hit in the hash table, that now no longer contains null entries, than it would be to manifest the non-null version of the streamed table prior to the join. |
at least q5,q10, q35, q69 in tpc-ds has this problem |
Reopening because the fix broke correctness in some cases and had to be reverted for the 0.2 release. I'll try to get it fixed properly, but 0.2 is too close to try and do it in that time frame. |
…IDIA#569) Signed-off-by: spark-rapids automation <[email protected]> Signed-off-by: spark-rapids automation <[email protected]>
Describe the bug
I test tpc-ds query-10 with rapids-0.2 recently committed , and I find there is a phenomenon that during the reduce process after shuffle, several takes are time-consuming, and I use nsys found that time spent on left_semi_join , i suspect this is a bug
Normally, this query can be completed in dozens of seconds, but now the several takes 5-8 minutes. At the same time, it is found that the GPU utilization rate is very high in this process
The text was updated successfully, but these errors were encountered: