-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Input partition differences using show() #882
Comments
I ran the query above and am seeing the same number of partitions in the stages. |
ok I tried reading a large file and I"m seeing cpu use 1 task on CPU side and 71 on the GPU side just when I do a show(). |
this actually happens in local mode as well. It looks like the way we are handling limit. Plan on GPU:
Plan on CPU:
|
We have an explicit shuffle in the GPU plan where as |
this is because we do a local limit on all the partitions first, then shuffle, then global limit. So we naturally run tasks on all the partitions doing a local limit on that. Spark itself does a special executeTake that starts with 1 partition and executes a job on that and then scales up the number of partitions it reads each iteration if it doesn't get the limit it needs. |
so running this on a bunch of small files really shows a performance hit on the GPU side. if we just do a read and then show my little test took 25 seconds on GPU and it took < 1 sec(0.3 sec). CollectLimitExec is a bit tricky because there are 2 paths it can take. The first is when you actually pull data back to driver so like show(), limit.collect(), take, etc. These call CollectLimitExec.executeCollect(). The second one is when there its not the last operation, I'm not sure all the cases but one I found is .limit(20).cache(). In the second case it actually calls doExecute(). In the second case the first time you execute after the cache (df.count()) it run over all partitions on both cpu and gpu. The problem with the executeCollect is that we have to be the first thing in the plan to get that executed, otherwise it goes calls spark default executeCollect which does all partitions and we have a wholestagecodegen that does the columnartorow as the first thing so it never gets called. I did a few tests:
Doing some more testing rather than just reading and show I haven't seen any worse performance in us not replacing CollectLimitExec. spark is smart and just stops calling next after it reaches. Now for us that is at least 1 batch, which could potentially be millions of rows. The test I ran pulled back 130000 rows and didn't see any performance difference. Another thing, reading and then sorting and then show takes another path through TakeOrderedAndProject that we don't replace either. I'm tempted to just shut this off for now so we don't replace CollectLimitExec. There might be some cases where our batches are huge - millions of rows where there is some difference but I've never seen times this is hurting our performance. |
put up PR #900 to turn off for now and we can leave this open to investigate better solutions. |
so I was actually able to reproduce the different number of parititons on EMR on a tpcds q38 with hte collectLimitExec turned off on the gpu side. If we use the EMR file reader the CPU gets a lot less partitions. If we change the reader back to the v2 reader then we get the same number on the CPU as the GPU. |
Closing as won't fix for now |
…IDIA#882) Signed-off-by: spark-rapids automation <[email protected]> Signed-off-by: spark-rapids automation <[email protected]>
Describe the bug
While working with AWS EMR Spark, I noticed a difference in the number of input partitions for a simple query. I'm not sure this is a bug, per-se, as both the CPU and GPU queries appeared to execute correctly. However I thought it might be worth investigating to see if there's a potential performance problem/opportunity for the plugin case.
Steps/Code to reproduce bug
Execute the following query and notice the difference between the number of tasks in the first stage between a run on the CPU and a run on the GPU. Update
some_bucket_path
with an appropriate writable bucket path.Expected behavior
The same number of input partitions (i.e.: tasks in the first stage) appear regardless of CPU vs. GPU run.
**Environment details **
AWS EMR w/ Spark 3.0
The text was updated successfully, but these errors were encountered: