-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow wide table left outer join on local machine #301
Comments
Thanks for trying out Sail and reporting the issue! It seems that all data are loaded before the join happens. Broadcast join can be helpful to keep the small dataset in memory while the big one is read in a streaming fashion. Would you mind sharing some statistics of the datasets (e.g. in terms of number of rows and columns)? We'll see if there is any optimization we can do here. BTW did you install Sail via |
I've tried to call broadcast on smaller dataset but had some error - assumed it is not supported yet - will try tomorrow fixing possible code error on my part and include broadcast again. Input dataset was quite wide with at least 200 columns and 6m rows. I've installed it using pip, version 0.2.0.dev0 |
Yeah we do not support using Could you share the output of |
I have no way to share those plans here unfortunately. I can try creating reproduction example using publicly available datasets in few next days :) |
I've tried using sail for local development of spark jobs. But running simple query on dataset that has size of few GBs makes sail slower than spark.
When join is not there then query runs within 10secs.
With join I can see in resource monitor that memory is rising whole time (2-3 minutes) and it seems like once all data is in memory then work is executed and released.
CPU and disk usage are very low during that time (barely nonexistent)
Query is as such in pseudocode:
Is there some bug or perhaps additional configuration option to make it run in partitions in similar way to how spark runs it?
The text was updated successfully, but these errors were encountered: