Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow wide table left outer join on local machine #301

Open
eredzik opened this issue Nov 20, 2024 · 4 comments
Open

Slow wide table left outer join on local machine #301

eredzik opened this issue Nov 20, 2024 · 4 comments

Comments

@eredzik
Copy link

eredzik commented Nov 20, 2024

I've tried using sail for local development of spark jobs. But running simple query on dataset that has size of few GBs makes sail slower than spark.
When join is not there then query runs within 10secs.

With join I can see in resource monitor that memory is rising whole time (2-3 minutes) and it seems like once all data is in memory then work is executed and released.
CPU and disk usage are very low during that time (barely nonexistent)

Query is as such in pseudocode:

df1 = session.read.parquet('big_dataset.parquet')\
smalldf = session.read.parquet('small.parquet')
df2 = df1.select(*some_columns)
df_res=df2.join(smalldf, on='somecol', how='left')
df_res.write.parquet('result_dataset.parquet')

Is there some bug or perhaps additional configuration option to make it run in partitions in similar way to how spark runs it?

@linhr
Copy link
Contributor

linhr commented Nov 20, 2024

Thanks for trying out Sail and reporting the issue!

It seems that all data are loaded before the join happens. Broadcast join can be helpful to keep the small dataset in memory while the big one is read in a streaming fashion. Would you mind sharing some statistics of the datasets (e.g. in terms of number of rows and columns)? We'll see if there is any optimization we can do here.

BTW did you install Sail via pip or build the package from source?

@eredzik
Copy link
Author

eredzik commented Nov 20, 2024

I've tried to call broadcast on smaller dataset but had some error - assumed it is not supported yet - will try tomorrow fixing possible code error on my part and include broadcast again.

Input dataset was quite wide with at least 200 columns and 6m rows.

I've installed it using pip, version 0.2.0.dev0

@shehabgamin shehabgamin changed the title Slow simple query on local machine Slow wide table left outer join on local machine Nov 21, 2024
@linhr
Copy link
Contributor

linhr commented Nov 21, 2024

Yeah we do not support using broadcast() to provide SQL hint yet. I mentioned broadcast join since I feel it could be the way we optimize such queries internally.

Could you share the output of df_res.explain() for both Sail and Spark? I'd like to compare the physical plans to see if there is any interesting difference. (Feel free to mask column names if they are sensitive.)

@eredzik
Copy link
Author

eredzik commented Nov 21, 2024

I have no way to share those plans here unfortunately.
One interesting thing I've found trying things out today is that runtime increases exponentially with amount of columns sourced from df1 (big).

I can try creating reproduction example using publicly available datasets in few next days :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants