Slow wide table left outer join on local machine #301

eredzik · 2024-11-20T10:38:40Z

I've tried using sail for local development of spark jobs. But running simple query on dataset that has size of few GBs makes sail slower than spark.
When join is not there then query runs within 10secs.

With join I can see in resource monitor that memory is rising whole time (2-3 minutes) and it seems like once all data is in memory then work is executed and released.
CPU and disk usage are very low during that time (barely nonexistent)

Query is as such in pseudocode:

df1 = session.read.parquet('big_dataset.parquet')\
smalldf = session.read.parquet('small.parquet')
df2 = df1.select(*some_columns)
df_res=df2.join(smalldf, on='somecol', how='left')
df_res.write.parquet('result_dataset.parquet')

Is there some bug or perhaps additional configuration option to make it run in partitions in similar way to how spark runs it?

The text was updated successfully, but these errors were encountered:

linhr · 2024-11-20T11:42:42Z

Thanks for trying out Sail and reporting the issue!

It seems that all data are loaded before the join happens. Broadcast join can be helpful to keep the small dataset in memory while the big one is read in a streaming fashion. Would you mind sharing some statistics of the datasets (e.g. in terms of number of rows and columns)? We'll see if there is any optimization we can do here.

BTW did you install Sail via pip or build the package from source?

eredzik · 2024-11-20T18:58:22Z

I've tried to call broadcast on smaller dataset but had some error - assumed it is not supported yet - will try tomorrow fixing possible code error on my part and include broadcast again.

Input dataset was quite wide with at least 200 columns and 6m rows.

I've installed it using pip, version 0.2.0.dev0

linhr · 2024-11-21T06:08:58Z

Yeah we do not support using broadcast() to provide SQL hint yet. I mentioned broadcast join since I feel it could be the way we optimize such queries internally.

Could you share the output of df_res.explain() for both Sail and Spark? I'd like to compare the physical plans to see if there is any interesting difference. (Feel free to mask column names if they are sensitive.)

eredzik · 2024-11-21T10:47:06Z

I have no way to share those plans here unfortunately.
One interesting thing I've found trying things out today is that runtime increases exponentially with amount of columns sourced from df1 (big).

I can try creating reproduction example using publicly available datasets in few next days :)

shehabgamin changed the title ~~Slow simple query on local machine~~ Slow wide table left outer join on local machine Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow wide table left outer join on local machine #301

Slow wide table left outer join on local machine #301

eredzik commented Nov 20, 2024

linhr commented Nov 20, 2024

eredzik commented Nov 20, 2024

linhr commented Nov 21, 2024

eredzik commented Nov 21, 2024

Slow wide table left outer join on local machine #301

Slow wide table left outer join on local machine #301

Comments

eredzik commented Nov 20, 2024

linhr commented Nov 20, 2024

eredzik commented Nov 20, 2024

linhr commented Nov 21, 2024

eredzik commented Nov 21, 2024