You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Basic LIMIT statements require getitem calls on each partition if table is persisted, but not if the table is not persisted. There's no difference in the RAL for each scenario, so I'm sharing screenshots of the task graph.
For some context, dask-sql has a special codepath that computes only the first partition length to check if it has enough rows to cover the limit clause and return a result. This codepath is limited to simpler task graphs (blockwise and IO operations in dask or explicitly via sql.limit.check-first-partition config added in #696) since this check can be expensive and duplicate compute for complex graphs including operations such as joins.
For all other cases we default to computing df.head with all partitions npartitions=-1ref.
For the persisted case we can add a check to the same fast codepath mentioned above. The cost for this check is negligible in persisted dataframes while from the task graph it does seem like there is a overhead for computing head across multiple partitions even if it's persisted in memory.
Basic LIMIT statements require getitem calls on each partition if table is persisted, but not if the table is not persisted. There's no difference in the RAL for each scenario, so I'm sharing screenshots of the task graph.
When persisted, the task stream (after the read_csv ops) looks quite different.
The text was updated successfully, but these errors were encountered: