Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when Z Ordering a larger dataset #1459

Closed
MrPowers opened this issue Jun 13, 2023 · 2 comments · Fixed by #1461
Closed

Error when Z Ordering a larger dataset #1459

MrPowers opened this issue Jun 13, 2023 · 2 comments · Fixed by #1461
Labels
bug Something isn't working

Comments

@MrPowers
Copy link
Collaborator

Environment

Delta-rs version: 0.10.0

Binding: Python

Environment:

  • Cloud provider: Localhost
  • OS: Macbook
  • Other: Macbook M1 with 64 GB of RAM

Bug

What happened: Z Order command worked on 5 GB h2o groupby dataset (1e8), but errors out of 50 GB dataset (1e9)

What you expected to happen: I expected the Z Ordering to work

How to reproduce it: This notebook shows the computations working well on the 1e8 dataset, but erroring out on the 1e9 dataset.

More details: I'm Z Ordering on a single column. Here's the error message:

thread 'tokio-runtime-worker' panicked at 'overflow', /Users/runner/.cargo/registry/src/github.aaakk.us.kg-1ecc6299db9ec823/arrow-select-39.0.0/src/interleave.rs:172:56
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
---------------------------------------------------------------------------
DeltaError                                Traceback (most recent call last)
File <timed eval>:1

File ~/opt/miniconda3/envs/deltalake-0100/lib/python3.9/site-packages/deltalake/table.py:697, in TableOptimizer.z_order(self, columns, partition_filters, target_size, max_concurrent_tasks)
    675 def z_order(
    676     self,
    677     columns: Iterable[str],
   (...)
    680     max_concurrent_tasks: Optional[int] = None,
    681 ) -> Dict[str, Any]:
    682     """
    683     Reorders the data using a Z-order curve to improve data skipping.
    684 
   (...)
    695     :return: the metrics from optimize
    696     """
--> 697     metrics = self.table._table.z_order_optimize(
    698         list(columns), partition_filters, target_size, max_concurrent_tasks
    699     )
    700     self.table.update_incremental()
    701     return json.loads(metrics)

DeltaError: Generic error: task 334 panicked
@MrPowers MrPowers added the bug Something isn't working label Jun 13, 2023
@wjones127
Copy link
Collaborator

Got it. That's likely from this line:

https://github.com/apache/arrow-rs/blob/700bd334ad8be53455f5dd80023b6c8c237559a7/arrow-select/src/interleave.rs#L172

Which indicates that we have a binary column that is too large for StringArray. Maybe we can use large types for this? Or else interleave the results in chunks.

@MrPowers
Copy link
Collaborator Author

Here's a few rows of the data:

	id1	id2	id3	id4	id5	id6	v1	v2	v3
0	id016	id046	id0000109363	88	13	146094	4	6	18.837686
1	id039	id087	id0000466766	14	30	111330	4	14	46.797328
2	id047	id098	id0000307804	85	23	187639	3	5	47.577311
3	id043	id017	id0000344864	87	76	256509	2	5	80.462924
4	id054	id027	id0000433679	99	67	32736	1	7	15.796662

Here are the dtypes:

id1     object
id2     object
id3     object
id4      int64
id5      int64
id6      int64
v1       int64
v2       int64
v3     float64
dtype: object

Here's the code I ran: dt.optimize.z_order(["id1"])

Should have included these details in the initial bug report.

wjones127 added a commit that referenced this issue Jul 3, 2023
…1461)

# Description

Fixes the base implementation so that is doesn't materialize the entire
result in one record batch. It will still require materializing the full
input for each partition in memory. This is mostly a problem for
unpartitioned table, since that means materializing the entire table in
memory.

Adds a new datafusion-based implementation enabled by the `datafusion`
feature. In theory, this should support spilling to disk.

# Related Issue(s)

For example:

- closes #1459
- closes #1460

# Documentation

<!---
Share links to useful documentation
--->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants