Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compression when writing bools with parquet #2579

Closed
stress-tess opened this issue Jul 14, 2023 · 4 comments · Fixed by #2601
Closed

Compression when writing bools with parquet #2579

stress-tess opened this issue Jul 14, 2023 · 4 comments · Fixed by #2601
Assignees
Labels
bug Something isn't working File IO Arkouda file IO capabilities

Comments

@stress-tess
Copy link
Member

While working #2539, it seems like we don't have compression support for writing boolean arrays with parquet. We should look into if that is supported by pyarrow and if it is we should figure out how to add it

@Ethan-DeBandi99 Ethan-DeBandi99 added the File IO Arkouda file IO capabilities label Jul 17, 2023
@Ethan-DeBandi99
Copy link
Contributor

I verified the pyarrow does support writing bool values with compression. By default, pyarrow uses the RLE encoding. This is what Arkouda uses as well.

The specific error we are receiving is

RuntimeError: problem writing to file 1 errors: Error: ParquetError 457 coforall_fn:ParquetMsg Not yet implemented: Selected encoding is not supported.

@Ethan-DeBandi99
Copy link
Contributor

Ethan-DeBandi99 commented Jul 20, 2023

I was able to get the write to work properly by removing the use of RLE encoding.

I think to fix this we should add a check if the column is of type bool and only set RLE encoding if it is not.

@Ethan-DeBandi99
Copy link
Contributor

One thing to note is that we may have to remove RLE compression altogether for the multi-column case. I am going to look into ways to avoid this if possible. However, it does appear that if no encoding is explicitly set, it will be inferred. We may want to just switch to this in all cases.

@Ethan-DeBandi99
Copy link
Contributor

I ran a bunch of tests this morning to confirm that if no specific encoding is specified Parquet will automatically select the "best" encoding. I was able to verify this as file sizes when specifying an encoding are identical to those when the encoding is not specified. As a result, I believe that it will be best to remove the specification in our code to use RLE as this is what is causing the issue for bool values. The only difference in file sizes comes with the compression.

@Ethan-DeBandi99 Ethan-DeBandi99 self-assigned this Jul 21, 2023
@Ethan-DeBandi99 Ethan-DeBandi99 added the bug Something isn't working label Jul 21, 2023
github-merge-queue bot pushed a commit that referenced this issue Jul 21, 2023
* Fix for writing Bool values with compression.

* Cleaning up old code and comments.

* Correcting formatting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working File IO Arkouda file IO capabilities
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants