-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10191: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes #8330
ARROW-10191: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes #8330
Conversation
cc @nevi-me |
I think the null bitmap should definitely match after the round trip (as it signals which rows are null, which should be preserved). I am not sure the actual data arrays have to be the same -- what needs to be the same is the data rows where the null_bit is 0 (aka we should probably be checking only for non-null rows) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I think the basic idea and structure of these tests is ✔️ -- with the possible exception of directly comparing the data values regardless of the null bit vector, I think the test are good.
acb50c9
to
91b7e3f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding some notes about why some tests are ignored/failing
} | ||
|
||
#[test] | ||
#[ignore] // Large Binary support isn't correct yet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test currently fails because both LargeBinary
and Binary
Arrow types are converted to Parquet BYTE_ARRAY
types. I think this is what's intended, and I can't find any C++ tests that specifically exercise Arrow LargeBinary data, so I'm not sure if it's just supposed to roundtrip and lose the "large" information or if it's supposed to panic as not supported either on writing or reading.
---- arrow::arrow_writer::tests::large_binary_single_column stdout ----
thread 'arrow::arrow_writer::tests::large_binary_single_column' panicked at 'assertion failed: `(left == right)`
left: `Schema { fields: [Field { name: "col", data_type: LargeBinary, nullable: false, dict_id: 0, dict_is_ordered: false }], metadata: {} }`,
right: `Schema { fields: [Field { name: "col", data_type: Binary, nullable: false, dict_id: 0, dict_is_ordered: false }], metadata: {} }`', parquet/src/arrow/arrow_writer.rs:712:9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm currently working on adding some of this support in C++. Based on parquet metadata there is not a great way to make distinguish these types. The way C++ handles it is on serialization the Arrow schema for the table is added as metadata (its base64 encoded). When reading back we deserialize it and use the types there to make an additional determination. Currently we only use it for extension types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, the reader is supposed to get the correct type from the serialized schema, and then do the conversion to a large array
} | ||
|
||
#[test] | ||
#[ignore] // Large String support isn't correct yet - null_bitmap and buffers don't match |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually the same as LargeArray-- the Large part of LargeUtf8 isn't coming back, and I'm not sure the intended way to handle that.
---- arrow::arrow_writer::tests::large_string_single_column stdout ----
thread 'arrow::arrow_writer::tests::large_string_single_column' panicked at 'assertion failed: `(left == right)`
left: `Schema { fields: [Field { name: "col", data_type: LargeUtf8, nullable: false, dict_id: 0, dict_is_ordered: false }], metadata: {} }`,
right: `Schema { fields: [Field { name: "col", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false }], metadata: {} }`', parquet/src/arrow/arrow_writer.rs:712:9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May you please have a look at the function parquet_to_arrow_schema_by_columns
and https://issues.apache.org/jira/browse/ARROW-10168. I suspect this might be why the arrow metadata gets lost during the roundtrip.
Small nit I'm prettry sure null_bit = 1 is actually where values should be compared. (thinking of it as a validity bitmap instead of a null bitmap helps). And yes I agree they should be compared separately, as there is generally no requirement on the values of null fields (generally, you'll want to initialize to some known values to avoid security issues). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @carols10cents, I've added some notes on why some of the tests are failing. I opened integer32llc#1. I ran out of time, I could have looked at other ignored tests, but I suppose running out of time is why I need the help :)
); | ||
} | ||
|
||
// How to create Float16 values that aren't supported in Rust? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't create them, so we always leave f16
as unsupported
} | ||
|
||
#[test] | ||
#[ignore] // Date support isn't correct yet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The failure because of ArrowError("underlying Arrow error: Compute error: Casting from Date64(Millisecond) to Int32 not supported")
can be addressed by adding a cast in arrow::compute::kernels::cast
.
We fail to cast the data in arrow_writer::197
. The main reason why some casts aren't supported is because I was worried about type loss if one converts something like date32 (number of days since epoch) to int64 then back to date64 (number of millis since epoch). Such a cast should = date32 to date64, which I think it currently doesn't.
} | ||
|
||
#[test] | ||
#[ignore] // Large Binary support isn't correct yet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, the reader is supposed to get the correct type from the serialized schema, and then do the conversion to a large array
fn struct_single_column() { | ||
let a_values = Int32Array::from(vec![1, 2, 3, 4, 5, 6, 7, 8, 9, 10]); | ||
let struct_field_a = Field::new("f", DataType::Int32, false); | ||
let s = StructArray::from(vec![(struct_field_a, Arc::new(a_values) as ArrayRef)]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cause might be this here. See https://issues.apache.org/jira/browse/ARROW-5408.
There was a time ~2 years ago where null slots would contain garbage/unknown values, but I haven't seen that in very long. @emkornfield in the above example, does the c++ Arrow implementation allow creating such a struct? Asked differently; is it some validation that we should be adding at an Arrow level in Rust, that if you create a nullable struct, you can't have its children being non-nullable? |
16f95e9
to
7f743c2
Compare
I rebased the parquet branch against the main one, but now I seem to have picked up additional commits on this PR. They should be fixed by a rebase. @carols10cents may you please grant me write permission on Integer32's fork, so I can rebase there. |
@nevi-me you should have an invite! |
Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc Tests that currently fail are either marked with `#[should_panic]` (if the reason they fail is because of a panic) or `#[ignore]` (if the reason they fail is because the values don't match).
91b7e3f
to
cfd366b
Compare
@nevi-me I went ahead and merged in your PR, rebased this PR's branch on rust-parquet-arrow-writer, added one commit to explicitly not support Float16 in the |
I'll merge this in my morning (GMT+2). I'm happy with the tests, they also help with unitising further work so others can help us out. I'll open JIRAs for the cases that still fail |
@carols10cents may you please create a JIRA account at https://issues.apache.org, so that we can assign the tasks that you've worked on to you. Thanks :) |
Hi @andygrove @jorgecarleitao may you please kindly merge this for me, the merge tool's failing to push to GH on my end :( |
…r all supported Arrow DataTypes Note that this PR goes to the rust-parquet-arrow-writer branch, not master. Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones. Tests that currently fail are either marked with `#[should_panic]` (if the reason they fail is because of a panic) or `#[ignore]` (if the reason they fail is because the values don't match). I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not. So I would love advice on that front, and I would love to know if these tests are useful or not! Closes #8330 from carols10cents/roundtrip-tests Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Andy Grove <[email protected]>
@nevi-me @carols10cents I ran the merge tool and the changes appear to have been committed to the branch but this PR did not get closed, so I am closing it manually. |
It's been the case since I started merging on the branch. I've been meaning to open a JIRA for this, but keep forgetting. I've opened https://issues.apache.org/jira/browse/ARROW-10198 |
…m Parquet metadata when available @nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
…r all supported Arrow DataTypes Note that this PR goes to the rust-parquet-arrow-writer branch, not master. Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones. Tests that currently fail are either marked with `#[should_panic]` (if the reason they fail is because of a panic) or `#[ignore]` (if the reason they fail is because the values don't match). I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not. So I would love advice on that front, and I would love to know if these tests are useful or not! Closes #8330 from carols10cents/roundtrip-tests Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Andy Grove <[email protected]>
…m Parquet metadata when available @nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
…r all supported Arrow DataTypes Note that this PR goes to the rust-parquet-arrow-writer branch, not master. Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones. Tests that currently fail are either marked with `#[should_panic]` (if the reason they fail is because of a panic) or `#[ignore]` (if the reason they fail is because the values don't match). I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not. So I would love advice on that front, and I would love to know if these tests are useful or not! Closes #8330 from carols10cents/roundtrip-tests Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Andy Grove <[email protected]>
…m Parquet metadata when available @nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
…r all supported Arrow DataTypes Note that this PR goes to the rust-parquet-arrow-writer branch, not master. Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones. Tests that currently fail are either marked with `#[should_panic]` (if the reason they fail is because of a panic) or `#[ignore]` (if the reason they fail is because the values don't match). I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not. So I would love advice on that front, and I would love to know if these tests are useful or not! Closes #8330 from carols10cents/roundtrip-tests Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Andy Grove <[email protected]>
…m Parquet metadata when available @nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
…r all supported Arrow DataTypes Note that this PR goes to the rust-parquet-arrow-writer branch, not master. Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones. Tests that currently fail are either marked with `#[should_panic]` (if the reason they fail is because of a panic) or `#[ignore]` (if the reason they fail is because the values don't match). I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not. So I would love advice on that front, and I would love to know if these tests are useful or not! Closes apache#8330 from carols10cents/roundtrip-tests Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Andy Grove <[email protected]>
…m Parquet metadata when available @nevi-me This is one commit on top of apache#8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes apache#8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
…r all supported Arrow DataTypes Note that this PR goes to the rust-parquet-arrow-writer branch, not master. Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones. Tests that currently fail are either marked with `#[should_panic]` (if the reason they fail is because of a panic) or `#[ignore]` (if the reason they fail is because the values don't match). I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not. So I would love advice on that front, and I would love to know if these tests are useful or not! Closes #8330 from carols10cents/roundtrip-tests Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Andy Grove <[email protected]>
…m Parquet metadata when available @nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
…r all supported Arrow DataTypes Note that this PR goes to the rust-parquet-arrow-writer branch, not master. Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones. Tests that currently fail are either marked with `#[should_panic]` (if the reason they fail is because of a panic) or `#[ignore]` (if the reason they fail is because the values don't match). I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not. So I would love advice on that front, and I would love to know if these tests are useful or not! Closes #8330 from carols10cents/roundtrip-tests Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Andy Grove <[email protected]>
…m Parquet metadata when available @nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
…r all supported Arrow DataTypes Note that this PR goes to the rust-parquet-arrow-writer branch, not master. Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones. Tests that currently fail are either marked with `#[should_panic]` (if the reason they fail is because of a panic) or `#[ignore]` (if the reason they fail is because the values don't match). I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not. So I would love advice on that front, and I would love to know if these tests are useful or not! Closes #8330 from carols10cents/roundtrip-tests Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Andy Grove <[email protected]>
…m Parquet metadata when available @nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
…m Parquet metadata when available @nevi-me This is one commit on top of apache/arrow#8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
…r all supported Arrow DataTypes Note that this PR goes to the rust-parquet-arrow-writer branch, not master. Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones. Tests that currently fail are either marked with `#[should_panic]` (if the reason they fail is because of a panic) or `#[ignore]` (if the reason they fail is because the values don't match). I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not. So I would love advice on that front, and I would love to know if these tests are useful or not! Closes apache#8330 from carols10cents/roundtrip-tests Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Andy Grove <[email protected]>
…m Parquet metadata when available @nevi-me This is one commit on top of apache#8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes apache#8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>
Note that this PR goes to the rust-parquet-arrow-writer branch, not master.
Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc
These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones.
Tests that currently fail are either marked with
#[should_panic]
(if thereason they fail is because of a panic) or
#[ignore]
(if the reasonthey fail is because the values don't match).
I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the
null_bitmap
isn't matching and I'm not sure if it's supposed to or not.So I would love advice on that front, and I would love to know if these tests are useful or not!