ARROW-10191: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes #8330

carols10cents · 2020-10-02T15:57:31Z

Note that this PR goes to the rust-parquet-arrow-writer branch, not master.

Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc

These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones.

Tests that currently fail are either marked with #[should_panic] (if the
reason they fail is because of a panic) or #[ignore] (if the reason
they fail is because the values don't match).

I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the null_bitmap isn't matching and I'm not sure if it's supposed to or not.

So I would love advice on that front, and I would love to know if these tests are useful or not!

carols10cents · 2020-10-02T15:57:40Z

cc @nevi-me

alamb · 2020-10-02T16:45:15Z

I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the null_bitmap isn't matching and I'm not sure if it's supposed to or not.

I think the null bitmap should definitely match after the round trip (as it signals which rows are null, which should be preserved).

I am not sure the actual data arrays have to be the same -- what needs to be the same is the data rows where the null_bit is 0 (aka we should probably be checking only for non-null rows)

alamb

FWIW I think the basic idea and structure of these tests is ✔️ -- with the possible exception of directly comparing the data values regardless of the null bit vector, I think the test are good.

carols10cents

Adding some notes about why some tests are ignored/failing

carols10cents · 2020-10-02T19:39:34Z

rust/parquet/src/arrow/arrow_writer.rs

+    }
+
+    #[test]
+    #[ignore] // Large Binary support isn't correct yet


This test currently fails because both LargeBinary and Binary Arrow types are converted to Parquet BYTE_ARRAY types. I think this is what's intended, and I can't find any C++ tests that specifically exercise Arrow LargeBinary data, so I'm not sure if it's just supposed to roundtrip and lose the "large" information or if it's supposed to panic as not supported either on writing or reading.

---- arrow::arrow_writer::tests::large_binary_single_column stdout ---- thread 'arrow::arrow_writer::tests::large_binary_single_column' panicked at 'assertion failed: `(left == right)` left: `Schema { fields: [Field { name: "col", data_type: LargeBinary, nullable: false, dict_id: 0, dict_is_ordered: false }], metadata: {} }`, right: `Schema { fields: [Field { name: "col", data_type: Binary, nullable: false, dict_id: 0, dict_is_ordered: false }], metadata: {} }`', parquet/src/arrow/arrow_writer.rs:712:9

I'm currently working on adding some of this support in C++. Based on parquet metadata there is not a great way to make distinguish these types. The way C++ handles it is on serialization the Arrow schema for the table is added as metadata (its base64 encoded). When reading back we deserialize it and use the types there to make an additional determination. Currently we only use it for extension types

In this case, the reader is supposed to get the correct type from the serialized schema, and then do the conversion to a large array

carols10cents · 2020-10-02T19:40:48Z

rust/parquet/src/arrow/arrow_writer.rs

+    }
+
+    #[test]
+    #[ignore] // Large String support isn't correct yet - null_bitmap and buffers don't match


This is actually the same as LargeArray-- the Large part of LargeUtf8 isn't coming back, and I'm not sure the intended way to handle that.

---- arrow::arrow_writer::tests::large_string_single_column stdout ---- thread 'arrow::arrow_writer::tests::large_string_single_column' panicked at 'assertion failed: `(left == right)` left: `Schema { fields: [Field { name: "col", data_type: LargeUtf8, nullable: false, dict_id: 0, dict_is_ordered: false }], metadata: {} }`, right: `Schema { fields: [Field { name: "col", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false }], metadata: {} }`', parquet/src/arrow/arrow_writer.rs:712:9

May you please have a look at the function parquet_to_arrow_schema_by_columns and https://issues.apache.org/jira/browse/ARROW-10168. I suspect this might be why the arrow metadata gets lost during the roundtrip.

emkornfield · 2020-10-03T02:39:56Z

I am not sure the actual data arrays have to be the same -- what needs to be the same is the data rows where the null_bit is 0 (aka we should probably be checking only for non-null rows)

Small nit I'm prettry sure null_bit = 1 is actually where values should be compared. (thinking of it as a validity bitmap instead of a null bitmap helps). And yes I agree they should be compared separately, as there is generally no requirement on the values of null fields (generally, you'll want to initialize to some known values to avoid security issues).

nevi-me

Hi @carols10cents, I've added some notes on why some of the tests are failing. I opened integer32llc#1. I ran out of time, I could have looked at other ignored tests, but I suppose running out of time is why I need the help :)

nevi-me · 2020-10-03T00:41:46Z

rust/parquet/src/arrow/arrow_writer.rs

+        );
+    }
+
+    // How to create Float16 values that aren't supported in Rust?


We can't create them, so we always leave f16 as unsupported

nevi-me · 2020-10-03T00:49:42Z

rust/parquet/src/arrow/arrow_writer.rs

+    }
+
+    #[test]
+    #[ignore] // Date support isn't correct yet


The failure because of ArrowError("underlying Arrow error: Compute error: Casting from Date64(Millisecond) to Int32 not supported") can be addressed by adding a cast in arrow::compute::kernels::cast.

We fail to cast the data in arrow_writer::197. The main reason why some casts aren't supported is because I was worried about type loss if one converts something like date32 (number of days since epoch) to int64 then back to date64 (number of millis since epoch). Such a cast should = date32 to date64, which I think it currently doesn't.

nevi-me · 2020-10-03T07:14:33Z

rust/parquet/src/arrow/arrow_writer.rs

+    }
+
+    #[test]
+    #[ignore] // Large Binary support isn't correct yet


In this case, the reader is supposed to get the correct type from the serialized schema, and then do the conversion to a large array

nevi-me · 2020-10-03T07:15:53Z

rust/parquet/src/arrow/arrow_writer.rs

+    fn struct_single_column() {
+        let a_values = Int32Array::from(vec![1, 2, 3, 4, 5, 6, 7, 8, 9, 10]);
+        let struct_field_a = Field::new("f", DataType::Int32, false);
+        let s = StructArray::from(vec![(struct_field_a, Arc::new(a_values) as ArrayRef)]);


The cause might be this here. See https://issues.apache.org/jira/browse/ARROW-5408.

nevi-me · 2020-10-03T07:25:57Z

And yes I agree they should be compared separately, as there is generally no requirement on the values of null fields (generally, you'll want to initialize to some known values to avoid security issues).

There was a time ~2 years ago where null slots would contain garbage/unknown values, but I haven't seen that in very long.
With nested arrays, we'll also want to convert the data directly as a struct with null slots containing a primitive that's non-nullable, would result in a nullable primitive after roundtrip.

@emkornfield in the above example, does the c++ Arrow implementation allow creating such a struct? Asked differently; is it some validation that we should be adding at an Arrow level in Rust, that if you create a nullable struct, you can't have its children being non-nullable?

nevi-me · 2020-10-03T07:47:29Z

I rebased the parquet branch against the main one, but now I seem to have picked up additional commits on this PR. They should be fixed by a rebase. @carols10cents may you please grant me write permission on Integer32's fork, so I can rebase there.

carols10cents · 2020-10-05T12:23:40Z

@nevi-me you should have an invite!

Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc Tests that currently fail are either marked with `#[should_panic]` (if the reason they fail is because of a panic) or `#[ignore]` (if the reason they fail is because the values don't match).

carols10cents · 2020-10-05T19:06:54Z

@nevi-me I went ahead and merged in your PR, rebased this PR's branch on rust-parquet-arrow-writer, added one commit to explicitly not support Float16 in the match in write_leaves, and force pushed. Do you think this can be merged or do more of the ignored tests need to be passing? I'm going to take a look at reading the schema metadata next, in any case.

nevi-me · 2020-10-06T00:37:24Z

@nevi-me I went ahead and merged in your PR, rebased this PR's branch on rust-parquet-arrow-writer, added one commit to explicitly not support Float16 in the match in write_leaves, and force pushed. Do you think this can be merged or do more of the ignored tests need to be passing? I'm going to take a look at reading the schema metadata next, in any case.

I'll merge this in my morning (GMT+2). I'm happy with the tests, they also help with unitising further work so others can help us out. I'll open JIRAs for the cases that still fail

nevi-me · 2020-10-06T11:25:22Z

@carols10cents may you please create a JIRA account at https://issues.apache.org, so that we can assign the tasks that you've worked on to you. Thanks :)

github-actions · 2020-10-06T11:32:09Z

https://issues.apache.org/jira/browse/ARROW-10191

nevi-me · 2020-10-06T11:33:36Z

Hi @andygrove @jorgecarleitao may you please kindly merge this for me, the merge tool's failing to push to GH on my end :(

carols10cents · 2020-10-06T12:06:07Z

@nevi-me Done! https://issues.apache.org/jira/secure/ViewProfile.jspa?name=carols10cents

…r all supported Arrow DataTypes Note that this PR goes to the rust-parquet-arrow-writer branch, not master. Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones. Tests that currently fail are either marked with `#[should_panic]` (if the reason they fail is because of a panic) or `#[ignore]` (if the reason they fail is because the values don't match). I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not. So I would love advice on that front, and I would love to know if these tests are useful or not! Closes #8330 from carols10cents/roundtrip-tests Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Andy Grove <[email protected]>

andygrove · 2020-10-06T14:46:38Z

@nevi-me @carols10cents I ran the merge tool and the changes appear to have been committed to the branch but this PR did not get closed, so I am closing it manually.

nevi-me · 2020-10-06T15:56:24Z

@nevi-me @carols10cents I ran the merge tool and the changes appear to have been committed to the branch but this PR did not get closed, so I am closing it manually.

It's been the case since I started merging on the branch. I've been meaning to open a JIRA for this, but keep forgetting. I've opened https://issues.apache.org/jira/browse/ARROW-10198

@nevi-me

…m Parquet metadata when available @nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>

…r all supported Arrow DataTypes Note that this PR goes to the rust-parquet-arrow-writer branch, not master. Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones. Tests that currently fail are either marked with `#[should_panic]` (if the reason they fail is because of a panic) or `#[ignore]` (if the reason they fail is because the values don't match). I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not. So I would love advice on that front, and I would love to know if these tests are useful or not! Closes #8330 from carols10cents/roundtrip-tests Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Andy Grove <[email protected]>

@nevi-me

…m Parquet metadata when available @nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>

…r all supported Arrow DataTypes Note that this PR goes to the rust-parquet-arrow-writer branch, not master. Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones. Tests that currently fail are either marked with `#[should_panic]` (if the reason they fail is because of a panic) or `#[ignore]` (if the reason they fail is because the values don't match). I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not. So I would love advice on that front, and I would love to know if these tests are useful or not! Closes #8330 from carols10cents/roundtrip-tests Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Andy Grove <[email protected]>

@nevi-me

…m Parquet metadata when available @nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>

…r all supported Arrow DataTypes Note that this PR goes to the rust-parquet-arrow-writer branch, not master. Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones. Tests that currently fail are either marked with `#[should_panic]` (if the reason they fail is because of a panic) or `#[ignore]` (if the reason they fail is because the values don't match). I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not. So I would love advice on that front, and I would love to know if these tests are useful or not! Closes #8330 from carols10cents/roundtrip-tests Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Andy Grove <[email protected]>

@nevi-me

…m Parquet metadata when available @nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>

…r all supported Arrow DataTypes Note that this PR goes to the rust-parquet-arrow-writer branch, not master. Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones. Tests that currently fail are either marked with `#[should_panic]` (if the reason they fail is because of a panic) or `#[ignore]` (if the reason they fail is because the values don't match). I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not. So I would love advice on that front, and I would love to know if these tests are useful or not! Closes apache#8330 from carols10cents/roundtrip-tests Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Andy Grove <[email protected]>

@nevi-me

…m Parquet metadata when available @nevi-me This is one commit on top of apache#8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes apache#8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>

…r all supported Arrow DataTypes Note that this PR goes to the rust-parquet-arrow-writer branch, not master. Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones. Tests that currently fail are either marked with `#[should_panic]` (if the reason they fail is because of a panic) or `#[ignore]` (if the reason they fail is because the values don't match). I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not. So I would love advice on that front, and I would love to know if these tests are useful or not! Closes #8330 from carols10cents/roundtrip-tests Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Andy Grove <[email protected]>

@nevi-me

…m Parquet metadata when available @nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>

…r all supported Arrow DataTypes Note that this PR goes to the rust-parquet-arrow-writer branch, not master. Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones. Tests that currently fail are either marked with `#[should_panic]` (if the reason they fail is because of a panic) or `#[ignore]` (if the reason they fail is because the values don't match). I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not. So I would love advice on that front, and I would love to know if these tests are useful or not! Closes #8330 from carols10cents/roundtrip-tests Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Andy Grove <[email protected]>

@nevi-me

…m Parquet metadata when available @nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>

…r all supported Arrow DataTypes Note that this PR goes to the rust-parquet-arrow-writer branch, not master. Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones. Tests that currently fail are either marked with `#[should_panic]` (if the reason they fail is because of a panic) or `#[ignore]` (if the reason they fail is because the values don't match). I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not. So I would love advice on that front, and I would love to know if these tests are useful or not! Closes #8330 from carols10cents/roundtrip-tests Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Andy Grove <[email protected]>

@nevi-me

…m Parquet metadata when available @nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>

@nevi-me

…m Parquet metadata when available @nevi-me This is one commit on top of apache/arrow#8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>

…r all supported Arrow DataTypes Note that this PR goes to the rust-parquet-arrow-writer branch, not master. Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones. Tests that currently fail are either marked with `#[should_panic]` (if the reason they fail is because of a panic) or `#[ignore]` (if the reason they fail is because the values don't match). I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not. So I would love advice on that front, and I would love to know if these tests are useful or not! Closes apache#8330 from carols10cents/roundtrip-tests Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Andy Grove <[email protected]>

@nevi-me

…m Parquet metadata when available @nevi-me This is one commit on top of apache#8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes apache#8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <[email protected]> Co-authored-by: Neville Dipale <[email protected]> Signed-off-by: Neville Dipale <[email protected]>

alamb approved these changes Oct 2, 2020

View reviewed changes

andygrove added the Component: Rust label Oct 2, 2020

carols10cents force-pushed the roundtrip-tests branch from acb50c9 to 91b7e3f Compare October 2, 2020 19:35

carols10cents commented Oct 2, 2020

View reviewed changes

nevi-me mentioned this pull request Oct 3, 2020

fix a few failing roundtrip tests integer32llc/arrow#1

Closed

nevi-me reviewed Oct 3, 2020

View reviewed changes

nevi-me force-pushed the rust-parquet-arrow-writer branch from 16f95e9 to 7f743c2 Compare October 3, 2020 07:44

carols10cents and others added 3 commits October 5, 2020 14:59

fix a few failing roundtrip tests

2563586

Explicitly don't support Float16

cfd366b

carols10cents force-pushed the roundtrip-tests branch from 91b7e3f to cfd366b Compare October 5, 2020 19:03

carols10cents added 2 commits October 5, 2020 15:15

Update to use new Iterator support by calling next instead of next_batch

8fe210b

Remove unused import

c3f3597

carols10cents mentioned this pull request Oct 6, 2020

ARROW-10168: [Rust] [Parquet] Schema roundtrip - use Arrow schema from Parquet metadata when available #8354

Closed

nevi-me approved these changes Oct 6, 2020

View reviewed changes

nevi-me changed the title ~~[Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes~~ ARROW-10191: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes Oct 6, 2020

andygrove closed this Oct 6, 2020

tustvold mentioned this pull request May 6, 2022

Parquet Treats Embedded Arrow Schema as Authoritative apache/arrow-rs#1663

Closed

asfimport mentioned this pull request Jan 16, 2021

[Rust] [Parquet] Add roundtrip tests for single column batches #26196

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-10191: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes #8330

ARROW-10191: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes #8330

carols10cents commented Oct 2, 2020

carols10cents commented Oct 2, 2020

alamb commented Oct 2, 2020

alamb left a comment

carols10cents left a comment

carols10cents Oct 2, 2020

emkornfield Oct 3, 2020

nevi-me Oct 3, 2020

carols10cents Oct 2, 2020

nevi-me Oct 3, 2020

emkornfield commented Oct 3, 2020

nevi-me left a comment

nevi-me Oct 3, 2020

nevi-me Oct 3, 2020

nevi-me Oct 3, 2020

nevi-me Oct 3, 2020

nevi-me commented Oct 3, 2020

nevi-me commented Oct 3, 2020

carols10cents commented Oct 5, 2020

carols10cents commented Oct 5, 2020

nevi-me commented Oct 6, 2020

nevi-me commented Oct 6, 2020

github-actions bot commented Oct 6, 2020

nevi-me commented Oct 6, 2020

carols10cents commented Oct 6, 2020

andygrove commented Oct 6, 2020 •

edited

Loading

nevi-me commented Oct 6, 2020

ARROW-10191: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes #8330

ARROW-10191: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes #8330

Conversation

carols10cents commented Oct 2, 2020

carols10cents commented Oct 2, 2020

alamb commented Oct 2, 2020

alamb left a comment

Choose a reason for hiding this comment

carols10cents left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emkornfield commented Oct 3, 2020

nevi-me left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nevi-me commented Oct 3, 2020

nevi-me commented Oct 3, 2020

carols10cents commented Oct 5, 2020

carols10cents commented Oct 5, 2020

nevi-me commented Oct 6, 2020

nevi-me commented Oct 6, 2020

github-actions bot commented Oct 6, 2020

nevi-me commented Oct 6, 2020

carols10cents commented Oct 6, 2020

andygrove commented Oct 6, 2020 • edited Loading

nevi-me commented Oct 6, 2020

andygrove commented Oct 6, 2020 •

edited

Loading