Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10191: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes #8330

Conversation

carols10cents
Copy link
Contributor

Note that this PR goes to the rust-parquet-arrow-writer branch, not master.

Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc

These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones.

Tests that currently fail are either marked with #[should_panic] (if the
reason they fail is because of a panic) or #[ignore] (if the reason
they fail is because the values don't match).

I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the null_bitmap isn't matching and I'm not sure if it's supposed to or not.

So I would love advice on that front, and I would love to know if these tests are useful or not!

@carols10cents
Copy link
Contributor Author

cc @nevi-me

@alamb
Copy link
Contributor

alamb commented Oct 2, 2020

I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the null_bitmap isn't matching and I'm not sure if it's supposed to or not.

I think the null bitmap should definitely match after the round trip (as it signals which rows are null, which should be preserved).

I am not sure the actual data arrays have to be the same -- what needs to be the same is the data rows where the null_bit is 0 (aka we should probably be checking only for non-null rows)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I think the basic idea and structure of these tests is ✔️ -- with the possible exception of directly comparing the data values regardless of the null bit vector, I think the test are good.

Copy link
Contributor Author

@carols10cents carols10cents left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding some notes about why some tests are ignored/failing

}

#[test]
#[ignore] // Large Binary support isn't correct yet
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test currently fails because both LargeBinary and Binary Arrow types are converted to Parquet BYTE_ARRAY types. I think this is what's intended, and I can't find any C++ tests that specifically exercise Arrow LargeBinary data, so I'm not sure if it's just supposed to roundtrip and lose the "large" information or if it's supposed to panic as not supported either on writing or reading.

---- arrow::arrow_writer::tests::large_binary_single_column stdout ----
thread 'arrow::arrow_writer::tests::large_binary_single_column' panicked at 'assertion failed: `(left == right)`
  left: `Schema { fields: [Field { name: "col", data_type: LargeBinary, nullable: false, dict_id: 0, dict_is_ordered: false }], metadata: {} }`,
 right: `Schema { fields: [Field { name: "col", data_type: Binary, nullable: false, dict_id: 0, dict_is_ordered: false }], metadata: {} }`', parquet/src/arrow/arrow_writer.rs:712:9

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm currently working on adding some of this support in C++. Based on parquet metadata there is not a great way to make distinguish these types. The way C++ handles it is on serialization the Arrow schema for the table is added as metadata (its base64 encoded). When reading back we deserialize it and use the types there to make an additional determination. Currently we only use it for extension types

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, the reader is supposed to get the correct type from the serialized schema, and then do the conversion to a large array

}

#[test]
#[ignore] // Large String support isn't correct yet - null_bitmap and buffers don't match
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually the same as LargeArray-- the Large part of LargeUtf8 isn't coming back, and I'm not sure the intended way to handle that.

---- arrow::arrow_writer::tests::large_string_single_column stdout ----
thread 'arrow::arrow_writer::tests::large_string_single_column' panicked at 'assertion failed: `(left == right)`
  left: `Schema { fields: [Field { name: "col", data_type: LargeUtf8, nullable: false, dict_id: 0, dict_is_ordered: false }], metadata: {} }`,
 right: `Schema { fields: [Field { name: "col", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false }], metadata: {} }`', parquet/src/arrow/arrow_writer.rs:712:9

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May you please have a look at the function parquet_to_arrow_schema_by_columns and https://issues.apache.org/jira/browse/ARROW-10168. I suspect this might be why the arrow metadata gets lost during the roundtrip.

@emkornfield
Copy link
Contributor

I am not sure the actual data arrays have to be the same -- what needs to be the same is the data rows where the null_bit is 0 (aka we should probably be checking only for non-null rows)

Small nit I'm prettry sure null_bit = 1 is actually where values should be compared. (thinking of it as a validity bitmap instead of a null bitmap helps). And yes I agree they should be compared separately, as there is generally no requirement on the values of null fields (generally, you'll want to initialize to some known values to avoid security issues).

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @carols10cents, I've added some notes on why some of the tests are failing. I opened integer32llc#1. I ran out of time, I could have looked at other ignored tests, but I suppose running out of time is why I need the help :)

);
}

// How to create Float16 values that aren't supported in Rust?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't create them, so we always leave f16 as unsupported

}

#[test]
#[ignore] // Date support isn't correct yet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failure because of ArrowError("underlying Arrow error: Compute error: Casting from Date64(Millisecond) to Int32 not supported") can be addressed by adding a cast in arrow::compute::kernels::cast.

We fail to cast the data in arrow_writer::197. The main reason why some casts aren't supported is because I was worried about type loss if one converts something like date32 (number of days since epoch) to int64 then back to date64 (number of millis since epoch). Such a cast should = date32 to date64, which I think it currently doesn't.

}

#[test]
#[ignore] // Large Binary support isn't correct yet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, the reader is supposed to get the correct type from the serialized schema, and then do the conversion to a large array

fn struct_single_column() {
let a_values = Int32Array::from(vec![1, 2, 3, 4, 5, 6, 7, 8, 9, 10]);
let struct_field_a = Field::new("f", DataType::Int32, false);
let s = StructArray::from(vec![(struct_field_a, Arc::new(a_values) as ArrayRef)]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cause might be this here. See https://issues.apache.org/jira/browse/ARROW-5408.

@nevi-me
Copy link
Contributor

nevi-me commented Oct 3, 2020

And yes I agree they should be compared separately, as there is generally no requirement on the values of null fields (generally, you'll want to initialize to some known values to avoid security issues).

There was a time ~2 years ago where null slots would contain garbage/unknown values, but I haven't seen that in very long.
With nested arrays, we'll also want to convert the data directly as a struct with null slots containing a primitive that's non-nullable, would result in a nullable primitive after roundtrip.

@emkornfield in the above example, does the c++ Arrow implementation allow creating such a struct? Asked differently; is it some validation that we should be adding at an Arrow level in Rust, that if you create a nullable struct, you can't have its children being non-nullable?

@nevi-me nevi-me force-pushed the rust-parquet-arrow-writer branch from 16f95e9 to 7f743c2 Compare October 3, 2020 07:44
@nevi-me
Copy link
Contributor

nevi-me commented Oct 3, 2020

I rebased the parquet branch against the main one, but now I seem to have picked up additional commits on this PR. They should be fixed by a rebase. @carols10cents may you please grant me write permission on Integer32's fork, so I can rebase there.

@carols10cents
Copy link
Contributor Author

@nevi-me you should have an invite!

carols10cents and others added 3 commits October 5, 2020 14:59
Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc

Tests that currently fail are either marked with `#[should_panic]` (if the
reason they fail is because of a panic) or `#[ignore]` (if the reason
they fail is because the values don't match).
@carols10cents
Copy link
Contributor Author

@nevi-me I went ahead and merged in your PR, rebased this PR's branch on rust-parquet-arrow-writer, added one commit to explicitly not support Float16 in the match in write_leaves, and force pushed. Do you think this can be merged or do more of the ignored tests need to be passing? I'm going to take a look at reading the schema metadata next, in any case.

@nevi-me
Copy link
Contributor

nevi-me commented Oct 6, 2020

@nevi-me I went ahead and merged in your PR, rebased this PR's branch on rust-parquet-arrow-writer, added one commit to explicitly not support Float16 in the match in write_leaves, and force pushed. Do you think this can be merged or do more of the ignored tests need to be passing? I'm going to take a look at reading the schema metadata next, in any case.

I'll merge this in my morning (GMT+2). I'm happy with the tests, they also help with unitising further work so others can help us out. I'll open JIRAs for the cases that still fail

@nevi-me nevi-me changed the title [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes ARROW-10191: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes Oct 6, 2020
@nevi-me
Copy link
Contributor

nevi-me commented Oct 6, 2020

@carols10cents may you please create a JIRA account at https://issues.apache.org, so that we can assign the tasks that you've worked on to you. Thanks :)

@github-actions
Copy link

github-actions bot commented Oct 6, 2020

@nevi-me
Copy link
Contributor

nevi-me commented Oct 6, 2020

Hi @andygrove @jorgecarleitao may you please kindly merge this for me, the merge tool's failing to push to GH on my end :(

@carols10cents
Copy link
Contributor Author

@nevi-me Done! https://issues.apache.org/jira/secure/ViewProfile.jspa?name=carols10cents

andygrove pushed a commit that referenced this pull request Oct 6, 2020
…r all supported Arrow DataTypes

Note that this PR goes to the rust-parquet-arrow-writer branch, not master.

Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc

These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones.

Tests that currently fail are either marked with `#[should_panic]` (if the
reason they fail is because of a panic) or `#[ignore]` (if the reason
they fail is because the values don't match).

I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not.

So I would love advice on that front, and I would love to know if these tests are useful or not!

Closes #8330 from carols10cents/roundtrip-tests

Lead-authored-by: Carol (Nichols || Goulding) <[email protected]>
Co-authored-by: Neville Dipale <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
@andygrove
Copy link
Member

andygrove commented Oct 6, 2020

@nevi-me @carols10cents I ran the merge tool and the changes appear to have been committed to the branch but this PR did not get closed, so I am closing it manually.

@andygrove andygrove closed this Oct 6, 2020
@nevi-me
Copy link
Contributor

nevi-me commented Oct 6, 2020

@nevi-me @carols10cents I ran the merge tool and the changes appear to have been committed to the branch but this PR did not get closed, so I am closing it manually.

It's been the case since I started merging on the branch. I've been meaning to open a JIRA for this, but keep forgetting. I've opened https://issues.apache.org/jira/browse/ARROW-10198

nevi-me added a commit that referenced this pull request Oct 7, 2020
…m Parquet metadata when available

@nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain.

I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well.

That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate?

Closes #8354 from carols10cents/schema-roundtrip

Lead-authored-by: Carol (Nichols || Goulding) <[email protected]>
Co-authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit that referenced this pull request Oct 7, 2020
…r all supported Arrow DataTypes

Note that this PR goes to the rust-parquet-arrow-writer branch, not master.

Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc

These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones.

Tests that currently fail are either marked with `#[should_panic]` (if the
reason they fail is because of a panic) or `#[ignore]` (if the reason
they fail is because the values don't match).

I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not.

So I would love advice on that front, and I would love to know if these tests are useful or not!

Closes #8330 from carols10cents/roundtrip-tests

Lead-authored-by: Carol (Nichols || Goulding) <[email protected]>
Co-authored-by: Neville Dipale <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
nevi-me added a commit that referenced this pull request Oct 7, 2020
…m Parquet metadata when available

@nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain.

I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well.

That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate?

Closes #8354 from carols10cents/schema-roundtrip

Lead-authored-by: Carol (Nichols || Goulding) <[email protected]>
Co-authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit that referenced this pull request Oct 12, 2020
…r all supported Arrow DataTypes

Note that this PR goes to the rust-parquet-arrow-writer branch, not master.

Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc

These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones.

Tests that currently fail are either marked with `#[should_panic]` (if the
reason they fail is because of a panic) or `#[ignore]` (if the reason
they fail is because the values don't match).

I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not.

So I would love advice on that front, and I would love to know if these tests are useful or not!

Closes #8330 from carols10cents/roundtrip-tests

Lead-authored-by: Carol (Nichols || Goulding) <[email protected]>
Co-authored-by: Neville Dipale <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
nevi-me added a commit that referenced this pull request Oct 12, 2020
…m Parquet metadata when available

@nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain.

I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well.

That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate?

Closes #8354 from carols10cents/schema-roundtrip

Lead-authored-by: Carol (Nichols || Goulding) <[email protected]>
Co-authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit that referenced this pull request Oct 16, 2020
…r all supported Arrow DataTypes

Note that this PR goes to the rust-parquet-arrow-writer branch, not master.

Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc

These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones.

Tests that currently fail are either marked with `#[should_panic]` (if the
reason they fail is because of a panic) or `#[ignore]` (if the reason
they fail is because the values don't match).

I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not.

So I would love advice on that front, and I would love to know if these tests are useful or not!

Closes #8330 from carols10cents/roundtrip-tests

Lead-authored-by: Carol (Nichols || Goulding) <[email protected]>
Co-authored-by: Neville Dipale <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
nevi-me added a commit that referenced this pull request Oct 16, 2020
…m Parquet metadata when available

@nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain.

I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well.

That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate?

Closes #8354 from carols10cents/schema-roundtrip

Lead-authored-by: Carol (Nichols || Goulding) <[email protected]>
Co-authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit to nevi-me/arrow that referenced this pull request Oct 17, 2020
…r all supported Arrow DataTypes

Note that this PR goes to the rust-parquet-arrow-writer branch, not master.

Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc

These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones.

Tests that currently fail are either marked with `#[should_panic]` (if the
reason they fail is because of a panic) or `#[ignore]` (if the reason
they fail is because the values don't match).

I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not.

So I would love advice on that front, and I would love to know if these tests are useful or not!

Closes apache#8330 from carols10cents/roundtrip-tests

Lead-authored-by: Carol (Nichols || Goulding) <[email protected]>
Co-authored-by: Neville Dipale <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
nevi-me added a commit to nevi-me/arrow that referenced this pull request Oct 17, 2020
…m Parquet metadata when available

@nevi-me This is one commit on top of apache#8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain.

I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well.

That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate?

Closes apache#8354 from carols10cents/schema-roundtrip

Lead-authored-by: Carol (Nichols || Goulding) <[email protected]>
Co-authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit that referenced this pull request Oct 25, 2020
…r all supported Arrow DataTypes

Note that this PR goes to the rust-parquet-arrow-writer branch, not master.

Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc

These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones.

Tests that currently fail are either marked with `#[should_panic]` (if the
reason they fail is because of a panic) or `#[ignore]` (if the reason
they fail is because the values don't match).

I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not.

So I would love advice on that front, and I would love to know if these tests are useful or not!

Closes #8330 from carols10cents/roundtrip-tests

Lead-authored-by: Carol (Nichols || Goulding) <[email protected]>
Co-authored-by: Neville Dipale <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
nevi-me added a commit that referenced this pull request Oct 25, 2020
…m Parquet metadata when available

@nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain.

I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well.

That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate?

Closes #8354 from carols10cents/schema-roundtrip

Lead-authored-by: Carol (Nichols || Goulding) <[email protected]>
Co-authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit that referenced this pull request Oct 27, 2020
…r all supported Arrow DataTypes

Note that this PR goes to the rust-parquet-arrow-writer branch, not master.

Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc

These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones.

Tests that currently fail are either marked with `#[should_panic]` (if the
reason they fail is because of a panic) or `#[ignore]` (if the reason
they fail is because the values don't match).

I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not.

So I would love advice on that front, and I would love to know if these tests are useful or not!

Closes #8330 from carols10cents/roundtrip-tests

Lead-authored-by: Carol (Nichols || Goulding) <[email protected]>
Co-authored-by: Neville Dipale <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
nevi-me added a commit that referenced this pull request Oct 27, 2020
…m Parquet metadata when available

@nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain.

I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well.

That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate?

Closes #8354 from carols10cents/schema-roundtrip

Lead-authored-by: Carol (Nichols || Goulding) <[email protected]>
Co-authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
nevi-me added a commit that referenced this pull request Oct 28, 2020
…r all supported Arrow DataTypes

Note that this PR goes to the rust-parquet-arrow-writer branch, not master.

Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc

These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones.

Tests that currently fail are either marked with `#[should_panic]` (if the
reason they fail is because of a panic) or `#[ignore]` (if the reason
they fail is because the values don't match).

I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not.

So I would love advice on that front, and I would love to know if these tests are useful or not!

Closes #8330 from carols10cents/roundtrip-tests

Lead-authored-by: Carol (Nichols || Goulding) <[email protected]>
Co-authored-by: Neville Dipale <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
nevi-me added a commit that referenced this pull request Oct 28, 2020
…m Parquet metadata when available

@nevi-me This is one commit on top of #8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain.

I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well.

That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate?

Closes #8354 from carols10cents/schema-roundtrip

Lead-authored-by: Carol (Nichols || Goulding) <[email protected]>
Co-authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
alamb pushed a commit to apache/arrow-rs that referenced this pull request Apr 20, 2021
…m Parquet metadata when available

@nevi-me This is one commit on top of apache/arrow#8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain.

I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well.

That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate?

Closes #8354 from carols10cents/schema-roundtrip

Lead-authored-by: Carol (Nichols || Goulding) <[email protected]>
Co-authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…r all supported Arrow DataTypes

Note that this PR goes to the rust-parquet-arrow-writer branch, not master.

Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc

These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones.

Tests that currently fail are either marked with `#[should_panic]` (if the
reason they fail is because of a panic) or `#[ignore]` (if the reason
they fail is because the values don't match).

I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not.

So I would love advice on that front, and I would love to know if these tests are useful or not!

Closes apache#8330 from carols10cents/roundtrip-tests

Lead-authored-by: Carol (Nichols || Goulding) <[email protected]>
Co-authored-by: Neville Dipale <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…m Parquet metadata when available

@nevi-me This is one commit on top of apache#8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain.

I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well.

That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate?

Closes apache#8354 from carols10cents/schema-roundtrip

Lead-authored-by: Carol (Nichols || Goulding) <[email protected]>
Co-authored-by: Neville Dipale <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants