Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Json decoder behavior changed from versions 21 to 21 and returns non-sensical num_rows for RecordBatch #2722

Closed
MathiasKindberg opened this issue Sep 13, 2022 · 4 comments
Labels
arrow Changes to the arrow crate

Comments

@MathiasKindberg
Copy link

MathiasKindberg commented Sep 13, 2022

Describe the bug
We have a test that relied on triggering an Arrow error by sending in an empty value to the decoder, truly questionable if that is sensible but that is how it was done to exercise failure cases in our business logic. When upgrading from version 21 or 22 this test stopped working.

Debugging it I could track the change to the next_batch function. Digging through recent PRs it seems like PR #2604 has changed the behavior when dealing with empty input values.

The weird thing now is that when calling num_rows on the produced record batch gives 1, even though no data is inside it which I would consider very undesirable behavior unless Arrow specifies that an empty RecordBatch has the length 1?

To Reproduce

fn main() {
    let some_json: Vec<Result<serde_json::Value, arrow::error::ArrowError>> =
        vec![Ok(serde_json::json!({}))];

    let schema = std::sync::Arc::new(arrow::datatypes::Schema::new(vec![]));
    let decoder_options =
        arrow::json::reader::DecoderOptions::new().with_batch_size(some_json.len());
    let decoder = arrow::json::reader::Decoder::new(schema, decoder_options);

    // This pr changes it. https://github.com/apache/arrow-rs/pull/2604
    let result = decoder.next_batch(&mut some_json.into_iter());
    dbg!(&result);
}

For version 21 this gives:

[src/main.rs:12] &result = Err(
    InvalidArgumentError(
        "must either specify a row count or at least one column",
    ),
)

For Version 22 this gives

[src/main.rs:12] &result = Ok(
    Some(
        RecordBatch {
            schema: Schema {
                fields: [],
                metadata: {},
            },
            columns: [],
            row_count: 1,
        },
    ),
)

Expected behavior
I would expect num_rows to return 0 on an empty record_batch. I don't see any issue with empty record batches existing, although the exact behavior of the next_batch function should likely be more thoroughly documented in regards to how it errors and why.

@tustvold
Copy link
Contributor

tustvold commented Sep 13, 2022

Is it possible the reason it is returning a length of 1, is that the specified schema is empty and so technically the number of rows is technically unbounded? How many times does 0 divide 0 😆 ?

Not saying we shouldn't fix this to behave the same way as before, but just wondering if this is specific to the case of an empty schema - where it isn't technically incorrect 😅

@Dandandan
Copy link
Contributor

Dandandan commented Sep 13, 2022

I think the RecordBatch with rows of 1 is as expected, as the input some_json also contains one record (without fields):

    let some_json: Vec<Result<serde_json::Value, arrow::error::ArrowError>> =
        vec![Ok(serde_json::json!({}))];

The previous implementation was always giving the error (because it doesn't deal with empty projections), while the correct result in this case should be a record batch with a row_count of 1 (even if it has no columns). If your input is vec![Ok(serde_json::json!({})), Ok(serde_json::json!({})] the expected row_count is 2 and with vec![] it should give a recordbatch with row_count of 0.

@jhorstmann
Copy link
Contributor

jhorstmann commented Sep 14, 2022

Sounds related to #1552

@Dandandan
Copy link
Contributor

Closing for now as it seems to work as intended - thanks for reporting @MathiasKindberg feel free to reopen if we have a bug or open an issue for missing documentation.

@alamb alamb added arrow Changes to the arrow crate and removed bug labels Sep 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

No branches or pull requests

5 participants