ndjson inferred schema differs from the schema inferred by JsonFormat when integrate with arrow-datafusion #1257

v0y4g3r · 2022-09-23T06:51:27Z

What I'm trying to do

AFAIK many open source projects are using datafusion as query engine but they also want to leverage the peformance of arrow2. But the arrow2 branch of arrow-datafusion falls far behind latest arrow2 version. So I'm trying to bump arrow2 dependency of arrow-datafusion.

What happened

During that process I found that some tests are failing because of json schema inference.

For example, in this test

 async fn nd_json_exec_file_projection() -> Result<()> {
        let session_ctx = SessionContext::new();
        let task_ctx = session_ctx.task_ctx();
        let (object_store_url, file_groups, file_schema) =
            prepare_store(&session_ctx).await;

        let exec = NdJsonExec::new(FileScanConfig {
            object_store_url,
            file_groups,
            file_schema,
            statistics: Statistics::default(),
            projection: Some(vec![0, 2]),
            limit: None,
            table_partition_cols: vec![],
        });
        let inferred_schema = exec.schema();
        assert_eq!(inferred_schema.fields().len(), 2);

        inferred_schema.field_with_name("a").unwrap();
        inferred_schema.field_with_name("b").unwrap_err();
        inferred_schema.field_with_name("c").unwrap();
        inferred_schema.field_with_name("d").unwrap_err();

        let mut it = exec.execute(0, task_ctx)?;
        let batch = it.next().await.unwrap()?;

        assert_eq!(batch.num_rows(), 4);
        let values = batch
            .column(0)
            .as_any()
            .downcast_ref::<arrow::array::Int64Array>()
            .unwrap();
        assert_eq!(values.value(0), 1);
        assert_eq!(values.value(1), -10);
        assert_eq!(values.value(2), 2);
        Ok(())
    }

If there's an ndjson file:

{"a":1, "b":0.1, "c": true, "d": "1"}
{"a":2, "b":0.2, "c": false, "d": "2"}

JSON file schema inferred by datafusion::datasource::file_format::json::JsonFormat::infer_schema is a list of fields:["a": Int64,"b": floa,"c":boolean,"d":string]
But schema(data type) inferred by arrow2::io::ndjson::read::file::infer is a DataType::Struct{ fields=["a": Int64,"b": floa,"c":boolean,"d":string]}.

To make it simple, datafusion's JSON format infers the schema of each line inside a ndjson file, with fields flatten, but arrow2's ndjson crate take each line inside a ndjson file as a struct. This difference makes it hard to do projection on an ndjson file.

Solutions on arrow-datafusion

I did some research on how arrow-datafusion (main branch) handles ndjson file, turned out that arrow crate uses a Decoder for each line inside an ndjson file to deconstruct the struct to a list of fields so that following projection on these fields is possible.

My question

What should I do if I want to fix the failing tests regarding ndjson schema in arrow-datafusion's arrow2 branch? If the schema inference difference is by design, maybe I should delete the projection tests. Otherwise maybe we should implement the similar line deconstruction mechanism just like what arrow does.

The text was updated successfully, but these errors were encountered:

jorgecarleitao · 2022-09-25T04:21:27Z

That is awesome!!

We infer it as a Struct to allow files of the form

[1]
[2]
[3]

since afaik they are valid ndjson files.

It seems that datafusion does not accept this. Thus, I would do

let fields = if let DataType::Struct(fields) = inferred_field {
    fields
} else {
    return Err("Datafusion only supports ndjson with objects on them")
}

would this work?

v0y4g3r · 2022-09-26T06:50:33Z

That is awesome!!

We infer it as a Struct to allow files of the form
[1]
[2]
[3]
since afaik they are valid ndjson files.

It seems that datafusion does not accept this. Thus, I would do
let fields = if let DataType::Struct(fields) = inferred_field {
    fields
} else {
    return Err("Datafusion only supports ndjson with objects on them")
}
would this work?

Yes, arrow-rs does not accept ndjson file that each row is an array like:

[1]
[2]
[3]

and will complain:

Error: ArrowError(JsonError("Expected JSON record to be an object, found Array [Number(1)]"))

But if arrow2 decides to accept this format, projection seems meaningless, since inferred schema only has one field (a struct that wraps all fields inside).

jorgecarleitao · 2022-09-29T04:32:50Z

Note that arrow2 already supports this format, since it is valid ndjson.

Yes, for projections it is meaningless, but is still a meaningful representation of data that query engines ought to support, imo

v0y4g3r mentioned this issue Sep 26, 2022

Bump arrow2 apache/datafusion#2855

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ndjson inferred schema differs from the schema inferred by JsonFormat when integrate with arrow-datafusion #1257

ndjson inferred schema differs from the schema inferred by JsonFormat when integrate with arrow-datafusion #1257

v0y4g3r commented Sep 23, 2022 •

edited

Loading

jorgecarleitao commented Sep 25, 2022 •

edited

Loading

v0y4g3r commented Sep 26, 2022

jorgecarleitao commented Sep 29, 2022

ndjson inferred schema differs from the schema inferred by JsonFormat when integrate with arrow-datafusion #1257

ndjson inferred schema differs from the schema inferred by JsonFormat when integrate with arrow-datafusion #1257

Comments

v0y4g3r commented Sep 23, 2022 • edited Loading

What I'm trying to do

What happened

Solutions on arrow-datafusion

My question

jorgecarleitao commented Sep 25, 2022 • edited Loading

v0y4g3r commented Sep 26, 2022

jorgecarleitao commented Sep 29, 2022

v0y4g3r commented Sep 23, 2022 •

edited

Loading

jorgecarleitao commented Sep 25, 2022 •

edited

Loading