Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

ndjson inferred schema differs from the schema inferred by JsonFormat when integrate with arrow-datafusion #1257

Open
v0y4g3r opened this issue Sep 23, 2022 · 3 comments

Comments

@v0y4g3r
Copy link

v0y4g3r commented Sep 23, 2022

What I'm trying to do

AFAIK many open source projects are using datafusion as query engine but they also want to leverage the peformance of arrow2. But the arrow2 branch of arrow-datafusion falls far behind latest arrow2 version. So I'm trying to bump arrow2 dependency of arrow-datafusion.

What happened

During that process I found that some tests are failing because of json schema inference.

For example, in this test

 async fn nd_json_exec_file_projection() -> Result<()> {
        let session_ctx = SessionContext::new();
        let task_ctx = session_ctx.task_ctx();
        let (object_store_url, file_groups, file_schema) =
            prepare_store(&session_ctx).await;

        let exec = NdJsonExec::new(FileScanConfig {
            object_store_url,
            file_groups,
            file_schema,
            statistics: Statistics::default(),
            projection: Some(vec![0, 2]),
            limit: None,
            table_partition_cols: vec![],
        });
        let inferred_schema = exec.schema();
        assert_eq!(inferred_schema.fields().len(), 2);

        inferred_schema.field_with_name("a").unwrap();
        inferred_schema.field_with_name("b").unwrap_err();
        inferred_schema.field_with_name("c").unwrap();
        inferred_schema.field_with_name("d").unwrap_err();

        let mut it = exec.execute(0, task_ctx)?;
        let batch = it.next().await.unwrap()?;

        assert_eq!(batch.num_rows(), 4);
        let values = batch
            .column(0)
            .as_any()
            .downcast_ref::<arrow::array::Int64Array>()
            .unwrap();
        assert_eq!(values.value(0), 1);
        assert_eq!(values.value(1), -10);
        assert_eq!(values.value(2), 2);
        Ok(())
    }

If there's an ndjson file:

{"a":1, "b":0.1, "c": true, "d": "1"}
{"a":2, "b":0.2, "c": false, "d": "2"}
  • JSON file schema inferred by datafusion::datasource::file_format::json::JsonFormat::infer_schema is a list of fields:["a": Int64,"b": floa,"c":boolean,"d":string]
  • But schema(data type) inferred by arrow2::io::ndjson::read::file::infer is a DataType::Struct{ fields=["a": Int64,"b": floa,"c":boolean,"d":string]}.

To make it simple, datafusion's JSON format infers the schema of each line inside a ndjson file, with fields flatten, but arrow2's ndjson crate take each line inside a ndjson file as a struct. This difference makes it hard to do projection on an ndjson file.

Solutions on arrow-datafusion

I did some research on how arrow-datafusion (main branch) handles ndjson file, turned out that arrow crate uses a Decoder for each line inside an ndjson file to deconstruct the struct to a list of fields so that following projection on these fields is possible.

My question

What should I do if I want to fix the failing tests regarding ndjson schema in arrow-datafusion's arrow2 branch? If the schema inference difference is by design, maybe I should delete the projection tests. Otherwise maybe we should implement the similar line deconstruction mechanism just like what arrow does.

@jorgecarleitao
Copy link
Owner

jorgecarleitao commented Sep 25, 2022

That is awesome!!

We infer it as a Struct to allow files of the form

[1]
[2]
[3]

since afaik they are valid ndjson files.

It seems that datafusion does not accept this. Thus, I would do

let fields = if let DataType::Struct(fields) = inferred_field {
    fields
} else {
    return Err("Datafusion only supports ndjson with objects on them")
}

would this work?

@v0y4g3r
Copy link
Author

v0y4g3r commented Sep 26, 2022

That is awesome!!

We infer it as a Struct to allow files of the form

[1]
[2]
[3]

since afaik they are valid ndjson files.

It seems that datafusion does not accept this. Thus, I would do

let fields = if let DataType::Struct(fields) = inferred_field {
    fields
} else {
    return Err("Datafusion only supports ndjson with objects on them")
}

would this work?

Yes, arrow-rs does not accept ndjson file that each row is an array like:

[1]
[2]
[3]

and will complain:

Error: ArrowError(JsonError("Expected JSON record to be an object, found Array [Number(1)]"))

But if arrow2 decides to accept this format, projection seems meaningless, since inferred schema only has one field (a struct that wraps all fields inside).

@jorgecarleitao
Copy link
Owner

Note that arrow2 already supports this format, since it is valid ndjson.

Yes, for projections it is meaningless, but is still a meaningful representation of data that query engines ought to support, imo

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants