You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.
AFAIK many open source projects are using datafusion as query engine but they also want to leverage the peformance of arrow2. But the arrow2 branch of arrow-datafusion falls far behind latest arrow2 version. So I'm trying to bump arrow2 dependency of arrow-datafusion.
What happened
During that process I found that some tests are failing because of json schema inference.
JSON file schema inferred by datafusion::datasource::file_format::json::JsonFormat::infer_schema is a list of fields:["a": Int64,"b": floa,"c":boolean,"d":string]
But schema(data type) inferred by arrow2::io::ndjson::read::file::infer is a DataType::Struct{ fields=["a": Int64,"b": floa,"c":boolean,"d":string]}.
To make it simple, datafusion's JSON format infers the schema of each line inside a ndjson file, with fields flatten, but arrow2's ndjson crate take each line inside a ndjson file as a struct. This difference makes it hard to do projection on an ndjson file.
What should I do if I want to fix the failing tests regarding ndjson schema in arrow-datafusion's arrow2 branch? If the schema inference difference is by design, maybe I should delete the projection tests. Otherwise maybe we should implement the similar line deconstruction mechanism just like what arrow does.
The text was updated successfully, but these errors were encountered:
We infer it as a Struct to allow files of the form
[1]
[2]
[3]
since afaik they are valid ndjson files.
It seems that datafusion does not accept this. Thus, I would do
let fields = ifletDataType::Struct(fields) = inferred_field {
fields
}else{returnErr("Datafusion only supports ndjson with objects on them")}
would this work?
Yes, arrow-rs does not accept ndjson file that each row is an array like:
[1]
[2]
[3]
and will complain:
Error: ArrowError(JsonError("Expected JSON record to be an object, found Array [Number(1)]"))
But if arrow2 decides to accept this format, projection seems meaningless, since inferred schema only has one field (a struct that wraps all fields inside).
What I'm trying to do
AFAIK many open source projects are using datafusion as query engine but they also want to leverage the peformance of arrow2. But the arrow2 branch of arrow-datafusion falls far behind latest arrow2 version. So I'm trying to bump arrow2 dependency of
arrow-datafusion
.What happened
During that process I found that some tests are failing because of json schema inference.
For example, in this test
If there's an ndjson file:
datafusion::datasource::file_format::json::JsonFormat::infer_schema
is a list of fields:["a": Int64,"b": floa,"c":boolean,"d":string]
arrow2::io::ndjson::read::file::infer
is aDataType::Struct{ fields=["a": Int64,"b": floa,"c":boolean,"d":string]}
.To make it simple, datafusion's JSON format infers the schema of each line inside a ndjson file, with fields flatten, but arrow2's ndjson crate take each line inside a ndjson file as a struct. This difference makes it hard to do projection on an ndjson file.
Solutions on arrow-datafusion
I did some research on how arrow-datafusion (main branch) handles ndjson file, turned out that
arrow
crate uses aDecoder
for each line inside an ndjson file to deconstruct the struct to a list of fields so that following projection on these fields is possible.My question
What should I do if I want to fix the failing tests regarding ndjson schema in
arrow-datafusion
'sarrow2
branch? If the schema inference difference is by design, maybe I should delete the projection tests. Otherwise maybe we should implement the similar line deconstruction mechanism just like what arrow does.The text was updated successfully, but these errors were encountered: