-
Notifications
You must be signed in to change notification settings - Fork 849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide Arrow Schema Hint to Parquet Reader #5657
Comments
Here is one potential API let file = File::open("data.parquet").unwrap();
let reader = ParquetRecordBatchReaderBuilder::try_new(file)?
// specify column "time" should be UTC
// will error if this type can not be read from parquet
.with_column_type("time", DateTime::Timestamp(Nanoseconds, Some("UTC"))
println!("Converted arrow schema is: {}", builder.schema()); I am not quite sure how to handle identifying nested types with a single column name Like if the parquet file has {
"my_object": {
"time": "12-01-02"
}
} maybe we would refer to the |
I think my expectation would be for you to provide the |
Let me try the remaining part if it is ok |
Basically agree with your idea |
Something like let file = File::open("data.parquet").unwrap();
let reader = ParquetRecordBatchReaderBuilder::try_new(file)?
// specify the arrow schema to read from this parquet file
// will error if the types in the parquet file can not be converted
// into the specific types.
// Will ignore any embedded metadata about types when written
.schema(schema)
println!("Converted arrow schema is: {}", builder.schema()); |
Do we need to add some checker in the function of the The compatibility is very important for the parquet reader |
The inference logic is already setup to use the arrow schema as a hint as opposed to authoritative , if you give it something invalid it will just ignore it |
thanks, got it. |
I noticed this seemed to have stalled, so I thought I would have a go at it. The implemented API is different from the one discussed. In particular, the schema is supplied as an ArrowReaderOption. let file = File::open("file.parquet");
let schema = Arc::new(Schema::new(vec![Fields::new("col", DataType::Int32, false)]));
let options = ArrowReaderOptions::new().with_schema(schema);
// This may fail as described above.
let builder = ParquetRecordBatchReaderBuilder::try_new_with_options(file, options).unwrap();
let reader = builder.build().unwrap(); This is necessary because it needs to be provided as a hint when the metadata is read from the parquet file. The tests are incomplete and there are some questions in the PR with respect to how error handling should be handled. |
|
|
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The parquet reader automatically uses an embedded arrow schema to hint type inference for decode. In particular if the hinted type is compatible with the underlying parquet type, it performs a cast.
Describe the solution you'd like
In situations where the writer was not an arrow writer this schema is not available, and therefore the arrow types are inferred from the parquet schema. This is not always desirable:
Describe alternatives you've considered
Additional context
The text was updated successfully, but these errors were encountered: