-
Notifications
You must be signed in to change notification settings - Fork 842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSV reader infers Date64 type for fields like "2020-03-19 00:00:00" that it can't parse to Date64 #3744
Comments
…elds like "2020-03-19 00:00:00" that it can't parse to Date64)
I came to make the same comment. here is a sample csv
The last column ExampleHere is some code. use arrow::array::{Array, StringArray};
use arrow::compute::cast;
use arrow::datatypes::DataType::{
Boolean, Date32, Date64, Float64, Int64, Interval, List, Time32, Time64, Timestamp, Utf8,
};
use arrow::ipc::Time;
use arrow::record_batch::RecordBatch;
use arrow_csv::reader;
use std::fs::File;
fn main{
let path = "data/weather.csv".to_owned();
// infer the schema using arrow_csv::reader
let schema = reader::infer_schema_from_files(&[path.clone()], 44, Some(1000), true);
let schema_data_types = reader::infer_schema_from_files(&[path.clone()], 44, Some(1000), true);
//for each feild in the schema, match on the data type and push a string to a vectotr `Vec<String>`.
let data_types: Vec<String> = schema_data_types
.expect("Schema should be infered")
.fields()
.iter()
.map(|field| {
let data_type = field.data_type();
match data_type {
Boolean => "<bool>".to_string(),
Int64 => "<int>".to_string(),
Float64 => "<dbl>".to_string(),
Utf8 => "<chr>".to_string(),
List(_) => "<list>".to_string(),
Date32 => "<date>".to_string(),
Date64 => "<date64>".to_string(),
Timestamp(_, _) => "<ts>".to_string(),
Time32(_) => "<time>".to_string(),
Time64(_) => "<time64>".to_string(),
_ => "<_>".to_string(),
}
})
.collect();
// print the data types
println!("data types {:?}", data_types);
let file = File::open(path).unwrap();
let mut reader = reader::Reader::new(
file,
Arc::new(schema.expect("Schema should be infered")),
true,
Some(44),
1024,
None,
None,
None,
);
}
// convert reader to record batch
let record_batch: RecordBatch = reader.next().unwrap().unwrap().clone();
// print record batch
println!("{:?}", record_batch); The Errordata types ["<int>", "<chr>", "<int>", "<int>", "<int>", "<int>", "<dbl>", "<dbl>", "<dbl>", "<chr>", "<dbl>", "<chr>", "<dbl>", "<chr>", "<dbl>", "<date64>"]
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ParseError("Error while parsing value 2013-01-01 02:00:00 for column 15 at line 2")' |
Expected BehaviorDoes not matter much to me. I just need (what I would call a date time) "2013-01-01 02:00:00" to not error. It would by nice if there was a |
I think the CSV reader should be inferring Timestamp for such columns, it is unclear to me that it should ever infer Date64, as the semantics of such a type are somewhat unclear, at least to me. I have posted an email to the arrow mailing list to try to clarify what the Date64 type is for - https://lists.apache.org/thread/q036r1q3cw5ysn3zkpvljx3s9ho18419 |
@tustvold I think that is sensible. I would call |
|
Describe the bug
Starting with version 0.6.1 of csv2parquet (which upgraded its arrow-rs dependency from 24.0 to 30.0.1), input CSV files that were converting just fine are now failing with errors like this one:
The csv2parquet tool is a very thin wrapper around arrow-rs that basically
arrow::csv::reader::infer_file_schema
to infer a schema for the input file;arrow::csv::Reader
from that schema and uses it to read the file;The tool has a command line option that prints out the inferred schemas, and input values like that are being inferred into Date64:
And I've written a unit test case that demonstrates that
To Reproduce
I wrote a couple of very simple test cases illustrating the problem:
My
test_can_parse_inferred_date64
inarrow-csv/src/reader/mod.rs
specifically shows howinfer_field_schema
returnsDate64
for the example string and yet when we feed it toparse_item::<Date64Type>
we getNone
.Expected behavior
I see that the various Timestamp types are able to parse strings like that correctly, so I don't understand whether the more correct behavior for this library would be to infer a Timestamp type for these strings instead of the Date64 type. It does seem clear to me that however that it should be possible to parse these strings as Date64.
The text was updated successfully, but these errors were encountered: