Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when informing data types, use functions to parse type (e.g. DateTime) #899

Closed
hdcpereira opened this issue Sep 13, 2021 · 5 comments
Closed

Comments

@hdcpereira
Copy link

hdcpereira commented Sep 13, 2021

Lets say I have a dataset with a DateTime type in one of its columns. Some SQL requests return datetime types as YYYY-MM-DD HH:MM:SS e.g. 2019-03-01 14:04:31. When reading with CSV.read I'm not able to use

my_data = CSV.read(file, DataFrame, header = 1, copycols = true, types = Dict(:time => Dates.DateTime)) 

because Dates.DateTime doesn't accept YYYY-MM-DD HH:MM:SS only YYYY-MM-DDTHH:MM:SS. One should be able to inform what is the DateTime structure before reading it with Dates.DateTime function, which allows us to inform what is the structure of the DateTime. e.g.

my_data = CSV.read(file, DataFrame, header = 1, copycols = true,types = Dict(:time => x -> Dates.DateTime(x, "YYYY-MM-DD HH:MM:SS"))) 
@nickrobinson251
Copy link
Collaborator

Does passing dateformat = "YYYY-MM-DD HH:MM:SS" address the issue?
e.g.

my_data = CSV.read(file, DataFrame, header = 1, copycols = true, types = Dict(:time => Dates.DateTime), dateformat = "YYYY-MM-DD HH:MM:SS") 

https://csv.juliadata.org/latest/reading.html#dateformat

@hdcpereira
Copy link
Author

hdcpereira commented Sep 13, 2021

Does passing dateformat = "YYYY-MM-DD HH:MM:SS" address the issue?
e.g.

my_data = CSV.read(file, DataFrame, header = 1, copycols = true, types = Dict(:time => Dates.DateTime), dateformat = "YYYY-MM-DD HH:MM:SS") 

https://csv.juliadata.org/latest/reading.html#dateformat

Nope. It returns me this error for every row in my dataset.

Warning: thread = 1 warning: error parsing DateTime around row = 1, col = 1: "2019-10-03 14:04:22,", error=INVALID: OK | DELIMITED | INVALID_DELIMITER 
└ @ CSV /home/hpereira/.julia/packages/CSV/22psr/src/file.jl:626

And the dataframe

time|
DateTime(?)|
missing |
missing |
missing | 
missing | 
missing | 
...

@nickrobinson251
Copy link
Collaborator

nickrobinson251 commented Sep 13, 2021

i wonder if that's just because the dateformat is incorrect. How about dateformat = "yyy-mm-dd HH:MM:SS"?

@hdcpereira
Copy link
Author

wow, it worked just fine. But I remain with question. Wouldn't it be a nice feature to let the function operate in the dictionary?

@quinnj
Copy link
Member

quinnj commented Sep 13, 2021

It would be really hard, nay impossible, to support arbitrary transform functions efficiently like this. For instance, in your first example, calling DateTime(x, "yyyy-mm-dd HH:MM:SS") on every value in a datetime column would create a new DateFormat object for each cell. That's a big reason why, for custom datetime formats, we require passing it in at the top-level as a dedicated keyword argument.

The other difficulty in allowing arbitrary functions would be the column typing; you can't really know the result type of applying a function like this, so it would be hard to have the column value setting be an efficient operation, which would really bloat parsing times.

As I've considered this in the past, there's really no advantage to having something like that builtin to CSV.read vs. just doing the transformation yourself post-parsing on the CSV.File or DataFrame object.

@quinnj quinnj closed this as completed Sep 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants