-
Notifications
You must be signed in to change notification settings - Fork 850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use array_value_to_string in arrow-csv #3514
Conversation
This functionality was added in #1347 by @gsserge with the original issue raised by @rjnanderson I'm a little concerned we are removing functionality here on which people depend. I wonder if we can provide a way to pass options to array_value_to_string 🤔 |
i agree that keeping some way to specify the formatting is likely necessary and it would be great to add the ability to do so to |
Thanks @JayjeetAtGithub -- I plan to take a look tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @JayjeetAtGithub -- it is really nice to get your contribution and I apologize in the delay of reviewing -- I think this is looking quite close.
The things I think are needed are:
- An explanation about why test_conversion_consistency was removed
- Change the signature from
&Option<String>
toOption<&str>
Nice to have
- Add some tests showing the rfc3339 behavior for time, dates, datetime, timestamps, timestamptz
When this PR is merged I'll try and write up some notes about making a better display API but I don't want to confuse this PR any more
arrow/tests/csv.rs
Outdated
@@ -57,8 +57,8 @@ fn test_export_csv_timestamps() { | |||
drop(writer); | |||
|
|||
let left = "c1,c2 | |||
2019-04-18T20:54:47.378000000+10:00,2019-04-18T10:54:47.378000000 | |||
2021-10-30T17:59:07.000000000+11:00,2021-10-30T06:59:07.000000000\n"; | |||
2019-04-18T20:54:47.378+10:00,2019-04-18T10:54:47.378000000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this sure looks better 😍
arrow-cast/src/display.rs
Outdated
column: &ArrayRef, | ||
row: usize, | ||
datetime_format_opt: &Option<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI I think the typical rust way to do this (and not require an owned String
) is:
datetime_format_opt: &Option<String>, | |
datetime_format_opt: Option<&str>, |
And then at callsites with val: Option<String>
instead of:
array_value_to_string_internal(foo, bar, &val)
You do :
array_value_to_string_internal(foo, bar, val.as_ref())
arrow-csv/src/writer.rs
Outdated
.to_string(), | ||
String::from_utf8(buffer).unwrap() | ||
); | ||
} | ||
|
||
#[test] | ||
fn test_conversion_consistency() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this test removed? It seems useful, doesn't it?
arrow-csv/src/writer.rs
Outdated
@@ -430,33 +364,45 @@ impl WriterBuilder { | |||
self | |||
} | |||
|
|||
/// Whether to use RFC3339 format for date/time/timestamps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Whether to use RFC3339 format for date/time/timestamps | |
/// Use RFC3339 format for date/time/timestamps by clearing all | |
/// date/time specific formats. |
arrow-csv/src/writer.rs
Outdated
@@ -430,33 +364,45 @@ impl WriterBuilder { | |||
self | |||
} | |||
|
|||
/// Whether to use RFC3339 format for date/time/timestamps | |||
pub fn with_rfc3339(mut self, use_rfc3339: bool) -> Self { | |||
self.use_rfc3339 = use_rfc3339; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than an extra flag, what would you think about having this function simply set
self.date_format = None;
self.datetime_format = None;
....
?
As a separate flag it might be confusing to understand what happens if use_rfc3339
is set and then a user also set self.date_format
or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I am also thinking if the user first uses with_rfc3339
and then with_date_format
, then the later will override the former, and it might also be confusing?! We would probably need to explain this behavior properly in the API docs.
arrow-cast/src/display.rs
Outdated
array_value_to_string_internal(column, row, &None) | ||
} | ||
|
||
pub fn datetime_array_value_to_string( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am somewhat worried about adding a new public API like this, but I see it is needed to call in arrow-csv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love to revamp this display code to be easier to use / more performant but I think that is outside the scope of this PR. I'll try and file a ticket explaining what I am thinking
arrow-cast/src/display.rs
Outdated
pub fn datetime_array_value_to_string( | ||
column: &ArrayRef, | ||
row: usize, | ||
format: &Option<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
format: &Option<String>, | |
format: Option<&str>, |
e97f8cb
to
ffbb030
Compare
Hi @alamb , Looks like there is a conflict. Any ideas to handle this would be great. Thanks. |
Hi @alamb , Looks like there is a conflict. Any ideas to handle this would be great. Thanks. Sure @JayjeetAtGithub -- I'll try and work on it later today |
alamb@MacBook-Pro-8 arrow-rs % git fetch apache
git fetch apache
remote: Enumerating objects: 11, done.
remote: Counting objects: 100% (11/11), done.
remote: Compressing objects: 100% (9/9), done.
remote: Total 11 (delta 3), reused 2 (delta 2), pack-reused 0
Unpacking objects: 100% (11/11), 8.37 KiB | 779.00 KiB/s, done.
From github.com:apache/arrow-rs
de381ec54..b826657ac master -> apache/master
alamb@MacBook-Pro-8 arrow-rs % git merge apache/master
git merge apache/master
Auto-merging arrow-cast/src/display.rs
Auto-merging arrow-csv/src/writer.rs
CONFLICT (content): Merge conflict in arrow-csv/src/writer.rs
Automatic merge failed; fix conflicts and then commit the result. I believe the logical conflict came from 24e5dae / #3570 Your changes may actually supercede the changes in #3570 but I would need to study it more carefully |
Thanks a lot for taking a look at this @alamb. |
Thank you for sticking with this @JayjeetAtGithub |
ffbb030
to
c36433a
Compare
@alamb I combined the changes #3570 and here is an initial version of that. I have mostly introduced a |
c2be153
to
e52940a
Compare
7ad8a8b
to
98251e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @JayjeetAtGithub -- I plan to review this one carefully tomorrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good to go -- thank you @JayjeetAtGithub
I am slightly concerned that we may have changed behavior of array_value_to_string
on invalid data (aka return an error rather than a string) but I don't see any tests that change 🤔
|
||
fn invalid_cast_error(dt: &str, col_idx: usize, row_idx: usize) -> ArrowError { | ||
ArrowError::CastError(format!( | ||
"Cannot cast to {dt} at col index: {col_idx} row index: {row_idx}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the addition of column index to the error message 👍
Ok(array | ||
.value_as_date($row) | ||
.map(|d| d.to_string()) | ||
.unwrap_or_else(|| "ERROR CONVERTING DATE".to_string())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this change the behavior with invalid dates (.e.g it will error rather than successfully write out "ERROR CONVERTING DATE")?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, It would. Do we want to keep the former behaviour ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please, let's keep the existing behavior
I think #3638 will result in a way to eventually control what happens on invalid data (e.g. error or placeholder) so I think we should avoid changing the behavior in this PR
@@ -524,6 +721,22 @@ pub fn array_value_to_string( | |||
} | |||
} | |||
|
|||
pub fn temporal_array_value_to_string( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please add some doc comments about what this does (specifically what DataTypes
it accepts) and what the format is used for?
I realize you didn't add this function so we can do so as a follow on PR I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb Sure. I was also thinking renaming this function to something like array_value_to_string_using_format
. What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
DataType::Date64 => { | ||
write_temporal_value!( | ||
DataType::Boolean => array_value_to_string(col, row_index)?.to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 that certainly looks better
@@ -463,6 +384,19 @@ impl WriterBuilder { | |||
self | |||
} | |||
|
|||
/// Use RFC3339 format for date/time/timestamps by clearing all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I was inspired by this PR to write up a potential new API for formatting arrays as strings as the current one has bothered me for a long time: #3638 |
I'm going to merge this in so #3647 doesn't generate huge numbers of conflicts with it |
Benchmark runs are scheduled for baseline = 2b9bbce and contender = 75a56be. 75a56be is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Thanks again @JayjeetAtGithub and @tustvold ! |
Which issue does this PR close?
Closes #3483.
Closes #3513