Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use array_value_to_string in arrow-csv #3514

Merged
merged 22 commits into from
Feb 2, 2023

Conversation

JayjeetAtGithub
Copy link
Contributor

@JayjeetAtGithub JayjeetAtGithub commented Jan 12, 2023

Which issue does this PR close?

Closes #3483.
Closes #3513

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 12, 2023
@tustvold tustvold added the api-change Changes to the arrow API label Jan 12, 2023
@tustvold
Copy link
Contributor

This functionality was added in #1347 by @gsserge with the original issue raised by @rjnanderson I'm a little concerned we are removing functionality here on which people depend.

I wonder if we can provide a way to pass options to array_value_to_string 🤔

@alamb
Copy link
Contributor

alamb commented Jan 12, 2023

i agree that keeping some way to specify the formatting is likely necessary and it would be great to add the ability to do so to array_value_to_string

@JayjeetAtGithub
Copy link
Contributor Author

Thanks for the idea @tustvold @alamb. I will explore how to do that.

@tustvold tustvold marked this pull request as draft January 13, 2023 07:49
@JayjeetAtGithub JayjeetAtGithub marked this pull request as ready for review January 18, 2023 07:49
@JayjeetAtGithub
Copy link
Contributor Author

@alamb @tustvold I have done some initial changes here. Looking forward to your feedback. Thanks a lot.

@alamb
Copy link
Contributor

alamb commented Jan 18, 2023

Thanks @JayjeetAtGithub -- I plan to take a look tomorrow

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JayjeetAtGithub -- it is really nice to get your contribution and I apologize in the delay of reviewing -- I think this is looking quite close.

The things I think are needed are:

  1. An explanation about why test_conversion_consistency was removed
  2. Change the signature from &Option<String> to Option<&str>

Nice to have

  1. Add some tests showing the rfc3339 behavior for time, dates, datetime, timestamps, timestamptz

When this PR is merged I'll try and write up some notes about making a better display API but I don't want to confuse this PR any more

@@ -57,8 +57,8 @@ fn test_export_csv_timestamps() {
drop(writer);

let left = "c1,c2
2019-04-18T20:54:47.378000000+10:00,2019-04-18T10:54:47.378000000
2021-10-30T17:59:07.000000000+11:00,2021-10-30T06:59:07.000000000\n";
2019-04-18T20:54:47.378+10:00,2019-04-18T10:54:47.378000000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sure looks better 😍

column: &ArrayRef,
row: usize,
datetime_format_opt: &Option<String>,
Copy link
Contributor

@alamb alamb Jan 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI I think the typical rust way to do this (and not require an owned String) is:

Suggested change
datetime_format_opt: &Option<String>,
datetime_format_opt: Option<&str>,

And then at callsites with val: Option<String> instead of:

array_value_to_string_internal(foo, bar, &val)

You do :

array_value_to_string_internal(foo, bar, val.as_ref())

.to_string(),
String::from_utf8(buffer).unwrap()
);
}

#[test]
fn test_conversion_consistency() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this test removed? It seems useful, doesn't it?

@@ -430,33 +364,45 @@ impl WriterBuilder {
self
}

/// Whether to use RFC3339 format for date/time/timestamps
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Whether to use RFC3339 format for date/time/timestamps
/// Use RFC3339 format for date/time/timestamps by clearing all
/// date/time specific formats.

@@ -430,33 +364,45 @@ impl WriterBuilder {
self
}

/// Whether to use RFC3339 format for date/time/timestamps
pub fn with_rfc3339(mut self, use_rfc3339: bool) -> Self {
self.use_rfc3339 = use_rfc3339;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than an extra flag, what would you think about having this function simply set

  self.date_format = None;
  self.datetime_format = None; 
....

?

As a separate flag it might be confusing to understand what happens if use_rfc3339 is set and then a user also set self.date_format or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I am also thinking if the user first uses with_rfc3339 and then with_date_format, then the later will override the former, and it might also be confusing?! We would probably need to explain this behavior properly in the API docs.

array_value_to_string_internal(column, row, &None)
}

pub fn datetime_array_value_to_string(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am somewhat worried about adding a new public API like this, but I see it is needed to call in arrow-csv

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to revamp this display code to be easier to use / more performant but I think that is outside the scope of this PR. I'll try and file a ticket explaining what I am thinking

pub fn datetime_array_value_to_string(
column: &ArrayRef,
row: usize,
format: &Option<String>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
format: &Option<String>,
format: Option<&str>,

@JayjeetAtGithub
Copy link
Contributor Author

Hi @alamb , Looks like there is a conflict. Any ideas to handle this would be great. Thanks.

@alamb
Copy link
Contributor

alamb commented Jan 23, 2023

Hi @alamb , Looks like there is a conflict. Any ideas to handle this would be great. Thanks.

Sure @JayjeetAtGithub -- I'll try and work on it later today

@alamb
Copy link
Contributor

alamb commented Jan 23, 2023

Hi @JayjeetAtGithub

alamb@MacBook-Pro-8 arrow-rs % git fetch apache
git fetch apache
remote: Enumerating objects: 11, done.        
remote: Counting objects: 100% (11/11), done.        
remote: Compressing objects: 100% (9/9), done.        
remote: Total 11 (delta 3), reused 2 (delta 2), pack-reused 0        
Unpacking objects: 100% (11/11), 8.37 KiB | 779.00 KiB/s, done.
From github.com:apache/arrow-rs
   de381ec54..b826657ac  master     -> apache/master
alamb@MacBook-Pro-8 arrow-rs % git merge apache/master
git merge apache/master
Auto-merging arrow-cast/src/display.rs
Auto-merging arrow-csv/src/writer.rs
CONFLICT (content): Merge conflict in arrow-csv/src/writer.rs
Automatic merge failed; fix conflicts and then commit the result.

I believe the logical conflict came from 24e5dae / #3570

Your changes may actually supercede the changes in #3570 but I would need to study it more carefully

@JayjeetAtGithub
Copy link
Contributor Author

Thanks a lot for taking a look at this @alamb.

@alamb
Copy link
Contributor

alamb commented Jan 24, 2023

Thank you for sticking with this @JayjeetAtGithub

@JayjeetAtGithub
Copy link
Contributor Author

JayjeetAtGithub commented Jan 26, 2023

@alamb I combined the changes #3570 and here is an initial version of that. I have mostly introduced a col_idx parameter to temporal_array_value_to_string, so that we can show an informative error message during a cast error. The col_idx parameter is an useful addition to not only temporal data types but all types. It requires a lot of changes, so If the current design looks good, I can add it in this PR. Thanks.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JayjeetAtGithub -- I plan to review this one carefully tomorrow

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good to go -- thank you @JayjeetAtGithub

I am slightly concerned that we may have changed behavior of array_value_to_string on invalid data (aka return an error rather than a string) but I don't see any tests that change 🤔

cc @viirya and @tustvold


fn invalid_cast_error(dt: &str, col_idx: usize, row_idx: usize) -> ArrowError {
ArrowError::CastError(format!(
"Cannot cast to {dt} at col index: {col_idx} row index: {row_idx}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the addition of column index to the error message 👍

Ok(array
.value_as_date($row)
.map(|d| d.to_string())
.unwrap_or_else(|| "ERROR CONVERTING DATE".to_string()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this change the behavior with invalid dates (.e.g it will error rather than successfully write out "ERROR CONVERTING DATE")?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, It would. Do we want to keep the former behaviour ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please, let's keep the existing behavior

I think #3638 will result in a way to eventually control what happens on invalid data (e.g. error or placeholder) so I think we should avoid changing the behavior in this PR

@@ -524,6 +721,22 @@ pub fn array_value_to_string(
}
}

pub fn temporal_array_value_to_string(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please add some doc comments about what this does (specifically what DataTypes it accepts) and what the format is used for?

I realize you didn't add this function so we can do so as a follow on PR I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb Sure. I was also thinking renaming this function to something like array_value_to_string_using_format. What do you think ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is pub I think we should leave it the same name for now, as I suspect @tustvold has some larger changes in the API planed as part of #3638 and it would be nice to avoid requiring users to change their code twice

}
DataType::Date64 => {
write_temporal_value!(
DataType::Boolean => array_value_to_string(col, row_index)?.to_string(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 that certainly looks better

@@ -463,6 +384,19 @@ impl WriterBuilder {
self
}

/// Use RFC3339 format for date/time/timestamps by clearing all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb
Copy link
Contributor

alamb commented Jan 30, 2023

I was inspired by this PR to write up a potential new API for formatting arrays as strings as the current one has bothered me for a long time: #3638

@tustvold
Copy link
Contributor

tustvold commented Feb 2, 2023

I'm going to merge this in so #3647 doesn't generate huge numbers of conflicts with it

@tustvold tustvold merged commit 75a56be into apache:master Feb 2, 2023
@ursabot
Copy link

ursabot commented Feb 2, 2023

Benchmark runs are scheduled for baseline = 2b9bbce and contender = 75a56be. 75a56be is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@alamb
Copy link
Contributor

alamb commented Feb 2, 2023

Thanks again @JayjeetAtGithub and @tustvold !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent output between pretty print and CSV writer for Arrow Use array_value_to_string in arrow-csv
4 participants