Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MINOR: [PARQUET] Clean up Skip method naming rows->values #13997

Merged
merged 15 commits into from
Aug 31, 2022

Conversation

fatemehp
Copy link
Contributor

The Skip method is skipping values and not rows. "values" is used throughout the code interchangeably with levels. Repeated fields may have multiple values, thus the use of "rows" is not accurate because we are not skipping over the values from the repeated field to the next row.

Similarly, two other variables total_num_rows_ and seen_num_rows_ actually refer to values and not rows. So I updated them as well.

I will add more tests for the Skip method that will clarify this behavior for repeated fields in a separate change.

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@fatemehp fatemehp changed the title Clean up Skip method naming rows->values Minor: [PARQUET] Clean up Skip method naming rows->values Aug 29, 2022
@emkornfield emkornfield changed the title Minor: [PARQUET] Clean up Skip method naming rows->values MINOR: [PARQUET] Clean up Skip method naming rows->values Aug 29, 2022
@emkornfield
Copy link
Contributor

LGTM, I'll merge in a few hours unless anybody has more feedback.

@emkornfield
Copy link
Contributor

Actually it looks like the lint error is related to this change, lets fix that first

// Skip reading values.
// Returns the number of values skipped.
// This function will NOT skip rows, and repeated fields may have multiple values
// corresponding to the same row.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this comment confusing. Let's say I ask to skip 20 values:

  • If I am in a non-repeated field, nothing is skipped and 0 is returned?
  • If I am in a repeated field and the current row has 10 values, 10 is returned? How do I skip several rows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I clarified the comment, please take a look and see if it makes it clear. Thanks!

// non-repeated fields. Note that this method is skipping values and not
// records. This distinction is important for repeated fields, meaning that
// we are not skipping over the values to the next record. We are skipping
// through them. So after the skip the iterator could be in the middle of a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. This reads better, but I'm afraid the distinction between "skipping over" and "skipping through" is not clear to me. Could you perhaps try to explain better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I further clarified the comment with an example, please take a look.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update!

@emkornfield
Copy link
Contributor

@jonkeane @thisisnic it looks like the R failure are unrelated to this PR, but wanted to double check before merging.

@thisisnic
Copy link
Member

thisisnic commented Aug 31, 2022

@emkornfield You're correct, that one is unrelated.

@pitrou pitrou merged commit c7e58ca into apache:master Aug 31, 2022
@pitrou
Copy link
Member

pitrou commented Aug 31, 2022

Thank you for contributing this @fatemehp. PR merged now.

@ursabot
Copy link

ursabot commented Aug 31, 2022

Benchmark runs are scheduled for baseline = 8f071be and contender = c7e58ca. c7e58ca is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️25.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Failed ⬇️0.27% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.35% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] c7e58cac ec2-t3-xlarge-us-east-2
[Failed] c7e58cac test-mac-arm
[Failed] c7e58cac ursa-i9-9960x
[Finished] c7e58cac ursa-thinkcentre-m75q
[Finished] 8f071be7 ec2-t3-xlarge-us-east-2
[Failed] 8f071be7 test-mac-arm
[Failed] 8f071be7 ursa-i9-9960x
[Finished] 8f071be7 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Aug 31, 2022

['Python', 'R'] benchmarks have high level of regressions.
ec2-t3-xlarge-us-east-2
ursa-i9-9960x

anjakefala pushed a commit to anjakefala/arrow that referenced this pull request Aug 31, 2022
The Skip method is skipping values and not rows. "values" is used throughout the code interchangeably with levels. Repeated fields may have multiple values, thus the use of "rows" is not accurate because we are not skipping over the values from the repeated field to the next row.

Similarly, two other variables total_num_rows_ and seen_num_rows_ actually refer to values and not rows. So I updated them as well.

I will add more tests for the Skip method that will clarify this behavior for repeated fields in a separate change.

Authored-by: Fatemah Panahi <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
zagto pushed a commit to zagto/arrow that referenced this pull request Oct 7, 2022
The Skip method is skipping values and not rows. "values" is used throughout the code interchangeably with levels. Repeated fields may have multiple values, thus the use of "rows" is not accurate because we are not skipping over the values from the repeated field to the next row.

Similarly, two other variables total_num_rows_ and seen_num_rows_ actually refer to values and not rows. So I updated them as well.

I will add more tests for the Skip method that will clarify this behavior for repeated fields in a separate change.

Authored-by: Fatemah Panahi <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
@fatemehp fatemehp deleted the skip branch October 26, 2022 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants