-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MINOR: [PARQUET] Clean up Skip method naming rows->values #13997
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename pull request title in the following format?
or
See also: |
LGTM, I'll merge in a few hours unless anybody has more feedback. |
Actually it looks like the lint error is related to this change, lets fix that first |
cpp/src/parquet/column_reader.h
Outdated
// Skip reading values. | ||
// Returns the number of values skipped. | ||
// This function will NOT skip rows, and repeated fields may have multiple values | ||
// corresponding to the same row. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this comment confusing. Let's say I ask to skip 20 values:
- If I am in a non-repeated field, nothing is skipped and 0 is returned?
- If I am in a repeated field and the current row has 10 values, 10 is returned? How do I skip several rows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I clarified the comment, please take a look and see if it makes it clear. Thanks!
cpp/src/parquet/column_reader.h
Outdated
// non-repeated fields. Note that this method is skipping values and not | ||
// records. This distinction is important for repeated fields, meaning that | ||
// we are not skipping over the values to the next record. We are skipping | ||
// through them. So after the skip the iterator could be in the middle of a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update. This reads better, but I'm afraid the distinction between "skipping over" and "skipping through" is not clear to me. Could you perhaps try to explain better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I further clarified the comment with an example, please take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update!
@jonkeane @thisisnic it looks like the R failure are unrelated to this PR, but wanted to double check before merging. |
@emkornfield You're correct, that one is unrelated. |
Thank you for contributing this @fatemehp. PR merged now. |
Benchmark runs are scheduled for baseline = 8f071be and contender = c7e58ca. c7e58ca is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
['Python', 'R'] benchmarks have high level of regressions. |
The Skip method is skipping values and not rows. "values" is used throughout the code interchangeably with levels. Repeated fields may have multiple values, thus the use of "rows" is not accurate because we are not skipping over the values from the repeated field to the next row. Similarly, two other variables total_num_rows_ and seen_num_rows_ actually refer to values and not rows. So I updated them as well. I will add more tests for the Skip method that will clarify this behavior for repeated fields in a separate change. Authored-by: Fatemah Panahi <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
The Skip method is skipping values and not rows. "values" is used throughout the code interchangeably with levels. Repeated fields may have multiple values, thus the use of "rows" is not accurate because we are not skipping over the values from the repeated field to the next row. Similarly, two other variables total_num_rows_ and seen_num_rows_ actually refer to values and not rows. So I updated them as well. I will add more tests for the Skip method that will clarify this behavior for repeated fields in a separate change. Authored-by: Fatemah Panahi <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
The Skip method is skipping values and not rows. "values" is used throughout the code interchangeably with levels. Repeated fields may have multiple values, thus the use of "rows" is not accurate because we are not skipping over the values from the repeated field to the next row.
Similarly, two other variables total_num_rows_ and seen_num_rows_ actually refer to values and not rows. So I updated them as well.
I will add more tests for the Skip method that will clarify this behavior for repeated fields in a separate change.