Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-17382: [C++] open_dataset doesn't ignore BOM in csv file when header's with quotes #13838

Merged
merged 7 commits into from
Sep 13, 2022

Conversation

ZMZ91
Copy link
Contributor

@ZMZ91 ZMZ91 commented Aug 10, 2022

No description provided.

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@ZMZ91 ZMZ91 changed the title skip bom in csv parser referred to https://github.com/apache/arrow/pull/11892 ARROW-17382: [C++] open_dataset doesn't ignore BOM in csv file when header's with quotes Aug 11, 2022
@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@ZMZ91
Copy link
Contributor Author

ZMZ91 commented Aug 18, 2022

Hi @pitrou, could you help review this pr?

@pitrou
Copy link
Member

pitrou commented Aug 18, 2022

@ZMZ91 I could, but the first step would be to get the CI runs fixed.

In addition, since this claims to fix a bug, there should be a unit test added somewhere.

@ZMZ91
Copy link
Contributor Author

ZMZ91 commented Aug 19, 2022

Thanks @pitrou. I've pushed a new commit and still got 2 ci failures. I'm not sure it's related with my change. Could you help check?

@pitrou
Copy link
Member

pitrou commented Aug 19, 2022

You are right, the failing CI checks are unrelated.

@pitrou
Copy link
Member

pitrou commented Aug 19, 2022

@ZMZ91 There is already some logic in the CSV reader to skip the BOM:

Result<TransformFlow<std::shared_ptr<Buffer>>> operator()(std::shared_ptr<Buffer> buf) {
if (buf == nullptr) {
// EOF
return TransformFinish();
}
int64_t offset = 0;
if (first_buffer_) {
ARROW_ASSIGN_OR_RAISE(auto data, util::SkipUTF8BOM(buf->data(), buf->size()));
offset += data - buf->data();
DCHECK_GE(offset, 0);
first_buffer_ = false;
}

Instead of adding the same logic in the CSV parser, you should instead try to find out what that logic (in the CSV reader) isn't sufficient here.

@ZMZ91
Copy link
Contributor Author

ZMZ91 commented Aug 22, 2022

I saw the code in reader.cc @pitrou. But it seems work on the entire csv file while the parser only takes a block of csv data and delimits rows and fields. And the data for each function are read respectively. Correct me if I'm not right.

@pitrou
Copy link
Member

pitrou commented Aug 22, 2022

That is why your approach is wrong: the BOM is only expected at the beginning of the file, not at the beginning of each CSV cell.

@pitrou pitrou force-pushed the bugfix/skip_BOM_in_csv_parser branch from 561065f to cd3bb9f Compare September 13, 2022 12:57
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ZMZ91 ! LGTM, will merge if CI is ok.

@pitrou
Copy link
Member

pitrou commented Sep 13, 2022

CI failures are unrelated.

@pitrou pitrou merged commit 01dce6a into apache:master Sep 13, 2022
@ursabot
Copy link

ursabot commented Sep 14, 2022

Benchmark runs are scheduled for baseline = 8bf60b5 and contender = 01dce6a. 01dce6a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.18% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 01dce6a0 ec2-t3-xlarge-us-east-2
[Finished] 01dce6a0 test-mac-arm
[Failed] 01dce6a0 ursa-i9-9960x
[Finished] 01dce6a0 ursa-thinkcentre-m75q
[Finished] 8bf60b5d ec2-t3-xlarge-us-east-2
[Failed] 8bf60b5d test-mac-arm
[Failed] 8bf60b5d ursa-i9-9960x
[Finished] 8bf60b5d ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

zagto pushed a commit to zagto/arrow that referenced this pull request Oct 7, 2022
…eader's with quotes (apache#13838)

Lead-authored-by: Zimo Zhang <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
fatemehp pushed a commit to fatemehp/arrow that referenced this pull request Oct 17, 2022
…eader's with quotes (apache#13838)

Lead-authored-by: Zimo Zhang <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants