-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-17382: [C++] open_dataset doesn't ignore BOM in csv file when header's with quotes #13838
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename pull request title in the following format?
or
See also: |
|
Hi @pitrou, could you help review this pr? |
@ZMZ91 I could, but the first step would be to get the CI runs fixed. In addition, since this claims to fix a bug, there should be a unit test added somewhere. |
Thanks @pitrou. I've pushed a new commit and still got 2 ci failures. I'm not sure it's related with my change. Could you help check? |
You are right, the failing CI checks are unrelated. |
@ZMZ91 There is already some logic in the CSV reader to skip the BOM: arrow/cpp/src/arrow/csv/reader.cc Lines 109 to 121 in fe9ca42
Instead of adding the same logic in the CSV parser, you should instead try to find out what that logic (in the CSV reader) isn't sufficient here. |
I saw the code in reader.cc @pitrou. But it seems work on the entire csv file while the parser only takes a block of csv data and delimits rows and fields. And the data for each function are read respectively. Correct me if I'm not right. |
That is why your approach is wrong: the BOM is only expected at the beginning of the file, not at the beginning of each CSV cell. |
561065f
to
cd3bb9f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ZMZ91 ! LGTM, will merge if CI is ok.
CI failures are unrelated. |
Benchmark runs are scheduled for baseline = 8bf60b5 and contender = 01dce6a. 01dce6a is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
…eader's with quotes (apache#13838) Lead-authored-by: Zimo Zhang <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…eader's with quotes (apache#13838) Lead-authored-by: Zimo Zhang <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
No description provided.