Add CSV Decoder::capacity (#3674) #3677

tustvold · 2023-02-08T18:50:22Z

Which issue does this PR close?

Closes #3674

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb

I think the code looks good, but I struggle to understand the test. I apologize if I am missing something obvious

alamb · 2023-02-10T12:23:59Z

arrow-csv/src/reader/mod.rs

@@ -438,10 +438,10 @@ impl<R: BufRead> BufReader<R> {
        loop {
            let buf = self.reader.fill_buf()?;
            let decoded = self.decoder.decode(buf)?;
-            if decoded == 0 {
+            self.reader.consume(decoded);
+            if decoded == 0 || self.decoder.capacity() == 0 {


this has the effect of potentially creating smaller output batches, right? Basically the reader will now yield up any batches it already has buffered.

Nope, it will yield only if it has fully read the batch size number of rows. I.e it will yield if it has read enough data, instead of looping around and calling fill_buf again to potentially read more data that it is just going to ignore (as it has already read batch_size rows)

Suggested change

if decoded == 0 || self.decoder.capacity() == 0 {

// yield only if it has fully read the batch size number of rows.

// instead of looping around and calling fill_buf again to potentially

// read more data that it is just going to ignore (as it has already read batch_size rows)

if decoded == 0 || self.decoder.capacity() == 0 {

alamb · 2023-02-10T12:31:13Z

arrow-csv/src/reader/mod.rs

@@ -2269,4 +2274,73 @@ mod tests {
            "Csv error: Encountered invalid UTF-8 data for line 1 and field 1",
        );
    }
+


I am sorry if I am mis understanding the rationale for this change, but I don't understand how this tests mimics what is described in #3674 -- namely that data that has been read can be decoded and read out as record batches prior to sending the end of stream.

I wonder can we write a test like

set the batch size to 5 (bigger than the output data)
2.send in "foo,bar\nbaz,foo\n"

Read those two records <-- as I understand this is what can not be done today

Feed in "a,b\nc,d" + EOS

read the final two records

I struggle how to map this test to that usecase -- it would be hard for me to understand the purpose of this test if I saw it without context. E.g. why is it important that there were two calls to fill(0)?

namely that data that has been read can be decoded and read out as record batches prior to sending the end of stream.

Because that isn't the issue. The problem was that it would try to fill the buffer again, even if it had already read the batch_size number of rows.

Without the change in this PR you have

fill_sizes: [23, 3, 3, 0, 0]

In the case of a streaming decoder, this could potentially cause it to wait for more input when it doesn't actually need any more input as it already has the requisite number of rows

I wonder can we write a test like

This has never been supported, and realistically can't be supported as BufRead::fill_buf will only return an empty slice on EOS, otherwise it will block. There is no fill_buf if data available that I am aware of.

Edit: it would appear there is an experimental API - rust-lang/rust#86423

Edit Edit: this might actually be impossible in general - https://stackoverflow.com/questions/5431941/why-is-while-feoffile-always-wrong

Thank you for the explanation

arrow-csv/src/lib.rs

alamb · 2023-02-10T14:00:27Z

arrow-csv/src/reader/mod.rs

@@ -438,10 +438,10 @@ impl<R: BufRead> BufReader<R> {
        loop {
            let buf = self.reader.fill_buf()?;
            let decoded = self.decoder.decode(buf)?;
-            if decoded == 0 {
+            self.reader.consume(decoded);
+            if decoded == 0 || self.decoder.capacity() == 0 {


Suggested change

if decoded == 0 || self.decoder.capacity() == 0 {

// yield only if it has fully read the batch size number of rows.

// instead of looping around and calling fill_buf again to potentially

// read more data that it is just going to ignore (as it has already read batch_size rows)

if decoded == 0 || self.decoder.capacity() == 0 {

ursabot · 2023-02-10T15:12:46Z

Benchmark runs are scheduled for baseline = 5b1821e and contender = 3e08a75. 3e08a75 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

github-actions bot added the arrow label Feb 8, 2023

tustvold mentioned this pull request Feb 8, 2023

Arrow-csv reader cannot produce RecordBatch even if the bytes are necessary #3674

Closed

Add CSV Decoder::capacity (apache#3674)

Unverified

This commit is not signed, but one or more authors requires that any commit attributed to them is signed.

Learn about vigilant mode

73a7644

tustvold force-pushed the add-csv-has-capacity branch from 9e229f6 to 73a7644 Compare February 8, 2023 18:54

Add test

9bbcacf

tustvold marked this pull request as ready for review February 9, 2023 17:21

alamb reviewed Feb 10, 2023

View reviewed changes

Remove unnecessary extern

efcc32e

alamb approved these changes Feb 10, 2023

View reviewed changes

Add docs

c0c7a83

tustvold merged commit 3e08a75 into apache:master Feb 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CSV Decoder::capacity (#3674) #3677

Add CSV Decoder::capacity (#3674) #3677

tustvold commented Feb 8, 2023 •

edited

Loading

alamb left a comment

alamb Feb 10, 2023

tustvold Feb 10, 2023 •

edited

Loading

alamb Feb 10, 2023

alamb Feb 10, 2023

tustvold Feb 10, 2023 •

edited

Loading

alamb Feb 10, 2023 •

edited

Loading

alamb Feb 10, 2023

ursabot commented Feb 10, 2023

-            if decoded == 0 || self.decoder.capacity() == 0 {
+            // yield only if it has fully read the batch size number of rows.
+            //  instead of looping around and calling fill_buf again to potentially
+            // read more data that it is just going to ignore (as it has already read batch_size rows)
+            if decoded == 0 || self.decoder.capacity() == 0 {

Add CSV Decoder::capacity (#3674) #3677

Add CSV Decoder::capacity (#3674) #3677

Conversation

tustvold commented Feb 8, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

alamb Feb 10, 2023

Choose a reason for hiding this comment

tustvold Feb 10, 2023 • edited Loading

Choose a reason for hiding this comment

alamb Feb 10, 2023

Choose a reason for hiding this comment

alamb Feb 10, 2023

Choose a reason for hiding this comment

tustvold Feb 10, 2023 • edited Loading

Choose a reason for hiding this comment

alamb Feb 10, 2023 • edited Loading

Choose a reason for hiding this comment

alamb Feb 10, 2023

Choose a reason for hiding this comment

ursabot commented Feb 10, 2023

tustvold commented Feb 8, 2023 •

edited

Loading

tustvold Feb 10, 2023 •

edited

Loading

tustvold Feb 10, 2023 •

edited

Loading

alamb Feb 10, 2023 •

edited

Loading