arrow::util::pretty::pretty_format_batches missing #769

yuribudilov · 2021-07-22T07:19:47Z

Hello
My apologies for novice Arrow question.
I am not able to compile the code sample due to missing "pretty" function in arrow util.
Using Rust 1.53.0 Stable.
Toml is:
[package]
name = "test_arrow"
version = "0.1.0"
edition = "2018"
[dependencies]
arrow = "5.0.0"
datafusion = "4.0.0"
tokio = "1.8.2"

// compilation can not find this:
use arrow::util::pretty::print_batches;
// also this fails to compile:
let pretty_results = arrow::util::pretty::pretty_format_batches(&results)?;

Error: cannot find 'pretty' in util.

What am I doing wrong please?

thank you very much

alamb · 2021-07-22T15:03:11Z

Hi @yuribudilov -- you need to enable the "prettyprint" feature for arrow.

So instead of

arrow = "5.0.0"

try using

arrow = { version = "5.0", features = ["prettyprint"] }

yuribudilov · 2021-07-22T23:08:34Z

thank you.

One compilation error is now gone but replaced by another 2 compilation errors, one step forward, two steps back.

Repro:

on https://github.com/apache/arrow-datafusion there is Rust code sample given (quote), which does not compile:

use arrow::record_batch::RecordBatch;
use arrow::util::pretty::print_batches;
use datafusion::prelude::*;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
// register the table
let mut ctx = ExecutionContext::new();
ctx.register_csv("example", "tests/example.csv", CsvReadOptions::new())?;

// create a plan to run a SQL query
let df = ctx.sql("SELECT a, MIN(b) FROM example GROUP BY a LIMIT 100")?;

// execute and print results
let results: Vec<RecordBatch> = df.collect().await?; // error 1 here
print_batches(&results)?; // error 2 here
Ok(())

}

The TOML on the link only shows one line: datafusion = "4.0.0-SNAPSHOT"

This TOML does not work because there is no arrow and no tokio dependency in TOML.
So I added those myself.

Here is what I have now, which still does not work:
[package]
name = "test_arrow"
version = "0.1.0"
edition = "2018"
[dependencies]

arrow = "5.0.0"

datafusion = "4.0.0"
tokio = "1.8.2"
arrow = { version = "5.0", features = ["prettyprint"] }

I still have 2 compilation errors based on above:

15 | let results: Vec = df.collect().await?; // error 1
| ^^^^^^^^^^^^^^^^^^^ expected struct arrow::record_batch::RecordBatch, found a different struct arrow::record_batch::RecordBatch
|
= note: expected struct Vec<arrow::record_batch::RecordBatch> (struct arrow::record_batch::RecordBatch)
found struct Vec<arrow::record_batch::RecordBatch> (struct arrow::record_batch::RecordBatch)
= note: perhaps two different versions of crate arrow are being used?
note: return type inferred to be Vec<arrow::record_batch::RecordBatch> here
--> src\main.rs:9:5
|
9 | ctx.register_csv("example", "tests/example.csv", CsvReadOptions::new())?;
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error[E0277]: ? couldn't convert the error to DataFusionError
--> src\main.rs:16:28
|
16 | print_batches(&results)?; // error 2
| ^ the trait From<arrow::error::ArrowError> is not implemented for DataFusionError
|
= note: the question mark operation (?) implicitly performs a conversion on the error value using the From trait
= help: the following implementations were found:
<DataFusionError as Fromarrow::error::ArrowError>
<DataFusionError as Fromparquet::errors::ParquetError>
<DataFusionError as Fromsqlparser::parser::ParserError>
<DataFusionError as Fromstd::io::Error>
= note: required by from

error: aborting due to 2 previous errors

First one can be "covered up" by letting Rust infer data type like so (which is very odd given it infers the same Vec !
let results = df.collect().await?;

The second error indicated something is wrong with TOML documentation:
print_batches(&results)?;

Can you please point me to documentation how to use this product from Rust?
Many thanks.

yuribudilov · 2021-07-22T23:31:11Z

OK, I fixed it, thanks to Rust compiler (what a fantastic language!!)

Rust errors "suggested" different version of arrow crate were used.

So I tried using an earlier arrow version in TOML:

arrow = { version = "4.4.0", features = ["prettyprint", "default"] }

This compiles and builds and runs correctly !! Phew! Happy days.

May I humbly suggest there is likely to be a buglet in either datafusion 4.0.0 or in arrow 5.0 or both ?

May I also suggest to update datafusion documentation to list more complete TOML dependencies because those of us who are new to arrow/datafusion but would like to learn could use more help and reliable and accessible documentation is all we have.

Many thanks for reading thus far, it looks like a fantastic product you have been building!
Please feel free to close this issue.

alamb · 2021-07-23T13:46:11Z

arrow = { version = "4.4.0", features = ["prettyprint", "default"] }

Yes, this is the version of arrow that the (released) datafusion version 4.0 works with. 👍

The fact that we haven't released a new version of datafusion to crates.io that works with arrow 5 is a problem which we should rectify.

DataFusion (at least on master) also includes a "public export" of its arrow dependency, so perhaps we should change the example from

use arrow::record_batch::RecordBatch;
use arrow::util::pretty::print_batches;

to

use datafusion::arrow::record_batch::RecordBatch;
use datafusion::arrow::util::pretty::print_batches;

Many thanks for reading thus far, it looks like a fantastic product you have been building!

Thanks! Kudos go to the whole team (there are many people whose work goes into making it)

alamb · 2021-07-23T20:07:47Z

I made #772 to try and improve the docs a little bit

yuribudilov · 2021-07-24T00:01:12Z

I appreciate your support, wonderful and quick!
FWIW - I have used Apache Spark heavily for a couple of years and I am of the opinion that Rust implementation of the great "Spark concept" should be the new ideal for the future. Most of the Spark issues I faced were related to JVM, OO memory overheads, vast memory bloat, many job crashes due to memory exhaustion and GC related issues. The performance often was far from great too. All of those issues should, in theory, disappear when Rust/Arrow/Datafusion/Ballista is running the Spark show. Bring it on. Thank you.

alamb · 2021-07-24T10:58:57Z

All of those issues should, in theory, disappear when Rust/Arrow/Datafusion/Ballista is running the Spark show

Indeed! I think this is @andygrove 's vision as well.

Thanks for the kind words.

alamb mentioned this issue Jul 23, 2021

Update docs to use vendored version of arrow #772

Merged

alamb closed this as completed in #772 Jul 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arrow::util::pretty::pretty_format_batches missing #769

arrow::util::pretty::pretty_format_batches missing #769

yuribudilov commented Jul 22, 2021

alamb commented Jul 22, 2021 •

edited

Loading

yuribudilov commented Jul 22, 2021

yuribudilov commented Jul 22, 2021

alamb commented Jul 23, 2021

alamb commented Jul 23, 2021

yuribudilov commented Jul 24, 2021

alamb commented Jul 24, 2021

arrow::util::pretty::pretty_format_batches missing #769

arrow::util::pretty::pretty_format_batches missing #769

Comments

yuribudilov commented Jul 22, 2021

alamb commented Jul 22, 2021 • edited Loading

yuribudilov commented Jul 22, 2021

arrow = "5.0.0"

yuribudilov commented Jul 22, 2021

alamb commented Jul 23, 2021

alamb commented Jul 23, 2021

yuribudilov commented Jul 24, 2021

alamb commented Jul 24, 2021

alamb commented Jul 22, 2021 •

edited

Loading