From d7c57fbf19efb0ea0e02a2bf45866420faadf29c Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Thu, 4 Aug 2022 14:04:57 -0400 Subject: [PATCH 1/4] Improve arrow and parquet READMEs, document parquet feature flags --- arrow/README.md | 43 +++++++++++++++++++++++++++---------------- parquet/README.md | 27 ++++++++++++++++++++++++--- 2 files changed, 51 insertions(+), 19 deletions(-) diff --git a/arrow/README.md b/arrow/README.md index d26a4f410c23..480c0c0bb42b 100644 --- a/arrow/README.md +++ b/arrow/README.md @@ -22,7 +22,10 @@ [![crates.io](https://img.shields.io/crates/v/arrow.svg)](https://crates.io/crates/arrow) [![docs.rs](https://img.shields.io/docsrs/arrow.svg)](https://docs.rs/arrow/latest/arrow/) -This crate contains the official Native Rust implementation of [Apache Arrow][arrow] in memory format, governed by the Apache Software Foundation. Additional details can be found on [crates.io](https://crates.io/crates/arrow), [docs.rs](https://docs.rs/arrow/latest/arrow/) and [examples](https://github.com/apache/arrow-rs/tree/master/arrow/examples). +This crate contains the official Native Rust implementation of [Apache Arrow][arrow] in memory format, governed by the Apache Software Foundation. + +The [crate documentation](https://docs.rs/arrow/latest/arrow/) contains examples and full API. +There are several [examples](https://github.com/apache/arrow-rs/tree/master/arrow/examples) to start from as well. ## Rust Version Compatibility @@ -34,18 +37,25 @@ The arrow crate follows the [SemVer standard](https://doc.rust-lang.org/cargo/re However, for historical reasons, this crate uses versions with major numbers greater than `0.x` (e.g. `19.0.0`), unlike many other crates in the Rust ecosystem which spend extended time releasing versions `0.x` to signal planned ongoing API changes. Minor arrow releases contain only compatible changes, while major releases may contain breaking API changes. -## Features +## Feature Flags -The arrow crate provides the following features which may be enabled: +The `arrow` crate provides the following features which may be enabled in your `Cargo.toml`: - `csv` (default) - support for reading and writing Arrow arrays to/from csv files - `ipc` (default) - support for the [arrow-flight](https://crates.io/crates/arrow-flight) IPC and wire format - `prettyprint` - support for formatting record batches as textual columns - `js` - support for building arrow for WebAssembly / JavaScript -- `simd` - (_Requires Nightly Rust_) alternate optimized +- `simd` - (_Requires Nightly Rust_) Use alternate hand optimized implementations of some [compute](https://github.com/apache/arrow-rs/tree/master/arrow/src/compute/kernels) - kernels using explicit SIMD instructions available through [packed_simd_2](https://docs.rs/packed_simd_2/latest/packed_simd_2/). + kernels using explicit SIMD instructions via [packed_simd_2](https://docs.rs/packed_simd_2/latest/packed_simd_2/). - `chrono-tz` - support of parsing timezone using [chrono-tz](https://docs.rs/chrono-tz/0.6.0/chrono_tz/) +- `ffi` - bindings for the Arrow C [C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html) +- `pyarrow` - bindings for pyo3 to call arrow-rs from python + +## Arrow Feature Status + +The [Apache Arrow Status](https://arrow.apache.org/docs/status.html) page lists which features of Arrow this crate supports. + ## Safety @@ -55,24 +65,24 @@ Arrow seeks to uphold the Rust Soundness Pledge as articulated eloquently [here] Where soundness in turn is defined as: -> Code is unable to trigger undefined behaviour using safe APIs +> Code is unable to trigger undefined behavior using safe APIs -One way to ensure this would be to not use `unsafe`, however, as described in the opening chapter of the [Rustonomicon](https://doc.rust-lang.org/nomicon/meet-safe-and-unsafe.html) this is not a requirement, and flexibility in this regard is actually one of Rust's great strengths. +One way to ensure this would be to not use `unsafe`, however, as described in the opening chapter of the [Rustonomicon](https://doc.rust-lang.org/nomicon/meet-safe-and-unsafe.html) this is not a requirement, and flexibility in this regard is one of Rust's great strengths. In particular there are a number of scenarios where `unsafe` is largely unavoidable: * Invariants that cannot be statically verified by the compiler and unlock non-trivial performance wins, e.g. values in a StringArray are UTF-8, [TrustedLen](https://doc.rust-lang.org/std/iter/trait.TrustedLen.html) iterators, etc... -* FFI +* FFI * SIMD -Additionally, this crate exposes a number of `unsafe` APIs, allowing downstream crates to explicitly opt-out of potentially expensive invariant checking where appropriate. +Additionally, this crate exposes a number of `unsafe` APIs, allowing downstream crates to explicitly opt-out of potentially expensive invariant checking where appropriate. We have a number of strategies to help reduce this risk: * Provide strongly-typed `Array` and `ArrayBuilder` APIs to safely and efficiently interact with arrays * Extensive validation logic to safely construct `ArrayData` from untrusted sources * All commits are verified using [MIRI](https://github.com/rust-lang/miri) to detect undefined behaviour -* We provide a `force_validate` feature that enables additional validation checks for use in test/debug builds +* Use a `force_validate` feature that enables additional validation checks for use in test/debug builds * There is ongoing work to reduce and better document the use of unsafe, and we welcome contributions in this space ## Building for WASM @@ -104,13 +114,14 @@ cargo run --example read_csv ## Performance -Most of the compute kernels benefit a lot from being optimized for a specific CPU target. -This is especially so on x86-64 since without specifying a target the compiler can only assume support for SSE2 vector instructions. -One of the following values as `-Ctarget-cpu=value` in `RUSTFLAGS` can therefore improve performance significantly: +Many compute kernels benefit from being optimized for a specific CPU target. +This is especially so on `x86-64`, as without a specific target, `rustc` only assumes support for `SSE2` vector instructions. + +Using one the following values as `-Ctarget-cpu=value` in `RUSTFLAGS` often improves performance significantly: - `native`: Target the exact features of the cpu that the build is running on. - This should give the best performance when building and running locally, but should be used carefully for example when building in a CI pipeline or when shipping pre-compiled software. - - `x86-64-v3`: Includes AVX2 support and is close to the intel `haswell` architecture released in 2013 and should be supported by any recent Intel or Amd cpu. + This should give the best performance when building and running locally, but should be used carefully for example when building in a CI pipeline or when shipping pre-compiled software. + - `x86-64-v3`: Includes AVX2 support and is close to the intel `haswell` architecture released in 2013 and should be supported by any recent Intel or AMD cpu. - `x86-64-v4`: Includes AVX512 support available on intel `skylake` server and `icelake`/`tigerlake`/`rocketlake` laptop and desktop processors. -These flags should be used in addition to the `simd` feature, since they will also affect the code generated by the simd library. \ No newline at end of file +These flags should be used in addition to the `simd` feature, since they will also affect the code generated by the simd library. diff --git a/parquet/README.md b/parquet/README.md index fbb6e3e1b5d5..ababa51ea62a 100644 --- a/parquet/README.md +++ b/parquet/README.md @@ -19,17 +19,38 @@ # Apache Parquet Official Native Rust Implementation -[![Crates.io](https://img.shields.io/crates/v/parquet.svg)](https://crates.io/crates/parquet) +[![crates.io](https://img.shields.io/crates/v/parquet.svg)](https://crates.io/crates/parquet) +[![docs.rs](https://img.shields.io/docsrs/parquet.svg)](https://docs.rs/parquet/latest/parquet/) This crate contains the official Native Rust implementation of [Apache Parquet](https://parquet.apache.org/), which is part of the [Apache Arrow](https://arrow.apache.org/) project. See [crate documentation](https://docs.rs/parquet/latest/parquet/) for examples and the full API. -## Rust Version Compatbility +## Rust Version Compatibility This crate is tested with the latest stable version of Rust. We do not currently test against other, older versions of the Rust compiler. -## Features +## Versioning / Releases + +The arrow crate follows the [SemVer standard](https://doc.rust-lang.org/cargo/reference/semver.html) defined by Cargo and works well within the Rust crate ecosystem. + +However, for historical reasons, this crate uses versions with major numbers greater than `0.x` (e.g. `19.0.0`), unlike many other crates in the Rust ecosystem which spend extended time releasing versions `0.x` to signal planned ongoing API changes. Minor arrow releases contain only compatible changes, while major releases may contain breaking API changes. + +## Feature Flags + +The `parquet` crate provides the following features which may be enabled in your `Cargo.toml`: + +- `arrow` (default) - support for reading / writing [`arrow`](https://crates.io/crates/arrow) arrays to / from parquet +- `async` - support `async` APIs for reading parquet +- `json` - support for reading / writing `json` data to / from parquet +- `brotli` (default) - support for parquet using `brotli` compression +- `flate2` (default) - support for parquet using `gzip` compression +- `lz4` (default) - support for parquet using `lz4` compression +- `zstd` (default) - support for parquet using `zstd` compression +- `cli` - parquet [CLI tools](https://github.com/apache/arrow-rs/tree/master/parquet/src/bin) +- `experimental` - Experimental APIs which may change, even between minor releases + +## Parquet Feature Status - [x] All encodings supported - [x] All compression codecs supported From 1c60702045dc4cfe5ffca1036b49f1735da0a9e0 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Thu, 4 Aug 2022 14:05:36 -0400 Subject: [PATCH 2/4] Fixup --- arrow/README.md | 26 ++++++++++++-------------- arrow/src/compute/README.md | 6 +++--- parquet/README.md | 2 +- 3 files changed, 16 insertions(+), 18 deletions(-) diff --git a/arrow/README.md b/arrow/README.md index 480c0c0bb42b..521c6d381a10 100644 --- a/arrow/README.md +++ b/arrow/README.md @@ -56,7 +56,6 @@ The `arrow` crate provides the following features which may be enabled in your ` The [Apache Arrow Status](https://arrow.apache.org/docs/status.html) page lists which features of Arrow this crate supports. - ## Safety Arrow seeks to uphold the Rust Soundness Pledge as articulated eloquently [here](https://raphlinus.github.io/rust/2020/01/18/soundness-pledge.html). Specifically: @@ -71,19 +70,19 @@ One way to ensure this would be to not use `unsafe`, however, as described in th In particular there are a number of scenarios where `unsafe` is largely unavoidable: -* Invariants that cannot be statically verified by the compiler and unlock non-trivial performance wins, e.g. values in a StringArray are UTF-8, [TrustedLen](https://doc.rust-lang.org/std/iter/trait.TrustedLen.html) iterators, etc... -* FFI -* SIMD +- Invariants that cannot be statically verified by the compiler and unlock non-trivial performance wins, e.g. values in a StringArray are UTF-8, [TrustedLen](https://doc.rust-lang.org/std/iter/trait.TrustedLen.html) iterators, etc... +- FFI +- SIMD Additionally, this crate exposes a number of `unsafe` APIs, allowing downstream crates to explicitly opt-out of potentially expensive invariant checking where appropriate. We have a number of strategies to help reduce this risk: -* Provide strongly-typed `Array` and `ArrayBuilder` APIs to safely and efficiently interact with arrays -* Extensive validation logic to safely construct `ArrayData` from untrusted sources -* All commits are verified using [MIRI](https://github.com/rust-lang/miri) to detect undefined behaviour -* Use a `force_validate` feature that enables additional validation checks for use in test/debug builds -* There is ongoing work to reduce and better document the use of unsafe, and we welcome contributions in this space +- Provide strongly-typed `Array` and `ArrayBuilder` APIs to safely and efficiently interact with arrays +- Extensive validation logic to safely construct `ArrayData` from untrusted sources +- All commits are verified using [MIRI](https://github.com/rust-lang/miri) to detect undefined behaviour +- Use a `force_validate` feature that enables additional validation checks for use in test/debug builds +- There is ongoing work to reduce and better document the use of unsafe, and we welcome contributions in this space ## Building for WASM @@ -111,7 +110,6 @@ cargo run --example read_csv [arrow]: https://arrow.apache.org/ - ## Performance Many compute kernels benefit from being optimized for a specific CPU target. @@ -119,9 +117,9 @@ This is especially so on `x86-64`, as without a specific target, `rustc` only as Using one the following values as `-Ctarget-cpu=value` in `RUSTFLAGS` often improves performance significantly: - - `native`: Target the exact features of the cpu that the build is running on. - This should give the best performance when building and running locally, but should be used carefully for example when building in a CI pipeline or when shipping pre-compiled software. - - `x86-64-v3`: Includes AVX2 support and is close to the intel `haswell` architecture released in 2013 and should be supported by any recent Intel or AMD cpu. - - `x86-64-v4`: Includes AVX512 support available on intel `skylake` server and `icelake`/`tigerlake`/`rocketlake` laptop and desktop processors. +- `native`: Target the exact features of the cpu that the build is running on. + This should give the best performance when building and running locally, but should be used carefully for example when building in a CI pipeline or when shipping pre-compiled software. +- `x86-64-v3`: Includes AVX2 support and is close to the intel `haswell` architecture released in 2013 and should be supported by any recent Intel or AMD cpu. +- `x86-64-v4`: Includes AVX512 support available on intel `skylake` server and `icelake`/`tigerlake`/`rocketlake` laptop and desktop processors. These flags should be used in addition to the `simd` feature, since they will also affect the code generated by the simd library. diff --git a/arrow/src/compute/README.md b/arrow/src/compute/README.md index 761713a531b4..a5d15a83046f 100644 --- a/arrow/src/compute/README.md +++ b/arrow/src/compute/README.md @@ -33,16 +33,16 @@ We use the term "kernel" to refer to particular general operation that contains Types of functions -* Scalar functions: elementwise functions that perform scalar operations in a +- Scalar functions: elementwise functions that perform scalar operations in a vectorized manner. These functions are generally valid for SQL-like context. These are called "scalar" in that the functions executed consider each value in an array independently, and the output array or arrays have the same length as the input arrays. The result for each array cell is generally independent of its position in the array. -* Vector functions, which produce a result whose output is generally dependent +- Vector functions, which produce a result whose output is generally dependent on the entire contents of the input arrays. These functions **are generally not valid** for SQL-like processing because the output size may be different than the input size, and the result may change based on the order of the values in the array. This includes things like array subselection, sorting, hashing, and more. -* Scalar aggregate functions of which can be used in a SQL-like context \ No newline at end of file +- Scalar aggregate functions of which can be used in a SQL-like context diff --git a/parquet/README.md b/parquet/README.md index ababa51ea62a..689a664b6326 100644 --- a/parquet/README.md +++ b/parquet/README.md @@ -41,7 +41,7 @@ However, for historical reasons, this crate uses versions with major numbers gre The `parquet` crate provides the following features which may be enabled in your `Cargo.toml`: - `arrow` (default) - support for reading / writing [`arrow`](https://crates.io/crates/arrow) arrays to / from parquet -- `async` - support `async` APIs for reading parquet +- `async` - support `async` APIs for reading parquet - `json` - support for reading / writing `json` data to / from parquet - `brotli` (default) - support for parquet using `brotli` compression - `flate2` (default) - support for parquet using `gzip` compression From caff265ce6c328771ee857cf8a25a1b58afbe793 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Thu, 4 Aug 2022 14:12:34 -0400 Subject: [PATCH 3/4] Move performance tips to crates.io and leave a link --- arrow/README.md | 40 +++++++++++++++++++++++++++++++--------- arrow/src/lib.rs | 37 ++----------------------------------- 2 files changed, 33 insertions(+), 44 deletions(-) diff --git a/arrow/README.md b/arrow/README.md index 521c6d381a10..6ff42110ef3e 100644 --- a/arrow/README.md +++ b/arrow/README.md @@ -110,16 +110,38 @@ cargo run --example read_csv [arrow]: https://arrow.apache.org/ -## Performance +## Performance Tips -Many compute kernels benefit from being optimized for a specific CPU target. -This is especially so on `x86-64`, as without a specific target, `rustc` only assumes support for `SSE2` vector instructions. +Arrow aims to be as fast as possible out of the box, whilst not compromising on safety. However, +it relies heavily on LLVM auto-vectorisation to achieve this. Unfortunately the LLVM defaults, +particularly for x86_64, favour portability over performance, and LLVM will consequently avoid +using more recent instructions that would result in errors on older CPUs. -Using one the following values as `-Ctarget-cpu=value` in `RUSTFLAGS` often improves performance significantly: +To address this it is recommended that you specify the override the LLVM defaults either +by setting the `RUSTFLAGS` environment variable, or by setting `rustflags` in your +[Cargo configuration](https://doc.rust-lang.org/cargo/reference/config.html) -- `native`: Target the exact features of the cpu that the build is running on. - This should give the best performance when building and running locally, but should be used carefully for example when building in a CI pipeline or when shipping pre-compiled software. -- `x86-64-v3`: Includes AVX2 support and is close to the intel `haswell` architecture released in 2013 and should be supported by any recent Intel or AMD cpu. -- `x86-64-v4`: Includes AVX512 support available on intel `skylake` server and `icelake`/`tigerlake`/`rocketlake` laptop and desktop processors. +Enable all features supported by the current CPU -These flags should be used in addition to the `simd` feature, since they will also affect the code generated by the simd library. +```ignore +RUSTFLAGS="-C target-cpu=native" +``` + +Enable all features supported by the current CPU, and enable full use of AVX512 + +```ignore +RUSTFLAGS="-C target-cpu=native -C target-feature=-prefer-256-bit" +``` + +Enable all features supported by CPUs more recent than haswell (2013) + +```ignore +RUSTFLAGS="-C target-cpu=haswell" +``` + +For a full list of features and target CPUs use + +```shell +$ rustc --print target-cpus +$ rustc --print target-features +``` diff --git a/arrow/src/lib.rs b/arrow/src/lib.rs index e64a5361176d..49de3a0049ed 100644 --- a/arrow/src/lib.rs +++ b/arrow/src/lib.rs @@ -18,41 +18,8 @@ //! A complete, safe, native Rust implementation of [Apache Arrow](https://arrow.apache.org), a cross-language //! development platform for in-memory data. //! -//! # Performance Tips -//! -//! Arrow aims to be as fast as possible out of the box, whilst not compromising on safety. However, -//! it relies heavily on LLVM auto-vectorisation to achieve this. Unfortunately the LLVM defaults, -//! particularly for x86_64, favour portability over performance, and LLVM will consequently avoid -//! using more recent instructions that would result in errors on older CPUs. -//! -//! To address this it is recommended that you specify the override the LLVM defaults either -//! by setting the `RUSTFLAGS` environment variable, or by setting `rustflags` in your -//! [Cargo configuration](https://doc.rust-lang.org/cargo/reference/config.html) -//! -//! Enable all features supported by the current CPU -//! -//! ```ignore -//! RUSTFLAGS="-C target-cpu=native" -//! ``` -//! -//! Enable all features supported by the current CPU, and enable full use of AVX512 -//! -//! ```ignore -//! RUSTFLAGS="-C target-cpu=native -C target-feature=-prefer-256-bit" -//! ``` -//! -//! Enable all features supported by CPUs more recent than haswell (2013) -//! -//! ```ignore -//! RUSTFLAGS="-C target-cpu=haswell" -//! ``` -//! -//! For a full list of features and target CPUs use -//! -//! ```ignore -//! $ rustc --print target-cpus -//! $ rustc --print target-features -//! ``` +//! Please see the [arrow crates.io](https://crates.io/crates/arrow) +//! page for feature flags and tips to improve performance //! //! # Columnar Format //! From 175e80e732bd39f68037c3ace93192d81f402062 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Thu, 4 Aug 2022 14:13:33 -0400 Subject: [PATCH 4/4] Add link back to crates.io from lib.rs --- arrow/src/lib.rs | 2 +- parquet/src/lib.rs | 3 +++ 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/arrow/src/lib.rs b/arrow/src/lib.rs index 49de3a0049ed..04f495dc0819 100644 --- a/arrow/src/lib.rs +++ b/arrow/src/lib.rs @@ -19,7 +19,7 @@ //! development platform for in-memory data. //! //! Please see the [arrow crates.io](https://crates.io/crates/arrow) -//! page for feature flags and tips to improve performance +//! page for feature flags and tips to improve performance. //! //! # Columnar Format //! diff --git a/parquet/src/lib.rs b/parquet/src/lib.rs index d4eaaf41686a..5ee43f8ad6fb 100644 --- a/parquet/src/lib.rs +++ b/parquet/src/lib.rs @@ -19,6 +19,9 @@ //! [Apache Parquet](https://parquet.apache.org/), part of //! the [Apache Arrow](https://arrow.apache.org/) project. //! +//! Please see the [parquet crates.io](https://crates.io/crates/parquet) +//! page for feature flags and tips to improve performance. +//! //! # Getting Started //! Start with some examples: //!