Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Commit

Permalink
Updated guide.
Browse files Browse the repository at this point in the history
  • Loading branch information
jorgecarleitao committed Aug 31, 2021
1 parent cd9600e commit c7a47e1
Show file tree
Hide file tree
Showing 2 changed files with 56 additions and 50 deletions.
92 changes: 48 additions & 44 deletions guide/src/high_level.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ let a: &dyn Array = &a;
### Downcast and `as_any`

Given a trait object `array: &dyn Array`, we know its physical type via
`array.data_type().to_physical_type()`, which we use to downcast the array
`PhysicalType: array.data_type().to_physical_type()`, which we use to downcast the array
to its concrete type:

```rust
Expand All @@ -114,53 +114,56 @@ to its concrete type:
# fn main() {
let a = PrimitiveArray::<i32>::from(&[Some(1), None]);
let array = &array as &dyn Array;
// ...
let physical_type: PhysicalType = array.data_type().to_physical_type();
# }
```

There is a one to one relationship between each variant of `PhysicalType` (an enum) and
an each implementation of `Array` (a struct):

| `PhysicalType` | `Array` |
|-------------------|------------------------|
| `Primitive(_)` | `PrimitiveArray<_>` |
| `Binary` | `BinaryArray<i32>` |
| `LargeBinary` | `BinaryArray<i64>` |
| `Utf8` | `Utf8Array<i32>` |
| `LargeUtf8` | `Utf8Array<i64>` |
| `List` | `ListArray<i32>` |
| `LargeList` | `ListArray<i64>` |
| `FixedSizeBinary` | `FixedSizeBinaryArray` |
| `FixedSizeList` | `FixedSizeListArray` |
| `Struct` | `StructArray` |
| `Union` | `UnionArray` |
| `Dictionary(_)` | `DictionaryArray<_>` |

where `_` represents each of the variants (e.g. `PrimitiveType::Int32 <-> i32`).

match array.data_type().to_physical_type() {
PhysicalType::Int32 => {
let array = array.as_any().downcast_ref::<PrimitiveArray<i32>>().unwrap();
let values: &[i32] = array.values();
assert_eq!(values, &[1, 0]);
In this context, a common idiom in using `Array` as a trait object is as follows:

```rust
use arrow2::datatypes::{PhysicalType, PrimitiveType};
use arrow2::array::{Array, PrimitiveArray};

fn float_operator(array: &dyn Array) -> Result<Box<dyn Array>, String> {
match array.data_type().to_physical_type() {
PhysicalType::Primitive(PrimitiveType::Float32) => {
let array = array.as_any().downcast_ref::<PrimitiveArray<f32>>().unwrap();
// let array = f32-specific operator
let array = array.clone();
Ok(Box::new(array))
}
PhysicalType::Primitive(PrimitiveType::Float64) => {
let array = array.as_any().downcast_ref::<PrimitiveArray<f64>>().unwrap();
// let array = f64-specific operator
let array = array.clone();
Ok(Box::new(array))
}
_ => Err("This operator is only valid for float point arrays".to_string()),
}
_ => todo!()
}
# }
```

There is a many-to-one relationship between `DataType` and an Array (i.e. a physical representation). The relationship is the following:

| `PhysicalType` | `PhysicalType` |
|----------------------|---------------------------|
| `UInt8` | `PrimitiveArray<u8>` |
| `UInt16` | `PrimitiveArray<u16>` |
| `UInt32` | `PrimitiveArray<u32>` |
| `UInt64` | `PrimitiveArray<u64>` |
| `Int8` | `PrimitiveArray<i8>` |
| `Int16` | `PrimitiveArray<i16>` |
| `Int32` | `PrimitiveArray<i32>` |
| `Int64` | `PrimitiveArray<i64>` |
| `Int128` | `PrimitiveArray<i128>` |
| `Float32` | `PrimitiveArray<f32>` |
| `Float64` | `PrimitiveArray<f64>` |
| `DaysMs` | `PrimitiveArray<days_ms>` |
| `Binary` | `BinaryArray<i32>` |
| `LargeBinary` | `BinaryArray<i64>` |
| `Utf8` | `Utf8Array<i32>` |
| `LargeUtf8` | `Utf8Array<i64>` |
| `List` | `ListArray<i32>` |
| `LargeList` | `ListArray<i64>` |
| `FixedSizeBinary` | `FixedSizeBinaryArray` |
| `FixedSizeList` | `FixedSizeListArray` |
| `Struct` | `StructArray` |
| `Union` | `UnionArray` |
| `Dictionary(UInt8)` | `DictionaryArray<u8>` |
| `Dictionary(UInt16)` | `DictionaryArray<u16>` |
| `Dictionary(UInt32)` | `DictionaryArray<u32>` |
| `Dictionary(UInt64)` | `DictionaryArray<u64>` |
| `Dictionary(Int8)` | `DictionaryArray<i8>` |
| `Dictionary(Int16)` | `DictionaryArray<i16>` |
| `Dictionary(Int32)` | `DictionaryArray<i32>` |
| `Dictionary(Int64)` | `DictionaryArray<i64>` |

## From Iterator

In the examples above, we've introduced how to create an array from an iterator.
Expand Down Expand Up @@ -218,7 +221,8 @@ bitwise operations, it is often more performant to operate on chunks of bits ins
## Vectorized operations

One of the main advantages of the arrow format and its memory layout is that
it often enables SIMD. For example, an unary operation `op` on a `PrimitiveArray` is likely auto-vectorized on the following code:
it often enables SIMD. For example, an unary operation `op` on a `PrimitiveArray`
likely emits SIMD instructions on the following code:

```rust
# use arrow2::buffer::Buffer;
Expand Down
14 changes: 8 additions & 6 deletions guide/src/metadata.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,16 +12,18 @@ semantical types defined in Arrow.
In Arrow2, logical types are declared as variants of the `enum` `arrow2::datatypes::DataType`.
For example, `DataType::Int32` represents a signed integer of 32 bits.

Each logical type has an associated in-memory physical representation and is associated to specific
semantics. For example, `Date32` has the same in-memory representation as `Int32`, but the value
represents the number of days since UNIX epoch.
Each `DataType` has an associated `enum PhysicalType` (many-to-one) representing the
particular in-memory representation, and is associated to specific semantics.
For example, both `DataType::Date32` and `DataType::Int32` have the same `PhysicalType`
(`PhysicalType::Primitive(PrimitiveType::Int32)`) but `Date32` represents the number of
days since UNIX epoch.

Logical types are metadata: they annotate arrays with extra information about in-memory data.
Logical types are metadata: they annotate physical types with extra information about data.

## `Field` (column metadata)

Besides logical types, the arrow format supports other relevant metadata to the format. All this
information is stored in `arrow2::datatypes::Field`.
Besides logical types, the arrow format supports other relevant metadata to the format.
All this information is stored in `arrow2::datatypes::Field`.

A `Field` is arrow's metadata associated to a column in the context of a columnar format.
It has a name, a logical type `DataType`, whether the column is nullable, etc.
Expand Down

0 comments on commit c7a47e1

Please sign in to comment.