Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-16773: [Docs][Format] Document Run-Length encoding in Arrow columnar format #13333

Closed
wants to merge 12 commits into from
63 changes: 63 additions & 0 deletions docs/source/format/Columnar.rst
Original file line number Diff line number Diff line change
Expand Up @@ -765,6 +765,68 @@ application.
We discuss dictionary encoding as it relates to serialization further
below.

.. _run-length-encoded-layout:

Run-Length Encoded Layout
-------------------------

Run-Length is a data representation that represents data as sequences of the
same value, called runs. Each run is represented as a value, and an integer
describing how often this value is repeated.

Any array can be run-length encoded. A run-length encoded array has no buffers
by itself, but has two child arrays. The first one holds a signed integer
called a "run end" for each run. The run ends array can hold either 16, 32, or
64-bit integers. The actual values of each run are held
the second child array.
pitrou marked this conversation as resolved.
Show resolved Hide resolved

The values in the first child array represent the length of each run. They do
not hold the length of the respective run directly, but the accumulated length
of all runs from the first to the current one, i.e. the logical index where the
current run ends. This allows relatively efficient random access from a logical
index using binary search. The length of an individual run can be determined by
subtracting two adjacent values.

A run must have have a length of at least 1. This means the values in the
run ends array all positive and in strictly ascending order. A run end cannot be
null.
alamb marked this conversation as resolved.
Show resolved Hide resolved

As an example, you could have the following data: ::

type: Float32
[1.0, 1.0, 1.0, 1.0, null, null, 2.0]

In Run-length-encoded form, this could appear as:

::

* Length: 7, Null count: 2
Copy link
Contributor

@tustvold tustvold Jan 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This formulation of null count is a little surprising, it implies that when computing the parent ArrayData from a set of children, it must iterate the null mask in conjunction with the run ends. This is also inconsistent with how null counts work for other nested types.

I've asked for clarification of this on the mailing list - https://lists.apache.org/thread/4x14b0h3fcfwzk68jpoq3n5xvr241qz5

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it should actually say 0, since the array has no validity bitmap

(I tried to reply via the reply button in the ml web ui but that does not seem to have come through)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the array has no validity bitmap

Is this a general restriction, or is it just in this case there is no validity mask?

Copy link
Contributor Author

@zagto zagto Jan 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a general restriction (at least I think so, and that's how the code works). The idea so far is that Null is just one of the possible values for a run.

The is somewhat consistent with how union types work.

(if we were to allow the RLE array parent to have an additional null mask, the null count field would represent that - there seems to be a generall assumption in Arrow code that a non-zero (or array length for the NULL) null count means the presence of the standard null mask)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a general restriction

We should probably document that if so, personally it seems a shame as it complicates things like nullif kernels which now have to explicitly handle the case of RLE data, whereas normally they don't need concern themselves with anything but the array data.

union types work

Yeah, union types are a bit of a special snowflake though 😆

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is actually documented by saying there are no buffers, but that's a bit indirect, so guess mentioning it explicitly won't hurt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I submitted a PR for this here, please take a look:
#33831

* Children arrays:

* run ends (Int32):
* Length: 3, Null count: 0
* Validity bitmap buffer: Not required
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Validity bitmap buffer: Not required
* Validity bitmap buffer: Not present (not allowed)

See comment above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it would be best to be explicit here that no validity bitmap is allowed to avoid ambiguity

* Values buffer

| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 6-63 |
|-------------|-------------|-------------|-----------------------|
| 4 | 6 | 7 | unspecified (padding) |

* values (Float32):
* Length: 3, Null count: 1
* Validity bitmap buffer:

| Byte 0 (validity bitmap) | Bytes 1-63 |
|--------------------------|-----------------------|
| 00000101 | 0 (padding) |

* Values buffer

| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 6-63 |
|-------------|-------------|-------------|-----------------------|
| 1.0 | unspecified | 2.0 | unspecified (padding) |


Buffer Listing for Each Layout
------------------------------

Expand All @@ -784,6 +846,7 @@ of memory buffers for each layout.
"Dense Union",type ids,offsets,
"Null",,,
"Dictionary-encoded",validity,data (indices),
"Run-length encoded",,,

Logical Types
=============
Expand Down