-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial write support #122
Conversation
Writing string/boolean arrays will be done after this since the PR is already quite large |
Sorry for such a large PR 😅 I wanted an end to end example, and felt it wouldn't be proper without at least integer support. I didn't want to waste time implementing RLEv1 writing, so the majority of time was spent on trying to understand the RLEv2 write logic in the Java implementation and translate it to code that I think might be easier to understand (I didn't want to just do a one-to-one port from the C++/Java implementation to here). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for this PR; it's really extensive! This review primarily focuses on the API related to IO operations.
writer: W, | ||
schema: SchemaRef, | ||
batch_size: usize, | ||
stripe_byte_size: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think using stripe_size
or stripe_length
for short?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like keeping the unit as part of the name, to be unambiguous
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM
* Zigzag and varint encoders * Short repeat and direct RLEv2 writers * Minor refactoring * Signed msb encoder * Refacftor RLE delta+patched base to be more functional * Minor refactoring * Byte RLE writer * Remove unused error types * Initial version of ORC writer, supporting only float * Int8 array write support * Integer RLEv2 Delta writing * Minor optimization * Abstract bits_used functionality to common NInt function * Remove overcomplicated AbsVarint code, replace with i64/u64 in delta encoding * Minor fixes * Initial RLEv2 encoder base * Remove u64 impl of NInt in favour of generic to determine sign * Simplify getting RLE reader * Fix percentile bit calculation encoding/decoding * Patched base writing * Support writing int arrays * Multi stripe write test case * Reduce duplication for primitive ColumnStripeEncoders * Introduce EstimateMemory trait to standardize * Comment * Remove need for seek from writer + minor PR comments * Rename s_type to kind * Deduplicate get_closest_fixed_bits * Fix comments * Switch arrow writer tests to be in-memory instead of writing to disk * Fix writing arrays with nulls * Add license to new files
Support for writing the following Arrow array types to an ORC file:
Broadly speaking, this writer will delegate to internal types to encode the actual primitive values into a run encoded form (RLEv2 for Int, Byte RLE for bytes, simple copy for floats) into an in-memory buffer until the stripe size exceeds a configured limit in which case the stripe is flushed to the actual writer (e.g. to disk) and the internal buffers cleared.
Overall architecture
src/arrow_writer.rs
contains the public interface for this writer, where anArrowWriterBuilder
is used to specify config options to build anArrowWriter
which allows writing a batch at a time, see example usage:datafusion-orc/src/arrow_writer.rs
Lines 275 to 280 in 5afcfe2
This writer will automatically flush a stripe when it is estimated to have exceeded the configured stripe size (default 64mb). Internally, this writer has a
StripeWriter
to handle encoding the actual stripe in-memory:datafusion-orc/src/writer/stripe.rs
Lines 40 to 47 in 5afcfe2
This
StripeWriter
has an array ofColumnStripeEncoder
s which are responsible for the encoding of individual columns within the recordbatch/array. Note the flat structure, this will need to be revisited when accounting for nested types (map, struct, etc.).datafusion-orc/src/writer/column.rs
Lines 27 to 40 in 5afcfe2
Currently only one implementation for
ColumnStripeEncoder
which isPrimitiveStripeEncoder
:datafusion-orc/src/writer/column.rs
Lines 66 to 76 in 5afcfe2
This handles encoding the actual primitive values themselves, which is then handled by the
PrimitiveValueEncoder
trait, and also optionally handles encoding a present stream.datafusion-orc/src/writer/column.rs
Lines 42 to 62 in 5afcfe2
Encoding
There are currently three implementations.
Float has the simplest as ORC specifies no special encoding for it, it simply copies. Refer to
FloatValueEncoder
:datafusion-orc/src/writer/column.rs
Lines 175 to 184 in 5afcfe2
Bytes have some simple logic for run length encoding, see
ByteRleWriter
:datafusion-orc/src/reader/decode/byte_rle.rs
Lines 32 to 44 in 5afcfe2
Integers have the most complicated version due to the complex sub0encoding types (as well as the actual logic to determine which sub0encoding to use), see
RleWriterV2
:datafusion-orc/src/reader/decode/rle_v2/mod.rs
Lines 208 to 215 in 5afcfe2
For both Byte and Int RLEv2 encodings, the logic has been translated from the Java implementation of ORC in order to capture details the spec doesn't specify as well as use the logic it uses to determine which sub-encodings to use. This logic was refactored to make the logic easier to understand (at least to me).
Other changes
div_ceil
)u64
in integer run length encoding code, as this is incorrect since the Java implementation doesn't support unsigned integers (everything is an i64 there)ByteRleIter
toByteRleReader
for consistency with other reader interfaces (such as RLEv2)