Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
ARROW-3144: [C++/Python] Move "dictionary" member from DictionaryType…
… to ArrayData to allow for variable dictionaries This patch moves the dictionary member out of DictionaryType to a new member on the internal ArrayData structure. As a result, serializing and deserializing schemas requires only a single IPC message, and schemas have no knowledge of what the dictionary values are. The objective of this change is to correct a long-standing Arrow C++ design problem with dictionary-encoded arrays where the dictionary values must be known at schema construction time. This has plagued us all over the codebase: * In reading Parquet files, reading directly to DictionaryArray is not simple because each row group may have a different dictionary * In IPC streams, delta dictionaries (not yet implemented) would invalidate the pre-existing schema, causing subsequent RecordBatch objects to be incompatible * In Arrow Flight, schema negotiation requires the dictionaries to be sent, having possibly unbounded size. * Not possible to have different dictionaries in a ChunkedArray * In CSV files, converting columns to dictionary in parallel would require an expensive type unification The summary of what can be learned from this is: do not put data in type objects, only metadata. Dictionaries are data, not metadata. There are a number of unavoidable API changes (straightforward for library users to fix) but otherwise no functional difference in the library. As you can see the change is quite complex as significant parts of IPC read/write, JSON integration testing, and Flight needed to be reworked to alter the control flow around schema resolution and handling the first record batch. Key APIs changed * `DictionaryType` constructor requires a `DataType` for the dictionary value type instead of the dictionary itself. The `dictionary` factory method is correspondingly changed. The `dictionary` accessor method on `DictionaryType` is replaced with `value_type`. * `DictionaryArray` constructor and `DictionaryArray::FromArrays` must be passed the dictionary values as an additional argument. * `DictionaryMemo` is exposed in the public API as it is now required for granular interactions with IPC messages with such functions as `ipc::ReadSchema` and `ipc::ReadRecordBatch` * A `DictionaryMemo*` argument is added to several low-level public functions in `ipc/writer.h` and `ipc/reader.h` Some other incidental changes: * Because DictionaryType objects could be reused previous in Schemas, such dictionaries would be "deduplicated" in IPC messages in passing. This is no longer possible by the same trick, so dictionary reuse will have to be handled in a different way (I opened ARROW-5340 to investigate) * As a result of this, an integration test that featured dictionary reuse has been changed to not reuse dictionaries. Technically this is a regression, but I didn't want to block the patch over it * R is added to allow_failures in Travis CI for now Author: Wes McKinney <[email protected]> Author: Kouhei Sutou <[email protected]> Author: Antoine Pitrou <[email protected]> Closes #4316 from wesm/ARROW-3144 and squashes the following commits: 9f1ccfb <Kouhei Sutou> Follow DictionaryArray changes 89e274d <Wes McKinney> Do not reuse dictionaries in integration tests for now until more follow on work around this can be done f62819f <Wes McKinney> Support many fields referencing the same dictionary, fix integration tests 37e82b4 <Antoine Pitrou> Fix CUDA and Duration issues 0370750 <Wes McKinney> Add R to allow_failures for now bd04774 <Wes McKinney> Code review comments b1cc52e <Wes McKinney> Fix rest of Python unit tests, fix some incorrect code comments f1178b2 <Wes McKinney> Fix all but 3 Python unit tests ab7fc17 <Wes McKinney> Fix up Cython compilation, haven't fixed unit tests yet though 6ce51ef <Wes McKinney> Get everything compiling again e23c578 <Wes McKinney> Fix Parquet tests c73b216 <Wes McKinney> arrow-tests all passing again, huzzah! 04d40e8 <Wes McKinney> Flat dictionary IPC test passing now 481f316 <Wes McKinney> Get JSON integration tests passing again 77a43dc <Wes McKinney> Fix pretty_print-test f4ada66 <Wes McKinney> array-tests compilers again 8276dce <Wes McKinney> libarrow compiles again 8ea0e26 <Wes McKinney> Refactor IPC read path for new paradigm a1afe87 <Wes McKinney> More refactoring to have correct logic in IPC paths, not yet done aed0430 <Wes McKinney> More refactoring, regularize some type names 6bd72f9 <Wes McKinney> Start porting changes 24f99f1 <Wes McKinney> Initial boilerplate
- Loading branch information