-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-3144: [C++/Python] Move "dictionary" member from DictionaryType to ArrayData to allow for variable dictionaries #4316
Conversation
cpp/src/arrow/array-dict-test.cc
Outdated
std::shared_ptr<Array> int_array; | ||
ASSERT_OK(int_builder.Finish(&int_array)); | ||
|
||
DictionaryArray expected(dtype, int_array); | ||
DictionaryArray expected(dictionary(int16(), decimal_type), int_array, fsb_array); | ||
ASSERT_TRUE(expected.Equals(result)); | ||
} | ||
|
||
// ---------------------------------------------------------------------- | ||
// DictionaryArray tests | ||
|
||
TEST(TestDictionary, Basics) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be moved to type-test.cc
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -765,20 +763,30 @@ static Status TransposeDictIndices(MemoryPool* pool, const ArrayData& in_data, | |||
} | |||
|
|||
Status DictionaryArray::Transpose(MemoryPool* pool, const std::shared_ptr<DataType>& type, | |||
const std::shared_ptr<Array>& dictionary, | |||
const std::vector<int32_t>& transpose_map, | |||
std::shared_ptr<Array>* out) const { | |||
DCHECK_EQ(type->id(), Type::DICTIONARY); | |||
const auto& out_dict_type = checked_cast<const DictionaryType&>(*type); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems equivalent to Cast(array=Take(indices=transpose_map, values=data_), to=out_index_type)
. Should we add an explicit output type to TakeOptions
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we want to use that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly, but out of scope for this patch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened https://issues.apache.org/jira/browse/ARROW-5343 which would be a pre-requisite for this
|
||
// The dictionary for this Array, if any. Only used for dictionary | ||
// type | ||
std::shared_ptr<Array> dictionary; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use child_data[0]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was discussed on the mailing list. I agree with the others (Antoine / Micah) that having an explicit dictionary field is more clear. I added a benchmark to assess if it causes meaningful overhead which it does not seem to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
break; | ||
default: | ||
ctx->SetStatus( | ||
Status::Invalid("Invalid index type: ", indices.type()->ToString())); | ||
Status::Invalid("Invalid index type: ", type.index_type()->ToString())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use TypeError
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to leave the semantics here as unchanged as possible from master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
went ahead and changed
@@ -292,7 +291,16 @@ struct TypeTraits<ExtensionType> { | |||
// | |||
|
|||
template <typename T> | |||
using is_number = std::is_base_of<Number, T>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're renaming this one, should we do the same for is_signed_integer
etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably. There's some other cleanup to do but not here, I will open a JIRA
return ConcatenateBuffers(Buffers(1, *fixed), pool_, &out_.buffers[1]); | ||
|
||
// Two cases: all the dictionaries are the same, or unification is | ||
// required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
/// Concrete type class for dictionary data | ||
/// \brief Dictionary-encoded value type with data-dependent | ||
/// dictionary | ||
class ARROW_EXPORT DictionaryType : public FixedWidthType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this refactor DictionaryType looks more like a nested type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a synthetic construct in C++ since there is no Dictionary type in the protocol metadata...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have refrained from looking at the IPC implementation details for now.
The one thing I'm worried about is that DictionaryMemo
now must be handled by all users of the IPC layer (or their system will be incompatible with dict arrays).
@@ -364,11 +364,30 @@ static void BM_BuildStringDictionaryArray( | |||
state.SetBytesProcessed(state.iterations() * fodder_size); | |||
} | |||
|
|||
static void BM_ArrayDataConstructDestruct( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. I'll remove it. I found that the roundtrip cost of an empty ArrayData
is about 60ns
using ArrayType = typename TypeTraits<T>::ArrayType; | ||
|
||
template <typename IndexType> | ||
Status Unpack(FunctionContext* ctx, const ArrayData& indices, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UnpackHelper
implementations could use ArrayDataVisitor
from visitor_inline.h
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they could. This patch removes some prior code duplication but otherwise roughly maintains the status quo
@@ -145,19 +145,21 @@ struct UnpackValues { | |||
|
|||
Status Visit(const DictionaryType& t) { | |||
std::shared_ptr<Array> taken_indices; | |||
const auto& values = static_cast<const DictionaryArray&>(*params_.values); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use checked_cast
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
cpp/src/arrow/flight/server.cc
Outdated
@@ -53,6 +55,9 @@ using ServerWriter = grpc::ServerWriter<T>; | |||
namespace pb = arrow::flight::protocol; | |||
|
|||
namespace arrow { | |||
|
|||
using internal::make_unique; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIR, our make_unique
implementation is buggy on MSVC. @bkietz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accurate; MSVC will occasionally get confused between internal::make_unique
and std::make_unique
when it is using
ed like this. Referring to it with internal::make_unique
prevents the issue https://issues.apache.org/jira/browse/ARROW-5121
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed using
statement
private: | ||
Status GetNextDictionary(FlightPayload* payload) { | ||
const auto& it = dictionaries_[dictionary_index_++]; | ||
return ipc::internal::GetDictionaryPayload(it.first, it.second, pool_, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried about the details of the IPC stream format leaking into all IPC users.
Can we implement an IpcPayloadWriter
instead and rely on OpenRecordBatchWriter
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Let's address shortly after this is merged
virtual std::shared_ptr<Schema> schema() = 0; | ||
|
||
/// \brief Compute FlightPayload containing serialized RecordBatch schema | ||
virtual Status GetSchemaPayload(FlightPayload* payload) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the user shouldn't have to implement this and deal with the specifics of dictionary transmission over the wire.
cc @lidavidm for opinions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(or is it @lihalite ?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either will ping me!
I think in this case, it's OK, in that Flight explicitly states we transmit the schema first, then data; also, if we have a set of reasonable implementations of this interface, users should hopefully not feel a need to implement it themselves unless they did actually want control over the specifics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but if you look at the RecordBatchStream implementation, it's doing non-trivial stuff with dictionaries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll have to develop some utility code to assist with these details, but it doesn't seem urgent for the moment. If you are creating a custom stream I am not sure right now how to protect the developer from the details of the stream protocol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did the necessary to hide the details in IpcPayloadWriter
, and the converse is available for reading in ipc::MessageReader
together with RecordBatchStreamReader
. There shouldn't be a need to expose this at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, how would you like to address this, sequence-wise -- is only the implementation here a problem, or is the public API also a problem? I think it would be much easier to fix both after getting this patch merged since we aren't releasing anytime soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's fix this afterwards.
cpp/src/arrow/ipc/read-write-test.cc
Outdated
io::BufferReader buf_reader(serialized_schema); | ||
return ReadSchema(&buf_reader, result); | ||
return ReadSchema(&buf_reader, &in_memo, result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we test something about in_memo
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an assertion that in_memo
and out_memo
agree about the number of dictionaries
std::shared_ptr<Schema> PyGeneratorFlightDataStream::schema() { return schema_; } | ||
|
||
Status PyGeneratorFlightDataStream::GetSchemaPayload(FlightPayload* payload) { | ||
return ipc::internal::GetSchemaPayload(*schema_, &dictionary_memo_, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand. This seems to populate a DictionaryMemo
but it's not used afterwards?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have a test for dictionary arrays in test_flight.py
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last I tried, cross-language Flight with dictionaries still didn't work, or even from C++ to C++, so it wouldn't have worked before in Python. https://issues.apache.org/jira/browse/ARROW-5143
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah... They work with DoGet
but perhaps not with DoPut
then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lihalite I'm fixing dict transfer with DoPut as part of ARROW-5113. It may produce conflicts for both you and @wesm :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, you may want to abort the dict transfer work until this is merged
@@ -64,9 +69,11 @@ class FlightMessageReaderImpl : public FlightMessageReader { | |||
public: | |||
FlightMessageReaderImpl(const FlightDescriptor& descriptor, | |||
std::shared_ptr<Schema> schema, | |||
std::unique_ptr<ipc::DictionaryMemo> dict_memo, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #4319 for a clean, dictionary-compatible, re-implementation of FlightMessageReader
based on ipc::MessageReader
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's excellent, thank you
I spoke with @romainfrancois and he may not be able to help fix the R bindings until next week, so if it doesn't offend anyone greatly I would add R to allowed failures in Travis CI until R can be fixed |
Looks like there are still some doxygen issues, and the integration tests are broken (I was hoping I would get lucky there...) so I will address those issues tomorrow (Thursday) |
I'll work on this in a few days. |
OK. |
This may be a out of scope topic of this pull request but I want to share. Can we use With this change, can we use If this is out of scope of this pull request, I'll open a JIRA issue. |
@wesm I'm currently trying to rebase this, and also fixing CUDA compile failures. Edit: done. |
I think the integration failure is because the dictionary integration test reuses the same dictionary array for two different fields (with different index types): {
"schema": {
"fields": [
{
"name": "dict1_0",
"type": {
"name": "utf8"
},
"nullable": true,
"children": [],
"dictionary": {
"id": 0,
"indexType": {
"name": "int",
"isSigned": true,
"bitWidth": 8
},
"isOrdered": false
}
},
{
"name": "dict1_1",
"type": {
"name": "utf8"
},
"nullable": true,
"children": [],
"dictionary": {
"id": 0,
"indexType": {
"name": "int",
"isSigned": true,
"bitWidth": 32
},
"isOrdered": false
}
},
{
"name": "dict2_0",
"type": {
"name": "int",
"isSigned": true,
"bitWidth": 64
},
"nullable": true,
"children": [],
"dictionary": {
"id": 1,
"indexType": {
"name": "int",
"isSigned": true,
"bitWidth": 16
},
"isOrdered": false
}
}
]
},
"dictionaries": [
{
... A complication is with how the dictionary types are serialized into JSON. The "dictionaries" key doesn't allow unserializing the dictionary arrays by themselves, as you have to parse the "schema" key to get the value type... |
@kou the troublesome thing with that is the |
@pitrou I can look at the integration test failure today -- thank you for rebasing! |
So the problem is that Either we change the JSON integration tests, or we need to change the IPC layer to accomodate this non-bijection. |
Yes, indeed. I'm on it, I will fix. |
OK, I have the integration tests fixed locally. Now multiple fields can reference the same dictionary with no problem. I'll add a unit test for the change I made to |
Done. I also added rudimentary arguments to toggle the JS and Java integration testers off, it might be worth looking a bit more holistically at integration test CLI options per ARROW-5066 |
Integration tests are failing with this error:
With these changes it is now more difficult to refer to dictionaries multiple times in IPC streams because ids are assigned to fields prior to becoming aware of the dictionaries themselves. I opened ARROW-5340 to spend some time on this -- I'm inclined to remove the multiply-referenced dictionary from the integration tests and leave it for follow up work Another option is to change Java to not perform assertions on the dictionary id's when comparing schemas |
@wesm I've created a new issue for #4316 (comment) : https://issues.apache.org/jira/browse/ARROW-5355 |
I removed the multiply-referenced dictionary from the integration tests. I think the dictionary-encoding stuff in Java will need a little bit of work -- it isn't clear to me, for example, why |
I'll work on this. |
Done. |
More refactoring Continued refactoring Begin removing fixed dictionaries from codebase Fix up Unify implementation and tests More refactoring, consolidation Revert changes to builder_dict.*
…low on work around this can be done
@xhochy if you are available to peek at this or approve, that would be helpful. I'm happy to address post-merge feedback as well |
Hi @wesm, looks like an issue with a dependency. I'll investigate (https://travis-ci.org/apache/arrow/jobs/533676609#L536) Corresponding issue in the rustyline repo: kkawakam/rustyline#217. I'm checking what's changed with the latest nightly. CC @andygrove |
I'm inclined to merge this with the Rust build broken since there are a lot of PRs that need to be rebased... if anyone has any objection or wants to look more at the changes please let me know in the next hour or two |
I've logged https://issues.apache.org/jira/browse/ARROW-5360
@wesm I'm happy that we don't stop the train for other languages because of the Rust issue. It's only occurring on nightly, and we at least know what the issue is. |
Merging so we can begin rebasing other PRs |
…ROW-3144 At the moment however, all the `DictionaryMemo` use is internal, it should probably be promoted to arguments (with defaults) to the R functions. I'll do this here or on another PR if this one is merged first so that `r/` builds again on travis. This now needs the C++ lib up to date, e.g. on my setup I get it through `brew install apache-arrow --HEAD`, and there is no conditional compiling so that it still works with previous versions. Let me know if that's ok. follow up from #4316 Author: Romain Francois <[email protected]> Closes #4413 from romainfrancois/ARROW-5361/dictionary and squashes the following commits: b0de1a8 <Romain Francois> R should pass now 2556c16 <Romain Francois> document() fa0440f <Romain Francois> update R to changes from ARROW-3144 #4316
…DictionaryBuilder Adds support for building and writing delta dictionaries. Moves the `dictionary` Vector pointer to the Data class, similar to #4316. Forked from #4476 since this adds support for delta dictionaries to the DictionaryBuilder. Will rebase this PR after that's merged. All the work is in the last commit, here: b12d842 Author: ptaylor <[email protected]> Closes #4502 from trxcllnt/js/delta-dictionaries and squashes the following commits: 6a70a25 <ptaylor> make dictionarybuilder and recordbatchwriter support delta dictionaries
This patch moves the dictionary member out of DictionaryType to a new
member on the internal ArrayData structure. As a result, serializing
and deserializing schemas requires only a single IPC message, and
schemas have no knowledge of what the dictionary values are.
The objective of this change is to correct a long-standing Arrow C++
design problem with dictionary-encoded arrays where the dictionary
values must be known at schema construction time. This has plagued us
all over the codebase:
simple because each row group may have a different dictionary
invalidate the pre-existing schema, causing subsequent RecordBatch
objects to be incompatible
sent, having possibly unbounded size.
require an expensive type unification
The summary of what can be learned from this is: do not put data in
type objects, only metadata. Dictionaries are data, not metadata.
There are a number of unavoidable API changes (straightforward for
library users to fix) but otherwise no functional difference in the
library.
As you can see the change is quite complex as significant parts of IPC
read/write, JSON integration testing, and Flight needed to be reworked
to alter the control flow around schema resolution and handling the
first record batch.
Key APIs changed
DictionaryType
constructor requires aDataType
for thedictionary value type instead of the dictionary itself. The
dictionary
factory method is correspondingly changed. Thedictionary
accessor method onDictionaryType
is replaced withvalue_type
.DictionaryArray
constructor andDictionaryArray::FromArrays
mustbe passed the dictionary values as an additional argument.
DictionaryMemo
is exposed in the public API as it is now requiredfor granular interactions with IPC messages with such functions as
ipc::ReadSchema
andipc::ReadRecordBatch
DictionaryMemo*
argument is added to several low-level publicfunctions in
ipc/writer.h
andipc/reader.h
Some other incidental changes:
Because DictionaryType objects could be reused previous in Schemas, such dictionaries would be "deduplicated" in IPC messages in passing. This is no longer possible by the same trick, so dictionary reuse will have to be handled in a different way (I opened ARROW-5340 to investigate)
As a result of this, an integration test that featured dictionary reuse has been changed to not reuse dictionaries. Technically this is a regression, but I didn't want to block the patch over it
R is added to allow_failures in Travis CI for now