Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-3144: [C++/Python] Move "dictionary" member from DictionaryType to ArrayData to allow for variable dictionaries #4316

Closed
wants to merge 22 commits into from

Conversation

wesm
Copy link
Member

@wesm wesm commented May 15, 2019

This patch moves the dictionary member out of DictionaryType to a new
member on the internal ArrayData structure. As a result, serializing
and deserializing schemas requires only a single IPC message, and
schemas have no knowledge of what the dictionary values are.

The objective of this change is to correct a long-standing Arrow C++
design problem with dictionary-encoded arrays where the dictionary
values must be known at schema construction time. This has plagued us
all over the codebase:

  • In reading Parquet files, reading directly to DictionaryArray is not
    simple because each row group may have a different dictionary
  • In IPC streams, delta dictionaries (not yet implemented) would
    invalidate the pre-existing schema, causing subsequent RecordBatch
    objects to be incompatible
  • In Arrow Flight, schema negotiation requires the dictionaries to be
    sent, having possibly unbounded size.
  • Not possible to have different dictionaries in a ChunkedArray
  • In CSV files, converting columns to dictionary in parallel would
    require an expensive type unification

The summary of what can be learned from this is: do not put data in
type objects, only metadata. Dictionaries are data, not metadata.

There are a number of unavoidable API changes (straightforward for
library users to fix) but otherwise no functional difference in the
library.

As you can see the change is quite complex as significant parts of IPC
read/write, JSON integration testing, and Flight needed to be reworked
to alter the control flow around schema resolution and handling the
first record batch.

Key APIs changed

  • DictionaryType constructor requires a DataType for the
    dictionary value type instead of the dictionary itself. The
    dictionary factory method is correspondingly changed. The
    dictionary accessor method on DictionaryType is replaced with
    value_type.
  • DictionaryArray constructor and DictionaryArray::FromArrays must
    be passed the dictionary values as an additional argument.
  • DictionaryMemo is exposed in the public API as it is now required
    for granular interactions with IPC messages with such functions as
    ipc::ReadSchema and ipc::ReadRecordBatch
  • A DictionaryMemo* argument is added to several low-level public
    functions in ipc/writer.h and ipc/reader.h

Some other incidental changes:

  • Because DictionaryType objects could be reused previous in Schemas, such dictionaries would be "deduplicated" in IPC messages in passing. This is no longer possible by the same trick, so dictionary reuse will have to be handled in a different way (I opened ARROW-5340 to investigate)

  • As a result of this, an integration test that featured dictionary reuse has been changed to not reuse dictionaries. Technically this is a regression, but I didn't want to block the patch over it

  • R is added to allow_failures in Travis CI for now

std::shared_ptr<Array> int_array;
ASSERT_OK(int_builder.Finish(&int_array));

DictionaryArray expected(dtype, int_array);
DictionaryArray expected(dictionary(int16(), decimal_type), int_array, fsb_array);
ASSERT_TRUE(expected.Equals(result));
}

// ----------------------------------------------------------------------
// DictionaryArray tests

TEST(TestDictionary, Basics) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be moved to type-test.cc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will do

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -765,20 +763,30 @@ static Status TransposeDictIndices(MemoryPool* pool, const ArrayData& in_data,
}

Status DictionaryArray::Transpose(MemoryPool* pool, const std::shared_ptr<DataType>& type,
const std::shared_ptr<Array>& dictionary,
const std::vector<int32_t>& transpose_map,
std::shared_ptr<Array>* out) const {
DCHECK_EQ(type->id(), Type::DICTIONARY);
const auto& out_dict_type = checked_cast<const DictionaryType&>(*type);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems equivalent to Cast(array=Take(indices=transpose_map, values=data_), to=out_index_type). Should we add an explicit output type to TakeOptions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we want to use that here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, but out of scope for this patch

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened https://issues.apache.org/jira/browse/ARROW-5343 which would be a pre-requisite for this


// The dictionary for this Array, if any. Only used for dictionary
// type
std::shared_ptr<Array> dictionary;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use child_data[0]?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was discussed on the mailing list. I agree with the others (Antoine / Micah) that having an explicit dictionary field is more clear. I added a benchmark to assess if it causes meaningful overhead which it does not seem to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

break;
default:
ctx->SetStatus(
Status::Invalid("Invalid index type: ", indices.type()->ToString()));
Status::Invalid("Invalid index type: ", type.index_type()->ToString()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use TypeError here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to leave the semantics here as unchanged as possible from master

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

went ahead and changed

@@ -292,7 +291,16 @@ struct TypeTraits<ExtensionType> {
//

template <typename T>
using is_number = std::is_base_of<Number, T>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're renaming this one, should we do the same for is_signed_integer etc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably. There's some other cleanup to do but not here, I will open a JIRA

return ConcatenateBuffers(Buffers(1, *fixed), pool_, &out_.buffers[1]);

// Two cases: all the dictionaries are the same, or unification is
// required
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


/// Concrete type class for dictionary data
/// \brief Dictionary-encoded value type with data-dependent
/// dictionary
class ARROW_EXPORT DictionaryType : public FixedWidthType {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this refactor DictionaryType looks more like a nested type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a synthetic construct in C++ since there is no Dictionary type in the protocol metadata...

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have refrained from looking at the IPC implementation details for now.

The one thing I'm worried about is that DictionaryMemo now must be handled by all users of the IPC layer (or their system will be incompatible with dict arrays).

@@ -364,11 +364,30 @@ static void BM_BuildStringDictionaryArray(
state.SetBytesProcessed(state.iterations() * fodder_size);
}

static void BM_ArrayDataConstructDestruct(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually useful?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. I'll remove it. I found that the roundtrip cost of an empty ArrayData is about 60ns

using ArrayType = typename TypeTraits<T>::ArrayType;

template <typename IndexType>
Status Unpack(FunctionContext* ctx, const ArrayData& indices,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UnpackHelper implementations could use ArrayDataVisitor from visitor_inline.h.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they could. This patch removes some prior code duplication but otherwise roughly maintains the status quo

https://issues.apache.org/jira/browse/ARROW-5344

@@ -145,19 +145,21 @@ struct UnpackValues {

Status Visit(const DictionaryType& t) {
std::shared_ptr<Array> taken_indices;
const auto& values = static_cast<const DictionaryArray&>(*params_.values);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use checked_cast.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -53,6 +55,9 @@ using ServerWriter = grpc::ServerWriter<T>;
namespace pb = arrow::flight::protocol;

namespace arrow {

using internal::make_unique;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIR, our make_unique implementation is buggy on MSVC. @bkietz

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accurate; MSVC will occasionally get confused between internal::make_unique and std::make_unique when it is usinged like this. Referring to it with internal::make_unique prevents the issue https://issues.apache.org/jira/browse/ARROW-5121

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed using statement

private:
Status GetNextDictionary(FlightPayload* payload) {
const auto& it = dictionaries_[dictionary_index_++];
return ipc::internal::GetDictionaryPayload(it.first, it.second, pool_,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about the details of the IPC stream format leaking into all IPC users.
Can we implement an IpcPayloadWriter instead and rely on OpenRecordBatchWriter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Let's address shortly after this is merged

virtual std::shared_ptr<Schema> schema() = 0;

/// \brief Compute FlightPayload containing serialized RecordBatch schema
virtual Status GetSchemaPayload(FlightPayload* payload) = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the user shouldn't have to implement this and deal with the specifics of dictionary transmission over the wire.
cc @lidavidm for opinions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(or is it @lihalite ?)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either will ping me!

I think in this case, it's OK, in that Flight explicitly states we transmit the schema first, then data; also, if we have a set of reasonable implementations of this interface, users should hopefully not feel a need to implement it themselves unless they did actually want control over the specifics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but if you look at the RecordBatchStream implementation, it's doing non-trivial stuff with dictionaries.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll have to develop some utility code to assist with these details, but it doesn't seem urgent for the moment. If you are creating a custom stream I am not sure right now how to protect the developer from the details of the stream protocol

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did the necessary to hide the details in IpcPayloadWriter, and the converse is available for reading in ipc::MessageReader together with RecordBatchStreamReader. There shouldn't be a need to expose this at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, how would you like to address this, sequence-wise -- is only the implementation here a problem, or is the public API also a problem? I think it would be much easier to fix both after getting this patch merged since we aren't releasing anytime soon

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's fix this afterwards.

io::BufferReader buf_reader(serialized_schema);
return ReadSchema(&buf_reader, result);
return ReadSchema(&buf_reader, &in_memo, result);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we test something about in_memo?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an assertion that in_memo and out_memo agree about the number of dictionaries

std::shared_ptr<Schema> PyGeneratorFlightDataStream::schema() { return schema_; }

Status PyGeneratorFlightDataStream::GetSchemaPayload(FlightPayload* payload) {
return ipc::internal::GetSchemaPayload(*schema_, &dictionary_memo_,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand. This seems to populate a DictionaryMemo but it's not used afterwards?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have a test for dictionary arrays in test_flight.py...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last I tried, cross-language Flight with dictionaries still didn't work, or even from C++ to C++, so it wouldn't have worked before in Python. https://issues.apache.org/jira/browse/ARROW-5143

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah... They work with DoGet but perhaps not with DoPut then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lihalite I'm fixing dict transfer with DoPut as part of ARROW-5113. It may produce conflicts for both you and @wesm :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, you may want to abort the dict transfer work until this is merged

@@ -64,9 +69,11 @@ class FlightMessageReaderImpl : public FlightMessageReader {
public:
FlightMessageReaderImpl(const FlightDescriptor& descriptor,
std::shared_ptr<Schema> schema,
std::unique_ptr<ipc::DictionaryMemo> dict_memo,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #4319 for a clean, dictionary-compatible, re-implementation of FlightMessageReader based on ipc::MessageReader.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's excellent, thank you

@wesm
Copy link
Member Author

wesm commented May 15, 2019

I spoke with @romainfrancois and he may not be able to help fix the R bindings until next week, so if it doesn't offend anyone greatly I would add R to allowed failures in Travis CI until R can be fixed

@wesm
Copy link
Member Author

wesm commented May 15, 2019

I addressed the comments so far and added R to allow_failures in .travis.yml. I think the only thing left for this to be merge-able is to fix GLib and Ruby -- I will wait for advice from @kou. I also need to merge #3644 and then rebase this on top of it

@wesm
Copy link
Member Author

wesm commented May 16, 2019

Looks like there are still some doxygen issues, and the integration tests are broken (I was hoping I would get lucky there...) so I will address those issues tomorrow (Thursday)

@kou
Copy link
Member

kou commented May 16, 2019

I'll work on this in a few days.
(I'll push some commits to this branch.)

@wesm
Copy link
Member Author

wesm commented May 16, 2019

thanks @kou! I might take a look at the GLib stuff quickly tomorrow to see how involved the changes are, let me know if you start working on it. I need to rebase after ARROW-835 went in so that will take me a little time

@kou
Copy link
Member

kou commented May 16, 2019

OK.
I'll leave a comment when I start committing.

@kou
Copy link
Member

kou commented May 16, 2019

This may be a out of scope topic of this pull request but I want to share.

Can we useDictionaryType instead of dictionary type's value type in DictionaryBuilder?
In Arrow GLib, we want to detect builder type from ArrayBuilder to map corresponding Arrow GLib class: https://github.com/apache/arrow/blob/master/c_glib/arrow-glib/array-builder.cpp#L3970
The current DictionaryBuilder only keeps value type. So we can't use ArrayBuilder::type() for this.

With this change, can we use DictionaryType in DictionaryBuilder because we don't need dictionary for DictionaryType?

If this is out of scope of this pull request, I'll open a JIRA issue.

@pitrou
Copy link
Member

pitrou commented May 16, 2019

@wesm I'm currently trying to rebase this, and also fixing CUDA compile failures.

Edit: done.

@pitrou
Copy link
Member

pitrou commented May 16, 2019

I think the integration failure is because the dictionary integration test reuses the same dictionary array for two different fields (with different index types):

{
  "schema": {
    "fields": [
      {
        "name": "dict1_0",
        "type": {
          "name": "utf8"
        },
        "nullable": true,
        "children": [],
        "dictionary": {
          "id": 0,
          "indexType": {
            "name": "int",
            "isSigned": true,
            "bitWidth": 8
          },
          "isOrdered": false
        }
      },
      {
        "name": "dict1_1",
        "type": {
          "name": "utf8"
        },
        "nullable": true,
        "children": [],
        "dictionary": {
          "id": 0,
          "indexType": {
            "name": "int",
            "isSigned": true,
            "bitWidth": 32
          },
          "isOrdered": false
        }
      },
      {
        "name": "dict2_0",
        "type": {
          "name": "int",
          "isSigned": true,
          "bitWidth": 64
        },
        "nullable": true,
        "children": [],
        "dictionary": {
          "id": 1,
          "indexType": {
            "name": "int",
            "isSigned": true,
            "bitWidth": 16
          },
          "isOrdered": false
        }
      }
    ]
  },
  "dictionaries": [
    {
...

A complication is with how the dictionary types are serialized into JSON. The "dictionaries" key doesn't allow unserializing the dictionary arrays by themselves, as you have to parse the "schema" key to get the value type...

@wesm
Copy link
Member Author

wesm commented May 16, 2019

@kou the troublesome thing with that is the DictionaryBuilder will automatically promote the index type depending on the size of the dictionary, so you might have initialized the builder with dictionary(int8(), utf8()) but then the result might be dictionary(int32(), utf8()). Maybe I'm wrong -- either way I think we can address with a new JIRA

@wesm
Copy link
Member Author

wesm commented May 16, 2019

@pitrou I can look at the integration test failure today -- thank you for rebasing!

@pitrou
Copy link
Member

pitrou commented May 16, 2019

So the problem is that DictionaryMemo expects a <dictionary field> <-> <dictionary id> bijection, but that's not true with the JSON integration tests (see example above).

Either we change the JSON integration tests, or we need to change the IPC layer to accomodate this non-bijection.

@wesm
Copy link
Member Author

wesm commented May 16, 2019

Yes, indeed. I'm on it, I will fix.

@wesm
Copy link
Member Author

wesm commented May 16, 2019

OK, I have the integration tests fixed locally. Now multiple fields can reference the same dictionary with no problem. I'll add a unit test for the change I made to DictionaryMemo (since the case of multiple fields referencing the same dictionary only occurs in the integration tests)

@wesm
Copy link
Member Author

wesm commented May 16, 2019

Done. I also added rudimentary arguments to toggle the JS and Java integration testers off, it might be worth looking a bit more holistically at integration test CLI options per ARROW-5066

@wesm
Copy link
Member Author

wesm commented May 16, 2019

Integration tests are failing with this error:

[dictionary: 0], dict1_1: Int(32, true)[dictionary: 0], dict2_0: Int(16, true)[dictionary: 1]>
Incompatible files
Different schemas:
Schema<dict1_0: Int(8, true)[dictionary: 0], dict1_1: Int(32, true)[dictionary: 1], dict2_0: Int(16, true)[dictionary: 2]>
Schema<dict1_0: Int(8, true)[dictionary: 0], dict1_1: Int(32, true)[dictionary: 0], dict2_0: Int(16, true)[dictionary: 1]>
20:23:31.048 [main] ERROR org.apache.arrow.tools.Integration - Incompatible files
java.lang.IllegalArgumentException: Different schemas:
Schema<dict1_0: Int(8, true)[dictionary: 0], dict1_1: Int(32, true)[dictionary: 1], dict2_0: Int(16, true)[dictionary: 2]>
Schema<dict1_0: Int(8, true)[dictionary: 0], dict1_1: Int(32, true)[dictionary: 0], dict2_0: Int(16, true)[dictionary: 1]>
	at org.apache.arrow.vector.util.Validator.compareSchemas(Validator.java:47)
	at org.apache.arrow.tools.Integration$Command$3.execute(Integration.java:194)
	at org.apache.arrow.tools.Integration.run(Integration.java:118)
	at org.apache.arrow.tools.Integration.main(Integration.java:69)

With these changes it is now more difficult to refer to dictionaries multiple times in IPC streams because ids are assigned to fields prior to becoming aware of the dictionaries themselves. I opened ARROW-5340 to spend some time on this -- I'm inclined to remove the multiply-referenced dictionary from the integration tests and leave it for follow up work

Another option is to change Java to not perform assertions on the dictionary id's when comparing schemas

@kou
Copy link
Member

kou commented May 16, 2019

@wesm I've created a new issue for #4316 (comment) : https://issues.apache.org/jira/browse/ARROW-5355
So we can ignore this topic in this pull request.

@wesm
Copy link
Member Author

wesm commented May 16, 2019

I removed the multiply-referenced dictionary from the integration tests. I think the dictionary-encoding stuff in Java will need a little bit of work -- it isn't clear to me, for example, why Field objects in Schema in Java have the dictionary id. In the meantime this will give me some time to sort out how to handle dictionary reuse (which is an optimization basically) in C++

@kou
Copy link
Member

kou commented May 17, 2019

I'll work on this.

@kou
Copy link
Member

kou commented May 17, 2019

Done.

@wesm
Copy link
Member Author

wesm commented May 17, 2019

Thanks @kou! I will rebase this to try to get a green build. @pitrou or @kou can you approve the PR?

Rust build is failing, @sunchao @nevi-me @andygrove do you know what is wrong?

wesm added 2 commits May 17, 2019 07:16
More refactoring

Continued refactoring

Begin removing fixed dictionaries from codebase

Fix up Unify implementation and tests

More refactoring, consolidation

Revert changes to builder_dict.*
@wesm
Copy link
Member Author

wesm commented May 17, 2019

@xhochy if you are available to peek at this or approve, that would be helpful. I'm happy to address post-merge feedback as well

@nevi-me
Copy link
Contributor

nevi-me commented May 17, 2019

Thanks @kou! I will rebase this to try to get a green build. @pitrou or @kou can you approve the PR?

Rust build is failing, @sunchao @nevi-me @andygrove do you know what is wrong?

Hi @wesm, looks like an issue with a dependency. I'll investigate (https://travis-ci.org/apache/arrow/jobs/533676609#L536)


Corresponding issue in the rustyline repo: kkawakam/rustyline#217.

I'm checking what's changed with the latest nightly. CC @andygrove

@wesm
Copy link
Member Author

wesm commented May 17, 2019

I'm inclined to merge this with the Rust build broken since there are a lot of PRs that need to be rebased... if anyone has any objection or wants to look more at the changes please let me know in the next hour or two

@nevi-me
Copy link
Contributor

nevi-me commented May 17, 2019

I've logged https://issues.apache.org/jira/browse/ARROW-5360


I'm inclined to merge this with the Rust build broken since there are a lot of PRs that need to be rebased... if anyone has any objection or wants to look more at the changes please let me know in the next hour or two

@wesm I'm happy that we don't stop the train for other languages because of the Rust issue. It's only occurring on nightly, and we at least know what the issue is.

@wesm
Copy link
Member Author

wesm commented May 17, 2019

Merging so we can begin rebasing other PRs

@wesm wesm closed this in e68ca7f May 17, 2019
@wesm
Copy link
Member Author

wesm commented May 17, 2019

thanks @pitrou and @kou for you help getting this done!

@wesm wesm deleted the ARROW-3144 branch May 17, 2019 16:41
romainfrancois added a commit to romainfrancois/arrow that referenced this pull request May 29, 2019
romainfrancois added a commit to romainfrancois/arrow that referenced this pull request May 30, 2019
romainfrancois added a commit to romainfrancois/arrow that referenced this pull request May 31, 2019
romainfrancois added a commit to romainfrancois/arrow that referenced this pull request Jun 1, 2019
romainfrancois added a commit that referenced this pull request Jun 3, 2019
…ROW-3144

At the moment however, all the `DictionaryMemo` use is internal, it should probably be promoted to arguments (with defaults) to the R functions.

I'll do this here or on another PR if this one is merged first so that `r/` builds again on travis.

This now needs the C++ lib up to date, e.g. on my setup I get it through `brew install apache-arrow --HEAD`, and there is no conditional compiling so that it still works with previous versions. Let me know if that's ok.

follow up from #4316

Author: Romain Francois <[email protected]>

Closes #4413 from romainfrancois/ARROW-5361/dictionary and squashes the following commits:

b0de1a8 <Romain Francois> R should pass now
2556c16 <Romain Francois> document()
fa0440f <Romain Francois> update R to changes from ARROW-3144 #4316
TheNeuralBit pushed a commit that referenced this pull request Jun 22, 2019
…DictionaryBuilder

Adds support for building and writing delta dictionaries. Moves the `dictionary` Vector pointer to the Data class, similar to #4316.

Forked from  #4476 since this adds support for delta dictionaries to the DictionaryBuilder. Will rebase this PR after that's merged. All the work is in the last commit, here: b12d842

Author: ptaylor <[email protected]>

Closes #4502 from trxcllnt/js/delta-dictionaries and squashes the following commits:

6a70a25 <ptaylor> make dictionarybuilder and recordbatchwriter support delta dictionaries
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants