Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Initial work for decimal type in Java/JNI [skip ci] #6514

Merged
merged 32 commits into from
Oct 27, 2020

Conversation

nartal1
Copy link
Member

@nartal1 nartal1 commented Oct 13, 2020

This partially addresses tasks from issue #6515.
The goal of this PR is to add DType class and make the necessary changes in Java/JNI code to support scale in DType.

@nartal1 nartal1 requested a review from a team as a code owner October 13, 2020 08:35
@nartal1 nartal1 marked this pull request as draft October 13, 2020 08:36
@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is still WIP, but I wanted to call out a few things.

java/src/main/java/ai/rapids/cudf/DataType.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/DataType.java Outdated Show resolved Hide resolved
@@ -38,10 +38,12 @@

protected final OffHeapState offHeap;
protected final DType type;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we contain DataType instead of DType + scale in here? And we need implement getDecimal. I think we can change it in subsequent PR.

@@ -3729,6 +3767,14 @@ public static ColumnVector fromDoubles(double... values) {
return build(DType.FLOAT64, values.length, (b) -> b.appendArray(values));
}

/* public static ColumnVector fromDecimal32(int... values) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

involve this commented codes later till we are fully ready?

@nartal1
Copy link
Member Author

nartal1 commented Oct 14, 2020

I think the code in Builder and ColumnBuilder also needs to be changed from DType to DataType.

@revans2
Copy link
Contributor

revans2 commented Oct 14, 2020

I think the code in Builder and ColumnBuilder also needs to be changed from DType to DataType.

If it is simple to do yes, but if it is not simple we should look at only supporting the newer one that supports nested types and deprecating the older one. No point in maintaing two APIs that do the same thing.

@nartal1
Copy link
Member Author

nartal1 commented Oct 14, 2020

@revans2 @jlowe It would be great if you could take a look and let me know if this is something in the right direction. As of now I am making changes and making sure we don't break any current tests. The refactor is in different places so would be helpful if I could get early feedback on whether to continue or if we need to pivot.

java/src/main/java/ai/rapids/cudf/HostColumnVector.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/HostColumnVector.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/HostColumnVector.java Outdated Show resolved Hide resolved
java/src/main/native/src/ColumnVectorJni.cpp Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/ColumnVector.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/ColumnVector.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/ColumnVector.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/ColumnVector.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/DType.java Outdated Show resolved Hide resolved
@razajafri razajafri added the Java Affects Java cuDF API. label Oct 15, 2020
@@ -3088,7 +3114,7 @@ public long getColumnViewAddress() {

@Override
public ColumnViewAccess getChildColumnViewAccess(int childIndex) {
if (!type.isNestedType()) {
if (!type.typeId.isNestedType()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would parrot @jlowe 's comment here to add the isNestedType to DataType to reduce code changes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

java/src/main/java/ai/rapids/cudf/HostColumnVector.java Outdated Show resolved Hide resolved
@@ -1239,48 +1281,48 @@ public Builder appendUTF8String(byte[] value, int offset, int length) {

public Builder appendArray(byte... values) {
assert (values.length + currentIndex) <= rows;
assert type == DType.INT8 || type == DType.UINT8 || type == DType.BOOL8;
data.setBytes(currentIndex * type.sizeInBytes, values, 0, values.length);
assert type.typeId == DType.INT8 || type.typeId == DType.UINT8 || type.typeId == DType.BOOL8;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we are making a change this would be a good time to clean this up and have a helper method in DataType like the one we have e.g. 'isTimestamp` etc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

currentIndex += values.length;
return this;
}

public Builder appendArray(short... values) {
assert type == DType.INT16 || type == DType.UINT16;
assert type.typeId == DType.INT16 || type.typeId == DType.UINT16;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here. can we have a helper?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

/**
* Get the value at index.
*/
public final short getShort(long index) {
assert type == DType.INT16 || type == DType.UINT16 : type + " is not stored as a short.";
assert type.typeId == DType.INT16 || type.typeId == DType.UINT16 : type + " is not stored as a short.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here. have a helper

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -273,28 +286,28 @@ private void assertsForGet(long index) {
* Get the value at index.
*/
public byte getByte(long index) {
assert type == DType.INT8 || type == DType.UINT8 || type == DType.BOOL8 : type +
assert type.typeId == DType.INT8 || type.typeId == DType.UINT8 || type.typeId == DType.BOOL8 : type +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here. have a helper

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

try {
cudf::jni::auto_set_device(env);
cudf::column_view *column = reinterpret_cast<cudf::column_view *>(handle);
return static_cast<jint>(column->type().scale());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is type checking needed before calling scale() on type_id? will it blow up in case its not a decimal type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not need the cast and I would prefer to not have it because it shows that we don't have issues mapping from the native scale to the JNI representation.

java/src/main/java/ai/rapids/cudf/DType.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/DataType.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/DataType.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/HostColumnVector.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/ColumnVector.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/ColumnVector.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/ColumnVector.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/ColumnVector.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/ColumnVector.java Outdated Show resolved Hide resolved
@revans2
Copy link
Contributor

revans2 commented Oct 15, 2020

I had an idea and I really would like some feedback on it. I think we can make a change that is mostly source compatible, but not binary compatible.

What if we renamed DType to something like DTypeEnum and made it an inner class of DataType, which we then rename DataType to DType. Inside that new DType we add in APIs that match the current DType (except for nativeId), and also public static final instances for each non DECIMAL type. I know it is not perfect, but it would let us make very few code changes. It would reduce the load on GC because most of the time there would not be new DType objects created. It would also be obvious in all places where the APIs got the nativeId that they would need to also be updated to look for a scale too.

i.e.

public final class DType {
  public enum DTypeEnum {
    EMPTY(0, 0, "NOT SUPPORTED"),
    INT8(1, 1, "byte"),
    ...
  }

  public static final DType INT8 = new DType(DTypeEnum.INT8);
  ...
}

We could also verify that if you try to create an object from just a nativeId and it is a DECIMAL* we can blow up.

@jlowe
Copy link
Member

jlowe commented Oct 15, 2020

and also public static final instances for each non DECIMAL type

I was already thinking along these lines, but I hadn't considered taking over DType itself. I'm generally +1 on the idea of commandeering DType, although we will have a "knife switch" event when this gets checked in. The snapshot build published afterwards is going to break anything that hasn't recompiled.

I'm still worried about all the places code assumes checking for equality on a data type can be performed with == which is no longer the case, but we need to worry about that whether it's called DataType or DType. Not calling it DataType nicely side-steps the problem identified in HostColumnVector and also the upcoming name confusion with Spark's DataType.

@nartal1
Copy link
Member Author

nartal1 commented Oct 15, 2020

Thanks @revans2 and @jlowe for reviewing and your inputs on new design. I will rework on this and update it here. Liked this idea where it solves confusion on naming DataType.

@sperlingxx
Copy link
Contributor

Hi @revans2 @jlowe , @nartal1 and I are trying to implement the idea DType with DTypeEnum. Both of us want to ensure whether we fully understand this idea. Here are some code snippets to elaborate our understanding:

  public static final DType EMPTY = new DType(DTypeEnum.EMPTY);
  public static final DType INT8 = new DType(DTypeEnum.INT8);
  public static final DType INT16 = new DType(DTypeEnum.INT16);
  public static final DType INT32 = new DType(DTypeEnum.INT32);
  public static final DType INT64 = new DType(DTypeEnum.INT64);
  public static final DType UINT8 = new DType(DTypeEnum.UINT8);
  public static final DType UINT16 = new DType(DTypeEnum.UINT16);
  public static final DType UINT32 = new DType(DTypeEnum.UINT32);
  public static final DType UINT64 = new DType(DTypeEnum.UINT64);
  public static final DType FLOAT32 = new DType(DTypeEnum.FLOAT32);
  public static final DType FLOAT64 = new DType(DTypeEnum.FLOAT64);
  public static final DType BOOL8 = new DType(DTypeEnum.BOOL8);
  public static final DType TIMESTAMP_DAYS = new DType(DTypeEnum.TIMESTAMP_DAYS);
  public static final DType TIMESTAMP_SECONDS = new DType(DTypeEnum.TIMESTAMP_SECONDS);
  public static final DType TIMESTAMP_MILLISECONDS = new DType(DTypeEnum.TIMESTAMP_MILLISECONDS);
  public static final DType TIMESTAMP_MICROSECONDS = new DType(DTypeEnum.TIMESTAMP_MICROSECONDS);
  public static final DType TIMESTAMP_NANOSECONDS = new DType(DTypeEnum.TIMESTAMP_NANOSECONDS);
  public static final DType DURATION_DAYS = new DType(DTypeEnum.DURATION_DAYS);
  public static final DType DURATION_SECONDS = new DType(DTypeEnum.DURATION_SECONDS);
  public static final DType DURATION_MILLISECONDS = new DType(DTypeEnum.DURATION_MILLISECONDS);
  public static final DType DURATION_MICROSECONDS = new DType(DTypeEnum.DURATION_MICROSECONDS);
  public static final DType DURATION_NANOSECONDS = new DType(DTypeEnum.DURATION_NANOSECONDS);
  public static final DType STRING = new DType(DTypeEnum.STRING);
  public static final DType LIST = new DType(DTypeEnum.LIST);
  public static final DType STRUCT = new DType(DTypeEnum.STRUCT);

  private static final DType[] staticDTypes = new DType[] {
      EMPTY,
      INT8,
      INT16,
      INT32,
      INT64,
      UINT8,
      UINT16,
      UINT32,
      UINT64,
      FLOAT32,
      FLOAT64,
      BOOL8,
      TIMESTAMP_DAYS,
      TIMESTAMP_SECONDS,
      TIMESTAMP_MILLISECONDS,
      TIMESTAMP_NANOSECONDS,
      DURATION_DAYS,
      DURATION_SECONDS,
      DURATION_MILLISECONDS,
      DURATION_MICROSECONDS,
      DURATION_NANOSECONDS,
      null, // DICTIONARY32
      STRING,
      LIST,
      null, // DECIMAL32
      null, // DECIMAL64
      STRUCT
  };

  public static DType fromNative(int nativeId, int scale) {
    if (nativeId >=0 && nativeId < staticDTypes.length) {
      if (staticDTypes[nativeId] != null) {
        return staticDTypes[nativeId];
      }
      if (nativeId == DTypeEnum.DECIMAL32.nativeId) {
        // TODO: maybe we can verify scale at here?
        return new DType(DTypeEnum.DECIMAL32, scale);
      }
      if (nativeId == DTypeEnum.DECIMAL64.nativeId) {
        // TODO: maybe we can verify scale at here?
        return new DType(DTypeEnum.DECIMAL64, scale);
      }
    }
    throw new IllegalArgumentException("Could not translate " + nativeId + " into a DType");
  }

  public static DType fromNative(int nativeId) {
    // TODO: replace Integer.MIN_VALUE with a constant as noop placeholder ?
    return fromNative(nativeId, Integer.MIN_VALUE);
  }

@nartal1 nartal1 changed the title [WIP] Initial work for decimal type in Java/JNI [skip ci] [REVIEW] Initial work for decimal type in Java/JNI [skip ci] Oct 24, 2020
@nartal1
Copy link
Member Author

nartal1 commented Oct 24, 2020

Thanks to @sperlingxx for adding scalar tests. I have merged it in this branch. Apologies for bringing up this late. Thought of adding this here to have some tests.

Signed-off-by: Niranjan Artal <[email protected]>
*/
#pragma once

namespace dtype_utils{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This is really minor because all of the functions are inline and the header is private to the JNI code, but I would prefer to have the namespace have something with cudf and java/jni in it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nartal1, here is an additional (non-actionable) note on @revans2's namespace comment:

These functions used to be in the "unnamed" namespace in ColumnVectorJni.cpp. Moving it to a separate header and out of the unnamed namespace forced external linkage, and led to the multiple definitions error you saw. This forced your hand with making them inline. You could have alternatively chosen to keep them in the "unnamed" namespace (nested within another cudf::jni namespace).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mythrocks , I have made the changes as per suggestion.

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor changes suggested.

*/
#pragma once

namespace dtype_utils{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nartal1, here is an additional (non-actionable) note on @revans2's namespace comment:

These functions used to be in the "unnamed" namespace in ColumnVectorJni.cpp. Moving it to a separate header and out of the unnamed namespace forced external linkage, and led to the multiple definitions error you saw. This forced your hand with making them inline. You could have alternatively chosen to keep them in the "unnamed" namespace (nested within another cudf::jni namespace).

java/src/main/native/src/ScalarJni.cpp Show resolved Hide resolved
@nartal1
Copy link
Member Author

nartal1 commented Oct 26, 2020

@jlowe @mythrocks I think I have addressed review comments. Please take another look.
@razajafri Tagging you as well since you had few comments initially which were resolved :)

Signed-off-by: Niranjan Artal <[email protected]>
Signed-off-by: Niranjan Artal <[email protected]>
@nartal1 nartal1 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs cuDF (Java) Reviewer labels Oct 27, 2020
@nartal1
Copy link
Member Author

nartal1 commented Oct 27, 2020

Thanks for the approvals!

@kkraus14 kkraus14 merged commit 4b4d962 into rapidsai:branch-0.17 Oct 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge Java Affects Java cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants