Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JNI: Rewrite growBuffersAndRows to accelerate the HostColumnBuilder #10025

Merged
merged 11 commits into from
Jan 31, 2022

Conversation

sperlingxx
Copy link
Contributor

@sperlingxx sperlingxx commented Jan 12, 2022

According to NVIDIA/spark-rapids#4393, current PR takes several measures to speed up the buffer growing during the build of HostColumnVector:

  1. Introduce rowCapacity to cache the maximum number of rows/bytes
  2. Introduce pura Java method byteSizeOfNullMask to get the size of the validity buffer
  3. Reorganize the code structure to reduce the number of method calls

I have tested this PR with the spark-rapids tests locally.
BTW, shall we clean up the HostColumnVector.Builder and replace all the usages of Builder with ColumnBuilder?

@sperlingxx sperlingxx requested a review from a team as a code owner January 12, 2022 10:15
@github-actions github-actions bot added the Java Affects Java cuDF API. label Jan 12, 2022
@sperlingxx sperlingxx added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 12, 2022
@codecov
Copy link

codecov bot commented Jan 12, 2022

Codecov Report

Merging #10025 (1f67210) into branch-22.04 (e24fa8f) will increase coverage by 0.04%.
The diff coverage is 0.00%.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.04   #10025      +/-   ##
================================================
+ Coverage         10.37%   10.42%   +0.04%     
================================================
  Files               119      119              
  Lines             20149    20607     +458     
================================================
+ Hits               2091     2148      +57     
- Misses            18058    18459     +401     
Impacted Files Coverage Δ
python/cudf/cudf/__init__.py 0.00% <ø> (ø)
python/cudf/cudf/_fuzz_testing/io.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/orc.py 0.00% <ø> (ø)
python/cudf/cudf/_fuzz_testing/parquet.py 0.00% <ø> (ø)
python/cudf/cudf/_fuzz_testing/utils.py 0.00% <0.00%> (ø)
python/cudf/cudf/api/types.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/_base_index.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/__init__.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/column.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/decimal.py 0.00% <0.00%> (ø)
... and 83 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fd968f3...1f67210. Read the comment docs.

Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing this for correctness is great, but this is a performance change and I really would like to see the results of some benchmarks too so we can see if it is getting faster or not. A lot of my comments are things that I think will make the new code faster, but I do not know for sure, and if you benchmark it and find that I am wrong or it makes no difference, then we should keep code that is more readable, than worrying about the performance.

java/src/main/java/ai/rapids/cudf/HostColumnVector.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/HostColumnVector.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/HostColumnVector.java Outdated Show resolved Hide resolved
* The Java substitution of native method `ColumnView.getNativeValidPointerSize`.
* Ideally, this method can speed up growValidBuffer by eliminating the JNI call.
*/
private static long byteSizeOfNullMask(int numRows) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to look at replacing the JNI call entirely with this. I don't see a lot of reason to have this hidden here when we could make it common.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

java/src/main/java/ai/rapids/cudf/HostColumnVector.java Outdated Show resolved Hide resolved
@sperlingxx
Copy link
Contributor Author

Hi @revans2, I ran a local benchmark to compare the performance before and after this change. I ran the benchmark with maven-surefire-plugin. And I disabled the all assertions with tag enableAssertions. I ran each condition for 5 times to compute the average cost.

Here is the result:

Original Implementation Current Implementation
Long without Null 2266.2 ms 1292.6 ms
Long 4794.0 ms 1647.4 ms
String without Null 18498.6 ms 14863.4 ms
String 16647.8 ms 15922.6 ms
List[Int] 12566.4 ms 8268.0 ms
Struct[(Int, Long, String)] 37077.0 ms 23791.0 ms
List[Struct[(Int, Long, String)]} 50902.8 ms 32077.0 ms

Here is the code snippet of the benchmark:

  @Test
  public void testBenchmarkHostColumnBuilder() {
    int runs = 5;
    int estimatedRows = 1024;
    int totalRows = Integer.MAX_VALUE / 10;
    long duration;
    HostColumnVector.DataType type = new HostColumnVector.BasicType(true, DType.INT64);

    System.out.println("append Long without NULL:");
    duration = 0L;
    for (int x = 0; x < runs; x++) {
      long start = System.currentTimeMillis();
      try (HostColumnVector.ColumnBuilder cb = new HostColumnVector.ColumnBuilder(
          type, estimatedRows)) {
        for (int i = 0; i < totalRows; i++) {
          cb.append((long) i);
        }
      }
      duration += System.currentTimeMillis() - start;
    }
    System.out.println("average cost of " + runs + " runs: " + (float) duration / runs);

    System.out.println("append Long:");
    duration = 0L;
    for (int x = 0; x < runs; x++) {
      long start = System.currentTimeMillis();
      try (HostColumnVector.ColumnBuilder cb = new HostColumnVector.ColumnBuilder(
          type, estimatedRows)) {
        for (int i = 0; i < totalRows; i++) {
          if (i % 10 == 0) {
            cb.appendNull();
          } else {
            cb.append((long) i);
          }
        }
      }
      duration += System.currentTimeMillis() - start;
    }
    System.out.println("average cost of " + runs + " runs: " + (float) duration / runs);

    System.out.println("append String without NULL:");
    duration = 0L;
    type = new HostColumnVector.BasicType(true, DType.STRING);
    for (int x = 0; x < runs; x++) {
      long start = System.currentTimeMillis();
      try (HostColumnVector.ColumnBuilder cb = new HostColumnVector.ColumnBuilder(
          type, estimatedRows)) {
        for (int i = 0; i < totalRows; i++) {
          if (i % 10 == 0) {
            cb.appendNull();
          } else {
            cb.append(String.valueOf(i));
          }
        }
      }
      duration += System.currentTimeMillis() - start;
    }
    System.out.println("average cost of " + runs + " runs: " + (float) duration / runs);

    System.out.println("append String:");
    duration = 0L;
    type = new HostColumnVector.BasicType(true, DType.STRING);
    for (int x = 0; x < runs; x++) {
      long start = System.currentTimeMillis();
      try (HostColumnVector.ColumnBuilder cb = new HostColumnVector.ColumnBuilder(
          type, estimatedRows)) {
        for (int i = 0; i < totalRows; i++) {
          cb.append(String.valueOf(i));
        }
      }
      duration += System.currentTimeMillis() - start;
    }
    System.out.println("average cost of " + runs + " runs: " + (float) duration / runs);

    System.out.println("append List[Int]:");
    duration = 0L;
    type = new HostColumnVector.ListType(true,
        new HostColumnVector.BasicType(true, DType.INT32));
    for (int x = 0; x < runs; x++) {
      long start = System.currentTimeMillis();
      try (HostColumnVector.ColumnBuilder cb = new HostColumnVector.ColumnBuilder(
          type, estimatedRows)) {
        for (int i = 0; i < totalRows; i++) {
          if (i % 10 == 0) {
            cb.appendNull();
            continue;
          }
          cb.appendLists(Lists.newArrayList(i - 1, i));
        }
      }
      duration += System.currentTimeMillis() - start;
    }
    System.out.println("average cost of " + runs + " runs: " + (float) duration / runs);

    System.out.println("append Struct[(Int, Long, String)]:");
    duration = 0L;
    type = new HostColumnVector.StructType(true,
        new HostColumnVector.BasicType(true, DType.INT32),
        new HostColumnVector.BasicType(true, DType.INT64),
        new HostColumnVector.BasicType(true, DType.STRING));
    for (int x = 0; x < runs; x++) {
      long start = System.currentTimeMillis();
      try (HostColumnVector.ColumnBuilder cb = new HostColumnVector.ColumnBuilder(
          type, estimatedRows)) {
        for (int i = 0; i < totalRows; i++) {
          if (i % 10 == 0) {
            cb.appendNull();
            continue;
          }
          cb.appendStructValues(
              new HostColumnVector.StructData(i, (long) i, String.valueOf(i)));
        }
      }
      duration += System.currentTimeMillis() - start;
    }
    System.out.println("average cost of " + runs + " runs: " + (float) duration / runs);

    System.out.println("append List[Struct[(Int, Long, String)]]:");
    duration = 0L;
    type = new HostColumnVector.ListType(true,
        new HostColumnVector.StructType(true,
            new HostColumnVector.BasicType(true, DType.INT32),
            new HostColumnVector.BasicType(true, DType.INT64),
            new HostColumnVector.BasicType(true, DType.STRING)));
    for (int x = 0; x < runs; x++) {
      long start = System.currentTimeMillis();
      try (HostColumnVector.ColumnBuilder cb = new HostColumnVector.ColumnBuilder(
          type, estimatedRows)) {
        for (int i = 0; i < totalRows; i++) {
          if (i % 20 == 0) {
            cb.appendNull();
          } else if (i % 10 == 0) {
            cb.appendLists(Lists.newArrayList((HostColumnVector.StructData) null));
          } else {
            cb.appendLists(Lists.newArrayList(
                new HostColumnVector.StructData(i, (long) i, String.valueOf(i))));
          }
        }
      }
      duration += System.currentTimeMillis() - start;
    }
    System.out.println("average cost of " + runs + " runs: " + (float) duration / runs);
  }

@sperlingxx sperlingxx requested a review from revans2 January 14, 2022 10:26
@revans2
Copy link
Contributor

revans2 commented Jan 14, 2022

For fixed width columns we typically can estimate the total number of rows accurately. So having it start with 1024 rows and grow regularly is not expected.

For Strings we guess 100 bytes per String, so with your example we are likely to over estimate there too. For lists we assume one item per list, and in your example will likely match the number of rows.

So if we want to be more complete I would like to see one case where we are correct in providing the estimatedRows, and then use this one for the other case where we are likely wrong. Just so we can see the overhead of growing a buffer as separate from the rest of the computation as possible.

@kuhushukla
Copy link
Contributor

This is looking pretty good to me.

@revans2
Copy link
Contributor

revans2 commented Jan 14, 2022

  Original Implementation Current Implementation Orig Speed (estimated) Current Speed (estimated)
Long without Null 2266.2 ms 1292.6 ms 0.71 GiB/s 1.24 GiB/s
Long 4794.0 ms 1647.4 ms 0.33 GiB/s 0.97 GiB/s
String without Null 18498.6 ms 14863.4 ms 0.09 GiB/s 0.11 GiB/s
String 16647.8 ms 15922.6 ms 0.10 GiB/s 0.11 GiB/s
List[Int] 12566.4 ms 8268.0 ms 0.13 GiB/s 0.19 GiB/s
Struct[(Int, Long, String)] 37077.0 ms 23791.0 ms 0.11 GiB/s 0.17 GiB/s
List[Struct[(Int, Long, String)]} 50902.8 ms 32077.0 ms 0.08 GiB/s 0.13 GiB/s

The numbers are definitely better, but I am not thrilled by them. That said you must have a much better CPU than me because when I run these they are not nearly as good.

Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I am happy with the change. I am a little disappointed that we couldn't make it go even faster, but this is a really good step forward. Just some nits and a bit of cleanup for the API.

* Get the number of bytes needed to allocate a validity buffer for the given number of rows.
* According to cudf::bitmask_allocation_size_bytes, the padding boundary for null mask is 64 bytes.
*/
public static long getValidityBufferSize(int numRows) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not public. The old API was package private

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

java/src/main/java/ai/rapids/cudf/HostColumnVector.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/HostColumnVector.java Outdated Show resolved Hide resolved
@sperlingxx
Copy link
Contributor Author

rerun tests

@sperlingxx sperlingxx requested a review from revans2 January 25, 2022 02:04
@sperlingxx
Copy link
Contributor Author

sperlingxx commented Jan 25, 2022

I think this PR is ready. @revans2

@revans2
Copy link
Contributor

revans2 commented Jan 31, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit b217d7e into rapidsai:branch-22.04 Jan 31, 2022
@sperlingxx sperlingxx deleted the opt_col_builder branch February 14, 2022 03:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function Java Affects Java cuDF API. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants