-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JNI: Rewrite growBuffersAndRows to accelerate the HostColumnBuilder #10025
JNI: Rewrite growBuffersAndRows to accelerate the HostColumnBuilder #10025
Conversation
Signed-off-by: sperlingxx <[email protected]>
Codecov Report
@@ Coverage Diff @@
## branch-22.04 #10025 +/- ##
================================================
+ Coverage 10.37% 10.42% +0.04%
================================================
Files 119 119
Lines 20149 20607 +458
================================================
+ Hits 2091 2148 +57
- Misses 18058 18459 +401
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing this for correctness is great, but this is a performance change and I really would like to see the results of some benchmarks too so we can see if it is getting faster or not. A lot of my comments are things that I think will make the new code faster, but I do not know for sure, and if you benchmark it and find that I am wrong or it makes no difference, then we should keep code that is more readable, than worrying about the performance.
* The Java substitution of native method `ColumnView.getNativeValidPointerSize`. | ||
* Ideally, this method can speed up growValidBuffer by eliminating the JNI call. | ||
*/ | ||
private static long byteSizeOfNullMask(int numRows) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be nice to look at replacing the JNI call entirely with this. I don't see a lot of reason to have this hidden here when we could make it common.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Hi @revans2, I ran a local benchmark to compare the performance before and after this change. I ran the benchmark with Here is the result:
Here is the code snippet of the benchmark: @Test
public void testBenchmarkHostColumnBuilder() {
int runs = 5;
int estimatedRows = 1024;
int totalRows = Integer.MAX_VALUE / 10;
long duration;
HostColumnVector.DataType type = new HostColumnVector.BasicType(true, DType.INT64);
System.out.println("append Long without NULL:");
duration = 0L;
for (int x = 0; x < runs; x++) {
long start = System.currentTimeMillis();
try (HostColumnVector.ColumnBuilder cb = new HostColumnVector.ColumnBuilder(
type, estimatedRows)) {
for (int i = 0; i < totalRows; i++) {
cb.append((long) i);
}
}
duration += System.currentTimeMillis() - start;
}
System.out.println("average cost of " + runs + " runs: " + (float) duration / runs);
System.out.println("append Long:");
duration = 0L;
for (int x = 0; x < runs; x++) {
long start = System.currentTimeMillis();
try (HostColumnVector.ColumnBuilder cb = new HostColumnVector.ColumnBuilder(
type, estimatedRows)) {
for (int i = 0; i < totalRows; i++) {
if (i % 10 == 0) {
cb.appendNull();
} else {
cb.append((long) i);
}
}
}
duration += System.currentTimeMillis() - start;
}
System.out.println("average cost of " + runs + " runs: " + (float) duration / runs);
System.out.println("append String without NULL:");
duration = 0L;
type = new HostColumnVector.BasicType(true, DType.STRING);
for (int x = 0; x < runs; x++) {
long start = System.currentTimeMillis();
try (HostColumnVector.ColumnBuilder cb = new HostColumnVector.ColumnBuilder(
type, estimatedRows)) {
for (int i = 0; i < totalRows; i++) {
if (i % 10 == 0) {
cb.appendNull();
} else {
cb.append(String.valueOf(i));
}
}
}
duration += System.currentTimeMillis() - start;
}
System.out.println("average cost of " + runs + " runs: " + (float) duration / runs);
System.out.println("append String:");
duration = 0L;
type = new HostColumnVector.BasicType(true, DType.STRING);
for (int x = 0; x < runs; x++) {
long start = System.currentTimeMillis();
try (HostColumnVector.ColumnBuilder cb = new HostColumnVector.ColumnBuilder(
type, estimatedRows)) {
for (int i = 0; i < totalRows; i++) {
cb.append(String.valueOf(i));
}
}
duration += System.currentTimeMillis() - start;
}
System.out.println("average cost of " + runs + " runs: " + (float) duration / runs);
System.out.println("append List[Int]:");
duration = 0L;
type = new HostColumnVector.ListType(true,
new HostColumnVector.BasicType(true, DType.INT32));
for (int x = 0; x < runs; x++) {
long start = System.currentTimeMillis();
try (HostColumnVector.ColumnBuilder cb = new HostColumnVector.ColumnBuilder(
type, estimatedRows)) {
for (int i = 0; i < totalRows; i++) {
if (i % 10 == 0) {
cb.appendNull();
continue;
}
cb.appendLists(Lists.newArrayList(i - 1, i));
}
}
duration += System.currentTimeMillis() - start;
}
System.out.println("average cost of " + runs + " runs: " + (float) duration / runs);
System.out.println("append Struct[(Int, Long, String)]:");
duration = 0L;
type = new HostColumnVector.StructType(true,
new HostColumnVector.BasicType(true, DType.INT32),
new HostColumnVector.BasicType(true, DType.INT64),
new HostColumnVector.BasicType(true, DType.STRING));
for (int x = 0; x < runs; x++) {
long start = System.currentTimeMillis();
try (HostColumnVector.ColumnBuilder cb = new HostColumnVector.ColumnBuilder(
type, estimatedRows)) {
for (int i = 0; i < totalRows; i++) {
if (i % 10 == 0) {
cb.appendNull();
continue;
}
cb.appendStructValues(
new HostColumnVector.StructData(i, (long) i, String.valueOf(i)));
}
}
duration += System.currentTimeMillis() - start;
}
System.out.println("average cost of " + runs + " runs: " + (float) duration / runs);
System.out.println("append List[Struct[(Int, Long, String)]]:");
duration = 0L;
type = new HostColumnVector.ListType(true,
new HostColumnVector.StructType(true,
new HostColumnVector.BasicType(true, DType.INT32),
new HostColumnVector.BasicType(true, DType.INT64),
new HostColumnVector.BasicType(true, DType.STRING)));
for (int x = 0; x < runs; x++) {
long start = System.currentTimeMillis();
try (HostColumnVector.ColumnBuilder cb = new HostColumnVector.ColumnBuilder(
type, estimatedRows)) {
for (int i = 0; i < totalRows; i++) {
if (i % 20 == 0) {
cb.appendNull();
} else if (i % 10 == 0) {
cb.appendLists(Lists.newArrayList((HostColumnVector.StructData) null));
} else {
cb.appendLists(Lists.newArrayList(
new HostColumnVector.StructData(i, (long) i, String.valueOf(i))));
}
}
}
duration += System.currentTimeMillis() - start;
}
System.out.println("average cost of " + runs + " runs: " + (float) duration / runs);
} |
For fixed width columns we typically can estimate the total number of rows accurately. So having it start with 1024 rows and grow regularly is not expected. For Strings we guess 100 bytes per String, so with your example we are likely to over estimate there too. For lists we assume one item per list, and in your example will likely match the number of rows. So if we want to be more complete I would like to see one case where we are correct in providing the estimatedRows, and then use this one for the other case where we are likely wrong. Just so we can see the overhead of growing a buffer as separate from the rest of the computation as possible. |
This is looking pretty good to me. |
The numbers are definitely better, but I am not thrilled by them. That said you must have a much better CPU than me because when I run these they are not nearly as good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I am happy with the change. I am a little disappointed that we couldn't make it go even faster, but this is a really good step forward. Just some nits and a bit of cleanup for the API.
* Get the number of bytes needed to allocate a validity buffer for the given number of rows. | ||
* According to cudf::bitmask_allocation_size_bytes, the padding boundary for null mask is 64 bytes. | ||
*/ | ||
public static long getValidityBufferSize(int numRows) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: not public. The old API was package private
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Signed-off-by: sperlingxx <[email protected]>
rerun tests |
I think this PR is ready. @revans2 |
@gpucibot merge |
According to NVIDIA/spark-rapids#4393, current PR takes several measures to speed up the buffer growing during the build of
HostColumnVector
:rowCapacity
to cache the maximum number of rows/bytesbyteSizeOfNullMask
to get the size of the validity bufferI have tested this PR with the spark-rapids tests locally.
BTW, shall we clean up the
HostColumnVector.Builder
and replace all the usages ofBuilder
withColumnBuilder
?