Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add JNI methods for detecting and purging non-empty nulls from LIST and STRUCT #12742

Merged
merged 19 commits into from
Feb 27, 2023

Conversation

razajafri
Copy link
Contributor

@razajafri razajafri commented Feb 9, 2023

Description

This PR adds methods for detecting and purging non-empty nulls.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@razajafri razajafri added 3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 9, 2023
@razajafri razajafri self-assigned this Feb 9, 2023
@razajafri razajafri requested a review from a team as a code owner February 9, 2023 04:47
@github-actions github-actions bot added the Java Affects Java cuDF API. label Feb 9, 2023
Copy link
Contributor

@ttnghia ttnghia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use case for this? We are working on the cudf side, trying to totally eliminate the non-empty null issue.

@razajafri
Copy link
Contributor Author

razajafri commented Feb 9, 2023

What is the use case for this? We are working on the cudf side, trying to totally eliminate the non-empty null issue.

This is needed because we need to support HashAggregate[List] which cannot be done unless we support sorting with non-empty nulls, so as a work around cudf has these functions which we can use in the interim. For reference checkout NVIDIA/spark-rapids#6679, NVIDIA/spark-rapids#5430 and perhaps follow the trail from NVIDIA/spark-rapids#6680.

How far is the team in figuring out the solution for this and when is the fix in cudf expected to go in? An alternative, and my preference, would be to let this go in for now and revert it when cudf is ready.

@ttnghia
Copy link
Contributor

ttnghia commented Feb 9, 2023

I see the issue with ColumnView. Ideally, this (purge non-empty nulls) should only be called in cudf or JNI C++, not in Java. Unfortunately in Java we also have ColumnView constructor (and similar) that allow non-empty nulls to exist.

In addition, the PR description doesn't have any context/meaningful description. Please update it.

@razajafri
Copy link
Contributor Author

I see the issue with ColumnView. Ideally, this (purge non-empty nulls) should only be called in cudf or JNI C++, not in Java. Unfortunately in Java we also have ColumnView constructor (and similar) that allow non-empty nulls to exist.

In addition, the PR description doesn't have any context/meaningful description. Please update it.

ColumnView in Java just calls into the cpp code, if cpp guarantees there will be no non-empty nulls then we won't need to call this from Java. As you said if we are working on a proper fix this wouldn't be needed hopefully.

@razajafri
Copy link
Contributor Author

I am having difficulty running pre-commit hooks and thus having to push multiple updates due to formatting errors

@ttnghia
Copy link
Contributor

ttnghia commented Feb 10, 2023

I am having difficulty running pre-commit hooks and thus having to push multiple updates due to formatting errors

You should run clang-format by yourself, not pre-commit hooks.

@razajafri razajafri requested a review from ttnghia February 14, 2023 00:20
@razajafri razajafri requested a review from mythrocks February 14, 2023 00:20
@razajafri
Copy link
Contributor Author

/test

@razajafri
Copy link
Contributor Author

re-run tests

@ttnghia
Copy link
Contributor

ttnghia commented Feb 17, 2023

CI Java failed due to:

Error:  Tests run: 311, Failures: 0, Errors: 1, Skipped: 2, Time elapsed: 8.111 s <<< FAILURE! - in ai.rapids.cudf.ColumnVectorTest
Error:  ai.rapids.cudf.ColumnVectorTest  Time elapsed: 6.67 s  <<< ERROR!
ai.rapids.cudf.RmmException: Could not shut down RMM there appear to be outstanding allocations

@razajafri
Copy link
Contributor Author

CI Java failed due to:

Error:  Tests run: 311, Failures: 0, Errors: 1, Skipped: 2, Time elapsed: 8.111 s <<< FAILURE! - in ai.rapids.cudf.ColumnVectorTest
Error:  ai.rapids.cudf.ColumnVectorTest  Time elapsed: 6.67 s  <<< ERROR!
ai.rapids.cudf.RmmException: Could not shut down RMM there appear to be outstanding allocations

Yes, I can see that thanks. The reason why I was confused about the CI failure was that the tests are passing locally for me, so I was wondering if the failure was because of something going haywire in the CI because the "update branch" was enabled.

@razajafri
Copy link
Contributor Author

The CI job is failing but when I try to run the exact CI jobs using @ajschmidt8 help and by following steps listed here. The tests pass for me locally and inside rapidsai ci docker.

We are still looking into the cause of this

List<Integer> list1 = Arrays.asList(4, 5, null);
List<Integer> list2 = Arrays.asList(7, 8, 9);
List<Integer> list3 = null;
ColumnVector input = makeListsColumn(DType.INT32, list0, list1, list2, list3);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI failed because there is memory leak in Java tests. Sorry I overlooked the Java tests. Please fix them.

Copy link
Contributor

@ttnghia ttnghia Feb 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ColumnVector as well as HostColumnVector must be wrapped in try blocks. Probably the buffer vars need to be wrapped too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... makes me want to have this failure on my local so we can avoid this. It will still be good to get to the bottom of this discrepancy.

In the meanwhile, I will push a fix. Thanks for pointing us to this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, the discrepancy isn't ideal.

Could it be caused by the types of GPUs that are used in CI vs. on our local machines?

I'm wondering if that or the amount of memory on a given GPU would contribute to this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there is a variable that controls the test behavior, and that var is set by the local environment. @revans2 Do you know about this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a memory leak happens and java's GC is fast, then the leak is cleared up before RMM is shut down and there is no error. If GC is slow, then the leak is not cleared and you get the error. We have a flag to print out messages to help debug leaks, but it is off by default because it makes the tests take a lot longer. Just set -Dai.rapids.refcount.debug=true on the maven commend line when running the tests. Even if they pass you can then look at the logs to see if any leaks were detected and use the stack traces to track down what is leaked and why.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in this case aren't the vectors off the heap? Did you mean the MemoryCleaner instead of GC?

@@ -6693,6 +6693,9 @@ void testApplyBooleanMaskFromListOfStructure() {
}
}

/**
* The caller needs to make sure to close the returned ColumnView
Copy link
Contributor

@ttnghia ttnghia Feb 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the returned ColumnView is not enough. There are more local variables/columns need to be closed.

Suggested change
* The caller needs to make sure to close the returned ColumnView
* The caller needs to make sure to call this function in a `try` block as local variables need to be closed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh, I am rushing through this and context-switching isn't helping either. I think it will make sense for me to close all the local allocations and leave only the returned vector to be closed by the caller.

Comment on lines 6720 to 6723
ColumnView colWithNonEmptyNulls = new ColumnView(input.type, input.rows, Optional.of(2L), dmb,
input.getDeviceBufferFor(BufferType.OFFSET), input.getChildColumnViews());
assertEquals(2, colWithNonEmptyNulls.nullCount);
return colWithNonEmptyNulls;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this returned vector should be in this try block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not the problem, the problem is that the input column will be closed while the view won't. Unfortunately my build env is completely ruined and I will have to spend time to fix it before I can proceed. Thanks for your quick reviews

@razajafri
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit ac1cac6 into rapidsai:branch-23.04 Feb 27, 2023
@razajafri razajafri deleted the SP-5430-non-empty branch February 27, 2023 04:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function Java Affects Java cuDF API. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants