Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add regex_program java APIs and unit tests #12548

Merged
merged 28 commits into from
Jan 26, 2023

Conversation

cindyyuanjiang
Copy link
Contributor

@cindyyuanjiang cindyyuanjiang commented Jan 14, 2023

Description

Adds a set of java regex APIs that take in a regex_program as parameter and java unit tests. This is part of the solution for NVIDIA/spark-rapids#7295.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@rapids-bot
Copy link

rapids-bot bot commented Jan 14, 2023

Pull requests from external contributors require approval from a rapidsai organization member with write or admin permissions before CI can begin.

@github-actions github-actions bot added the Java Affects Java cuDF API. label Jan 14, 2023
@mythrocks mythrocks self-requested a review January 18, 2023 19:27
@mythrocks mythrocks added feature request New feature or request non-breaking Non-breaking change labels Jan 18, 2023
@mythrocks
Copy link
Contributor

@cindyyuanjiang: I've added Non-breaking and Feature labels to this PR, so that CI might run.

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of minor nitpicks to start.
I have yet to go over the tests.

java/src/main/java/ai/rapids/cudf/CaptureGroups.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/RegexProgram.java Outdated Show resolved Hide resolved
Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple more minor nitpicks.

java/src/test/java/ai/rapids/cudf/ColumnVectorTest.java Outdated Show resolved Hide resolved
java/src/test/java/ai/rapids/cudf/ColumnVectorTest.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/RegexFlags.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/RegexFlags.java Outdated Show resolved Hide resolved
java/src/test/java/ai/rapids/cudf/ColumnVectorTest.java Outdated Show resolved Hide resolved
java/src/test/java/ai/rapids/cudf/ColumnVectorTest.java Outdated Show resolved Hide resolved
Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, LGTM. A couple of very minor nitpicks remain.

@cindyyuanjiang cindyyuanjiang marked this pull request as ready for review January 19, 2023 01:14
@cindyyuanjiang cindyyuanjiang requested a review from a team as a code owner January 19, 2023 01:14
@cindyyuanjiang cindyyuanjiang changed the title [WIP] Add regex_program java APIs and unit tests Add regex_program java APIs and unit tests Jan 19, 2023
java/src/main/java/ai/rapids/cudf/ColumnView.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/ColumnView.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/ColumnView.java Outdated Show resolved Hide resolved
java/src/main/java/ai/rapids/cudf/RegexFlags.java Outdated Show resolved Hide resolved
public CaptureGroups capture() {
return capture;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we add in native APIs to allow us to call the following APIs.

/**
* @brief Return the number of instructions in this instance
*
* @return Number of instructions
*/
int32_t instructions_count() const;
/**
* @brief Return the number of capture groups in this instance
*
* @return Number of groups
*/
int32_t groups_count() const;
/**
* @brief Return the pattern used to create this instance
*
* @param num_strings Number of strings for computation
* @return Size of the working memory in bytes
*/
std::size_t compute_working_memory_size(int32_t num_strings) const;

I am fine if it is a follow on issue that we do later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can definitely add them in a follow up issue. I find the current PR a bit large to track things easily. Thank you for the comment!

@codecov
Copy link

codecov bot commented Jan 25, 2023

Codecov Report

Base: 86.58% // Head: 85.73% // Decreases project coverage by -0.85% ⚠️

Coverage data is based on head (2554907) compared to base (b6dccb3).
Patch has no changes to coverable lines.

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-23.02   #12548      +/-   ##
================================================
- Coverage         86.58%   85.73%   -0.85%     
================================================
  Files               155      155              
  Lines             24368    24889     +521     
================================================
+ Hits              21098    21339     +241     
- Misses             3270     3550     +280     
Impacted Files Coverage Δ
python/cudf/cudf/_version.py 1.41% <0.00%> (-98.59%) ⬇️
python/cudf/cudf/core/buffer/spill_manager.py 72.50% <0.00%> (-7.50%) ⬇️
python/cudf/cudf/utils/dtypes.py 77.85% <0.00%> (-1.61%) ⬇️
python/cudf/cudf/options.py 86.11% <0.00%> (-1.59%) ⬇️
python/cudf/cudf/core/single_column_frame.py 94.30% <0.00%> (-1.27%) ⬇️
python/cudf/cudf/io/json.py 91.04% <0.00%> (-1.02%) ⬇️
...ython/custreamz/custreamz/tests/test_dataframes.py 98.38% <0.00%> (-1.01%) ⬇️
python/dask_cudf/dask_cudf/io/csv.py 96.34% <0.00%> (-1.00%) ⬇️
python/dask_cudf/dask_cudf/io/parquet.py 91.81% <0.00%> (-0.59%) ⬇️
python/cudf/cudf/core/multiindex.py 91.66% <0.00%> (-0.51%) ⬇️
... and 47 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@ajschmidt8
Copy link
Member

/ok to test

@revans2
Copy link
Contributor

revans2 commented Jan 25, 2023

The test failures appear to be totally unrelated. It failed in a boolean segmented sort test. Not sure if you just need to upmerge or what.

@revans2
Copy link
Contributor

revans2 commented Jan 25, 2023

rerun tests

@revans2
Copy link
Contributor

revans2 commented Jan 25, 2023

hmm the same tests failed again. I wonder if others are seeing these same failures or not.

@revans2
Copy link
Contributor

revans2 commented Jan 25, 2023

@davidwendt it looks like SegmentedSortInt.Bool is failing in this PR only on v100 and this PR didn't touch any code that should be impacting this path. Is the patch just out of date or is there some other issue going on here? Calling you out because I think you added this test after there was a bug in CUB with segmented sort.

@ajschmidt8
Copy link
Member

rerun tests

@revans2, the rerun tests command does not work with the new GitHub Actions CI setup.

You can check out the docs below for information on how to rerun tests in GH Actions

@davidwendt
Copy link
Contributor

@davidwendt it looks like SegmentedSortInt.Bool is failing in this PR only on v100 and this PR didn't touch any code that should be impacting this path. Is the patch just out of date or is there some other issue going on here? Calling you out because I think you added this test after there was a bug in CUB with segmented sort.

Yes, this was a cmake error that is fixed by rapidsai/rapids-cmake#353
You will need to do a Rerun All Jobs as referenced by AJ in his comment.

@revans2
Copy link
Contributor

revans2 commented Jan 26, 2023

@cindyyuanjiang your branch is still out of date. Please upmerge again. Also because this is deprecating some APIs that the spark plugin uses we are going to need to put up a PR in the plugin that adds the warnings to an allow list so we don't break the build.

You will need to add an annotation similar to

@scala.annotation.nowarn("msg=method XXX in class YYY is deprecated")

You can put this on each off the methods that calls one of the deprecated methods. Sorry I know this is a pain.

@ajschmidt8
Copy link
Member

@cindyyuanjiang your branch is still out of date. Please upmerge again

@revans2, a fully up-to-date PR branch is not required for a PR to merge.

If a PR has all of the required reviews and all of the required GitHub checks are passing, then it can be merged by someone with write permissions or greater posting /merge as a comment on the PR.

The Update Branch button is a GitHub feature that we enabled (src) to compliment the new Recently Updated check shown in the screenshot below.

You can read about the Recently Updated check here: https://docs.rapids.ai/resources/recently-updated/

The Update Branch button and the Recently Updated check are separate but complimentary features.

image

At this time, the Recently Updated check is not required, but that may change in the future.

@revans2
Copy link
Contributor

revans2 commented Jan 26, 2023

@ajschmidt8 you are right. I just got confused when I saw

Screenshot from 2023-01-26 09-23-05

But it is not a blocker....

@revans2
Copy link
Contributor

revans2 commented Jan 26, 2023

/merge

@rapids-bot rapids-bot bot merged commit 20c945b into rapidsai:branch-23.02 Jan 26, 2023
@cindyyuanjiang
Copy link
Contributor Author

cindyyuanjiang commented Jan 26, 2023

@cindyyuanjiang your branch is still out of date. Please upmerge again. Also because this is deprecating some APIs that the spark plugin uses we are going to need to put up a PR in the plugin that adds the warnings to an allow list so we don't break the build.

@revans2 Thank you very much! I will follow up with the spark plugin side.

@revans2
Copy link
Contributor

revans2 commented Jan 26, 2023

@cindyyuanjiang it's OK I have a patch for the warnings almost ready to go. I realized I hit merge a bit too soon on this so I figured I would take the hit and make the patch

@cindyyuanjiang cindyyuanjiang deleted the regex-program-cudf branch January 26, 2023 19:00
@jlowe jlowe mentioned this pull request Jan 26, 2023
3 tasks
rapids-bot bot pushed a commit that referenced this pull request Jan 27, 2023
#12548 introduced a number of GPU resource leaks in ColumnVectorTest.  This cleans them up by wrapping them in `try` blocks.

Authors:
  - Jason Lowe (https://github.com/jlowe)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Jim Brennan (https://github.com/jbrennan333)
  - Nghia Truong (https://github.com/ttnghia)

URL: #12625
cindyyuanjiang added a commit to cindyyuanjiang/cudf that referenced this pull request Jan 28, 2023
cindyyuanjiang added a commit to cindyyuanjiang/cudf that referenced this pull request Jan 28, 2023
msadang pushed a commit that referenced this pull request Jan 31, 2023
* Revert "Fix leaks in ColumnVectorTest (#12625)"

This reverts commit fb17ac7.

* Revert "Add `regex_program` java APIs and unit tests (#12548)"

This reverts commit 20c945b.

---------

Co-authored-by: Cindy Jiang <[email protected]>
Co-authored-by: GALI PREM SAGAR <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Java Affects Java cuDF API. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants