Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement per-list sequence #9839

Merged
merged 25 commits into from
Jan 4, 2022
Merged

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Dec 3, 2021

This PR adds lists::sequences API, allowing to generate per-list sequence. In particular, it allows generating a lists column in which each list is a sequence of numbers/durations. These sequences are generated individually from separate sets of (start, step, size) input values.

Closes #9424.

Note: lists::sequences supports only numeric types (integer types + floating-point types) and duration types.

@ttnghia ttnghia added feature request New feature or request 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Dec 3, 2021
@ttnghia ttnghia requested review from vyasr and nvdbaranec December 3, 2021 16:37
@ttnghia ttnghia self-assigned this Dec 3, 2021
@github-actions github-actions bot added the CMake CMake build issue label Dec 3, 2021
@ttnghia ttnghia marked this pull request as ready for review December 3, 2021 20:39
@ttnghia ttnghia requested review from a team as code owners December 3, 2021 20:39
@rapidsai rapidsai deleted a comment from codecov bot Dec 3, 2021
@ttnghia ttnghia requested a review from robertmaynard December 3, 2021 20:40
@rapidsai rapidsai deleted a comment from codecov bot Dec 3, 2021
@vuule
Copy link
Contributor

vuule commented Dec 4, 2021

rerun tests

@rapidsai rapidsai deleted a comment from codecov bot Dec 7, 2021
@rapidsai rapidsai deleted a comment from codecov bot Dec 7, 2021
@rapidsai rapidsai deleted a comment from codecov bot Dec 7, 2021
@rapidsai rapidsai deleted a comment from codecov bot Dec 14, 2021
@wbo4958
Copy link
Contributor

wbo4958 commented Dec 14, 2021

Hi @ttnghia

Looks like when size has some negative values, the sequences result will have some issues.

ColumnVector start = ColumnVector.fromBoxedInts(1, 2, 3, 4, 5, 6);
ColumnVector size = ColumnVector.fromBoxedInts(2, 7, -4, 2, 5, 2);

sequences(start, size) will get below result, the second row is not correct

1 2 
2 3 4 4 5 5 6 
null
4 5 
5 6 7 8 9 
6 7 

@wbo4958
Copy link
Contributor

wbo4958 commented Dec 14, 2021

BTW, When I tested the sequences with float start and float-step. The result returned from CUDF seems to be INT32 type and the result seems not correct.

float[] x = {1.2f, 2.2f, 3.3f};
Integer[] sizeV = new Integer[]{1, 2, 3};
ColumnVector start = ColumnVector.fromFloats(x);
ColumnVector size = ColumnVector.fromBoxedInts(sizeV);
ColumnVector step = ColumnVector.fromFloats(x);

The output is like that

1 2 
2 3 4 5 6 7 8 
3 4 5 6 
4 5 
5 6 7 8 9 
6 7 

@rapidsai rapidsai deleted a comment from codecov bot Dec 14, 2021
@ttnghia
Copy link
Contributor Author

ttnghia commented Dec 14, 2021

Hi @ttnghia

Looks like when size has some negative values, the sequences result will have some issues.

Yes, the API explicitly says that "if the input size is negative then output is undefined". Thus, the caller must pass in size column containing values that are at least 0.

@ttnghia
Copy link
Contributor Author

ttnghia commented Dec 14, 2021

BTW, When I tested the sequences with float start and float-step. The result returned from CUDF seems to be INT32 type and the result seems not correct.

Sorry I can't reproduce your issue. I ran your example and get this:

List<float>:
Length : 3
Offsets : 0, 1, 3, 6
   1.20000005, 2.20000005, 4.4000001, 3.29999995, 6.5999999, 9.89999962

So maybe something else is wrong with your test?

@wbo4958
Copy link
Contributor

wbo4958 commented Dec 15, 2021

Okay, Got it. Thx for the explanation

@wbo4958
Copy link
Contributor

wbo4958 commented Dec 15, 2021

BTW, When I tested the sequences with float start and float-step. The result returned from CUDF seems to be INT32 type and the result seems not correct.

Sorry I can't reproduce your issue. I ran your example and get this:

List<float>:
Length : 3
Offsets : 0, 1, 3, 6
   1.20000005, 2.20000005, 4.4000001, 3.29999995, 6.5999999, 9.89999962

So maybe something else is wrong with your test?

Sorry, it turned out my test was wrong.

@codecov

This comment has been minimized.

@ttnghia ttnghia requested a review from a team December 16, 2021 18:41
@ttnghia
Copy link
Contributor Author

ttnghia commented Jan 4, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 36fa5f3 into rapidsai:branch-22.02 Jan 4, 2022
@ttnghia ttnghia deleted the list_sequences branch January 6, 2022 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Add API for generating per-list sequences
7 participants