Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add is_valid_integer format check API #7094

Closed
wants to merge 3 commits into from

Conversation

chenrui17
Copy link
Contributor

@revans2 this is my first commit , in order to close #7080 , i just finished the allow_decimal part , there are still overflow checks that have not been completed , please give me some suggestions about this.
In addition , this feature helps me improve my spark query performance 30%~50%, so about how to checks overflow, please give me some idea , Thank you very much.

@chenrui17 chenrui17 requested a review from a team as a code owner January 7, 2021 12:39
@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

1 similar comment
@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

@chenrui17 chenrui17 requested review from vuule and davidwendt January 7, 2021 12:39
Copy link
Contributor

@davidwendt davidwendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 8 types of integers.

INT8, ///< 1 byte signed integer
INT16, ///< 2 byte signed integer
INT32, ///< 4 byte signed integer
INT64, ///< 8 byte signed integer
UINT8, ///< 1 byte unsigned integer
UINT16, ///< 2 byte unsigned integer
UINT32, ///< 4 byte unsigned integer
UINT64, ///< 8 byte unsigned integer

Both the to_integers() and the to_floats() allow you to specify the output data type so it seems we should take the data type as an input parameter here and use that when checking for overflow.

strings_column_view const& strings,
bool allow_decimal,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not go in this header file. I would suggest creating a new header file for this function.
/cpp/include/cudf/strings/convert/is_valid_integer.hpp

* @param allow_decimal identification whether the format is decimal or not
* @param mr Device memory resource used to allocate the returned column's device memory.
* @return New column of boolean results for each string.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doxygen does not match this function. The doxygen in the string.cuh file looks correct and should be moved here instead. You can use the @copydoc in the .cu to refer to the doxygen here.
https://github.com/rapidsai/cudf/blob/branch-0.18/cpp/docs/DOCUMENTATION.md#copydoc

@@ -213,6 +213,35 @@ std::unique_ptr<column> is_integer(
return results;
}

std::unique_ptr<column> is_valid_integer(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function should be moved into a new source file instead
/cpp/src/strings/convert/is_valid_integer.cu

This source file should also contain all the decimal integer character checking as well as the overflow checking.

* @param allow_decimal Decimal format or not
* @return true if string has valid integer characters
*/
__device__ bool is_valid_integer(string_view const& d_str ,bool allow_decimal)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function should be move into the is_valid_integer.cu source file as mentioned in another review comment. It should also be updated to check for overflow as well.

*/
std::unique_ptr<column> is_valid_integer(
strings_column_view const& strings,
bool allow_decimal,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter doesn't seem like it belongs. Fixed-point decimal numbers aren't integers. This seems like it should be is_valid_fixed_point or something similar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I've been looking up is_numeric(), is_decimal(), etc type APIs in other languages and cannot find one that matches the requirements for #7080 . Perhaps is_valid_numeric() with allow-decimal and allow-exponent flags. Perhaps this is too Spark specific.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name does not matter too much to me. is_valid_numeric is OK, but I think I like is_valid_fixed_point best because it does appear to match with a fixed point number.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of all of these type specific APIs, couldn't we just have:

unique_ptr<column> is_valid_element(strings_column_view strings, data_type type);

i.e., verifies that the string elements in strings can be parsed as the specified type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_valid_element is fine with me. The main issue would be decimal vs not decimal for an integer. I think we could address that with some flags passed to is_valid_element. I also assume that this PR would only be for integer types and other types would be added later?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main issue would be decimal vs not decimal for an integer

I don't understand the problem. This is what I was thinking:

If you want to know if "12345" is a valid integer you'd do:

is_valid_element( "12345", data_type{INT32})`

If you want to know if it's a valid fixed-point decimal, you do:

is_valid_element("12345", data_type{decimal64, scale, etc.})

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yes, is_valid_element can be a different PR.

Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenrui17 thanks for jumping on this.

*/
std::unique_ptr<column> is_valid_integer(
strings_column_view const& strings,
bool allow_decimal,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name does not matter too much to me. is_valid_numeric is OK, but I think I like is_valid_fixed_point best because it does appear to match with a fixed point number.

__device__ bool is_valid_integer(string_view const& d_str ,bool allow_decimal)
{
bool decimal_found = false;
if (!allow_decimal) return is_integer(d_str);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we are not going to be able to use is_integer here in the long term. Especially if we want to support range checking. I think the spark code that does this is a decent starting point.

https://github.com/apache/spark/blob/7b06acc28b5c37da6c48bc44c3d921309d4ad3a8/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L1120-L1194

You would want to pass in a min_value that corresponds to the type you are checking. You also would never return the resulting value in any way, just the true/false.

@revans2 revans2 added 2 - In Progress Currently a work in progress improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python) labels Jan 7, 2021
@chenrui17 chenrui17 force-pushed the branch-0.18 branch 4 times, most recently from 53cf088 to 537f92c Compare January 8, 2021 10:24
@@ -0,0 +1,47 @@
/*
* Copyright (c) 2019, NVIDIA CORPORATION.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copyright needs to be 2021, and I am not sure you want to give the copyright to NVIDIA for the code that you have written. You should probably include the copyright for the company you work for instead (NOTE: I am not a lawyer)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed here to my company copyright , i don't know if this is appropriate , Are all the other contributors from NVIDIA ?What did they do

@@ -0,0 +1,173 @@
/*
* Copyright (c) 2019-2020, NVIDIA CORPORATION.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same copyright issue. Should be 2021 and probably want to not have it be for NVIDIA unless you work for NVIDIA (again not a lawyer)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this is a new file so it does not need the 2019- part.

namespace strings {

/**
* Parses this UTF8String(trimmed if needed) to INT8/16/32/64...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit indentation in the doxegen comments appears to be off.

int offset = 0;
size_type bytes = d_str.size_bytes();
const char* data = d_str.data();
while (offset < bytes && data[offset] == ' ') ++offset;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to support stripping leading white space? I understand why we might want to do it, but if we do then we need to have it documented for this function and documented for the public API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cudf::strings::to_integers() does not ignore the leading whitespace.
If this is used to validate a strings column before calling to_integers() then this should not ignore whitespace either.

if (offset == bytes) return false;

int end = bytes - 1;
while (end > offset && data[end] == ' ') --end;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above. Do we want to strip trailing white space (Specifically only the space character?) If so that is fine, but we need to be sure that it is documented clearly.

}

// We are going to process the new digit and accumulate the result. However, before doing
// this, if the result is already smaller than the stopValue(Long.MIN_VALUE / radix), then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit Long.MIN_VALUE is no longer correct for this comment.


result = result * radix - digit;

// Since the previous result is less than or equal to stopValue(Long.MIN_VALUE / radix), we
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Same here Long.MIN_VALUE should probably be updated in the comment.

*/
std::unique_ptr<column> is_valid_fixed_point(
strings_column_view const& strings,
bool allow_decimal,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the purpose of this parameter anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference is [0-9] vs [0-9](\.[0-9])?. if allow_decimal is true then we allow the number to be followed by a decimal point followed and more numbers. If it is false we do not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understood that much, I just don't understand why this is a desired option. In my mind this would be like having an allow_decimal flag to the is_floating_point API---both strike me as odd.

bool flags are always a code smell for me as they are warning sign of a single responsibility violation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to match what Spark and Hive do when casting strings to int values. Floating point allows for NaN, -Inf, etc that I guess we could special case yet again, but then the overflow wold not be properly checks, which is another feature that we are looking for.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to match what Spark and Hive do when casting strings to int values.

What behavior are you looking for? From my perspective, if allow_decimal is False, wouldn't that effectively make this function is_valid_integer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so , here we need 2 functions instead of 1 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think is_valid_integer should be distinct from is_valid_fixed_point. And ultimately we should just have is_valid_element.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on what you have discussed, I think we should do this :
@jrhemstad
this pr is for

unique_ptr<column> is_valid_integer(strings_column_view strings, data_type type);

and , i will file a new pr for

unique_ptr<column> is_valid_fixed_point(strings_column_view strings, data_type type);

@revans2
above api contains parameter data_type , that will help us to check overflow.

while, for parameter allow_decimal requirements you mentioned in #7080 , we can judge allow_decimal in spark-rapids, if allow_decimal is true, through JNI layer , we finally call is_valid_fixed_point in cudf cpp module, if allow_decimal is false , we finally call is_valid_integer in cudf cpp module, is it right ? in this way , allow_decimal parameter will not enter cudf.

please give me some advice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For CUDF JNI I would rather mirror the underlying C++ API as much as possible. The Spark plugin can decide which CUDF API to call instead of pushing that into cudf.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrhemstad I'm not sure is_valid_element will do what we want. We are asking for two separate things, and perhaps it should be two separate API calls then. First we want to check if the format of the string matches what we expect, next we want to check if when we parse the string as an integer if the number will overflow. In the current set of APIs is_valid_fixed_point and is_valid_integer the name of the function indicates the format of the string while the type we pass in, indicates the overflow value. If we just have is_valid_element then we would need that same split in roles. Does the format match fixed point and we don't overflow parsing to a byte? If that is too much for a single API to do, then perhaps what we want is to add is_fixed_point along with is_integer and is_float, and also add an API to check if a strings column would overflow when parsing to an integer of type X.


// ready a min_value corresponds to the input type in order to check overflow
long d_min_value = 0;
switch (input_type.id()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be coded using the type_dispatcher instead of a switch statement. Also this should use the std::numeric_limits class instead of hardcoded values.

// This is the case when we've encountered a decimal separator. The fractional
// part will not change the number, but we will verify that the fractional part
// is well formed.
while (offset <= end) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we doing this?

Copy link
Contributor Author

@chenrui17 chenrui17 Jan 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in order to verify the fractional part whether is well formed. if allow_decimal is true and has character '.' , it will break from last while loop , so here we need to verify the fractional part after the decimal point.

++offset;
// We allow decimals and will return a truncated integral in that case.
// Therefore we won't throw an exception here (checking the fractional
// part happens below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// part happens below.
// part happens below).

@@ -0,0 +1,173 @@
/*
* Copyright (c) 2019-2020, NVIDIA CORPORATION.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this is a new file so it does not need the 2019- part.

/**
* @brief Returns a boolean column identifying strings in which all
* characters are valid for conversion to integers.
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description should include how the values are the decimal (if allowed) must all be valid decimal characters.
Also, the description should mention how the overflow depends on the input_type.

int offset = 0;
size_type bytes = d_str.size_bytes();
const char* data = d_str.data();
while (offset < bytes && data[offset] == ' ') ++offset;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cudf::strings::to_integers() does not ignore the leading whitespace.
If this is used to validate a strings column before calling to_integers() then this should not ignore whitespace either.

@chenrui17 chenrui17 requested a review from revans2 January 12, 2021 10:59
@@ -14,10 +14,14 @@
* limitations under the License.
*/

#include <cudf/strings/convert/is_valid_element.hpp>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest moving these to a new test source file /cpp/tests/strings/valid_element.cpp

namespace strings {

/**
* Check whether the UTF8String is valid when convert data from string to all kinds of integers,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Check whether the UTF8String is valid when convert data from string to all kinds of integers,
* Check whether the string is valid when convert string to signed integers

Comment on lines 37 to 39
* like INT8/16/32/64. for example, if allow_decimal is true, this will return `true, true` when input string
* is `1.23, 123`, or this function will return `false, true`, it means firt element is invalid,
* while second data is valid.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* like INT8/16/32/64. for example, if allow_decimal is true, this will return `true, true` when input string
* is `1.23, 123`, or this function will return `false, true`, it means firt element is invalid,
* while second data is valid.
* like INT8/16/32/64. For example, if `allow_decimal` is true, then strings `['1.23', '123']` will return `[true, true]`.
* If `allow_decimal` is false, then this function will return `[false, true]`.

Comment on lines 26 to 28
#include <strings/utilities.cuh>
#include <strings/utilities.hpp>
#include <strings/utilities.cuh>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include <strings/utilities.cuh>
#include <strings/utilities.hpp>
#include <strings/utilities.cuh>

These are not needed.

#include <cudf/column/column_factories.hpp>
#include <cudf/detail/null_mask.hpp>
#include <cudf/detail/nvtx/ranges.hpp>
#include <cudf/strings/detail/utilities.hpp>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include <cudf/strings/detail/utilities.hpp>

This header is not required.

}

// We are going to process the new digit and accumulate the result. However,
// before doing this, if the result is already smaller than the stopValue which is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// before doing this, if the result is already smaller than the stopValue which is
// before doing this, if the result is already smaller than the stop_value which is

// We are going to process the new digit and accumulate the result. However,
// before doing this, if the result is already smaller than the stopValue which is
// (std::numeric_limits<data_type>::min() / radix), then result * 10 will definitely
// be smaller than minValue, and we can stop.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// be smaller than minValue, and we can stop.
// be smaller than the min value, and we can stop.

Comment on lines 127 to 128
* @brief The dispatch functions for calculate the min value of input data type
* to check overflow.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @brief The dispatch functions for calculate the min value of input data type
* to check overflow.
* @brief The dispatch functions returns the min value of the input data type used
* for checking overflow.

* @brief The dispatch functions for calculate the min value of input data type
* to check overflow.
*
* The output is the min value of spicified type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The output is the min value of spicified type.
* The output is the min value of specified type.

Comment on lines 123 to 124
namespace detail {
namespace {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move these two lines up so that it includes the is_valid_element internal device function.

@harrism
Copy link
Member

harrism commented Jan 13, 2021

Reviewers, please make sure the title and description of this PR is made more descriptive before merging.

{
// allow_decimal = true
cudf::test::strings_column_wrapper strings1(
{"+175", "-34", "9.8", "17+2", "+-14", "1234567890", "67de", "", "1e10", "-", "++", "", "21474836482222"});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check for too large negative number too?

namespace detail {
namespace {
/**
* Check whether the string is valid when convert string to signed integers,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing an @brief tag.

Comment on lines +36 to +38
* input_type is used to check whether the data overflows, for example, if input_type is
* `int8_t` and input string data is `128`, then it will return false ,because it out of ranges
* [-128, 127] and overflows.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* input_type is used to check whether the data overflows, for example, if input_type is
* `int8_t` and input string data is `128`, then it will return false ,because it out of ranges
* [-128, 127] and overflows.
* `input_type` is used to check whether the data causes an integer overflow. For example, if `input_type` is
* `INT8` and the input `strings[i]` is "128", then the `output[i]` will be `false` since resulting integer
* would be out of range `[-128, 127]` for `int8_t`.

@harrism
Copy link
Member

harrism commented Feb 2, 2021

Let's move this to 0.19 since it needs significant work.

@harrism
Copy link
Member

harrism commented Feb 2, 2021

@chenrui17 you will need to click "edit" at the top and switch the target branch to branch-0.19, and then merge branch-0.19 into your local PR branch before proceeding. Thanks!

@ttnghia
Copy link
Contributor

ttnghia commented Mar 3, 2021

How this is going? I was suggested to take over this if necessary.

@davidwendt
Copy link
Contributor

So is_fixed_point() is available now if that helps this
https://docs.rapids.ai/api/libcudf/nightly/group__strings__convert.html#ga492da4125dd774bf90c458840779b746
Overflow is checked but the decimal point is always allowed.

@ttnghia
Copy link
Contributor

ttnghia commented Mar 10, 2021

So is_fixed_point() is available now if that helps this
https://docs.rapids.ai/api/libcudf/nightly/group__strings__convert.html#ga492da4125dd774bf90c458840779b746
Overflow is checked but the decimal point is always allowed.

@davidwendt is_fixed_point is fantastic. Since that function does range check, I think we should not add the new function is_valid_element, instead we should modifystrings::is_integer to make it doing the same things: checking for both pattern and range. By doing so, we can maintain consistency between those similar functionalities.

@davidwendt
Copy link
Contributor

So is_fixed_point() is available now if that helps this
https://docs.rapids.ai/api/libcudf/nightly/group__strings__convert.html#ga492da4125dd774bf90c458840779b746
Overflow is checked but the decimal point is always allowed.

@davidwendt is_fixed_point is fantastic. Since that function does range check, I think we should not add the new function is_valid_element, instead we should modifystrings::is_integer to make it doing the same things: checking for both pattern and range. By doing so, we can maintain consistency between those similar functionalities.

Sorry, I don't understand why we need to change cudf::strings::is_integer().
Can you just call cudf::strings::is_fixed_point() and then we don't need this PR?

@ttnghia
Copy link
Contributor

ttnghia commented Mar 10, 2021

Sorry, I don't understand why we need to change cudf::strings::is_integer().
Can you just call cudf::strings::is_fixed_point() and then we don't need this PR?

As you said, decimal point is always allowed. is_integer should not allow decimal point.

@davidwendt
Copy link
Contributor

Sorry, I don't understand why we need to change cudf::strings::is_integer().
Can you just call cudf::strings::is_fixed_point() and then we don't need this PR?

As you said, decimal point is always allowed. is_integer should not allow decimal point.

cudf::strings::is_integer() does not allow a decimal point.
I think you are asking for cudf::strings::is_integer() to check for overflow?

@ttnghia
Copy link
Contributor

ttnghia commented Mar 10, 2021

Sorry, I don't understand why we need to change cudf::strings::is_integer().
Can you just call cudf::strings::is_fixed_point() and then we don't need this PR?

As you said, decimal point is always allowed. is_integer should not allow decimal point.

cudf::strings::is_integer() does not allow a decimal point.
I think you are asking for cudf::strings::is_integer() to check for overflow?

Correct. Please also see here: #7557

@ttnghia
Copy link
Contributor

ttnghia commented Mar 18, 2021

Hi @chenrui17.
Sorry that I will not adopt this PR as it is very outdated. I have pushed a new PR addressing the issue of converting string to integer with bounds check (#7642). My code only addresses integer types without decimal. For decimal types, there has been another work for this (is_fixed_point()).

rapids-bot bot pushed a commit that referenced this pull request Mar 24, 2021
…eger conversion (#7642)

This PR addresses #5110, #7080, and rework #7094. It adds the function `cudf::strings::is_integer` that can check if strings can be correctly converted into integer values. Underflow and overflow are also taken into account.

Note that this `cudf::strings::is_integer` is different from the existing `cudf::strings::string::is_integer`, which only checks for pattern and does not care about under/overflow.

Examples:
```
s = { "eee", "-200", "-100", "127", "128", "1.5", NULL}

is_integer(s, INT8) = { 0, 0, 1, 1, 0, 0, NULL}
is_integer(s, INT32) = { 0, 1, 1, 1, 1, 0, NULL}
```

Authors:
  - Nghia Truong (@ttnghia)

Approvers:
  - David (@davidwendt)
  - Jake Hemstad (@jrhemstad)
  - Mark Harris (@harrism)

URL: #7642
@ttnghia ttnghia closed this Mar 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] add something like an is_valid_integer function
8 participants