-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add is_valid_integer format check API #7094
Conversation
Can one of the admins verify this patch? |
1 similar comment
Can one of the admins verify this patch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 8 types of integers.
cudf/cpp/include/cudf/types.hpp
Lines 198 to 205 in f768da7
INT8, ///< 1 byte signed integer | |
INT16, ///< 2 byte signed integer | |
INT32, ///< 4 byte signed integer | |
INT64, ///< 8 byte signed integer | |
UINT8, ///< 1 byte unsigned integer | |
UINT16, ///< 2 byte unsigned integer | |
UINT32, ///< 4 byte unsigned integer | |
UINT64, ///< 8 byte unsigned integer |
Both the to_integers()
and the to_floats()
allow you to specify the output data type so it seems we should take the data type as an input parameter here and use that when checking for overflow.
strings_column_view const& strings, | ||
bool allow_decimal, | ||
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not go in this header file. I would suggest creating a new header file for this function.
/cpp/include/cudf/strings/convert/is_valid_integer.hpp
* @param allow_decimal identification whether the format is decimal or not | ||
* @param mr Device memory resource used to allocate the returned column's device memory. | ||
* @return New column of boolean results for each string. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doxygen does not match this function. The doxygen in the string.cuh
file looks correct and should be moved here instead. You can use the @copydoc
in the .cu
to refer to the doxygen here.
https://github.com/rapidsai/cudf/blob/branch-0.18/cpp/docs/DOCUMENTATION.md#copydoc
@@ -213,6 +213,35 @@ std::unique_ptr<column> is_integer( | |||
return results; | |||
} | |||
|
|||
std::unique_ptr<column> is_valid_integer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function should be moved into a new source file instead
/cpp/src/strings/convert/is_valid_integer.cu
This source file should also contain all the decimal integer character checking as well as the overflow checking.
cpp/include/cudf/strings/string.cuh
Outdated
* @param allow_decimal Decimal format or not | ||
* @return true if string has valid integer characters | ||
*/ | ||
__device__ bool is_valid_integer(string_view const& d_str ,bool allow_decimal) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function should be move into the is_valid_integer.cu
source file as mentioned in another review comment. It should also be updated to check for overflow as well.
*/ | ||
std::unique_ptr<column> is_valid_integer( | ||
strings_column_view const& strings, | ||
bool allow_decimal, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This parameter doesn't seem like it belongs. Fixed-point decimal numbers aren't integers. This seems like it should be is_valid_fixed_point
or something similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I've been looking up is_numeric(), is_decimal(), etc
type APIs in other languages and cannot find one that matches the requirements for #7080 . Perhaps is_valid_numeric()
with allow-decimal and allow-exponent flags. Perhaps this is too Spark specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name does not matter too much to me. is_valid_numeric
is OK, but I think I like is_valid_fixed_point
best because it does appear to match with a fixed point number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of all of these type specific APIs, couldn't we just have:
unique_ptr<column> is_valid_element(strings_column_view strings, data_type type);
i.e., verifies that the string elements in strings
can be parsed as the specified type
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_valid_element
is fine with me. The main issue would be decimal vs not decimal for an integer. I think we could address that with some flags passed to is_valid_element
. I also assume that this PR would only be for integer types and other types would be added later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main issue would be decimal vs not decimal for an integer
I don't understand the problem. This is what I was thinking:
If you want to know if "12345"
is a valid integer you'd do:
is_valid_element( "12345", data_type{INT32})`
If you want to know if it's a valid fixed-point decimal, you do:
is_valid_element("12345", data_type{decimal64, scale, etc.})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And yes, is_valid_element
can be a different PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chenrui17 thanks for jumping on this.
*/ | ||
std::unique_ptr<column> is_valid_integer( | ||
strings_column_view const& strings, | ||
bool allow_decimal, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name does not matter too much to me. is_valid_numeric
is OK, but I think I like is_valid_fixed_point
best because it does appear to match with a fixed point number.
cpp/include/cudf/strings/string.cuh
Outdated
__device__ bool is_valid_integer(string_view const& d_str ,bool allow_decimal) | ||
{ | ||
bool decimal_found = false; | ||
if (!allow_decimal) return is_integer(d_str); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think we are not going to be able to use is_integer
here in the long term. Especially if we want to support range checking. I think the spark code that does this is a decent starting point.
You would want to pass in a min_value
that corresponds to the type you are checking. You also would never return the resulting value in any way, just the true/false.
53cf088
to
537f92c
Compare
@@ -0,0 +1,47 @@ | |||
/* | |||
* Copyright (c) 2019, NVIDIA CORPORATION. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copyright needs to be 2021, and I am not sure you want to give the copyright to NVIDIA for the code that you have written. You should probably include the copyright for the company you work for instead (NOTE: I am not a lawyer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed here to my company copyright , i don't know if this is appropriate , Are all the other contributors from NVIDIA ?What did they do
@@ -0,0 +1,173 @@ | |||
/* | |||
* Copyright (c) 2019-2020, NVIDIA CORPORATION. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same copyright issue. Should be 2021 and probably want to not have it be for NVIDIA unless you work for NVIDIA (again not a lawyer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this is a new file so it does not need the 2019-
part.
namespace strings { | ||
|
||
/** | ||
* Parses this UTF8String(trimmed if needed) to INT8/16/32/64... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit indentation in the doxegen comments appears to be off.
int offset = 0; | ||
size_type bytes = d_str.size_bytes(); | ||
const char* data = d_str.data(); | ||
while (offset < bytes && data[offset] == ' ') ++offset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to support stripping leading white space? I understand why we might want to do it, but if we do then we need to have it documented for this function and documented for the public API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cudf::strings::to_integers()
does not ignore the leading whitespace.
If this is used to validate a strings column before calling to_integers()
then this should not ignore whitespace either.
if (offset == bytes) return false; | ||
|
||
int end = bytes - 1; | ||
while (end > offset && data[end] == ' ') --end; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above. Do we want to strip trailing white space (Specifically only the space character?) If so that is fine, but we need to be sure that it is documented clearly.
} | ||
|
||
// We are going to process the new digit and accumulate the result. However, before doing | ||
// this, if the result is already smaller than the stopValue(Long.MIN_VALUE / radix), then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit Long.MIN_VALUE
is no longer correct for this comment.
|
||
result = result * radix - digit; | ||
|
||
// Since the previous result is less than or equal to stopValue(Long.MIN_VALUE / radix), we |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Same here Long.MIN_VALUE
should probably be updated in the comment.
*/ | ||
std::unique_ptr<column> is_valid_fixed_point( | ||
strings_column_view const& strings, | ||
bool allow_decimal, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the purpose of this parameter anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference is [0-9]
vs [0-9](\.[0-9])?
. if allow_decimal
is true then we allow the number to be followed by a decimal point followed and more numbers. If it is false we do not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understood that much, I just don't understand why this is a desired option. In my mind this would be like having an allow_decimal
flag to the is_floating_point
API---both strike me as odd.
bool
flags are always a code smell for me as they are warning sign of a single responsibility violation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to match what Spark and Hive do when casting strings to int values. Floating point allows for NaN
, -Inf
, etc that I guess we could special case yet again, but then the overflow wold not be properly checks, which is another feature that we are looking for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to match what Spark and Hive do when casting strings to int values.
What behavior are you looking for? From my perspective, if allow_decimal
is False
, wouldn't that effectively make this function is_valid_integer
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so , here we need 2 functions instead of 1 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think is_valid_integer
should be distinct from is_valid_fixed_point
. And ultimately we should just have is_valid_element
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on what you have discussed, I think we should do this :
@jrhemstad
this pr is for
unique_ptr<column> is_valid_integer(strings_column_view strings, data_type type);
and , i will file a new pr for
unique_ptr<column> is_valid_fixed_point(strings_column_view strings, data_type type);
@revans2
above api contains parameter data_type
, that will help us to check overflow.
while, for parameter allow_decimal
requirements you mentioned in #7080 , we can judge allow_decimal
in spark-rapids, if allow_decimal
is true, through JNI layer , we finally call is_valid_fixed_point
in cudf cpp module, if allow_decimal
is false , we finally call is_valid_integer
in cudf cpp module, is it right ? in this way , allow_decimal
parameter will not enter cudf.
please give me some advice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For CUDF JNI I would rather mirror the underlying C++ API as much as possible. The Spark plugin can decide which CUDF API to call instead of pushing that into cudf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jrhemstad I'm not sure is_valid_element
will do what we want. We are asking for two separate things, and perhaps it should be two separate API calls then. First we want to check if the format of the string matches what we expect, next we want to check if when we parse the string as an integer if the number will overflow. In the current set of APIs is_valid_fixed_point
and is_valid_integer
the name of the function indicates the format of the string while the type we pass in, indicates the overflow value. If we just have is_valid_element
then we would need that same split in roles. Does the format match fixed point and we don't overflow parsing to a byte? If that is too much for a single API to do, then perhaps what we want is to add is_fixed_point
along with is_integer
and is_float
, and also add an API to check if a strings column would overflow when parsing to an integer of type X.
|
||
// ready a min_value corresponds to the input type in order to check overflow | ||
long d_min_value = 0; | ||
switch (input_type.id()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be coded using the type_dispatcher
instead of a switch statement. Also this should use the std::numeric_limits
class instead of hardcoded values.
// This is the case when we've encountered a decimal separator. The fractional | ||
// part will not change the number, but we will verify that the fractional part | ||
// is well formed. | ||
while (offset <= end) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we doing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in order to verify the fractional part whether is well formed. if allow_decimal is true and has character '.' , it will break from last while loop , so here we need to verify the fractional part after the decimal point.
++offset; | ||
// We allow decimals and will return a truncated integral in that case. | ||
// Therefore we won't throw an exception here (checking the fractional | ||
// part happens below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// part happens below. | |
// part happens below). |
@@ -0,0 +1,173 @@ | |||
/* | |||
* Copyright (c) 2019-2020, NVIDIA CORPORATION. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this is a new file so it does not need the 2019-
part.
/** | ||
* @brief Returns a boolean column identifying strings in which all | ||
* characters are valid for conversion to integers. | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This description should include how the values are the decimal (if allowed) must all be valid decimal characters.
Also, the description should mention how the overflow depends on the input_type
.
int offset = 0; | ||
size_type bytes = d_str.size_bytes(); | ||
const char* data = d_str.data(); | ||
while (offset < bytes && data[offset] == ' ') ++offset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cudf::strings::to_integers()
does not ignore the leading whitespace.
If this is used to validate a strings column before calling to_integers()
then this should not ignore whitespace either.
@@ -14,10 +14,14 @@ | |||
* limitations under the License. | |||
*/ | |||
|
|||
#include <cudf/strings/convert/is_valid_element.hpp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest moving these to a new test source file /cpp/tests/strings/valid_element.cpp
namespace strings { | ||
|
||
/** | ||
* Check whether the UTF8String is valid when convert data from string to all kinds of integers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Check whether the UTF8String is valid when convert data from string to all kinds of integers, | |
* Check whether the string is valid when convert string to signed integers |
* like INT8/16/32/64. for example, if allow_decimal is true, this will return `true, true` when input string | ||
* is `1.23, 123`, or this function will return `false, true`, it means firt element is invalid, | ||
* while second data is valid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* like INT8/16/32/64. for example, if allow_decimal is true, this will return `true, true` when input string | |
* is `1.23, 123`, or this function will return `false, true`, it means firt element is invalid, | |
* while second data is valid. | |
* like INT8/16/32/64. For example, if `allow_decimal` is true, then strings `['1.23', '123']` will return `[true, true]`. | |
* If `allow_decimal` is false, then this function will return `[false, true]`. |
#include <strings/utilities.cuh> | ||
#include <strings/utilities.hpp> | ||
#include <strings/utilities.cuh> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#include <strings/utilities.cuh> | |
#include <strings/utilities.hpp> | |
#include <strings/utilities.cuh> |
These are not needed.
#include <cudf/column/column_factories.hpp> | ||
#include <cudf/detail/null_mask.hpp> | ||
#include <cudf/detail/nvtx/ranges.hpp> | ||
#include <cudf/strings/detail/utilities.hpp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#include <cudf/strings/detail/utilities.hpp> |
This header is not required.
} | ||
|
||
// We are going to process the new digit and accumulate the result. However, | ||
// before doing this, if the result is already smaller than the stopValue which is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// before doing this, if the result is already smaller than the stopValue which is | |
// before doing this, if the result is already smaller than the stop_value which is |
// We are going to process the new digit and accumulate the result. However, | ||
// before doing this, if the result is already smaller than the stopValue which is | ||
// (std::numeric_limits<data_type>::min() / radix), then result * 10 will definitely | ||
// be smaller than minValue, and we can stop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// be smaller than minValue, and we can stop. | |
// be smaller than the min value, and we can stop. |
* @brief The dispatch functions for calculate the min value of input data type | ||
* to check overflow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @brief The dispatch functions for calculate the min value of input data type | |
* to check overflow. | |
* @brief The dispatch functions returns the min value of the input data type used | |
* for checking overflow. |
* @brief The dispatch functions for calculate the min value of input data type | ||
* to check overflow. | ||
* | ||
* The output is the min value of spicified type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* The output is the min value of spicified type. | |
* The output is the min value of specified type. |
namespace detail { | ||
namespace { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move these two lines up so that it includes the is_valid_element
internal device function.
Reviewers, please make sure the title and description of this PR is made more descriptive before merging. |
{ | ||
// allow_decimal = true | ||
cudf::test::strings_column_wrapper strings1( | ||
{"+175", "-34", "9.8", "17+2", "+-14", "1234567890", "67de", "", "1e10", "-", "++", "", "21474836482222"}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check for too large negative number too?
namespace detail { | ||
namespace { | ||
/** | ||
* Check whether the string is valid when convert string to signed integers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing an @brief
tag.
* input_type is used to check whether the data overflows, for example, if input_type is | ||
* `int8_t` and input string data is `128`, then it will return false ,because it out of ranges | ||
* [-128, 127] and overflows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* input_type is used to check whether the data overflows, for example, if input_type is | |
* `int8_t` and input string data is `128`, then it will return false ,because it out of ranges | |
* [-128, 127] and overflows. | |
* `input_type` is used to check whether the data causes an integer overflow. For example, if `input_type` is | |
* `INT8` and the input `strings[i]` is "128", then the `output[i]` will be `false` since resulting integer | |
* would be out of range `[-128, 127]` for `int8_t`. |
Let's move this to 0.19 since it needs significant work. |
@chenrui17 you will need to click "edit" at the top and switch the target branch to branch-0.19, and then merge branch-0.19 into your local PR branch before proceeding. Thanks! |
How this is going? I was suggested to take over this if necessary. |
So |
@davidwendt |
Sorry, I don't understand why we need to change |
As you said, decimal point is always allowed. |
|
Correct. Please also see here: #7557 |
Hi @chenrui17. |
…eger conversion (#7642) This PR addresses #5110, #7080, and rework #7094. It adds the function `cudf::strings::is_integer` that can check if strings can be correctly converted into integer values. Underflow and overflow are also taken into account. Note that this `cudf::strings::is_integer` is different from the existing `cudf::strings::string::is_integer`, which only checks for pattern and does not care about under/overflow. Examples: ``` s = { "eee", "-200", "-100", "127", "128", "1.5", NULL} is_integer(s, INT8) = { 0, 0, 1, 1, 0, 0, NULL} is_integer(s, INT32) = { 0, 1, 1, 1, 1, 0, NULL} ``` Authors: - Nghia Truong (@ttnghia) Approvers: - David (@davidwendt) - Jake Hemstad (@jrhemstad) - Mark Harris (@harrism) URL: #7642
@revans2 this is my first commit , in order to close #7080 , i just finished the allow_decimal part , there are still overflow checks that have not been completed , please give me some suggestions about this.
In addition , this feature helps me improve my spark query performance 30%~50%, so about how to checks overflow, please give me some idea , Thank you very much.