Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add is_valid_integer format check API #7094

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 47 additions & 0 deletions cpp/include/cudf/strings/convert/is_valid_fixed_point.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
/*
* Copyright (c) 2019, NVIDIA CORPORATION.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copyright needs to be 2021, and I am not sure you want to give the copyright to NVIDIA for the code that you have written. You should probably include the copyright for the company you work for instead (NOTE: I am not a lawyer)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed here to my company copyright , i don't know if this is appropriate , Are all the other contributors from NVIDIA ?What did they do

*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#pragma once

#include <cudf/column/column.hpp>

namespace cudf {
namespace strings {
/**
* @addtogroup strings_convert
* @{
* @file
*/

/**
* @brief Returns a boolean column identifying strings in which all
* characters are valid for conversion to integers.
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description should include how the values are the decimal (if allowed) must all be valid decimal characters.
Also, the description should mention how the overflow depends on the input_type.

* @param strings Strings instance for this operation.
* @param allow_decimal identification whether the format is decimal or not
* @param input_type input data type for check overflow
* @param mr Device memory resource used to allocate the returned column's device memory.
* @return New column of boolean results for each string.
*/
std::unique_ptr<column> is_valid_fixed_point(
strings_column_view const& strings,
bool allow_decimal,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the purpose of this parameter anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference is [0-9] vs [0-9](\.[0-9])?. if allow_decimal is true then we allow the number to be followed by a decimal point followed and more numbers. If it is false we do not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understood that much, I just don't understand why this is a desired option. In my mind this would be like having an allow_decimal flag to the is_floating_point API---both strike me as odd.

bool flags are always a code smell for me as they are warning sign of a single responsibility violation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to match what Spark and Hive do when casting strings to int values. Floating point allows for NaN, -Inf, etc that I guess we could special case yet again, but then the overflow wold not be properly checks, which is another feature that we are looking for.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to match what Spark and Hive do when casting strings to int values.

What behavior are you looking for? From my perspective, if allow_decimal is False, wouldn't that effectively make this function is_valid_integer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so , here we need 2 functions instead of 1 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think is_valid_integer should be distinct from is_valid_fixed_point. And ultimately we should just have is_valid_element.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on what you have discussed, I think we should do this :
@jrhemstad
this pr is for

unique_ptr<column> is_valid_integer(strings_column_view strings, data_type type);

and , i will file a new pr for

unique_ptr<column> is_valid_fixed_point(strings_column_view strings, data_type type);

@revans2
above api contains parameter data_type , that will help us to check overflow.

while, for parameter allow_decimal requirements you mentioned in #7080 , we can judge allow_decimal in spark-rapids, if allow_decimal is true, through JNI layer , we finally call is_valid_fixed_point in cudf cpp module, if allow_decimal is false , we finally call is_valid_integer in cudf cpp module, is it right ? in this way , allow_decimal parameter will not enter cudf.

please give me some advice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For CUDF JNI I would rather mirror the underlying C++ API as much as possible. The Spark plugin can decide which CUDF API to call instead of pushing that into cudf.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrhemstad I'm not sure is_valid_element will do what we want. We are asking for two separate things, and perhaps it should be two separate API calls then. First we want to check if the format of the string matches what we expect, next we want to check if when we parse the string as an integer if the number will overflow. In the current set of APIs is_valid_fixed_point and is_valid_integer the name of the function indicates the format of the string while the type we pass in, indicates the overflow value. If we just have is_valid_element then we would need that same split in roles. Does the format match fixed point and we don't overflow parsing to a byte? If that is too much for a single API to do, then perhaps what we want is to add is_fixed_point along with is_integer and is_float, and also add an API to check if a strings column would overflow when parsing to an integer of type X.

data_type input_type,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/** @} */ // end of doxygen group
} // namespace strings
} // namespace cudf

173 changes: 173 additions & 0 deletions cpp/src/strings/convert/is_valid_fixed_point.cu
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
/*
* Copyright (c) 2019-2020, NVIDIA CORPORATION.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same copyright issue. Should be 2021 and probably want to not have it be for NVIDIA unless you work for NVIDIA (again not a lawyer)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this is a new file so it does not need the 2019- part.

*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include <cudf/column/column.hpp>
#include <cudf/column/column_device_view.cuh>
#include <cudf/column/column_factories.hpp>
#include <cudf/detail/null_mask.hpp>
#include <cudf/detail/nvtx/ranges.hpp>
#include <cudf/strings/detail/utilities.hpp>
#include <cudf/strings/string_view.cuh>
#include <cudf/strings/strings_column_view.hpp>
#include <cudf/utilities/traits.hpp>
#include <cudf/utilities/type_dispatcher.hpp>
#include <strings/utilities.cuh>
#include <strings/utilities.hpp>

#include <thrust/logical.h>

namespace cudf {
namespace strings {

/**
* Parses this UTF8String(trimmed if needed) to INT8/16/32/64...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit indentation in the doxegen comments appears to be off.

*
* Note that, in this method we accumulate the result in negative format, and convert it to
* positive format at the end, if this string is not started with '-'. This is because min value
* is bigger than max value in digits, e.g. Long.MAX_VALUE is '9223372036854775807' and
* Long.MIN_VALUE is '-9223372036854775808'.
*
* This code is mostly copied from LazyLong.parseLong in Hive.
*
* @param d_str String to check.
* @param allow_decimal Decimal format or not
* @param min_value min_value that corresponds to the type that is checking
* @return true if string has valid integer characters
*/
__device__ bool is_valid_fixed_point(string_view const& d_str, bool allow_decimal, long min_value)
{
int offset = 0;
size_type bytes = d_str.size_bytes();
const char* data = d_str.data();
while (offset < bytes && data[offset] == ' ') ++offset;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to support stripping leading white space? I understand why we might want to do it, but if we do then we need to have it documented for this function and documented for the public API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cudf::strings::to_integers() does not ignore the leading whitespace.
If this is used to validate a strings column before calling to_integers() then this should not ignore whitespace either.

if (offset == bytes) return false;

int end = bytes - 1;
while (end > offset && data[end] == ' ') --end;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above. Do we want to strip trailing white space (Specifically only the space character?) If so that is fine, but we need to be sure that it is documented clearly.


char c_sign = data[offset];
const bool negative = c_sign == '-';
if (negative || c_sign == '+'){
if (end - offset == 0) return false;
++offset;
}

const char separator = '.';
const int radix = 10;
const long stop_value = min_value / radix;
long result = 0;

while (offset <= end) {
const char c = data[offset];
++offset;
// We allow decimals and will return a truncated integral in that case.
// Therefore we won't throw an exception here (checking the fractional
// part happens below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// part happens below.
// part happens below).

if (c == separator && allow_decimal) break;

int digit;
if (c >= '0' && c <= '9'){
digit = c - '0';
} else {
return false;
}

// We are going to process the new digit and accumulate the result. However, before doing
// this, if the result is already smaller than the stopValue(Long.MIN_VALUE / radix), then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit Long.MIN_VALUE is no longer correct for this comment.

// result * 10 will definitely be smaller than minValue, and we can stop.
if (result < stop_value) return false;

result = result * radix - digit;

// Since the previous result is less than or equal to stopValue(Long.MIN_VALUE / radix), we
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Same here Long.MIN_VALUE should probably be updated in the comment.

// can just use `result > 0` to check overflow. If result overflows, we should stop.
if (result > 0) return false;
}
// This is the case when we've encountered a decimal separator. The fractional
// part will not change the number, but we will verify that the fractional part
// is well formed.
while (offset <= end) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we doing this?

Copy link
Contributor Author

@chenrui17 chenrui17 Jan 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in order to verify the fractional part whether is well formed. if allow_decimal is true and has character '.' , it will break from last while loop , so here we need to verify the fractional part after the decimal point.

char currentByte = data[offset];
if (currentByte < '0' || currentByte > '9') return false;
++offset;
}

if (!negative) {
result = -result;
if (result < 0) return false;
}

return true;
}

namespace detail {

std::unique_ptr<column> is_valid_fixed_point(
strings_column_view const& strings,
bool allow_decimal,
data_type input_type,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource())
{
auto strings_column = column_device_view::create(strings.parent(), stream);
auto d_column = *strings_column;
auto d_allow_decimal = allow_decimal;

// ready a min_value corresponds to the input type in order to check overflow
long d_min_value = 0;
switch (input_type.id()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be coded using the type_dispatcher instead of a switch statement. Also this should use the std::numeric_limits class instead of hardcoded values.

case type_id::INT8: d_min_value = -128;
case type_id::INT16: d_min_value = -32768;
case type_id::INT32: d_min_value = -2147483648;
case type_id::INT64: d_min_value = -9223372036854775808;
default: CUDF_FAIL("Unsupported current data type check when convert string type");
}

// create output column
auto results = make_numeric_column(data_type{type_id::BOOL8},
strings.size(),
cudf::detail::copy_bitmask(strings.parent(), stream, mr),
strings.null_count(),
stream,
mr);
auto d_results = results->mutable_view().data<bool>();
thrust::transform(rmm::exec_policy(stream),
thrust::make_counting_iterator<size_type>(0),
thrust::make_counting_iterator<size_type>(strings.size()),
d_results,
[d_column,d_allow_decimal,d_min_value] __device__(size_type idx) {
if (d_column.is_null(idx)) return false;
return strings::is_valid_fixed_point(d_column.element<string_view>(idx), d_allow_decimal, d_min_value);
});
results->set_null_count(strings.null_count());
return results;
}

} // namespace detail

// external API

std::unique_ptr<column> is_valid_fixed_point(strings_column_view const& strings,
bool allow_decimal,
data_type input_type,
rmm::mr::device_memory_resource* mr)
{
CUDF_FUNC_RANGE();
return detail::is_valid_fixed_point(strings, allow_decimal, input_type, rmm::cuda_stream_default, mr);
}

} // namespace strings
} // namespace cudf

16 changes: 16 additions & 0 deletions cpp/tests/strings/chars_types_tests.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -243,6 +243,21 @@ TEST_F(StringsCharsTest, Integers)
EXPECT_TRUE(cudf::strings::all_integer(cudf::strings_column_view(strings2)));
}

TEST_F(StringsCharsTest, ValidIntegers)
{
cudf::test::strings_column_wrapper strings1(
{"+175", "-34", "9.8", "17+2", "+-14", "1234567890", "67de", "", "1e10", "-", "++", ""});
auto results = cudf::strings::is_valid_fixed_point(cudf::strings_column_view(strings1), true, data_type::INT64);
cudf::test::fixed_width_column_wrapper<bool> expected1({1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0});
CUDF_TEST_EXPECT_COLUMNS_EQUAL(*results, expected1);

cudf::test::strings_column_wrapper strings2(
{"0", "+0", "-0", "1234567890", "-27341132", "+012", "023", "-045", "-1.1", "+1000.1"});
results = cudf::strings::is_valid_fixed_point(cudf::strings_column_view(strings2), true, data_type::INT64);
cudf::test::fixed_width_column_wrapper<bool> expected2({1, 1, 1, 1, 1, 1, 1, 1, 1, 1});
CUDF_TEST_EXPECT_COLUMNS_EQUAL(*results, expected2);
}

TEST_F(StringsCharsTest, Floats)
{
cudf::test::strings_column_wrapper strings1({"+175",
Expand Down Expand Up @@ -390,3 +405,4 @@ TEST_F(StringsCharsTest, EmptyStringsColumn)
EXPECT_EQ(cudf::type_id::STRING, results->view().type().id());
EXPECT_EQ(0, results->view().size());
}