-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add decimal support to integration tester #361
Conversation
77f9263
to
8693900
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #361 +/- ##
==========================================
- Coverage 88.19% 87.67% -0.52%
==========================================
Files 74 75 +1
Lines 12212 12940 +728
==========================================
+ Hits 10770 11345 +575
- Misses 1442 1595 +153 ☔ View full report in Codecov by Sentry. |
Very cool - this would be very helpful for arrow-adbc. Is this expected to work for Decimal256 as well? |
Yes! It would probably be more useful for ADBC if it handled the decimal point and a trailing exponent, but that should be fairly straightforward to add after the integer version. |
The postgres implementation in ADBC currently sends the string and decimal placement separately, so this would be helpful for that without going all the way |
src/nanoarrow/utils.c
Outdated
100000ULL, 1000000ULL, 10000000ULL, 100000000ULL, 1000000000ULL}; | ||
|
||
// Adapted from Arrow C++ to use 32-bit words for better C portability | ||
// https://github.com/apache/arrow/blob/main/cpp/src/arrow/util/decimal.cc#L524-L544 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add a git hash in place of main
if you'd like. (Press 'y' after opening the above link)
It also might be nice to copy the comment over as well if applicable.
src/nanoarrow/nanoarrow.h
Outdated
/// \brief Sets the integer value of an ArrowDecimal from a string | ||
ArrowErrorCode ArrowDecimalSetIntString(struct ArrowDecimal* decimal, | ||
struct ArrowStringView value); | ||
|
||
/// \brief Get the integer value of an ArrowDecimal as string | ||
ArrowErrorCode ArrowDecimalAppendIntStringToBuffer(const struct ArrowDecimal* decimal, | ||
struct ArrowBuffer* buffer); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the function names are a bit confusing. Maybe something like ArrowDecimalSetFromString
? I think the IntString
terminology is the part that is hard for me to parse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it's not great! I think the reason I didn't pick ArrowDecimalSetFromString()
is because to me that would parse the decimal point and this function won't do that. It will also error very quickly if anybody tries to put a decimal point there and I'm game to change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It floated in last night that Digits
is probably a better term. I pushed that change but feel free to suggest something else!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even better, I love this change!
if (c < '0' || c > '9') { | ||
return EINVAL; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might be an unhelpful guess at micro-optimization.. but if you are trying to avoid branching, you can set a bitmask here and check for failure outside of the loop. e.g. is_invalid &= c < '0' || c > '9';
. Maybe the compiler would optimize this for us? I wouldn't know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow...I wasn't intending to optimize anything, just to make sure we had a valid integer string before proceeding. The other option is to pass end_ptr
instead of NULL
into strtoll()
and check that it parsed the entire chunk of characters. Maybe that would be less confusing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignore me then! This code is already very readable IMO. Other parts of the diff look like they are specifically written to avoid branching in loops (e.g. if
statements), so I wasn't sure if that was an active decision.
|
||
// Use 32-bit words for portability | ||
uint32_t words32[8]; | ||
int n_words32 = decimal->n_words * 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assert that this is <= 8?
memset(segments, 0, sizeof(segments)); | ||
uint64_t* most_significant_elem = words_little_endian + most_significant_elem_idx; | ||
|
||
do { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should comment on what this is doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
src/nanoarrow/utils.c
Outdated
|
||
// We know our output has no more than 9 digits per segment, plus a negative sign, | ||
// plus any further digits between our output of 9 digits and the maximum theoretical | ||
// number of digits in an unsigned long, including the null terminator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what the maximum theoretical number of digits has to do with this, can you explain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was to make sure that the 21
in snprintf(, 21, "%lu")
is always accurate (I added this to the comment).
int n_chars = snprintf((char*)buffer->data + buffer->size_bytes, 21, "%09lu", | ||
(unsigned long)segments[i]); | ||
buffer->size_bytes += n_chars; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you assert that buffer->size_bytes
is not longer than the size allocated above?
|
||
for (auto bitwidth : {128, 256}) { | ||
ArrowDecimalInit(&decimal, bitwidth, 10, 3); | ||
ArrowDecimalSetInt(&decimal, 12345); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you test with larger numbers?
src/nanoarrow/utils_test.cc
Outdated
EXPECT_EQ(std::string(reinterpret_cast<char*>(buffer.data), buffer.size_bytes), | ||
"18446744073709551615"); | ||
|
||
// Check with the maximum value of a 128-bit integer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also have a similar test for 256 bits?
Overall LGTM! |
This PR adds decimal support to the integration test utility. Because decimal buffers are implemented in the integration test JSON format as strings containing the integer representation of the decimal, it meant that nanoarrow needed an implementation of arbitrarily large integer to/from string. I modified this from Arrow C++ (links in comments next to the implementation) with a few differences to avoid porting the complete int128 implementation and the C++ standard library.
The gaps in test coverage are from big-endian parts, which I are tested as part of weekly verification and that I tested locally with:
export NANOARROW_ARCH=s390x docker compose run --rm verify
With
archery integration --with-cpp=true --with-nanoarrow=true --run-c-data
, the decimal tests now pass: