Skip to content
This repository has been archived by the owner on May 6, 2024. It is now read-only.

[POAE7-2544] String LOWER/UPPER op support for Arrow format #147

Merged
merged 8 commits into from
Nov 11, 2022

Conversation

YBRua
Copy link
Contributor

@YBRua YBRua commented Nov 8, 2022

What changes were proposed in this pull request?

Updated support for String LOWER() and UPPER() functions.

  1. Enabled existing code for string upper and lower ops. Also added some minor patches to Analyzer and RelAlg EU.
  2. Isthmus does not support translating LOWER or UPPER to Substrait plans. So plan JSON files were created manually according to observations and Substrait docs. JSON query plans are generated with a modified local fork of substrait-java that supports LOWER and UPPER.
  3. Increased nudmber of generated rows of CiderStringTest classes from 30 to 50 to cover lowercase letters in the test.
; IR SNIPPET of SQL Query: SELECT LOWER(col_2) FROM test;
groupby_nullcheck_true:                           ; preds = %filter_true
  ; extract input strings from arrow array
  %4 = call i8* @cider_ColDecoder_extractArrowBuffersAt(i8* %col_buf1, i64 0)  ; null buffer
  %5 = call i8* @cider_ColDecoder_extractArrowBuffersAt(i8* %col_buf1, i64 1)  ; offset buffer
  %6 = call i8* @cider_ColDecoder_extractArrowBuffersAt(i8* %col_buf1, i64 2)  ; data buffer
  %7 = call i8* @extract_str_ptr_arrow(i8* %6, i8* %5, i64 %pos)
  %8 = call i32 @extract_str_len_arrow(i8* %5, i64 %pos)
  %9 = call i1 @check_bit_vector_clear(i8* %4, i64 %pos)

  ; apply string op function && decode results
  %10 = call i64 @apply_string_ops_and_encode_cider_nullable(
                   i8* %7, i32 %8, i64 93825094087552, i64 93825052641744, i1 %9)
  %11 = call i8* @cider_hasher_decode_str_ptr(i64 %10, i64 93825052641744)
  %12 = call i32 @cider_hasher_decode_str_len(i64 %10, i64 93825052641744)

  ; store results to output buffers
  %13 = getelementptr i64, i64* %group_by_buff, i32 0
  %14 = load i64, i64* %13
  %15 = inttoptr i64 %14 to i8*
  call void @reallocate_string_buffer_if_need(i8* %15, i64 %1)
  %16 = call i8* @cider_ColDecoder_extractArrowBuffersAt(i8* %15, i64 2)
  %17 = call i8* @cider_ColDecoder_extractArrowBuffersAt(i8* %15, i64 1)
  %18 = call i8* @cider_ColDecoder_extractArrowBuffersAt(i8* %15, i64 0)
  call void @cider_agg_id_proj_string_nullable(
              i8* %16, i8* %17, i64 %1, i8* %11, i32 %12, i8* %18, i1 %9)
  br label %filter_false

Why are the changes needed?

Supporting string ops for Arrow format.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UTs.

Which label does this PR belong to?

spevenhe
spevenhe previously approved these changes Nov 9, 2022
Copy link
Contributor

@spevenhe spevenhe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yma11
Copy link
Contributor

yma11 commented Nov 9, 2022

For the unsupported functions like lower, 'upper, etc, can you check whether they are supported in latest substrait-java? If not, we can submit a PR.

@YBRua
Copy link
Contributor Author

YBRua commented Nov 9, 2022

For the unsupported functions like lower, 'upper, etc, can you check whether they are supported in latest substrait-java? If not, we can submit a PR.

Yes. @jikunshang also mentioned this yesterday. I'll take a look.

"stringop_lower_literal_null.json");

// SELECT UPPER(column) FROM table
assertQueryArrow("SELECT col_2, UPPER(col_2) FROM test;", "stringop_upper_null.json");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can add some case like SELECT * FROM test WHERE UPPER(col_2) = "AAA" once #144 got merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@@ -37,7 +37,7 @@ bool getExprUpdatable(std::unordered_map<std::shared_ptr<Analyzer::Expr>, bool>
}

bool isStringFunction(std::string function_name) {
std::unordered_set<std::string> supportedStrFunctionSet{"substring"};
std::unordered_set<std::string> supportedStrFunctionSet{"substring", "lower", "upper"};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please define the input argument as const std::string& name
and cause the supportedStrFunctionSet only has three names, so if we want make this function more fast, codes can be written:

std::string_view names[] = {"substring", "lower", "upper"};
return std::find(std::cbegin(names), std::cend(names), name) != std::cend(names);

or

std::string_view names[] = {"substring", "lower", "upper"};
for (size_t i = 0; i < sizeof(names); ++i) {
  if (names[i] == name) {
    return true;
  }
return false;

Copy link
Contributor Author

@YBRua YBRua Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the optimization advice!

  1. pass-by-const-ref has been added.

  2. Currently there are only 3 names, but more names are to be added in the future as more string ops are supported. I think leaving this part as is would be fine?

@yma11 yma11 merged commit 715317b into intel:main Nov 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants