Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement initial support for avro logical types (#6482) #12788

Merged

Conversation

tpn
Copy link
Contributor

@tpn tpn commented Feb 16, 2023

Description

This includes the date type. Scaffolding has been put in place to handle the other logical types. This closes #6482.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@tpn tpn requested review from a team as code owners February 16, 2023 04:35
@rapids-bot
Copy link

rapids-bot bot commented Feb 16, 2023

Pull requests from external contributors require approval from a rapidsai organization member with write or admin permissions before CI can begin.

@github-actions github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Feb 16, 2023
@tpn
Copy link
Contributor Author

tpn commented Feb 16, 2023

Looks like I don't have permission to edit the labels for this PR, maybe because I created it off an issue that I didn't raise?

gh pr edit 12788 --add-label "feature request,non-breaking"
Message: tpn does not have the correct permissions to execute `AddLabelsToLabelable`, Locations: [{Line:1 Column:54}]

@kkraus14 kkraus14 added feature request New feature or request cuIO cuIO issue non-breaking Non-breaking change labels Feb 16, 2023
@vyasr
Copy link
Contributor

vyasr commented Feb 17, 2023

/ok to test

@vyasr
Copy link
Contributor

vyasr commented Feb 17, 2023

Thanks for the contribution @tpn! @vuule and @galipremsagar are probably the best people to answer the open questions here.

@tpn tpn force-pushed the issue6482-avro-logical-type-awareness branch from 71de769 to 80f28de Compare February 17, 2023 04:16
@tpn
Copy link
Contributor Author

tpn commented Feb 17, 2023

Ignore the 80f28de force-push, I'm about to squash and remove the time-millis/micros stuff.

@tpn tpn force-pushed the issue6482-avro-logical-type-awareness branch 2 times, most recently from cb82e82 to 41790ff Compare February 21, 2023 16:06
@tpn
Copy link
Contributor Author

tpn commented Feb 21, 2023

@vyasr can I get another /ok to test? Thanks! (I can't do this, right?)

And friendly ping to @vuule & @galipremsagar for review. I'd like to get this PR merged so I can tackle a couple of follow-up avro items. Thanks!

@tpn tpn force-pushed the issue6482-avro-logical-type-awareness branch from 41790ff to 397ba8b Compare February 21, 2023 16:13
@vuule
Copy link
Contributor

vuule commented Feb 21, 2023

/ok to test

@vuule
Copy link
Contributor

vuule commented Feb 21, 2023

@tpn I'm OOTO this week, will review as soon as I can next week.
Apologies for the delay.

@tpn
Copy link
Contributor Author

tpn commented Feb 21, 2023

@tpn I'm OOTO this week, will review as soon as I can next week.
Apologies for the delay.

No problem, thanks for the update.

Comment on lines 316 to 343
dates = [
datetime.date(1970, 1, 1),
datetime.date(1970, 1, 2),
datetime.date(1981, 10, 25),
datetime.date(2012, 5, 18),
datetime.date(2019, 9, 3),
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would maybe be good to have some nulls mixed into a variant of this test as well to ensure that we properly handle the situations where we get a UNION type with nulls and dates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pushed a change that rewrites the test to test nulls where applicable. Can you take a look? Thanks!

@tpn
Copy link
Contributor Author

tpn commented Feb 21, 2023 via email

@tpn tpn force-pushed the issue6482-avro-logical-type-awareness branch from 397ba8b to e83f52e Compare March 1, 2023 22:53
@tpn
Copy link
Contributor Author

tpn commented Mar 1, 2023

Friendly ping to @vuule for another /ok to test and PR review.

I just pushed a rebase + some test tweaks that @kkraus14 suggested.

@vuule
Copy link
Contributor

vuule commented Mar 10, 2023

@tpn looks like the Java failure is related to this PR, not a random issue

@davidwendt
Copy link
Contributor

davidwendt commented Mar 10, 2023

@tpn
Copy link
Contributor Author

tpn commented Mar 10, 2023

@tpn tpn force-pushed the issue6482-avro-logical-type-awareness branch from 326b171 to 1731f4f Compare March 10, 2023 22:24
@tpn
Copy link
Contributor Author

tpn commented Mar 10, 2023

Yeah it looks like either the Java tests or the Java avro test file will need some tweaking. Still investigating, but finishing up for the week shortly. I'll pick it up first thing next week.

@tpn
Copy link
Contributor Author

tpn commented Mar 14, 2023

I think I need some help with the Java test failures. If I run df = cudf.read_avro('alltypes_plain.avro') on a fresh cudf env vs my branch, I get identical dataframes, dtypes, everything. The only thing that stands out about the schema of the test avro file is the logicalType: 'timestamp-micros', but that type isn't supported in either the existing main or my branch:

[ins] In [22]: pprint(schema)
{'fields': [{'name': 'id', 'type': ['int', 'null']},
            {'name': 'bool_col', 'type': ['boolean', 'null']},
            {'name': 'tinyint_col', 'type': ['int', 'null']},
            {'name': 'smallint_col', 'type': ['int', 'null']},
            {'name': 'int_col', 'type': ['int', 'null']},
            {'name': 'bigint_col', 'type': ['long', 'null']},
            {'name': 'float_col', 'type': ['float', 'null']},
            {'name': 'double_col', 'type': ['double', 'null']},
            {'name': 'date_string_col', 'type': ['bytes', 'null']},
            {'name': 'string_col', 'type': ['bytes', 'null']},
            {'name': 'timestamp_col',
             'type': [{'logicalType': 'timestamp-micros', 'type': 'long'},
                      'null']}],
 'name': 'topLevelRecord',
 'type': 'record'}

I'm still trying to get a local Java env set up -- the cudfjar helper in build.sh doesn't appear to work (appears to be using obsolete nvidia-docker invocations?), and an explicit maven build appears to be failing with some bizarre codehaus error:

org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.codehaus.gmaven:gmaven-plugin:1.5:execute (setproperty) on project cudf: Execution setproperty of goal org.codehaus.gmaven:gmaven-plugin:1.5:execute failed: A required class was missing while executing org.codehaus.gmaven:gmaven-plugin:1.5:execute: Could not initialize class org.codehaus.groovy.runtime.InvokerHelper

@davidwendt
Copy link
Contributor

Try running mvn test -B -DCUDF_JNI_ARROW_STATIC=OFF -DCUDF_JNI_ENABLE_PROFILING=OFF from your cpp/java directory perhaps. I have a local conda environment where I do builds on my machine and this worked for me.

Reference:

mvn test -B -DCUDF_JNI_ARROW_STATIC=OFF -DCUDF_JNI_ENABLE_PROFILING=OFF

@tpn
Copy link
Contributor Author

tpn commented Mar 14, 2023

Try running mvn test -B -DCUDF_JNI_ARROW_STATIC=OFF -DCUDF_JNI_ENABLE_PROFILING=OFF from your cpp/java directory perhaps. I have a local conda environment where I do builds on my machine and this worked for me.

Reference:

mvn test -B -DCUDF_JNI_ARROW_STATIC=OFF -DCUDF_JNI_ENABLE_PROFILING=OFF

Hey @davidwendt, yeah, that didn't seem to work either. Fails with the exact same codehaus error. I've just done a build.sh clean and am trying: export EXTRA_CMAKE_ARGS="-DCUDF_USE_ARROW_STATIC=ON -DCUDF_ENABLE_ARROW_S3=OFF"; bash build.sh -g -v libcudf to see if that yields something that maven will consume.

Don't suppose you're able to share your local conda env + build.sh invocation? Feel free to email if that's easier: trent at trent dot me.

@vuule
Copy link
Contributor

vuule commented Mar 14, 2023

/ok to test

@davidwendt
Copy link
Contributor

I don't think I will be able to help you with the Java debugging. However, I was able to recreate the error with a C++ test. Here is the source:

#include <cudf_test/base_fixture.hpp>
#include <cudf_test/column_utilities.hpp>
#include <cudf_test/cudf_gtest.hpp>
#include <cudf/io/avro.hpp>
#include <cudf/io/datasource.hpp>
#include <string>
#include <vector>

struct AvroTest : public cudf::test::BaseFixture {};
TEST_F(AvroTest, TestFile)
{
  std::string filepath = "/cudf/java/src/test/resources/alltypes_plain.avro";
  std::vector<std::string> names({"bool_col", "int_col", "timestamp_col"});
  auto in_opts = cudf::io::avro_reader_options::builder(cudf::io::source_info{filepath})
                   .columns(names).build();
  auto result     = cudf::io::read_avro(in_opts);
  const auto view = result.tbl->view();
  for (auto col : view) {
    cudf::test::print(col);
  }
}
CUDF_TEST_PROGRAM_MAIN()

It is odd we do not have any avro gtests for this.
To build it, add a line to the cpp/tests/CMakeList.txt:

ConfigureTest(AVRO_TEST io/avro_test.cpp)

(Here the test file is cpp/tests/io/avro_test.cpp)
Once it is built you can run it from the build directory using gtests/AVRO_TEST
This should crash on the cudf::io::read_avro() line with a cudaErrorIllegalAddress error.
Running with compute-sanitizer will show it fails in the gpuDecodeAvroColumnData kernel:

========= COMPUTE-SANITIZER
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from AvroTest
[ RUN      ] AvroTest.TestFile
========= Invalid __global__ write of size 4 bytes
=========     at 0x1ca0 in cudf::io::avro::gpu::gpuDecodeAvroColumnData(cudf::device_span<const cudf::io::avro::block_desc_s, (unsigned long)18446744073709551615>, cudf::io::avro::gpu::schemadesc_s *, cudf::device_span<const thrust::pair<const char *, int>, (unsigned long)18446744073709551615>, const unsigned char *, unsigned int, unsigned int, unsigned long, unsigned long)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x0 is out of bounds
=========     and is 140,629,891,874,816 bytes before the nearest allocation at 0x7fe6f2c00000 of size 816 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
...

I'm not sure why it fails only with your new code. One thing I did notice that was different with the Python tests is that this one passes in column names. If you comment out passing in the column names in the avro_test.cpp then it runs without error

  auto in_opts = cudf::io::avro_reader_options::builder(cudf::io::source_info{filepath})
                   .build();  //.columns(names).build();

One last thing I noticed is that the compute-sanitizer error indicates it is trying to write 4 bytes to address 0x0 which would be a null-pointer. The only null-pointer I see here is the global_dictionary. So something is going wrong around that.

I'm not sure if that helps. I certainly can provide better help with the C++ test/debug. I'm confident that if this is fixed that the Java test will work as well.

@tpn
Copy link
Contributor Author

tpn commented Mar 14, 2023

I don't think I will be able to help you with the Java debugging. However, I was able to recreate the error with a C++ test. Here is the source:

Oh this is incredibly helpful, thanks! Investigating now.

@tpn
Copy link
Contributor Author

tpn commented Mar 16, 2023

Found the issue, heh, @davidwendt, it was an artifact of that switch statement refactoring I did. I also noticed a subtle logic error whilst traipsing through the code in cuda-gdb related to kind/logical_kind and type_union. Here's the patch. I'm going to pick it up tomorrow and rebase etc.

diff --git a/cpp/src/io/avro/avro_gpu.cu b/cpp/src/io/avro/avro_gpu.cu
index b6f56cbb5c..4d33d3f0e7 100644
--- a/cpp/src/io/avro/avro_gpu.cu
+++ b/cpp/src/io/avro/avro_gpu.cu
@@ -97,6 +97,8 @@ avro_decode_row(schemadesc_s const* schema,
       }
       if (i >= schema_len || skip_after < 0) break;
       kind = schema[i].kind;
+      logical_kind = schema[i].logical_kind;
+      if (is_supported_logical_type(logical_kind)) { kind = static_cast<type_kind_e>(logical_kind); }
       skip = skip_after;
     }
 
@@ -110,13 +112,17 @@ avro_decode_row(schemadesc_s const* schema,
         break;
 
       case type_int: {
-        int64_t v                           = avro_decode_zigzag_varint(cur, end);
-        static_cast<int32_t*>(dataptr)[row] = static_cast<int32_t>(v);
+        int64_t v = avro_decode_zigzag_varint(cur, end);
+        if (dataptr != nullptr && row < max_rows) {
+          static_cast<int32_t*>(dataptr)[row] = static_cast<int32_t>(v);
+        }
       } break;
 
       case type_long: {
-        int64_t v                           = avro_decode_zigzag_varint(cur, end);
-        static_cast<int64_t*>(dataptr)[row] = v;
+        int64_t v = avro_decode_zigzag_varint(cur, end);
+        if (dataptr != nullptr && row < max_rows) {
+          static_cast<int64_t*>(dataptr)[row] = v;
+        }
       } break;
 
       case type_bytes: [[fallthrough]];

Semi-related question: I'd like to add in the C++ unit tests for the alltypes_plain.avro file. It doesn't seem like there's a precedent for e.g. resolving the .avro file name relative to the unit test .cpp file name, so I was going to use something like this:

#include <string>
#include <experimental/filesystem>
namespace fs = std::experimental::filesystem;
...
TEST_F(AvroTest, TestFile)
{
  fs::path basedir = fs::path(__FILE__).parent_path();
  fs::path avro_path = basedir / "../../../java/src/test/resources/alltypes_plain.avro";
  auto srcinfo = cudf::io::source_info{avro_path.string()};
  ...

However, depending on std::experimental::filesystem seems like it might be undesirable, if not banned outright. Thoughts? (I suspect one of the reasons there aren't any C++ avro unit tests currently is that we don't have any C++ avro writer scaffolding, so we can't write .avro files to the test temp dir like we do with other file formats (which would avoid this issue of needing to reference a source-file-relative test file).)

@davidwendt
Copy link
Contributor

I'm glad you found the issue. I'll let @vuule answer the question about testing against an existing test file in our repo.

@vuule
Copy link
Contributor

vuule commented Mar 16, 2023

Great news @tpn !
Could this be a Python test instead? We already have a precedent there both for Avro tests and for tests that use test files (not generated at run time).

As for std::filesystem, we already rely on it in a few places. Not sure about C++14, but it's not experimental as of C++17, which we currently use.

@tpn
Copy link
Contributor Author

tpn commented Mar 16, 2023

Great news @tpn ! Could this be a Python test instead? We already have a precedent there both for Avro tests and for tests that use test files (not generated at run time).

Actually yeah I need to figure out why the existing Python unit tests didn't hit this. There shouldn't be any reason preventing this logic from being exercised via Python avenue. I'll investigate.

As for std::filesystem, we already rely on it in a few places. Not sure about C++14, but it's not experimental as of C++17, which we currently use.

Ah maybe it's `std::filesystem::path`` I'm thinking of that's still experimental. Nevertheless, probably moot if I can get the Python tests to exercise this stuff.

tpn and others added 7 commits March 16, 2023 14:15
Co-authored-by: Vukasin Milovanovic <[email protected]>
 - Use `Optional` instead of `Union`.
 - In `test_can_parse_avro_date_logical_type()`:
    - Tweak `avro_type` initialization.
    - Improve clarity of nullable data initialization.
Only specify our test dates once.
 - Remove `using` directive.
 - Fix typo in `CUDF_UNREACHABLE()` message.
Comment-out dead/unreachable avro device code for now.
Tweak the switch statement in avro_decode_row().
@tpn tpn force-pushed the issue6482-avro-logical-type-awareness branch from b0da10f to 2141345 Compare March 16, 2023 21:17
@tpn
Copy link
Contributor Author

tpn commented Mar 16, 2023

Just rebased and pushed a fix for the CUDA kernel crashes. Figured out how to replicate the crash via a Python unit test (and verified it actually did crash prior to this fix being applied).

Friendly request for (hopefully the last!) /ok to test.

Edit: pushed again to fix a small typo.

@tpn tpn force-pushed the issue6482-avro-logical-type-awareness branch from 2141345 to 4d2a317 Compare March 16, 2023 21:23
@wence-
Copy link
Contributor

wence- commented Mar 16, 2023

/ok to test

@rapids-bot rapids-bot bot merged commit 1e377fc into rapidsai:branch-23.04 Mar 16, 2023
@tpn
Copy link
Contributor Author

tpn commented Mar 16, 2023

Hurrah! Thanks folks for all your assistance in getting this merged. Much appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] datetime dtype not being inferred by cudf avro reader
8 participants