[DO NOT MERGE] FST benchmark #1

karthikeyann · 2022-07-12T12:45:25Z

DO NOT MERGE. This PR is to compare the diff.
Benchmark for Finite State Transducer
parse and identify JSON symbols

This resolves rapidsai#8104. Note that that issue also requests an update to requirements.txt files, but those no longer exist. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - https://github.com/brandon-b-miller - H. Thomson Comer (https://github.com/thomcom) URL: rapidsai#11306

Resolves rapidsai#11316 Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: rapidsai#11318

In my endless wandering through parquet code, I found this unused code. Removing it. Authors: - Mike Wilson (https://github.com/hyperbolic2346) Approvers: - Nghia Truong (https://github.com/ttnghia) - Yunsong Wang (https://github.com/PointKernel) URL: rapidsai#11305

This just adds in a simple JNI binding for the join_strings cudf function. Authors: - Robert (Bobby) Evans (https://github.com/revans2) Approvers: - Alessandro Bellina (https://github.com/abellina) - Raza Jafri (https://github.com/razajafri) - Jason Lowe (https://github.com/jlowe) URL: rapidsai#11309

Closes rapidsai#10994. This PR removes the Arrow CUDA-IPC related code we have, which has two benefits: 1. It deletes code (I have confirmed that no one uses this code today) 2. It removes our dependency on Arrow CUDA, which contributes towards removing our shared lib dependency on `libcuda.so` Authors: - Ashwin Srinath (https://github.com/shwina) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Matthew Roeschke (https://github.com/mroeschke) - Bradley Dice (https://github.com/bdice) - AJ Schmidt (https://github.com/ajschmidt8) - https://github.com/jakirkham URL: rapidsai#10995

…i#11252) When handling list columns in the parquet reader, we have to run a preprocess step that computes several things per-page before we can decode values. If the user has further specified artificial row bounds (`skip_rows`, `min_rows`) we have to do a second pass during the preprocess step. If the user has _not_ specified row bounds, there is no need to do this; however the code was naively always doing so. This PR simply detects when we're reading all rows (which is 99% of use cases) and skips the second pass. Also includes some cleanup of redundant stream synchronizations. Also worth mentioning, this `skip_rows`/`num_rows` feature is going to be deprecated in 22.08 so we will be able to follow up further in 22.10 to rip more of this code out. Authors: - https://github.com/nvdbaranec Approvers: - Mike Wilson (https://github.com/hyperbolic2346) - Jim Brennan (https://github.com/jbrennan333) - Nghia Truong (https://github.com/ttnghia) URL: rapidsai#11252

This PR: - [x] Deprecates `skiprows` & `num_rows` from cudf parquet reader (`cudf.read_parquet`) since these parameters are adding to a lot of overhead incase of nested types and also not supported in `pd.read_parquet` Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Richard (Rick) Zamora (https://github.com/rjzamora) - Vyas Ramasubramani (https://github.com/vyasr) URL: rapidsai#11218

Fixes: rapidsai#11256 This PR fixes an issue with type casting when non-numpy dtypes are passed into the column constructor. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: rapidsai#11282

Closes rapidsai#10948 Adds support for dictionary encoding with 24 bit indices. Authors: - Devavret Makkar (https://github.com/devavret) Approvers: - David Wendt (https://github.com/davidwendt) - Vukasin Milovanovic (https://github.com/vuule) URL: rapidsai#11216

This change should make the test fail reliably, whereas the current approach is flaky and leads to not infrequent test failures. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Nghia Truong (https://github.com/ttnghia) - Bradley Dice (https://github.com/bdice) URL: rapidsai#11326

This PR completely removes `cudf::lists::drop_list_duplicates`. It is replaced by the new API `cudf::list::distinct` which has a simpler implementation but better performance. The replacements for internal cudf usage have all been merged before thus there is no side effect or breaking for the existing APIs in this work. Closes rapidsai#11114, rapidsai#11093, rapidsai#11053, rapidsai#11034, and closes rapidsai#9257. Depends on: * rapidsai#11228 * rapidsai#11149 * rapidsai#11234 * rapidsai#11233 Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Jordan Jacobelli (https://github.com/Ethyling) - Robert Maynard (https://github.com/robertmaynard) - Vukasin Milovanovic (https://github.com/vuule) - Bradley Dice (https://github.com/bdice) URL: rapidsai#11236

This PR adds a parallel _Finite-State Transducer_ (FST) algorithm. The FST is a key component of the nested JSON parser. # Background **An example of a Finite-State Transducer (FST) // aka the algorithm which we try to mimic**: [Slides from the JSON parser presentation, Slides 11-17](https://docs.google.com/presentation/d/1NTQdUMM44NzzHxLNnvcGLQk6pI-fdoM3cXqNqushMbU/edit?usp=sharing) ## Our GPU-based implementation **The GPU-based algorithm builds on the following work:** [ParPaRaw: Massively Parallel Parsing of Delimiter-Separated Raw Data](https://arxiv.org/pdf/1905.13415.pdf) **The following sections are of relevance:** - Section 3.1 - Section 4.5 (i.e., the Multi-fragment in-register array) **How the algorithm works is illustrated in the following presentation:** [ParPaRaw @VlLDB'20](https://eliasstehle.com/media/parparaw_vldb_2020.pdf#page=21) ## Relevent Data Structures **A word about the motivation and need for the _Multi-fragment in-register array_:** The composition over to state-transaction vectors is a key operation (in the prefix scan). Basically, what it does for two state-transition vectors `lhs` and `rhs`, both comprising `N` items: ``` for (int32_t i = 0; i < N; ++i) { result[n] = rhs[lhs[i]]; } return result; ``` The relevant part is the indexing into `rhs`: `rhs[lhs[i]]`, i.e., the index is `lhs[i]`, a runtime value that isn't known at compile time. It's important to understand that in CUB's prefix scan both `rhs` and `lhs` are thread-local variables. As such, they either live in the fast register file or in (slow off-chip) local memory. The register file has a shortcoming, it cannot be indexed dynamically. And here, we are dynamically indexing into `rhs`. So `rhs` will need to be spilled to local memory (backed by device memory) to allow for dynamic indexing. This would usually make the algorithm very slow. That's why we have the _Multi-fragment in-register array_. For its implementation details I'd suggest reading [Section 4.5](https://arxiv.org/pdf/1905.13415.pdf). In contrast, the following example is fine and `foo` will be mapped to registers, because the loop can be unrolled, and, if `N` is known at compile time and sufficiently small (of at most tens of items). ``` // this is fine, if N is a compile-time constant for (int32_t i = 1; i < N; ++i) { foo[n] = foo[n-1]; } ``` # Style & CUB Integration The following may be considered for being integrated into CUB at a later point, hence the deviation in style from cuDF. - `in_reg_array.cuh` - `agent_dfa.cuh` - `device_dfa.cuh` - `dispatch_dfa.cuh` Authors: - Elias Stehle (https://github.com/elstehle) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Tobias Ribizel (https://github.com/upsj) - Karthikeyan (https://github.com/karthikeyann) URL: rapidsai#11242

@devavret

…imate of sizes (rapidsai#11288) This is a possible workaround for issue rapidsai#11280. We have a goal to support NVCOMP ZSTD in 22.08, so a short-term fix is desired. There is a heuristic in `gpuParseCompressedStripeData` to estimate the size of the decompress buffer for very small compressed blocks. For ZSTD, it is possible to have a high enough compression ratio that this heuristic underestimates the needed decompress size. This pr adds a boolean parameter to allow us to disable the block size estimate for ZSTD. When the estimate is disabled, it falls back to the maximum block size, which is guaranteed to be big enough. cc: @devavret, @vuule Authors: - Jim Brennan (https://github.com/jbrennan333) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Vyas Ramasubramani (https://github.com/vyasr) URL: rapidsai#11288

Resolves rapidsai#3036 by making `make test` or `ninja test` default to showing output when tests fail. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Bradley Dice (https://github.com/bdice) URL: rapidsai#11321

I recently revamped our cuDF [CONTRIBUTING guide](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md). I would like to consider replacing the current PR template (which has a fairly daunting amount of text that is immediately deleted by many contributors) with a short checklist of actionable items and a reference to the CONTRIBUTING guide for longer content. I kept this draft very minimal. Reviewers can see other examples here for inspiration: https://axolo.co/blog/p/part-3-github-pull-request-template. Happy to crowdsource others' thoughts here. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Jake Hemstad (https://github.com/jrhemstad) - Karthikeyan (https://github.com/karthikeyann) - AJ Schmidt (https://github.com/ajschmidt8) URL: rapidsai#10774

…ins` (rapidsai#11330) The current implementation of `cudf::detail::contains` can process input with arbitrary nested types. However, it was reported to have severe performance issue when the input tables have many duplicate rows (rapidsai#11299). In order to fix the issue, rapidsai#11310 and rapidsai#11325 was created. Unfortunately, rapidsai#11310 is separating semi-anti-join from `cudf::detail::contains`, causing duplicate implementation. On the other hand, rapidsai#11325 can address the issue rapidsai#11299 but semi-anti-join using it still performs worse than the previous semi-anti-join implementation. The changes in this PR include the following: * Fix the performance issue reported in rapidsai#11299 for the current `cudf::detail::contains` implementation that support nested types. * Add a separate code path into `cudf::detail::contains` such that: * Input without having lists column (at any nested level) will be processed by the code path that is the same as the old implementation of semi-anti-join. This is to make sure the performance of semi-anti-join will remain the same as before. * Input with nested lists column, or NaNs compared as unequal, will be processed by another code path that supports nested types and different NaNs behavior. This will make sure support for nested types will not be dropped. Closes rapidsai#11299. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Bradley Dice (https://github.com/bdice) - MithunR (https://github.com/mythrocks) - Vyas Ramasubramani (https://github.com/vyasr) - Mike Wilson (https://github.com/hyperbolic2346) - Alessandro Bellina (https://github.com/abellina) URL: rapidsai#11330

…fst_benchmark

Depends on #11242 Feature/finite state transducer Benchmark for Finite State Transducer parse and identify JSON symbols - [x] FST with output, output index, output str - [x] FST without output index - [x] FST without, output - [x] FST without output str Look into elstehle#1 for files modified only in this PR (i.e excluding parent depending PR) Authors: - Karthikeyan (https://github.com/karthikeyann) - Elias Stehle (https://github.com/elstehle) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Elias Stehle (https://github.com/elstehle) URL: #11243

This implements stacktrace and adds a stacktrace string into any exception thrown by cudf. By doing so, the exception carries information about where it originated, allowing the downstream application to trace back with much less effort. Closes rapidsai#12422. ### Example: ``` #0: cudf/cpp/build/libcudf.so : std::unique_ptr<cudf::column, std::default_delete<cudf::column> > cudf::detail::sorted_order<false>(cudf::table_view, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x446 #1: cudf/cpp/build/libcudf.so : cudf::detail::sorted_order(cudf::table_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x113 #2: cudf/cpp/build/libcudf.so : std::unique_ptr<cudf::column, std::default_delete<cudf::column> > cudf::detail::segmented_sorted_order_common<(cudf::detail::sort_method)1>(cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x66e #3: cudf/cpp/build/libcudf.so : cudf::detail::segmented_sort_by_key(cudf::table_view const&, cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x88 #4: cudf/cpp/build/libcudf.so : cudf::segmented_sort_by_key(cudf::table_view const&, cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::mr::device_memory_resource*)+0xb9 rapidsai#5: cudf/cpp/build/gtests/SORT_TEST : ()+0xe3027 rapidsai#6: cudf/cpp/build/lib/libgtest.so.1.13.0 : void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x8f rapidsai#7: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::Test::Run()+0xd6 rapidsai#8: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::TestInfo::Run()+0x195 rapidsai#9: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::TestSuite::Run()+0x109 rapidsai#10: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::internal::UnitTestImpl::RunAllTests()+0x44f rapidsai#11: cudf/cpp/build/lib/libgtest.so.1.13.0 : bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*)+0x87 rapidsai#12: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::UnitTest::Run()+0x95 rapidsai#13: cudf/cpp/build/gtests/SORT_TEST : ()+0xdb08c rapidsai#14: /lib/x86_64-linux-gnu/libc.so.6 : ()+0x29d90 rapidsai#15: /lib/x86_64-linux-gnu/libc.so.6 : __libc_start_main()+0x80 rapidsai#16: cudf/cpp/build/gtests/SORT_TEST : ()+0xdf3d5 ``` ### Usage In order to retrieve a stacktrace with fully human-readable symbols, some compiling options must be adjusted. To make such adjustment convenient and effortless, a new cmake option (`CUDF_BUILD_STACKTRACE_DEBUG`) has been added. Just set this option to `ON` before building cudf and it will be ready to use. For downstream applications, whenever a cudf-type exception is thrown, it can retrieve the stored stacktrace and do whatever it wants with it. For example: ``` try { // cudf API calls } catch (cudf::logic_error const& e) { std::cout << e.what() << std::endl; std::cout << e.stacktrace() << std::endl; throw e; } // similar with catching other exception types ``` ### Follow-up work The next step would be patching `rmm` to attach stacktrace into `rmm::` exceptions. Doing so will allow debugging various memory exceptions thrown from libcudf using their stacktrace. ### Note: * This feature doesn't require libcudf to be built in Debug mode. * The flag `CUDF_BUILD_STACKTRACE_DEBUG` should not be turned on in production as it may affect code optimization. Instead, libcudf compiled with that flag turned on should be used only when needed, when debugging cudf throwing exceptions. * This flag removes the current optimization flag from compiling (such as `-O2` or `-O3`, if in Release mode) and replaces by `-Og` (optimize for debugging). * If this option is not set to `ON`, the stacktrace will not be available. This is to avoid expensive stracktrace retrieval if the throwing exception is expected. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Robert Maynard (https://github.com/robertmaynard) - Vyas Ramasubramani (https://github.com/vyasr) - Jason Lowe (https://github.com/jlowe) URL: rapidsai#13298

Pin conda packages to `aws-sdk-cpp<1.11`. The recent upgrade in version `1.11.*` has caused several issues with cleaning up (more details on changes can be read in [this link](https://github.com/aws/aws-sdk-cpp#version-111-is-now-available)), leading to Distributed and Dask-CUDA processes to segfault. The stack for one of those crashes looks like the following: ``` (gdb) bt #0 0x00007f5125359a0c in Aws::Utils::Logging::s_aws_logger_redirect_get_log_level(aws_logger*, unsigned int) () from /opt/conda/envs/dask/lib/python3.9/site-packages/pyarrow/../../.././libaws-cpp-sdk-core.so #1 0x00007f5124968f83 in aws_event_loop_thread () from /opt/conda/envs/dask/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-io.so.1.0.0 #2 0x00007f5124ad9359 in thread_fn () from /opt/conda/envs/dask/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1 #3 0x00007f519958f6db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #4 0x00007f5198b1361f in clone () from /lib/x86_64-linux-gnu/libc.so.6 ``` Such segfaults now manifest frequently in CI, and in some cases are reproducible with a hit rate of ~30%. Given the approaching release time, it's probably the safest option to just pin to an older version of the package while we don't pinpoint the exact cause for the issue and a patched build is released upstream. The `aws-sdk-cpp` is statically-linked in the `pyarrow` pip package, which prevents us from using the same pinning technique. cuDF is currently pinned to `pyarrow=12.0.1` which seems to be built against `aws-sdk-cpp=1.10.*`, as per [recent build logs](https://github.com/apache/arrow/actions/runs/6276453828/job/17046177335?pr=37792#step:6:1372). Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Ray Douglass (https://github.com/raydouglass) URL: rapidsai#14173

Update image name

elstehle and others added 30 commits May 6, 2022 03:47

Squashed with initial test set

5b4a6fc

style fix & additional test scenario

377358a

removed forceinline

4186004

tagging host device function

a921c66

Added utility to debug print & instrumented code to use it

75a1853

switched to using rmm also inside algorithm

a23668a

header include order & SFINAE macro

aa5f5c4

debug print cleanups

4ee2253

renaming key-value store op to stack_op

0f35852

device_span

ca5d465

addressing review comments & minor cleanups

f5960bd

error on unsupported unsigned_t and fixed typos

80226b7

minor style changes addressing review comments

e8bc8a5

squashed with bracket/brace test

c5274b5

clean up & addressing review comments

bb16254

refactored lookup tables

4e42d0e

put lookup tables into their own cudf file

e439320

Change interface for FST to not need temp storage

05840b3

removing unused var post-cleanup

6da9360

unified usage of pragma unrolls

702dfa1

Adding hostdevice macros to in-reg array

26a39ea

making const vars const

8c685c0

refactor lut sanity check

5c94521

use rmm::cuda_stream

e44a563

comments

78778f7

use enum for DFA_STATES instead of uint32_t, move to fst/common.hpp

f969e50

disable narrowing conversion in InitDeviceTransitionTable

165555a

add FST_NVBENCH

24b34e3

generate input string in device using cudf::repeat

86bb898

fix doc tparam order

89c3dee

vyasr and others added 19 commits July 20, 2022 23:30

Remove unused import in README sample (rapidsai#11318)

f549ccb

Resolves rapidsai#11316 Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: rapidsai#11318

Merge branch 'branch-22.08' of https://github.com/rapidsai/cudf into …

622185a

…fst_benchmark

update benchmark, test for fst for updated DFA table

247ab5e

move FST states, definition to common.hpp

4efd1db

github-actions bot added cuDF (Python) cuDF (Java) conda labels Jul 25, 2022

remove comment notes.

dc1f420

karthikeyann closed this Jul 26, 2022

karthikeyann deleted the fst_benchmark branch July 26, 2022 09:50

karthikeyann restored the fst_benchmark branch August 4, 2022 16:25

elstehle pushed a commit that referenced this pull request Dec 4, 2023

Merge pull request #1 from AyodeAwe/update-image-name

6d73abe

Update image name

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO NOT MERGE] FST benchmark #1

[DO NOT MERGE] FST benchmark #1

karthikeyann commented Jul 12, 2022

[DO NOT MERGE] FST benchmark #1

[DO NOT MERGE] FST benchmark #1

Conversation

karthikeyann commented Jul 12, 2022