Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds the Finite-State Transducer algorithm #11242

Merged
Merged
Show file tree
Hide file tree
Changes from 27 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
0557d41
squashed with bracket/brace test
elstehle Apr 11, 2022
355d1e4
clean up & addressing review comments
elstehle Apr 20, 2022
39a6b65
refactored lookup tables
elstehle Apr 25, 2022
239f138
put lookup tables into their own cudf file
elstehle Apr 25, 2022
39cff80
Change interface for FST to not need temp storage
elstehle Apr 27, 2022
e24a133
removing unused var post-cleanup
elstehle May 4, 2022
caf6195
unified usage of pragma unrolls
elstehle May 9, 2022
ea79a81
Adding hostdevice macros to in-reg array
elstehle May 9, 2022
17dcbfd
making const vars const
elstehle May 9, 2022
6fdd24a
refactor lut sanity check
elstehle May 9, 2022
eccf970
fixes sg-count & uses rmm stream in fst tests
elstehle Jun 2, 2022
9fe8e4b
minor doxygen fix
elstehle Jun 14, 2022
694a365
adopts suggested fst test changes
elstehle Jun 15, 2022
f656f49
adopts device-side test data gen
elstehle Jul 7, 2022
485a1c6
adopts c++17 namespaces declarations
elstehle Jul 9, 2022
5f1c4b5
removes state vector-wrapper in favor of vanilla array
elstehle Jul 11, 2022
e6f8def
some west-const remainders & unifies StateIndexT
elstehle Jul 11, 2022
a798852
adds check for state transition narrowing conversion
elstehle Jul 11, 2022
eb24962
fixes logical stack test includes
elstehle Jul 12, 2022
f52e614
replaces enum with typed constexpr
elstehle Jul 14, 2022
3038058
adds excplitis error checking
elstehle Jul 14, 2022
d351e5c
addresses style review comments & fixes a todo
elstehle Jul 14, 2022
3f47952
replaces gtest asserts with expects
elstehle Jul 14, 2022
cba1619
fixes style in dispatch dfa
elstehle Jul 14, 2022
bea2a02
replaces vanilla loop with iota
elstehle Jul 15, 2022
8a184e9
rephrases documentation on in-reg array
elstehle Jul 16, 2022
78dd893
Merge remote-tracking branch 'upstream/branch-22.08' into feature/fin…
elstehle Jul 16, 2022
7a19f64
Merge remote-tracking branch 'upstream/branch-22.08' into feature/fin…
elstehle Jul 19, 2022
4783aae
improves style in fst test
elstehle Jul 20, 2022
6203709
adds comments in in_reg array
elstehle Jul 20, 2022
ad5817a
adds comments to lookup tables
elstehle Jul 20, 2022
dc55653
fixes formatting
elstehle Jul 20, 2022
378be9f
exchanges loops in favor of copy and fills
elstehle Jul 20, 2022
4ba5472
clarifies documentation in agent dfa
elstehle Jul 20, 2022
7980978
disambiguates transition and translation tables
elstehle Jul 20, 2022
2bce061
minor style fix
elstehle Jul 21, 2022
b37f716
if constexprs and doxy on DFA helper
elstehle Jul 21, 2022
d42869a
minor documentation fix
elstehle Jul 21, 2022
6c889f7
replaces loop for comparing vectors with generic macro
elstehle Jul 21, 2022
8a54c72
uses new vector comparison for logical stack test
elstehle Jul 21, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
672 changes: 672 additions & 0 deletions cpp/src/io/fst/agent_dfa.cuh

Large diffs are not rendered by default.

94 changes: 94 additions & 0 deletions cpp/src/io/fst/device_dfa.cuh
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
/*
* Copyright (c) 2022, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#pragma once

#include "dispatch_dfa.cuh"

#include <io/utilities/hostdevice_vector.hpp>

#include <cstdint>

namespace cudf::io::fst {

/**
* @brief Uses a deterministic finite automaton to transduce a sequence of symbols from an input
* iterator to a sequence of transduced output symbols.
*
* @tparam DfaT The DFA specification
* @tparam SymbolItT Random-access input iterator type to symbols fed into the FST
* @tparam TransducedOutItT Random-access output iterator to which the transduced output will be
* written
* @tparam TransducedIndexOutItT Random-access output iterator type to which the indexes of the
* symbols that caused some output to be written.
elstehle marked this conversation as resolved.
Show resolved Hide resolved
* @tparam TransducedCountOutItT A single-item output iterator type to which the total number of
* output symbols is written
* @tparam OffsetT A type large enough to index into either of both: (a) the input symbols and (b)
* the output symbols
* @param[in] d_temp_storage Device-accessible allocation of temporary storage. When NULL, the
* required allocation size is written to \p temp_storage_bytes and no work is done.
* @param[in,out] temp_storage_bytes Reference to size in bytes of \p d_temp_storage allocation
* @param[in] dfa The DFA specifying the number of distinct symbol groups, transition table, and
* translation table
* @param[in] d_chars_in Random-access input iterator to the beginning of the sequence of input
* symbols
* @param[in] num_chars The total number of input symbols to process
* @param[out] transduced_out_it Random-access output iterator to which the transduced output is
* written
elstehle marked this conversation as resolved.
Show resolved Hide resolved
* @param[out] transduced_out_idx_it Random-access output iterator to which, the index i is written
* iff the i-th input symbol caused some output to be written
* @param[out] d_num_transduced_out_it A single-item output iterator type to which the total number
* of output symbols is written
* @param[in] seed_state The DFA's starting state. For streaming DFAs this corresponds to the
* "end-state" of the previous invocation of the algorithm.
* @param[in] stream CUDA stream to launch kernels within. Default is the null-stream.
*/
template <typename DfaT,
typename SymbolItT,
typename TransducedOutItT,
typename TransducedIndexOutItT,
typename TransducedCountOutItT,
typename OffsetT>
cudaError_t DeviceTransduce(void* d_temp_storage,
size_t& temp_storage_bytes,
DfaT dfa,
SymbolItT d_chars_in,
OffsetT num_chars,
TransducedOutItT transduced_out_it,
TransducedIndexOutItT transduced_out_idx_it,
TransducedCountOutItT d_num_transduced_out_it,
uint32_t seed_state = 0,
cudaStream_t stream = 0)
{
using DispatchDfaT = detail::DispatchFSM<DfaT,
SymbolItT,
TransducedOutItT,
TransducedIndexOutItT,
TransducedCountOutItT,
OffsetT>;

return DispatchDfaT::Dispatch(d_temp_storage,
temp_storage_bytes,
dfa,
seed_state,
d_chars_in,
num_chars,
transduced_out_it,
transduced_out_idx_it,
d_num_transduced_out_it,
stream);
}

} // namespace cudf::io::fst
Loading