-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes a symbol group lookup table issue #14561
Merged
rapids-bot
merged 2 commits into
rapidsai:branch-24.02
from
elstehle:fix/fst-symbol-lookup-table
Dec 7, 2023
Merged
Fixes a symbol group lookup table issue #14561
rapids-bot
merged 2 commits into
rapidsai:branch-24.02
from
elstehle:fix/fst-symbol-lookup-table
Dec 7, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
elstehle
requested review from
davidwendt,
nvdbaranec,
karthikeyann and
shrshi
December 4, 2023 17:24
elstehle
added
bug
Something isn't working
3 - Ready for Review
Ready for review by team
libcudf
Affects libcudf (C++/CUDA) code.
cuIO
cuIO issue
non-breaking
Non-breaking change
labels
Dec 4, 2023
elstehle
changed the title
Fixes a symbol group lookup table corner case
Fixes a symbol group lookup table issue
Dec 4, 2023
elstehle
commented
Dec 6, 2023
init_data.num_valid_entries = max_base_match_val + 1; | ||
// The number of valid entries in the table (including the entry for the out-of-bounds symbol | ||
// group id) | ||
init_data.num_valid_entries = oob_match_index + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In essence, this is the change that was done here. Previously, we were setting
init_data.num_valid_entries = oob_match_index;
Which, means that lookups larger than the highest valid index were getting mapped to to the symbol group of that highest entry, instead of the OTHER
symbol group which would come just one after.
ttnghia
approved these changes
Dec 6, 2023
/merge |
karthikeyann
pushed a commit
to karthikeyann/cudf
that referenced
this pull request
Dec 12, 2023
This PR fixes an issue in the finite-state transducer's (FST) lookup table that is used to map an input character to a symbol group. A symbol group is a an integer that's subsequently used to select a row from the transition table. The FST uses a `OTHER` symbol group, to which all symbols are mapped that are not explicitly mapped to a symbol group. E.g., say, we have two symbol groups, one that contains braces (`{`,`}`) and one that contains brackets (`[`,`]`). ``` const std::vector<std::string> symbol_groups = {"{}", "[]"}; // symbol (ASCII value) -> symbol group // { (123) -> 0 // } (125) -> 0 // [ (91) -> 1 // ] (93) -> 1 // <anything else> -> 2 ('OTHER') So the lookup table will look something like this: // lut[0] -> 2 // lut[1] -> 2 // lut[2] -> 2 // ... // lut[91] -> 1 // lut[92] -> 2 // lut[93] -> 1 // ... // lut[123] -> 0 // lut[124] -> 2 // lut[125] -> 0 // lut[126] -> 2 ``` Now, when running the FST, we want to limit the range of lookups that we have to perform, so we bound the character to lookup to one-past-the-last index that was explicitly provided, because anything that comes after that index maps to the `OTHER` symbol group anyways. In the above example, the highest provided index is `125` (`}`) and one past it is index `126`. We clamp any character value above `126` to `126`. The _number_ of valid items is `126+1`. So the lookup at runtime becomes: ``` return sym_to_sgid[min(static_cast<SymbolGroupIdT>(symbol), num_valid_entries - 1U)]; ``` Previously, we were computing number of valid items wrongly. And the issue didn't surface because most of our FST usage included `}`, which is only succeeded by `~` and `DEL`, which are actually anyways only valid as part of string values, and hence wouldn't have changed semantics there. Authors: - Elias Stehle (https://github.com/elstehle) - Ray Douglass (https://github.com/raydouglass) Approvers: - Nghia Truong (https://github.com/ttnghia) URL: rapidsai#14561
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
3 - Ready for Review
Ready for review by team
bug
Something isn't working
cuIO
cuIO issue
libcudf
Affects libcudf (C++/CUDA) code.
non-breaking
Non-breaking change
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR fixes an issue in the finite-state transducer's (FST) lookup table that is used to map an input character to a symbol group. A symbol group is a an integer that's subsequently used to select a row from the transition table.
The FST uses a
OTHER
symbol group, to which all symbols are mapped that are not explicitly mapped to a symbol group.E.g., say, we have two symbol groups, one that contains braces (
{
,}
) and one that contains brackets ([
,]
).Now, when running the FST, we want to limit the range of lookups that we have to perform, so we bound the character to lookup to one-past-the-last index that was explicitly provided, because anything that comes after that index maps to the
OTHER
symbol group anyways. In the above example, the highest provided index is125
(}
) and one past it is index126
. We clamp any character value above126
to126
. The number of valid items is126+1
.So the lookup at runtime becomes:
Previously, we were computing number of valid items wrongly. And the issue didn't surface because most of our FST usage included
}
, which is only succeeded by~
andDEL
, which are actually anyways only valid as part of string values, and hence wouldn't have changed semantics there.Checklist