Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regex-based parser #166

Closed
4 tasks done
masklinn opened this issue Jul 11, 2023 · 0 comments · Fixed by #217
Closed
4 tasks done

regex-based parser #166

masklinn opened this issue Jul 11, 2023 · 0 comments · Fixed by #217

Comments

@masklinn
Copy link
Contributor

masklinn commented Jul 11, 2023

After testing RegexSet it's the same as re2::Set (if not worse) and definitely unsuitable for the job (builds a giant DFA, which is ungodly slow when using a bunch of actual regexes).

So we'd need to implement something like FilteredRE2, finally took a gander at how that works and turns out it's conceptually pretty simple: internally it's a big Vec<Regex>, an extractor and mapper of the relevant literal atoms (PrefilterTree), and an gatekeeper (re2::Set or an aho-corasick or sumthin, basically something which can efficiently check a large amount of literal strings against a needle and returns those which do match), the gatekeeper is initialised with the atoms, and its job is to return a set of matching atoms for the needle. FilteredRE2 then maps those atoms back to their regexes, and only tests those against the needle.

Per rust-lang/regex#923 the regex-syntax crate can be used to extract the literal nubbins, however selecting the relevant atoms will require studying and translating re2's prefilter (and prefilter tree), as it doesn't look like the asker of the original question went on to do publicly available things which could be reused, and the process seems hairy.

This complexity means we're not getting around to that any time soon, especially as I still haven't seen a way to do pure python fallbacks with maturin (cf PyO3/maturin#563, PyO3/maturin#1081, PyO3/maturin#1839, 35227), so this should probably be an independent package e.g. uap-rs with a published pyo3 package. A general-purpose regex binding (à la google-re2) would not make much sense due to the need of implementing the filtering system unless that can get upstreamed (doesn't seem super likely).

Opportunity

A specialised version might actually do better than FilteredRE2: functionally in the re2 parser we have to apply the gatekeeper, then test all the potential regexes, then get the smallest index, then match the selected regex to extract the data.

But with a specialised implementation, we can match the potential regexes, in-order, and return the first capture, avoiding a second pass. The selection cost might be a bit higher as it requires sorting the potential matches (rather than just taking the lowest actual match), and I assume even preparing for capture is more expensive than preparing for matching, but we'd have to test the difference on match failure (as on success test + match should logically be costlier than direct match).

Also obviously we might be able to use a more efficient gatekeeper, since aho-corasick exists and is already a dependency of regex.

Similar idea to #149

Depends on #116

Unresolved question

pyo3 or cffi?

  • pyo3 is almost certainly faster because it can do less call and the data extraction / conversion is done on the rust side, and much safer as the API is expressed in Rust

    The major drawback is that pyo3 bindings only really work with cpython, pypy has cpyext but edit: probably not an option (or at least a good one) per PyPy Support via cpyext - What's needed? PyO3/pyo3#129

  • cffi has a much lower level API as everything has to transit through C, no rich type on either side, and requires providing python code, no cool conversions... but it works and should be well supported on pypy, otoh the cffi bit is the python side, the C side is... C, so usable from any language which can call C which could be helpful

  • uniffi seems like a new kid on the block, a multi language binding generator, but it's not clear that it supports pypy? no mention anywhere

    edit: per the pyo3 docs: "uniffi wheels are compatible with all python versions including pypy."

  • hpy apparently also a new kid on the block, but seems quite slow going

@masklinn masklinn changed the title Test regex regex-based parsers Oct 23, 2023
@masklinn masklinn changed the title regex-based parsers regex-based parser Oct 23, 2023
@masklinn masklinn added this to the 1.0 milestone Feb 6, 2024
@masklinn masklinn removed this from the 1.0 milestone Mar 13, 2024
masklinn added a commit to masklinn/uap-python that referenced this issue Jul 15, 2024
masklinn added a commit to masklinn/uap-python that referenced this issue Oct 8, 2024
masklinn added a commit to masklinn/uap-python that referenced this issue Oct 8, 2024
masklinn added a commit to masklinn/uap-python that referenced this issue Oct 8, 2024
masklinn added a commit to masklinn/uap-python that referenced this issue Oct 13, 2024
Uses ua-parser/uap-rust#3

- add an optional dependency on `ua-parser-rs` under the `regex` key
- add a regex-based parser

misc:

- update the classifiers
- bump required-python to 3.9 (3.8 is basically EOL)
- update CI to better split up the steps
- fix up the check for binary pyyaml: requirements_dev was removed in
  81da21a, in May 2023, so this
  hasn't been working for 18 months
- fix CLI script to correctly handle optional modules so it can run on
  pypy and graal, add regex, make tracemalloc optional as pypy doesn't
  support it (didn't check graal)
- update tox: remove cpython 3.8, pypy 3.8 and 3.9 from tox (3.8's
  last supporting release was 7.3.11 in May 2023, 3.9's was 7.3.16 in
  April 2024), add graal

Fixes ua-parser#166
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant