Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider switching to google-re2 and Filter #149

Closed
junyer opened this issue Mar 15, 2023 · 7 comments
Closed

consider switching to google-re2 and Filter #149

junyer opened this issue Mar 15, 2023 · 7 comments
Milestone

Comments

@junyer
Copy link
Contributor

junyer commented Mar 15, 2023

google-re2 is a drop-in replacement for the re module that also offers Set and Filter. ua-parser/uap-cpp#38 describes why you might want to consider using the latter.

@mattrobenolt
Copy link
Member

Do we know if all of the patterns in uap-core would be re2 compatible?

I'm just not very aware of what features we need to support, obviously if we need backtracking or something, we can't. I think there are cases where some patterns are syntactically different than PCRE too.

So I guess that'd be first step. If it would be compatible, I think being able to support google-re2 optionally would be good, rather than a hard requirement on it.

@junyer
Copy link
Contributor Author

junyer commented Mar 15, 2023

Do we know if all of the patterns in uap-core would be re2 compatible?

I believe so because ua-parser/uap-cpp#26 describes the due diligence done (back in 2019) before switching to RE2. I make no claim regarding the correctness or otherwise of that switch; it's just a data point.

So I guess that'd be first step. If it would be compatible, I think being able to support google-re2 optionally would be good, rather than a hard requirement on it.

How would you prefer that optionality to work? I'm aware of extras_require, for example, which I think entails running pip install ua-parser[RE2] and thus makes that dependency on RE2 explicit, not opportunistic. However, given that other re2 modules exist, hooking up a maybe-import in user_agent_parser.py probably wouldn't be especially elegant. Thoughts? :)

@mattrobenolt
Copy link
Member

Yeah, thinking it can be something in extras_require and/or just use something simple like:

try:
    import re2 as re
except ImportError:
    import re

or whatever the compatibility shim would need to be, I assume it wouldnt' need to be too complex, but I know nothing about google-re2 explicitly, just RE2 in general. :)

Unless you're suggesting that there are multiple common modules that exist as import re2, in which case, this all seems complicated even if we explicitly declare it as a dependency, since someone may have multiple variants installed. That seems like a bigger issue.

@junyer
Copy link
Contributor Author

junyer commented Mar 15, 2023

Before google-re2, there was pyre2, re2, sw-re2, … and all of them expect to be installed (and imported) as re2. We should really "probe" for Filter as well, I suppose.

@mattrobenolt
Copy link
Member

Given that's the state of the ecosystem, it'd probably be best we don't require one in our installation since it's a higher chance of conflicting with another project maybe explicitly or implicitly depending on one of the competing re2 imports, which can likely break programs for people.

I'd rather, as you suggest, expand the detection and see if we have the google-re2 lib installed explicitly by some deeper feature detection.

Looking briefly at the code, it seems the C module can be imported at _re2 so maybe some detection off of that?

@junyer
Copy link
Contributor Author

junyer commented Mar 16, 2023

The _re2 extension has more or less the same problem with regard to naming collisions, I suspect, so "probing" for Filter within the re2 module seems like about the best that we can do, but that would arguably also be in the spirit of duck typing.

I can hopefully find some time soon to put together a PR for this. The lawyer🐈s have advised me to try to tidy up a couple of things for compliance reasons; I will file a separate issue about that.

@junyer
Copy link
Contributor Author

junyer commented Mar 16, 2023

Okay, I took a look and had a think and... it's complicated. FilteredRE2 keeps an array of the RE2 objects; I want to make a couple of tweaks such that Filter exposes them so that callers don't have to duplicate them. I will need to think a lot more about achieving optionality in user_agent_parser.py without introducing inefficiencies or intrusive, ugly hacks.

This was referenced Apr 30, 2023
@masklinn masklinn mentioned this issue Jul 11, 2023
4 tasks
masklinn added a commit to masklinn/uap-python that referenced this issue Nov 2, 2023
Requires splitting out some of the testenvs, as re2 is not available
for pypy at all, and not yet for 3.12.

Only uses re2.Set which turns out to be not great, at least according
to `pytest --durations` on 3.11:

- re2 is sometimes faster for UA tests
  - `pgts_browser_list.yaml` goes from 2.5s to 1.5
  - `firefox_user_agent_strings.yaml` goes from 0.05 to 0.04 (not
    really significant)
  - though `test_ua.yaml` goes from 0.18 to 0.65
- re2 is *way* slower for devices tests
  - `test_device.yaml` goes from 2.5 to 8s

Obviously tests might not be representative at all, implementing a
proper benchmark on a real-life test-set (ua-parser#163) would likely provide
better information.

It's possible that `FilteredRE2` would would offer better
performances, *but* it requires additional memory and more importantly
it requires a fast literal string matcher e.g. a fast implementation
of Aho-Corasick, or possibly Hyperscan's Teddy (via
[python-hyperscan][5]?). [According to burntsushi commentz-walter is
not great in practice][1], at least as you increase the number of
patterns, so that one looks like a dead end.

Either way this would likely be an *additional* dependency to make it
usable, although there seems to be [a well-maintained Python version
with impressive performances (for pure python)][2], [a native
module][3], and [a wrapper for burntsushi's rust implementation][4]
which claims even better performances than the native module.

Linked to (but probably can't be argued to fix) ua-parser#149.

[1]: https://news.ycombinator.com/item?id=26913349
[2]: https://github.com/abusix/ahocorapy
[3]: https://github.com/WojciechMula/pyahocorasick/
[4]: https://github.com/G-Research/ahocorasick_rs/
[5]: https://python-hyperscan.readthedocs.io
masklinn added a commit to masklinn/uap-python that referenced this issue Nov 3, 2023
Requires splitting out some of the testenvs, as re2 is not available
for pypy at all, and not yet for 3.12.

Uses `re2.Filter`, which unlike the C++ `FilteredRE2` bundles
prefiltering, using an `re2.Set` so likely less efficient than
providing one's own e.g. aho-corasick, but avoids having to do that.

At first glance according to pytest's `--durations 0` this is quite
successful (unlike using `re2.Set` which was more of a mixed bag):

```
2.54s call     tests/test_core.py::test_devices[test_device.yaml-basic]
2.51s call     tests/test_core.py::test_ua[pgts_browser_list.yaml-basic]
2.48s call     tests/test_legacy.py::TestParse::testPGTSStrings
2.43s call     tests/test_legacy.py::TestParse::testStringsDevice
0.95s call     tests/test_core.py::test_devices[test_device.yaml-re2]
0.55s call     tests/test_core.py::test_ua[pgts_browser_list.yaml-re2]
0.18s call     tests/test_core.py::test_ua[test_ua.yaml-basic]
0.16s call     tests/test_legacy.py::TestParse::testBrowserscopeStrings
0.10s call     tests/test_core.py::test_ua[test_ua.yaml-re2]
```

While the "basic" parser for the new API is slightly slower than the
legacy API (browserscope does use test_ua.yaml so that matches) the
re2 parser is significantly faster than both:

- 60% faster on test_device.yaml (~2.5s -> 1s)
- 80% faster on pgts (2.5s -> 0.5s)
- 40% faster on browserscope (0.16 -> 0.1)

This is very encouraging, altough the memory consumption has not been
checked (yet).

Fixes ua-parser#149, kind-of
masklinn added a commit to masklinn/uap-python that referenced this issue Jan 14, 2024
Requires splitting out some of the testenvs, as re2 is not available
for pypy at all, and not yet for 3.12.

Uses `re2.Filter`, which unlike the C++ `FilteredRE2` bundles
prefiltering, using an `re2.Set` so likely less efficient than
providing one's own e.g. aho-corasick, but avoids having to do that.

At first glance according to pytest's `--durations 0` this is quite
successful (unlike using `re2.Set` which was more of a mixed bag):

```
2.54s call     tests/test_core.py::test_devices[test_device.yaml-basic]
2.51s call     tests/test_core.py::test_ua[pgts_browser_list.yaml-basic]
2.48s call     tests/test_legacy.py::TestParse::testPGTSStrings
2.43s call     tests/test_legacy.py::TestParse::testStringsDevice
0.95s call     tests/test_core.py::test_devices[test_device.yaml-re2]
0.55s call     tests/test_core.py::test_ua[pgts_browser_list.yaml-re2]
0.18s call     tests/test_core.py::test_ua[test_ua.yaml-basic]
0.16s call     tests/test_legacy.py::TestParse::testBrowserscopeStrings
0.10s call     tests/test_core.py::test_ua[test_ua.yaml-re2]
```

While the "basic" parser for the new API is slightly slower than the
legacy API (browserscope does use test_ua.yaml so that matches) the
re2 parser is significantly faster than both:

- 60% faster on test_device.yaml (~2.5s -> 1s)
- 80% faster on pgts (2.5s -> 0.5s)
- 40% faster on browserscope (0.16 -> 0.1)

This is very encouraging, altough the memory consumption has not been
checked (yet).

Fixes ua-parser#149, kind-of
masklinn added a commit to masklinn/uap-python that referenced this issue Feb 3, 2024
Requires splitting out some of the testenvs, as re2 is not available
for pypy at all, and not yet for 3.12.

Uses `re2.Filter`, which unlike the C++ `FilteredRE2` bundles
prefiltering, using an `re2.Set` so likely less efficient than
providing one's own e.g. aho-corasick, but avoids having to do that.

At first glance according to pytest's `--durations 0` this is quite
successful (unlike using `re2.Set` which was more of a mixed bag):

```
2.54s call     tests/test_core.py::test_devices[test_device.yaml-basic]
2.51s call     tests/test_core.py::test_ua[pgts_browser_list.yaml-basic]
2.48s call     tests/test_legacy.py::TestParse::testPGTSStrings
2.43s call     tests/test_legacy.py::TestParse::testStringsDevice
0.95s call     tests/test_core.py::test_devices[test_device.yaml-re2]
0.55s call     tests/test_core.py::test_ua[pgts_browser_list.yaml-re2]
0.18s call     tests/test_core.py::test_ua[test_ua.yaml-basic]
0.16s call     tests/test_legacy.py::TestParse::testBrowserscopeStrings
0.10s call     tests/test_core.py::test_ua[test_ua.yaml-re2]
```

While the "basic" parser for the new API is slightly slower than the
legacy API (browserscope does use test_ua.yaml so that matches) the
re2 parser is significantly faster than both:

- 60% faster on test_device.yaml (~2.5s -> 1s)
- 80% faster on pgts (2.5s -> 0.5s)
- 40% faster on browserscope (0.16 -> 0.1)

This is very encouraging, altough the memory consumption has not been
checked (yet).

Fixes ua-parser#149, kind-of
masklinn added a commit to masklinn/uap-python that referenced this issue Feb 3, 2024
Requires splitting out some of the testenvs, as re2 is not available
for pypy at all, and not yet for 3.12.

Uses `re2.Filter`, which unlike the C++ `FilteredRE2` bundles
prefiltering, using an `re2.Set` so likely less efficient than
providing one's own e.g. aho-corasick, but avoids having to do that.

At first glance according to pytest's `--durations 0` this is quite
successful (unlike using `re2.Set` which was more of a mixed bag):

```
2.54s call     tests/test_core.py::test_devices[test_device.yaml-basic]
2.51s call     tests/test_core.py::test_ua[pgts_browser_list.yaml-basic]
2.48s call     tests/test_legacy.py::TestParse::testPGTSStrings
2.43s call     tests/test_legacy.py::TestParse::testStringsDevice
0.95s call     tests/test_core.py::test_devices[test_device.yaml-re2]
0.55s call     tests/test_core.py::test_ua[pgts_browser_list.yaml-re2]
0.18s call     tests/test_core.py::test_ua[test_ua.yaml-basic]
0.16s call     tests/test_legacy.py::TestParse::testBrowserscopeStrings
0.10s call     tests/test_core.py::test_ua[test_ua.yaml-re2]
```

While the "basic" parser for the new API is slightly slower than the
legacy API (browserscope does use test_ua.yaml so that matches) the
re2 parser is significantly faster than both:

- 60% faster on test_device.yaml (~2.5s -> 1s)
- 80% faster on pgts (2.5s -> 0.5s)
- 40% faster on browserscope (0.16 -> 0.1)

This is very encouraging, altough the memory consumption has not been
checked (yet).

Fixes ua-parser#149, kind-of
masklinn added a commit to masklinn/uap-python that referenced this issue Feb 6, 2024
Requires splitting out some of the testenvs, as re2 is not available
for pypy at all, and not yet for 3.12.

Uses `re2.Filter`, which unlike the C++ `FilteredRE2` bundles
prefiltering, using an `re2.Set` so likely less efficient than
providing one's own e.g. aho-corasick, but avoids having to do that.

At first glance according to pytest's `--durations 0` this is quite
successful (unlike using `re2.Set` which was more of a mixed bag):

```
2.54s call     tests/test_core.py::test_devices[test_device.yaml-basic]
2.51s call     tests/test_core.py::test_ua[pgts_browser_list.yaml-basic]
2.48s call     tests/test_legacy.py::TestParse::testPGTSStrings
2.43s call     tests/test_legacy.py::TestParse::testStringsDevice
0.95s call     tests/test_core.py::test_devices[test_device.yaml-re2]
0.55s call     tests/test_core.py::test_ua[pgts_browser_list.yaml-re2]
0.18s call     tests/test_core.py::test_ua[test_ua.yaml-basic]
0.16s call     tests/test_legacy.py::TestParse::testBrowserscopeStrings
0.10s call     tests/test_core.py::test_ua[test_ua.yaml-re2]
```

While the "basic" parser for the new API is slightly slower than the
legacy API (browserscope does use test_ua.yaml so that matches) the
re2 parser is significantly faster than both:

- 60% faster on test_device.yaml (~2.5s -> 1s)
- 80% faster on pgts (2.5s -> 0.5s)
- 40% faster on browserscope (0.16 -> 0.1)

This is very encouraging, altough the memory consumption has not been
checked (yet).

Fixes ua-parser#149, kind-of
@masklinn masklinn added this to the 1.0 milestone Feb 6, 2024
masklinn added a commit that referenced this issue Feb 6, 2024
Requires splitting out some of the testenvs, as re2 is not available
for pypy at all, and not yet for 3.12.

Uses `re2.Filter`, which unlike the C++ `FilteredRE2` bundles
prefiltering, using an `re2.Set` so likely less efficient than
providing one's own e.g. aho-corasick, but avoids having to do that.

At first glance according to pytest's `--durations 0` this is quite
successful (unlike using `re2.Set` which was more of a mixed bag):

```
2.54s call     tests/test_core.py::test_devices[test_device.yaml-basic]
2.51s call     tests/test_core.py::test_ua[pgts_browser_list.yaml-basic]
2.48s call     tests/test_legacy.py::TestParse::testPGTSStrings
2.43s call     tests/test_legacy.py::TestParse::testStringsDevice
0.95s call     tests/test_core.py::test_devices[test_device.yaml-re2]
0.55s call     tests/test_core.py::test_ua[pgts_browser_list.yaml-re2]
0.18s call     tests/test_core.py::test_ua[test_ua.yaml-basic]
0.16s call     tests/test_legacy.py::TestParse::testBrowserscopeStrings
0.10s call     tests/test_core.py::test_ua[test_ua.yaml-re2]
```

While the "basic" parser for the new API is slightly slower than the
legacy API (browserscope does use test_ua.yaml so that matches) the
re2 parser is significantly faster than both:

- 60% faster on test_device.yaml (~2.5s -> 1s)
- 80% faster on pgts (2.5s -> 0.5s)
- 40% faster on browserscope (0.16 -> 0.1)

This is very encouraging, altough the memory consumption has not been
checked (yet).

Fixes #149, kind-of
@masklinn masklinn closed this as completed Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants