Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move lru cache from inside of _encode_host to outside #1348

Merged
merged 32 commits into from
Oct 21, 2024
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
8debaf8
Move lru cache from inside of _encode_host to outside
bdraco Oct 21, 2024
056f2e5
remove unused
bdraco Oct 21, 2024
dea6b8e
docs
bdraco Oct 21, 2024
0a4a731
docs
bdraco Oct 21, 2024
944526c
docs
bdraco Oct 21, 2024
25c1377
docs syntax fixes
bdraco Oct 21, 2024
e9271f4
coverage
bdraco Oct 21, 2024
b0311b6
changelog
bdraco Oct 21, 2024
60ae823
bump ver
bdraco Oct 21, 2024
60c1b74
Update CHANGES/1348.breaking.rst
bdraco Oct 21, 2024
1441def
bump ver
bdraco Oct 21, 2024
9279ad6
Merge remote-tracking branch 'origin/move_cache_encode_host' into mov…
bdraco Oct 21, 2024
fb219d2
Update yarl/_url.py
bdraco Oct 21, 2024
0a7fdb8
keep _idna_encode
bdraco Oct 21, 2024
61ad89a
keep _idna_encode
bdraco Oct 21, 2024
861fed7
revert to 256 to check benchmark
bdraco Oct 21, 2024
40a16dc
split sizes
bdraco Oct 21, 2024
7a61562
Update docs/api.rst
bdraco Oct 21, 2024
e176f59
Apply suggestions from code review
bdraco Oct 21, 2024
e6d8744
Update docs/api.rst
bdraco Oct 21, 2024
fd9432b
split defaults
bdraco Oct 21, 2024
baa6966
Update CHANGES/1348.breaking.rst
bdraco Oct 21, 2024
0f4c481
Update docs/api.rst
bdraco Oct 21, 2024
6bd0f1a
Apply suggestions from code review
bdraco Oct 21, 2024
02318e5
fix
bdraco Oct 21, 2024
099abd9
Update yarl/_url.py
bdraco Oct 21, 2024
54bdd41
Update CHANGES/1348.breaking.rst
bdraco Oct 21, 2024
df3a97f
cleanup changelog message
bdraco Oct 21, 2024
5c891a5
Update CHANGES/1348.breaking.rst
bdraco Oct 21, 2024
04ea620
Merge branch 'master' into move_cache_encode_host
bdraco Oct 21, 2024
111da73
Merge branch 'master' into move_cache_encode_host
bdraco Oct 21, 2024
aa9afb4
Merge branch 'master' into move_cache_encode_host
bdraco Oct 21, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions CHANGES/1348.breaking.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Migrate to using a single cache for encoding hosts -- by :user:`bdraco`
bdraco marked this conversation as resolved.
Show resolved Hide resolved

Passing ``idna_encode_size``, ``ip_address_size``, and ``host_validate_size`` to :py:meth:`~yarl.cache_configure` is deprecated in favor of the new ``encode_host`` parameter and will be removed in a future release.

For backwards compatibility, the old parameters affect the ``encode_host`` cache size.
bdraco marked this conversation as resolved.
Show resolved Hide resolved
bdraco marked this conversation as resolved.
Show resolved Hide resolved
29 changes: 19 additions & 10 deletions docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1039,41 +1039,50 @@ Cache control

IDNA conversion, host validation, and IP Address parsing used for host
encoding are quite expensive operations, that's why the ``yarl``
library caches these calls by storing last ``256`` results in the
library caches these calls by storing last ``512`` results in the
bdraco marked this conversation as resolved.
Show resolved Hide resolved
global LRU cache.
bdraco marked this conversation as resolved.
Show resolved Hide resolved

.. function:: cache_clear()

Clear IDNA, host validation, and IP Address caches.
Clear IDNA and host encoding cache.


.. function:: cache_info()

Return a dictionary with ``"idna_encode"``, ``"idna_decode"``, ``"ip_address"``,
bdraco marked this conversation as resolved.
Show resolved Hide resolved
and ``"host_validate"`` keys, each value
``"host_validate"``, and ``"encode_host"`` keys, each value
bdraco marked this conversation as resolved.
Show resolved Hide resolved
points to corresponding ``CacheInfo`` structure (see :func:`functools.lru_cache` for
details):

.. doctest::
:options: +SKIP

>>> yarl.cache_info()
{'idna_encode': CacheInfo(hits=5, misses=5, maxsize=256, currsize=5),
'idna_decode': CacheInfo(hits=24, misses=15, maxsize=256, currsize=15),
'ip_address': CacheInfo(hits=46933, misses=84, maxsize=256, currsize=101),
'host_validate': CacheInfo(hits=0, misses=0, maxsize=256, currsize=0)}
{'idna_encode': CacheInfo(hits=5, misses=5, maxsize=512, currsize=5),
'idna_decode': CacheInfo(hits=24, misses=15, maxsize=512, currsize=15),
'ip_address': CacheInfo(hits=46933, misses=84, maxsize=512, currsize=101),
'host_validate': CacheInfo(hits=0, misses=0, maxsize=512, currsize=0),
'encode_host': CacheInfo(hits=0, misses=0, maxsize=512, currsize=0)}

.. versionchanged:: 1.16

``idna_encode``, ``ip_address``, and ``host_validate``
are deprecated in favor of a single ``encode_host`` cache.

.. function:: cache_configure(*, idna_encode_size=256, idna_decode_size=256, ip_address_size=256, host_validate_size=256)
.. function:: cache_configure(*, idna_encode_size=512, idna_decode_size=512, ip_address_size=512, host_validate_size=512, encode_host=512)

Set the IP Address, host validation, and IDNA encode and
decode cache sizes (``256`` for each by default).
Set the IP Address, host validation, and IDNA encode, host encode and
bdraco marked this conversation as resolved.
Show resolved Hide resolved
decode cache sizes (``512`` for each by default).

Pass ``None`` to make the corresponding cache unbounded (may speed up host encoding
operation a little but the memory footprint can be very high,
please use with caution).

.. versionchanged:: 1.16

``idna_encode_size``, ``ip_address_size``, and ``host_validate_size``
are deprecated in favor of a single ``encode_host`` cache.

References
----------

Expand Down
39 changes: 32 additions & 7 deletions tests/test_cache.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
import pytest

import yarl

# Don't check the actual behavior but make sure that calls are allowed
Expand All @@ -13,7 +15,13 @@ def test_cache_clear() -> None:

def test_cache_info() -> None:
info = yarl.cache_info()
assert info.keys() == {"idna_encode", "idna_decode", "ip_address", "host_validate"}
assert info.keys() == {
"idna_encode",
"idna_decode",
"ip_address",
"host_validate",
"encode_host",
}


def test_cache_configure_default() -> None:
Expand All @@ -22,17 +30,34 @@ def test_cache_configure_default() -> None:

def test_cache_configure_None() -> None:
yarl.cache_configure(
idna_encode_size=None,
idna_decode_size=None,
ip_address_size=None,
host_validate_size=None,
encode_host_size=None,
)


def test_cache_configure_explicit() -> None:
yarl.cache_configure(
idna_encode_size=128,
idna_decode_size=128,
ip_address_size=128,
host_validate_size=128,
encode_host_size=128,
)


def test_cache_configure_waring() -> None:
msg = (
r"cache_configure\(\) no longer accepts idna_encode_size, ip_address_size, "
r"or host_validate_size arguments, they are used to set the "
r"encode_host_size instead and will be removed in the future"
)
with pytest.warns(DeprecationWarning, match=msg):
yarl.cache_configure(
idna_encode_size=1024,
idna_decode_size=1024,
ip_address_size=1024,
host_validate_size=1024,
)

assert yarl.cache_info()["encode_host"].maxsize == 1024
with pytest.warns(DeprecationWarning, match=msg):
yarl.cache_configure(host_validate_size=None)

assert yarl.cache_info()["encode_host"].maxsize is None
2 changes: 1 addition & 1 deletion yarl/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
cache_info,
)

__version__ = "1.15.6.dev0"
__version__ = "1.16.0.dev0"

__all__ = (
"URL",
Expand Down
207 changes: 92 additions & 115 deletions yarl/_url.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,7 @@ class CacheInfo(TypedDict):
idna_decode: _CacheInfo
ip_address: _CacheInfo
host_validate: _CacheInfo
encode_host: _CacheInfo


class _InternalURLCache(TypedDict, total=False):
Expand Down Expand Up @@ -240,78 +241,6 @@ def _check_netloc(netloc: str) -> None:
)


@lru_cache # match the same size as urlsplit
def _parse_host(host: str) -> tuple[bool, str, Union[bool, None], str, str, str]:
"""Parse host into parts

Returns a tuple of:
- True if the host looks like an IP address, False otherwise.
- Lowercased host
- True if the host is ASCII-only, False otherwise.
- Raw IP address
- Separator between IP address and zone
- Zone part of the IP address
"""
lower_host = host.lower()
is_ascii = host.isascii()

# If the host ends with a digit or contains a colon, its likely
# an IP address.
if host and (host[-1].isdigit() or ":" in host):
if "%" in host:
return True, lower_host, is_ascii, *host.partition("%")
return True, lower_host, is_ascii, host, "", ""

return False, lower_host, is_ascii, "", "", ""


def _encode_host(host: str, validate_host: bool) -> str:
"""Encode host part of URL."""
looks_like_ip, lower_host, is_ascii, raw_ip, sep, zone = _parse_host(host)
if looks_like_ip:
# If it looks like an IP, we check with _ip_compressed_version
# and fall-through if its not an IP address. This is a performance
# optimization to avoid parsing IP addresses as much as possible
# because it is orders of magnitude slower than almost any other
# operation this library does.
# Might be an IP address, check it
#
# IP Addresses can look like:
# https://datatracker.ietf.org/doc/html/rfc3986#section-3.2.2
# - 127.0.0.1 (last character is a digit)
# - 2001:db8::ff00:42:8329 (contains a colon)
# - 2001:db8::ff00:42:8329%eth0 (contains a colon)
# - [2001:db8::ff00:42:8329] (contains a colon -- brackets should
# have been removed before it gets here)
# Rare IP Address formats are not supported per:
# https://datatracker.ietf.org/doc/html/rfc3986#section-7.4
#
# IP parsing is slow, so its wrapped in an LRU
try:
host, version = _ip_compressed_version(raw_ip)
except ValueError:
pass
else:
# These checks should not happen in the
# LRU to keep the cache size small
if version == 6:
return f"[{host}%{zone}]" if sep else f"[{host}]"
return f"{host}%{zone}" if sep else host

# IDNA encoding is slow,
# skip it for ASCII-only strings
# Don't move the check into _idna_encode() helper
# to reduce the cache size
if is_ascii:
# Check for invalid characters explicitly; _idna_encode() does this
# for non-ascii host names.
if validate_host:
_host_validate(lower_host)
return lower_host

return _idna_encode(lower_host)


@lru_cache # match the same size as urlsplit
def _split_netloc(
netloc: str,
Expand Down Expand Up @@ -1737,7 +1666,7 @@ def _human_quote(s: Union[str, None], unsafe: str) -> Union[str, None]:
return "".join(c if c.isprintable() else quote(c) for c in s)


_MAXCACHE = 256
_MAXCACHE = 512


@lru_cache(_MAXCACHE)
Expand All @@ -1749,72 +1678,120 @@ def _idna_decode(raw: str) -> str:


@lru_cache(_MAXCACHE)
def _idna_encode(host: str) -> str:
def _encode_host(host: str, validate_host: bool) -> str:
"""Encode host part of URL."""
# If the host ends with a digit or contains a colon, its likely
# an IP address.
if host and (host[-1].isdigit() or ":" in host):
raw_ip, sep, zone = host.partition("%")
# If it looks like an IP, we check with _ip_compressed_version
# and fall-through if its not an IP address. This is a performance
# optimization to avoid parsing IP addresses as much as possible
# because it is orders of magnitude slower than almost any other
# operation this library does.
# Might be an IP address, check it
#
# IP Addresses can look like:
# https://datatracker.ietf.org/doc/html/rfc3986#section-3.2.2
# - 127.0.0.1 (last character is a digit)
# - 2001:db8::ff00:42:8329 (contains a colon)
# - 2001:db8::ff00:42:8329%eth0 (contains a colon)
# - [2001:db8::ff00:42:8329] (contains a colon -- brackets should
# have been removed before it gets here)
# Rare IP Address formats are not supported per:
# https://datatracker.ietf.org/doc/html/rfc3986#section-7.4
#
# IP parsing is slow, so its wrapped in an LRU
try:
ip = ip_address(raw_ip)
except ValueError:
pass
else:
# These checks should not happen in the
# LRU to keep the cache size small
host = ip.compressed
if ip.version == 6:
return f"[{host}%{zone}]" if sep else f"[{host}]"
return f"{host}%{zone}" if sep else host

# IDNA encoding is slow, skip it for ASCII-only strings
if host.isascii():
# Check for invalid characters explicitly; _idna_encode() does this
# for non-ascii host names.
if validate_host and (invalid := _not_reg_name.search(host)):
value, pos, extra = invalid.group(), invalid.start(), ""
if value == "@" or (value == ":" and "@" in host[pos:]):
# this looks like an authority string
extra = (
", if the value includes a username or password, "
"use 'authority' instead of 'host'"
)
raise ValueError(
f"Host {host!r} cannot contain {value!r} (at position " f"{pos}){extra}"
bdraco marked this conversation as resolved.
Show resolved Hide resolved
) from None
return host.lower()

try:
return idna.encode(host, uts46=True).decode("ascii")
return idna.encode(host.lower(), uts46=True).decode("ascii")
except UnicodeError:
return host.encode("idna").decode("ascii")


@lru_cache(_MAXCACHE)
def _ip_compressed_version(raw_ip: str) -> tuple[str, int]:
"""Return compressed version of IP address and its version."""
ip = ip_address(raw_ip)
return ip.compressed, ip.version


@lru_cache(_MAXCACHE)
def _host_validate(host: str) -> None:
"""Validate an ascii host name."""
invalid = _not_reg_name.search(host)
if invalid is None:
return
value, pos, extra = invalid.group(), invalid.start(), ""
if value == "@" or (value == ":" and "@" in host[pos:]):
# this looks like an authority string
extra = (
", if the value includes a username or password, "
"use 'authority' instead of 'host'"
)
raise ValueError(
f"Host {host!r} cannot contain {value!r} (at position " f"{pos}){extra}"
) from None


@rewrite_module
def cache_clear() -> None:
"""Clear all LRU caches."""
_idna_decode.cache_clear()
_idna_encode.cache_clear()
_ip_compressed_version.cache_clear()
_host_validate.cache_clear()
_encode_host.cache_clear()


@rewrite_module
def cache_info() -> CacheInfo:
"""Report cache statistics."""
return {
"idna_encode": _idna_encode.cache_info(),
"idna_encode": _encode_host.cache_info(),
"idna_decode": _idna_decode.cache_info(),
"ip_address": _ip_compressed_version.cache_info(),
"host_validate": _host_validate.cache_info(),
"ip_address": _encode_host.cache_info(),
"host_validate": _encode_host.cache_info(),
"encode_host": _encode_host.cache_info(),
}


_SENTINEL = object()


@rewrite_module
def cache_configure(
*,
idna_encode_size: Union[int, None] = _MAXCACHE,
idna_encode_size: Union[int, None, object] = _SENTINEL,
idna_decode_size: Union[int, None] = _MAXCACHE,
ip_address_size: Union[int, None] = _MAXCACHE,
host_validate_size: Union[int, None] = _MAXCACHE,
ip_address_size: Union[int, None, object] = _SENTINEL,
host_validate_size: Union[int, None, object] = _SENTINEL,
encode_host_size: Union[int, None] = _MAXCACHE,
) -> None:
"""Configure LRU cache sizes."""
global _idna_decode, _idna_encode, _ip_compressed_version, _host_validate
global _idna_decode, _encode_host
# idna_encode_size, ip_address_size, host_validate_size are no longer
# used, but are kept for backwards compatibility.
if encode_host_size is not None:
for size in (idna_encode_size, ip_address_size, host_validate_size):
if size is not _SENTINEL:
warnings.warn(
"cache_configure() no longer accepts idna_encode_size, "
"ip_address_size, or host_validate_size arguments, "
"they are used to set the encode_host_size instead "
"and will be removed in the future",
DeprecationWarning,
stacklevel=2,
)
if size is None:
encode_host_size = None
break
elif size is _SENTINEL:
size = _MAXCACHE
if TYPE_CHECKING:
assert isinstance(size, int)
if size > encode_host_size:
encode_host_size = size

_idna_encode = lru_cache(idna_encode_size)(_idna_encode.__wrapped__)
_encode_host = lru_cache(encode_host_size)(_encode_host.__wrapped__)
_idna_decode = lru_cache(idna_decode_size)(_idna_decode.__wrapped__)
_ip_compressed_version = lru_cache(ip_address_size)(
_ip_compressed_version.__wrapped__
)
_host_validate = lru_cache(host_validate_size)(_host_validate.__wrapped__)
Loading