-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(router) support for regex URIs #2681
Conversation
52e11ad
to
f9b4dbd
Compare
kong/core/router.lua
Outdated
-- regex URI | ||
local strip_regex = uri .. [[/?(?P<stripped_uri>.*)]] | ||
|
||
local from, _, err = re_find(uri, [[\(.*\)]], "jo") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this expression needs some more fleshing out. for example, this matches the literal string
\(foo\)
which is not a capture group. additionally, the *
quantifier results in greedy + backtracking behavior; we can use a more performant negative char class ([^)]+
).
as an alternative, consider the following expression (untested, but passing for this scenario):
(?<!\\)\([^)]+(?<!\\)\)
for readability, we may want to consider using char groups instead of escaping \
, as well as the x
opt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I implemented this towards the end and totally did not give it proper attention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should even consider using Lua patterns for this instead, as the PCRE is overkill for such a use-cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a solution with patterns that I think covers those scenarios:
local uris = {
["/users/\\(foo\\)"] = false,
["/users/\\(foo)"] = false,
["/users/(foo\\)"] = false,
["/users/(foo"] = false,
["/users/(foo)"] = true,
["/users/(hello(foo)world)"] = true,
["/users/(hello(foo)world"] = true,
}
for uri, expected_to_match in pairs(uris) do
local s = string.match(uri, "[^\\]%(.-[^\\]%)")
print(uri, ": ", s)
if expected_to_match then
assert(s ~= nil, uri .. " was supposed to match")
else
assert(s == nil, uri .. " was not supposed to match")
end
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my only comment here is that we dont have any test uris in this harness that look like:
/users/(foo)/thing/(bar)
where we have more than one capture group. i dont think there is any different here, but if only for completeness it's worth noting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added it and other test cases for completeness. It gets tricky for capture groups at the start of the evaluated string, and unbalanced captured groups as well:
local uris = {
["/users/\\(foo\\)"] = false,
["/users/\\(\\)"] = false,
["/users/(foo)"] = true,
["/users/()"] = true,
["/users/()/foo"] = true,
["/users/(hello(foo)world)"] = true,
["/users/(hello(foo)world"] = true,
["/users/(foo)/thing/(bar)"] = true,
["/users/\\(foo\\)/thing/(bar)"] = true,
-- 0-indexed capture groups
["()/world"] = true,
["(/hello)/world"] = true,
-- unbalanced capture groups
["(/hello\\)/world"] = false,
["/users/(foo"] = false,
["/users/\\(foo)"] = false,
["/users/(foo\\)"] = false,
}
for uri, expected_to_match in pairs(uris) do
local s = string.match(uri, "[^\\]%(.-[^\\]%)")
s = s or string.match(uri, "^%(.-[^\\]%)")
s = s or string.match(uri, "%(%)")
print(uri, ": ", s)
if expected_to_match then
assert(s ~= nil, uri .. " was supposed to match")
else
assert(s == nil, uri .. " was not supposed to match")
end
end
This looks fairly complete to me now.
kong/core/router.lua
Outdated
end | ||
if regex_t.has_captures then | ||
m[0] = nil | ||
ctx.captures = m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would like to see a more meaningful name for this array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to do so but wanted to rename it uri_captures
before submission, as per the return value from the router.
kong/core/router.lua
Outdated
ok, clear_tab = pcall(require, "table.clear") | ||
if not ok then | ||
clear_tab = function(tab) | ||
for k, _ in pairs(tab) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could remove this useless _
variable here.
f9b4dbd
to
c9e3aad
Compare
kong/dao/schemas/apis.lua
Outdated
-- URI contains characters outside of the reserved list of | ||
-- RFC 3986: the value will be interpreted as a regex; | ||
-- but is it a valid one? | ||
local _, _, err = ngx.re.find("", uri) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use the same flags as in router here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "aj"
? Although to the best of my knowledge, I don't see why they would result in an error and not the default flags. I'm not feeling strongly about it, but if we decide to, then the cors plugin's schema, which is using the same trick, should maybe do so as well. I don't think we should use o
though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now it is probably not going to matter, but there are flags, that I think may change how regexes are interpreted (e.g. x
). I was just thinking it would be good if we follow the same arguments here. Mostly because there is no harm adding them.
Do we need to consider this: |
Looks good and going to test this next week. It seems to be most useful with strip_uri = false, as in most cases the upstream most probably would need to have the matching regex parts as well. A nice feature would to be able to specify which part of the uri should be stripped; complete uri, or up to the first match. In that case it would be possible to strip only a part of the uri and have the matching parts forwarded to your upstream. |
So far it seems to be working as expected. The output seems to be somewhat strange when creating an API (and getting the API). See the
This was just created with a regex like Secondly, it seems not possible to start your uri with a Same goes for ending with $ (for example |
Found another issue: if the URI is not ending with a regex (for example \d+) the API is not found:
|
@RolphH, Using |
spec/01-unit/11-router_spec.lua
Outdated
@@ -480,6 +544,25 @@ describe("Router", function() | |||
assert.same(use_case[2], api_t.api) | |||
end) | |||
|
|||
it("does not supersedes another API with a longer [uri prefix]", function() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"does not supersede"
@@ -77,9 +77,15 @@ describe("Router", function() | |||
{ | |||
name = "api-4", | |||
upstream_url = "http://httpbin.org/basic-auth", | |||
uris = { "/user" }, | |||
uris = { "/private" }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the reason for this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/user
conflicts with the new "[uri] with a regex"
test below which requests /users/foo/profile
. And since /private
is a better alias for the upstream_uri associated with this test (/basic-auth
), story checks out.
spec/01-unit/11-router_spec.lua
Outdated
assert.same(use_case[1], api_t.api) | ||
end) | ||
|
||
it("matches the right API when several ones have a [uri regex]", function() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious, what happens when use_case
looks like:
local use_case = {
{
name = "api-1",
uris = { [[/api/persons/\d{2}]] },
},
{
name = "api-2",
uris = { [[/api/persons/\d{3}]] },
},
}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the use case internally debated where the length of the regex URI does not have any meaning in terms of matching prioritization. In this case, because /api/persons/\d{2}
was defined first, it will be evaluated first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is the proper evaluation of this, just wanted to clarify. tyty
spec/01-unit/11-router_spec.lua
Outdated
it("matches when given [uri] is in request URI prefix", function() | ||
-- uri prefix | ||
local api_t = router.select("GET", "/my-api/some/path") | ||
assert.truthy(api_t) | ||
assert.same(use_case[3], api_t.api) | ||
end) | ||
|
||
it("does not superseds another API with a longer URI prefix", function() | ||
it("does not superseds another API with a longer [uri]", function() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated to change, but should be "supersedes"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it "does not supersede" btw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
@RolphH can you check your regex? you defined |
This is just a result of the data being JSON encoded (backslashes are escaped). This is expected behavior :) update I was under the impression you were referring to the double backslash, not the missing |
@p0pr0ck5 cheked, and I added with \d+. curl -i -X POST --url http://localhost:8001/apis/ --data 'name=testl-api' --data 'uris=/test/\d+/resource' --data 'upstream_url=http://httpbin.org' {"http_if_terminated":false,"id":"aeb5670c-2c7b-473b-b878-0a601c1df86a","retries":5,"preserve_host":false,"created_at":1499715928000,"upstream_connect_timeout":60000,"upstream_url":"http://httpbin.org","upstream_read_timeout":60000,"https_only":false,"upstream_send_timeout":60000,"strip_uri":true,"name":"testl-api","uris":["/test/\d /resource"]} |
Hi @RolphH, You will need to url encode your
Either that, or send your body with |
69c7ce0
to
ba52268
Compare
Updated with the provided feedback and a couple of other improvements. @RolphH, it would be of great help if you could try your use-case again (with the proper encoding or MIME type), and let us know your thoughts and whether your use-case is covered. Thanks! |
@thibaultcha a quick check shows that it is working as expected. Going to do more tests tomorrow. |
@@ -62,18 +72,25 @@ local match_api | |||
local reduce | |||
|
|||
|
|||
local function has_capturing_groups(subj) | |||
local s = find(subj, "[^\\]%(.-[^\\]%)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weird whitespacing alignment here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
intentional
spec/01-unit/11-router_spec.lua
Outdated
@@ -434,7 +474,31 @@ describe("Router", function() | |||
assert.same(use_case[2], api_t.api) | |||
end) | |||
|
|||
it("[method] does not supersede non-plain [uri]", function() | |||
it("half [uri regex] amd [method] match does not supersede another API", function() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
amd -> and?
spec/01-unit/11-router_spec.lua
Outdated
["()/world"] = true, | ||
["(/hello)/world"] = true, | ||
|
||
["/users/\\(foo\\)"] = false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
odd whitespacing alignment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
intentional
ba52268
to
fd0b11d
Compare
@thibaultcha I have tested the feature and with the improvements it works like a charm for our use case. With this, we are able to start using Kong in our environment once 0.11 is released (we will start with this change in our dev environment). |
kong/core/router.lua
Outdated
insert(api_t.hosts, { | ||
wildcard = true, | ||
value = host_value, | ||
regex = wildcard_host_regex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pedantic, missing trailing comma
kong/core/router.lua
Outdated
if re_match(uri, [[^[a-zA-Z0-9\.\-_~/%]*$]]) then | ||
-- plain URI or URI prefix | ||
local escaped_uri = [[\Q]] .. uri .. [[\E]] | ||
local strip_regex = escaped_uri .. [[/?(?P<stripped_uri>.*)]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the P
isnt necessary; i wonder if removing it will help keep things cleaner:
$ cat t.lua
local m, err = ngx.re.match("hello, world", "[a-z](?<name>[a-z]+)", "ajo")
print(m[0])
print(m[1])
print(m.name)
$ resty ./t.lua
hello
ello
ello
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it isn't, and have no strong feelings about either syntaxes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would say lets get rid of it, in favor of being concise, and we're not python/php ;)
kong/core/router.lua
Outdated
} | ||
else | ||
-- regex URI | ||
local strip_regex = uri .. [[/?(?P<stripped_uri>.*)]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idem re: P
kong/core/router.lua
Outdated
return | ||
end | ||
if regex_t.has_captures then | ||
m[0] = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering a bit about this. is there logic around removing this group? are there no cases where we would not want to expose the whole match further on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There would be, but only if we also expose a match from a plain/prefix URI as well, otherwise the behavior is inconsistent between the two possibilities. Since we do not (for now) expose it on a plain match, we do not expose it on a regexp match either.
LGTM besides 2 pending comments |
b966ab2
to
678c84a
Compare
When matching URIs, we iterate over a list of APIs, and not a list of URIs (for various, performance reasons, although a refactor of the router is scheduled with some performance improvements later on). This list of APIs is sorted per URI length. But an API can have more than one URI. As such, the router can potentially evaluate an API with a very long URI, and a much shorter one that matches the request URI, instead of a following API which would be an exact match for that API. To fix this, we sort our deserialized list of URIs to iterate on in the select method, and if we do have a match, we conserve the URI that was matched, to check if it belongs to the right API at API match time. Fix #2662
When a value from an API's `uris` property does not respect the reserved character set from RFC 3986, the router now assumes this value is a user-specified regex. User regexes are stored in another index than what we consider "URI prefixes" (non-regex `uris` values), because while the later needs to be evaluated based on their length, the same does not apply for the former. The order in which regexes are specified (and ultimately, APIs are registered) will be the order in which they will be evaluated. Because users might specify their own capturing groups once we support parameters extraction, the URI stripping now uses named capturing groups. This requires PCRE 7.2+. The original implementation used numbered capturing groups, but this is not aligned with our long-term goal of allowing dynamic URI rewriting as part of our Plugins API or request-transformer plugin. Breaking changes: requires PCRE 7.2+ Implements: #677 (partially)
APIs can now receive user-specified regexes in their `uris` property. Those regexes will be treated as such by the router. A URI is considered a regex when its character set does not respect the reserved list from RFC 3986. Implements: #677 (partially)
* remove some duplicated tests * clarify the titles of a few tests to better understand their scope at first sight
We now perform efficient matching for URIs based on its property and the regex being evaluated. We favorise running `ngx.re.find` when the more expensive `ngx.re.match` is not necessary. When a regex URI has capture groups, we return them from the router `exec()` for further processing inside the core, which will be responsible for passing them along to plugins. This also has the advantage of being more performant on router cache hits, since URI stripping is done only once, at matching type, and cached in the LRU structure. The same way, groups matching is cached as well. Implements: #677 (partially)
We now return any value that matched back to the core so that plugins can make use of it. This also improves the router performance by moving the URI stripping/trailing slash/preserve host header to be properly cached by the router. We also simplify the return values of the router by providing a table instead of yet one more return value.
* use string.find instead of re.find whenever possible when matching URI prefixes * use string.sub for prefix URI stripping instead of re.match * use string.find/string.sub for Host port stripping
678c84a
to
1dec37d
Compare
0.11.0 feature introduced in Kong/kong#2681
0.11.0 feature introduced in Kong/kong#2681
Is there a way to use the captured values in the upsteam url? I tried with |
@wandenberg The Mashape Enterprise subscription has an updated version of the request-transformer plugin that allows for substituting such captured groups directly in the Otherwise, those values are exposed in the Lua land via the |
Thanks for the quick response @thibaultcha I will take a look on it. |
Summary
APIs now accept regex values in their
uris
property. Any such value that does not respect the reserved character set of RFC 3986 is currently considered a regex.URI regexes can be useful for use-cases like:
/users/\d+/profile
/version/\d+/users/\S+
APIs with
strip_uri = true
will correctly strip off the entire matching regex from the request URI.We also support named (or numbered) groups captures:
/users/(?P<user_id>\d+)/profile
/version/(?P<version>\d+)/users/(?P<user>\S+)
Those URIs, when matched, will make the router return a new table with both a hash and an array part, to be used at the user's convenience:
This value is not used by the core as of yet, and will be in a later contribution, and will get attached to the request's
ngx.ctx
structure for later usage in the plugins, at the user's convenience.Full changelog
uris
values with characters outside of the reserved set (only if they are considered valid regexes as per the ngx_lua embedded PCRE interface)Issues resolved
Implement #677