Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(router) support for regex URIs #2681

Merged
merged 7 commits into from
Jul 20, 2017
Merged

Conversation

thibaultcha
Copy link
Member

@thibaultcha thibaultcha commented Jul 7, 2017

Summary

APIs now accept regex values in their uris property. Any such value that does not respect the reserved character set of RFC 3986 is currently considered a regex.

URI regexes can be useful for use-cases like:

  • /users/\d+/profile
  • /version/\d+/users/\S+

APIs with strip_uri = true will correctly strip off the entire matching regex from the request URI.

We also support named (or numbered) groups captures:

  • /users/(?P<user_id>\d+)/profile
  • /version/(?P<version>\d+)/users/(?P<user>\S+)

Those URIs, when matched, will make the router return a new table with both a hash and an array part, to be used at the user's convenience:

{ "1", "john", version = "1", user = "john" }

This value is not used by the core as of yet, and will be in a later contribution, and will get attached to the request's ngx.ctx structure for later usage in the plugins, at the user's convenience.

Full changelog

  • ⚠️ Breaking change: Kong now requires OpenResty to be compiled with PCRE 7.2+
  • 🎆 APIs now accept uris values with characters outside of the reserved set (only if they are considered valid regexes as per the ngx_lua embedded PCRE interface)
  • 🎆 The router now evaluates regex URIs
  • 🎆 Improve the performance and maintainability of the router
  • 🎆 The router captures matching groups from a regex URI if any
  • Provide related tests for API schema, router and proxy integration
  • Slight cleanup of the test suite

Issues resolved

Implement #677

-- regex URI
local strip_regex = uri .. [[/?(?P<stripped_uri>.*)]]

local from, _, err = re_find(uri, [[\(.*\)]], "jo")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this expression needs some more fleshing out. for example, this matches the literal string

\(foo\)

which is not a capture group. additionally, the * quantifier results in greedy + backtracking behavior; we can use a more performant negative char class ([^)]+).

as an alternative, consider the following expression (untested, but passing for this scenario):

(?<!\\)\([^)]+(?<!\\)\)

for readability, we may want to consider using char groups instead of escaping \, as well as the x opt.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I implemented this towards the end and totally did not give it proper attention.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should even consider using Lua patterns for this instead, as the PCRE is overkill for such a use-cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a solution with patterns that I think covers those scenarios:

local uris             = {
  ["/users/\\(foo\\)"] = false,
  ["/users/\\(foo)"]   = false,
  ["/users/(foo\\)"]   = false,
  ["/users/(foo"]      = false,

  ["/users/(foo)"]             = true,
  ["/users/(hello(foo)world)"] = true,
  ["/users/(hello(foo)world"]  = true,
}

for uri, expected_to_match in pairs(uris) do
  local s = string.match(uri, "[^\\]%(.-[^\\]%)")
  print(uri, ":    ", s)

  if expected_to_match then
    assert(s ~= nil, uri .. " was supposed to match")

  else
    assert(s == nil, uri .. " was not supposed to match")
  end
end

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my only comment here is that we dont have any test uris in this harness that look like:

/users/(foo)/thing/(bar)

where we have more than one capture group. i dont think there is any different here, but if only for completeness it's worth noting

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added it and other test cases for completeness. It gets tricky for capture groups at the start of the evaluated string, and unbalanced captured groups as well:

local uris             = {
  ["/users/\\(foo\\)"] = false,
  ["/users/\\(\\)"]    = false,

  ["/users/(foo)"]                 = true,
  ["/users/()"]                    = true,
  ["/users/()/foo"]                = true,
  ["/users/(hello(foo)world)"]     = true,
  ["/users/(hello(foo)world"]      = true,
  ["/users/(foo)/thing/(bar)"]     = true,
  ["/users/\\(foo\\)/thing/(bar)"] = true,

  -- 0-indexed capture groups
  ["()/world"]       = true,
  ["(/hello)/world"] = true,

  -- unbalanced capture groups
  ["(/hello\\)/world"] = false,
  ["/users/(foo"]      = false,
  ["/users/\\(foo)"]   = false,
  ["/users/(foo\\)"]   = false,
}

for uri, expected_to_match in pairs(uris) do
  local s = string.match(uri, "[^\\]%(.-[^\\]%)")
  s = s or string.match(uri, "^%(.-[^\\]%)")
  s = s or string.match(uri, "%(%)")
  print(uri, ":    ", s)

  if expected_to_match then
    assert(s ~= nil, uri .. " was supposed to match")

  else
    assert(s == nil, uri .. " was not supposed to match")
  end
end

This looks fairly complete to me now.

end
if regex_t.has_captures then
m[0] = nil
ctx.captures = m
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would like to see a more meaningful name for this array

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to do so but wanted to rename it uri_captures before submission, as per the return value from the router.

ok, clear_tab = pcall(require, "table.clear")
if not ok then
clear_tab = function(tab)
for k, _ in pairs(tab) do
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could remove this useless _ variable here.

@thibaultcha thibaultcha force-pushed the feat/uris-regex-support branch from f9b4dbd to c9e3aad Compare July 8, 2017 02:22
@thibaultcha thibaultcha added this to the 0.11.0rc2 milestone Jul 8, 2017
@thibaultcha thibaultcha self-assigned this Jul 8, 2017
-- URI contains characters outside of the reserved list of
-- RFC 3986: the value will be interpreted as a regex;
-- but is it a valid one?
local _, _, err = ngx.re.find("", uri)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use the same flags as in router here?

Copy link
Member Author

@thibaultcha thibaultcha Jul 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "aj"? Although to the best of my knowledge, I don't see why they would result in an error and not the default flags. I'm not feeling strongly about it, but if we decide to, then the cors plugin's schema, which is using the same trick, should maybe do so as well. I don't think we should use o though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now it is probably not going to matter, but there are flags, that I think may change how regexes are interpreted (e.g. x). I was just thinking it would be good if we follow the same arguments here. Mostly because there is no harm adding them.

@bungle
Copy link
Member

bungle commented Jul 8, 2017

Do we need to consider this:
https://tools.ietf.org/html/rfc3987

@RolphH
Copy link

RolphH commented Jul 8, 2017

Looks good and going to test this next week.

It seems to be most useful with strip_uri = false, as in most cases the upstream most probably would need to have the matching regex parts as well.

A nice feature would to be able to specify which part of the uri should be stripped; complete uri, or up to the first match. In that case it would be possible to strip only a part of the uri and have the matching parts forwarded to your upstream.

@RolphH
Copy link

RolphH commented Jul 10, 2017

So far it seems to be working as expected. The output seems to be somewhat strange when creating an API (and getting the API). See the \\d / in response below:

{ ..., "uris": [ "/v1/test/\\d /testresource", "/v1/test/\\d /testresource2", "/v1/test2/\\d /" ] }

This was just created with a regex like /v1/test/\d+/testresource.

Secondly, it seems not possible to start your uri with a ^. Is this used by default or not needed? Currently most of our locations start with a ^.

Same goes for ending with $ (for example /v1/test/\d+$ or (bit more complex) /v1/test\d+(/|$))

@RolphH
Copy link

RolphH commented Jul 10, 2017

Found another issue: if the URI is not ending with a regex (for example \d+) the API is not found:

"uris": [ "/test/\\d", "/test2/\\d /resource" ]

curl http://localhost:8000/test2/123/resource
{"message":"no API found with those values"}

@bungle
Copy link
Member

bungle commented Jul 10, 2017

@RolphH, Using ^ is possible but not needed as the regexes are compiled in anchored mode (so they are always anchored). I'm not sure if we want to have flexibility in that, and does that flexibility cause problems with url stripping.

@@ -480,6 +544,25 @@ describe("Router", function()
assert.same(use_case[2], api_t.api)
end)

it("does not supersedes another API with a longer [uri prefix]", function()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"does not supersede"

@@ -77,9 +77,15 @@ describe("Router", function()
{
name = "api-4",
upstream_url = "http://httpbin.org/basic-auth",
uris = { "/user" },
uris = { "/private" },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the reason for this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/user conflicts with the new "[uri] with a regex" test below which requests /users/foo/profile. And since /private is a better alias for the upstream_uri associated with this test (/basic-auth), story checks out.

assert.same(use_case[1], api_t.api)
end)

it("matches the right API when several ones have a [uri regex]", function()
Copy link
Contributor

@p0pr0ck5 p0pr0ck5 Jul 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, what happens when use_case looks like:

local use_case = {
{
name = "api-1",
uris = { [[/api/persons/\d{2}]] },
},
{
name = "api-2",
uris = { [[/api/persons/\d{3}]] },
},
}

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the use case internally debated where the length of the regex URI does not have any meaning in terms of matching prioritization. In this case, because /api/persons/\d{2} was defined first, it will be evaluated first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is the proper evaluation of this, just wanted to clarify. tyty

it("matches when given [uri] is in request URI prefix", function()
-- uri prefix
local api_t = router.select("GET", "/my-api/some/path")
assert.truthy(api_t)
assert.same(use_case[3], api_t.api)
end)

it("does not superseds another API with a longer URI prefix", function()
it("does not superseds another API with a longer [uri]", function()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated to change, but should be "supersedes"

Copy link
Member Author

@thibaultcha thibaultcha Jul 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it "does not supersede" btw?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

@p0pr0ck5
Copy link
Contributor

p0pr0ck5 commented Jul 10, 2017

Found another issue: if the URI is not ending with a regex (for example \d+) the API is not found:

"uris": [ "/test/\d", "/test2/\d /resource" ]

curl http://localhost:8000/test2/123/resource
{"message":"no API found with those values"}

@RolphH can you check your regex? you defined /test2/\d /resource, perhaps you mean /test2/\d+/resource?

@p0pr0ck5
Copy link
Contributor

p0pr0ck5 commented Jul 10, 2017

So far it seems to be working as expected. The output seems to be somewhat strange when creating an API (and getting the API). See the \d / in response below:

{ ..., "uris": [ "/v1/test/\d /testresource", "/v1/test/\d /testresource2", "/v1/test2/\d /" ] }

This is just a result of the data being JSON encoded (backslashes are escaped). This is expected behavior :)

update I was under the impression you were referring to the double backslash, not the missing +. see below.

@RolphH
Copy link

RolphH commented Jul 10, 2017

@p0pr0ck5 cheked, and I added with \d+.

curl -i -X POST --url http://localhost:8001/apis/ --data 'name=testl-api' --data 'uris=/test/\d+/resource' --data 'upstream_url=http://httpbin.org'
HTTP/1.1 201 Created
Date: Mon, 10 Jul 2017 19:45:28 GMT
Content-Type: application/json; charset=utf-8
Transfer-Encoding: chunked
Connection: keep-alive
Access-Control-Allow-Origin: *
Server: kong/0.10.3

{"http_if_terminated":false,"id":"aeb5670c-2c7b-473b-b878-0a601c1df86a","retries":5,"preserve_host":false,"created_at":1499715928000,"upstream_connect_timeout":60000,"upstream_url":"http://httpbin.org","upstream_read_timeout":60000,"https_only":false,"upstream_send_timeout":60000,"strip_uri":true,"name":"testl-api","uris":["/test/\d /resource"]}

@thibaultcha
Copy link
Member Author

Hi @RolphH,

You will need to url encode your uris parameter like so if you are sending a body with application/x-www-form-urlencoded, as is the default with curl:

curl -i -X POST --url http://localhost:8001/apis/ --data 'name=testl-api' --data-urlencode 'uris=/test/\d+/resource' --data 'upstream_url=http://httpbin.org'

Either that, or send your body with application/json instead.

@thibaultcha thibaultcha force-pushed the feat/uris-regex-support branch 2 times, most recently from 69c7ce0 to ba52268 Compare July 11, 2017 01:20
@thibaultcha
Copy link
Member Author

Updated with the provided feedback and a couple of other improvements.

@RolphH, it would be of great help if you could try your use-case again (with the proper encoding or MIME type), and let us know your thoughts and whether your use-case is covered. Thanks!

@RolphH
Copy link

RolphH commented Jul 11, 2017

@thibaultcha a quick check shows that it is working as expected. Going to do more tests tomorrow.

@p0pr0ck5 p0pr0ck5 dismissed their stale review July 11, 2017 15:52

updated

@@ -62,18 +72,25 @@ local match_api
local reduce


local function has_capturing_groups(subj)
local s = find(subj, "[^\\]%(.-[^\\]%)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird whitespacing alignment here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intentional

@@ -434,7 +474,31 @@ describe("Router", function()
assert.same(use_case[2], api_t.api)
end)

it("[method] does not supersede non-plain [uri]", function()
it("half [uri regex] amd [method] match does not supersede another API", function()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amd -> and?

["()/world"] = true,
["(/hello)/world"] = true,

["/users/\\(foo\\)"] = false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

odd whitespacing alignment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intentional

@thibaultcha thibaultcha force-pushed the feat/uris-regex-support branch from ba52268 to fd0b11d Compare July 11, 2017 21:40
@RolphH
Copy link

RolphH commented Jul 14, 2017

@thibaultcha I have tested the feature and with the improvements it works like a charm for our use case. With this, we are able to start using Kong in our environment once 0.11 is released (we will start with this change in our dev environment).
Thank you for this feature.

insert(api_t.hosts, {
wildcard = true,
value = host_value,
regex = wildcard_host_regex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pedantic, missing trailing comma

if re_match(uri, [[^[a-zA-Z0-9\.\-_~/%]*$]]) then
-- plain URI or URI prefix
local escaped_uri = [[\Q]] .. uri .. [[\E]]
local strip_regex = escaped_uri .. [[/?(?P<stripped_uri>.*)]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the P isnt necessary; i wonder if removing it will help keep things cleaner:

$ cat t.lua 
local m, err = ngx.re.match("hello, world", "[a-z](?<name>[a-z]+)", "ajo")
print(m[0])
print(m[1])
print(m.name)
$ resty ./t.lua 
hello
ello
ello

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it isn't, and have no strong feelings about either syntaxes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would say lets get rid of it, in favor of being concise, and we're not python/php ;)

}
else
-- regex URI
local strip_regex = uri .. [[/?(?P<stripped_uri>.*)]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem re: P

return
end
if regex_t.has_captures then
m[0] = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering a bit about this. is there logic around removing this group? are there no cases where we would not want to expose the whole match further on?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There would be, but only if we also expose a match from a plain/prefix URI as well, otherwise the behavior is inconsistent between the two possibilities. Since we do not (for now) expose it on a plain match, we do not expose it on a regexp match either.

@p0pr0ck5
Copy link
Contributor

LGTM besides 2 pending comments

@thibaultcha thibaultcha force-pushed the feat/uris-regex-support branch 7 times, most recently from b966ab2 to 678c84a Compare July 19, 2017 21:54
@thibaultcha thibaultcha changed the base branch from master to next July 19, 2017 21:54
When matching URIs, we iterate over a list of APIs, and not a list of
URIs (for various, performance reasons, although a refactor of the
router is scheduled with some performance improvements later on).
This list of APIs is sorted per URI length. But an API can have more
than one URI. As such, the router can potentially evaluate an API with a
very long URI, and a much shorter one that matches the request URI,
instead of a following API which would be an exact match for that API.

To fix this, we sort our deserialized list of URIs to iterate on in the
select method, and if we do have a match, we conserve the URI that was
matched, to check if it belongs to the right API at API match time.

Fix #2662
When a value from an API's `uris` property does not respect the reserved
character set from RFC 3986, the router now assumes this value is a
user-specified regex.

User regexes are stored in another index than what we consider "URI
prefixes" (non-regex `uris` values), because while the later needs to be
evaluated based on their length, the same does not apply for the former.
The order in which regexes are specified (and ultimately, APIs are
registered) will be the order in which they will be evaluated.

Because users might specify their own capturing groups once we support
parameters extraction, the URI stripping now uses named capturing
groups. This requires PCRE 7.2+. The original implementation used
numbered capturing groups, but this is not aligned with our long-term
goal of allowing dynamic URI rewriting as part of our Plugins API or
request-transformer plugin.

Breaking changes: requires PCRE 7.2+

Implements: #677 (partially)
APIs can now receive user-specified regexes in their `uris` property.
Those regexes will be treated as such by the router.

A URI is considered a regex when its character set does not respect the
reserved list from RFC 3986.

Implements: #677 (partially)
* remove some duplicated tests
* clarify the titles of a few tests to better understand their scope at
first sight
We now perform efficient matching for URIs based on its property and the
regex being evaluated. We favorise running `ngx.re.find` when the more
expensive `ngx.re.match` is not necessary.

When a regex URI has capture groups, we return them from the router
`exec()` for further processing inside the core, which will be
responsible for passing them along to plugins.

This also has the advantage of being more performant on router cache
hits, since URI stripping is done only once, at matching type, and
cached in the LRU structure. The same way, groups matching is cached as
well.

Implements: #677 (partially)
We now return any value that matched back to the core so that plugins
can make use of it.

This also improves the router performance by moving the URI
stripping/trailing slash/preserve host header to be properly cached by
the router.

We also simplify the return values of the router by providing a table
instead of yet one more return value.
* use string.find instead of re.find whenever possible when matching URI
prefixes
* use string.sub for prefix URI stripping instead of re.match
* use string.find/string.sub for Host port stripping
@thibaultcha thibaultcha force-pushed the feat/uris-regex-support branch from 678c84a to 1dec37d Compare July 20, 2017 00:58
@thibaultcha thibaultcha merged commit c2abc93 into next Jul 20, 2017
@thibaultcha thibaultcha deleted the feat/uris-regex-support branch July 20, 2017 01:51
thibaultcha added a commit to Kong/docs.konghq.com that referenced this pull request Aug 11, 2017
thibaultcha added a commit to Kong/docs.konghq.com that referenced this pull request Aug 16, 2017
@wandenberg
Copy link

Is there a way to use the captured values in the upsteam url? I tried with $1 like suggested in #677 but didn't work.
Also took a look at the code but couldn't identify if it is possible or not.
If not possible to use directly in the upstream url like a rewrite, can this be done using the rewrite phase of a plugin?
I really need a way to do a rewrite based on regexp. I also tried to set a rewrite directly in the Nginx configuration, but Kong always uses the original url, no matter if I use last or break in the rewrite. (The idea was to rewrite the url and configure only the new one in Kong. Any help will be appreciated.

@thibaultcha
Copy link
Member Author

@wandenberg The Mashape Enterprise subscription has an updated version of the request-transformer plugin that allows for substituting such captured groups directly in the upstream_url value, like so: config.replace.uri=/users/$(uri_captures.user2)/foo. See our Enterprise Subscription page for more information.

Otherwise, those values are exposed in the Lua land via the ngx.ctx.router_matches value for plugins consumption.

@wandenberg
Copy link

Thanks for the quick response @thibaultcha I will take a look on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants