feat(router) support for regex URIs #2681

thibaultcha · 2017-07-07T21:30:27Z

Summary

APIs now accept regex values in their uris property. Any such value that does not respect the reserved character set of RFC 3986 is currently considered a regex.

URI regexes can be useful for use-cases like:

/users/\d+/profile
/version/\d+/users/\S+

APIs with strip_uri = true will correctly strip off the entire matching regex from the request URI.

We also support named (or numbered) groups captures:

/users/(?P<user_id>\d+)/profile
/version/(?P<version>\d+)/users/(?P<user>\S+)

Those URIs, when matched, will make the router return a new table with both a hash and an array part, to be used at the user's convenience:

{ "1", "john", version = "1", user = "john" }

This value is not used by the core as of yet, and will be in a later contribution, and will get attached to the request's ngx.ctx structure for later usage in the plugins, at the user's convenience.

Full changelog

⚠️ Breaking change: Kong now requires OpenResty to be compiled with PCRE 7.2+
🎆 APIs now accept uris values with characters outside of the reserved set (only if they are considered valid regexes as per the ngx_lua embedded PCRE interface)
🎆 The router now evaluates regex URIs
🎆 Improve the performance and maintainability of the router
🎆 The router captures matching groups from a regex URI if any
Provide related tests for API schema, router and proxy integration
Slight cleanup of the test suite

Issues resolved

Implement #677

p0pr0ck5 · 2017-07-07T21:36:51Z

kong/core/router.lua

+          -- regex URI
+          local strip_regex = uri .. [[/?(?P<stripped_uri>.*)]]
+
+          local from, _, err = re_find(uri, [[\(.*\)]], "jo")


this expression needs some more fleshing out. for example, this matches the literal string

$foo$

which is not a capture group. additionally, the * quantifier results in greedy + backtracking behavior; we can use a more performant negative char class ([^)]+).

as an alternative, consider the following expression (untested, but passing for this scenario):

(?<!\\)$[^)]+(?<!\$\)

for readability, we may want to consider using char groups instead of escaping \, as well as the x opt.

Good point, I implemented this towards the end and totally did not give it proper attention.

We should even consider using Lua patterns for this instead, as the PCRE is overkill for such a use-cases.

Here is a solution with patterns that I think covers those scenarios:

local uris = { ["/users/\$foo\$"] = false, ["/users/\$foo)"] = false, ["/users/(foo\$"] = false, ["/users/(foo"] = false, ["/users/(foo)"] = true, ["/users/(hello(foo)world)"] = true, ["/users/(hello(foo)world"] = true, } for uri, expected_to_match in pairs(uris) do local s = string.match(uri, "[^\\]%(.-[^\\]%)") print(uri, ": ", s) if expected_to_match then assert(s ~= nil, uri .. " was supposed to match") else assert(s == nil, uri .. " was not supposed to match") end end

my only comment here is that we dont have any test uris in this harness that look like:

/users/(foo)/thing/(bar)

where we have more than one capture group. i dont think there is any different here, but if only for completeness it's worth noting

Added it and other test cases for completeness. It gets tricky for capture groups at the start of the evaluated string, and unbalanced captured groups as well:

local uris = { ["/users/\$foo\$"] = false, ["/users/\$\$"] = false, ["/users/(foo)"] = true, ["/users/()"] = true, ["/users/()/foo"] = true, ["/users/(hello(foo)world)"] = true, ["/users/(hello(foo)world"] = true, ["/users/(foo)/thing/(bar)"] = true, ["/users/\$foo\$/thing/(bar)"] = true, -- 0-indexed capture groups ["()/world"] = true, ["(/hello)/world"] = true, -- unbalanced capture groups ["(/hello\\)/world"] = false, ["/users/(foo"] = false, ["/users/\$foo)"] = false, ["/users/(foo\$"] = false, } for uri, expected_to_match in pairs(uris) do local s = string.match(uri, "[^\\]%(.-[^\\]%)") s = s or string.match(uri, "^%(.-[^\\]%)") s = s or string.match(uri, "%(%)") print(uri, ": ", s) if expected_to_match then assert(s ~= nil, uri .. " was supposed to match") else assert(s == nil, uri .. " was not supposed to match") end end

This looks fairly complete to me now.

p0pr0ck5 · 2017-07-07T21:37:24Z

kong/core/router.lua

-        end
+              if regex_t.has_captures then
+                m[0] = nil
+                ctx.captures = m


would like to see a more meaningful name for this array

I forgot to do so but wanted to rename it uri_captures before submission, as per the return value from the router.

thibaultcha · 2017-07-07T21:56:18Z

kong/core/router.lua

+  ok, clear_tab = pcall(require, "table.clear")
+  if not ok then
+    clear_tab = function(tab)
+      for k, _ in pairs(tab) do


We could remove this useless _ variable here.

bungle · 2017-07-08T04:57:27Z

kong/dao/schemas/apis.lua

+    -- URI contains characters outside of the reserved list of
+    -- RFC 3986: the value will be interpreted as a regex;
+    -- but is it a valid one?
+    local _, _, err = ngx.re.find("", uri)


Should we use the same flags as in router here?

Maybe "aj"? Although to the best of my knowledge, I don't see why they would result in an error and not the default flags. I'm not feeling strongly about it, but if we decide to, then the cors plugin's schema, which is using the same trick, should maybe do so as well. I don't think we should use o though.

Right now it is probably not going to matter, but there are flags, that I think may change how regexes are interpreted (e.g. x). I was just thinking it would be good if we follow the same arguments here. Mostly because there is no harm adding them.

bungle · 2017-07-08T05:09:47Z

Do we need to consider this:
https://tools.ietf.org/html/rfc3987

RolphH · 2017-07-08T15:42:14Z

Looks good and going to test this next week.

It seems to be most useful with strip_uri = false, as in most cases the upstream most probably would need to have the matching regex parts as well.

A nice feature would to be able to specify which part of the uri should be stripped; complete uri, or up to the first match. In that case it would be possible to strip only a part of the uri and have the matching parts forwarded to your upstream.

RolphH · 2017-07-10T13:56:39Z

So far it seems to be working as expected. The output seems to be somewhat strange when creating an API (and getting the API). See the \\d / in response below:

{ ..., "uris": [ "/v1/test/\\d /testresource", "/v1/test/\\d /testresource2", "/v1/test2/\\d /" ] }

This was just created with a regex like /v1/test/\d+/testresource.

Secondly, it seems not possible to start your uri with a ^. Is this used by default or not needed? Currently most of our locations start with a ^.

Same goes for ending with $ (for example /v1/test/\d+$ or (bit more complex) /v1/test\d+(/|$))

RolphH · 2017-07-10T14:18:29Z

Found another issue: if the URI is not ending with a regex (for example \d+) the API is not found:

"uris": [ "/test/\\d", "/test2/\\d /resource" ]

curl http://localhost:8000/test2/123/resource
{"message":"no API found with those values"}

bungle · 2017-07-10T14:41:10Z

@RolphH, Using ^ is possible but not needed as the regexes are compiled in anchored mode (so they are always anchored). I'm not sure if we want to have flexibility in that, and does that flexibility cause problems with url stripping.

p0pr0ck5 · 2017-07-09T02:44:43Z

spec/01-unit/11-router_spec.lua

@@ -480,6 +544,25 @@ describe("Router", function()
        assert.same(use_case[2], api_t.api)
      end)

+      it("does not supersedes another API with a longer [uri prefix]", function()


"does not supersede"

p0pr0ck5 · 2017-07-09T02:51:30Z

spec/02-integration/05-proxy/01-router_spec.lua

@@ -77,9 +77,15 @@ describe("Router", function()
        {
          name = "api-4",
          upstream_url = "http://httpbin.org/basic-auth",
-          uris = { "/user" },
+          uris = { "/private" },


why the reason for this change?

/user conflicts with the new "[uri] with a regex" test below which requests /users/foo/profile. And since /private is a better alias for the upstream_uri associated with this test (/basic-auth), story checks out.

p0pr0ck5 · 2017-07-09T02:54:21Z

spec/01-unit/11-router_spec.lua

+        assert.same(use_case[1], api_t.api)
+      end)
+
+      it("matches the right API when several ones have a [uri regex]", function()


just curious, what happens when use_case looks like:

local use_case = { { name = "api-1", uris = { [[/api/persons/\d{2}]] }, }, { name = "api-2", uris = { [[/api/persons/\d{3}]] }, }, }

?

This is the use case internally debated where the length of the regex URI does not have any meaning in terms of matching prioritization. In this case, because /api/persons/\d{2} was defined first, it will be evaluated first.

I agree this is the proper evaluation of this, just wanted to clarify. tyty

p0pr0ck5 · 2017-07-09T02:55:29Z

spec/01-unit/11-router_spec.lua

      it("matches when given [uri] is in request URI prefix", function()
        -- uri prefix
        local api_t = router.select("GET", "/my-api/some/path")
        assert.truthy(api_t)
        assert.same(use_case[3], api_t.api)
      end)

-      it("does not superseds another API with a longer URI prefix", function()
+      it("does not superseds another API with a longer [uri]", function()


unrelated to change, but should be "supersedes"

Isn't it "does not supersede" btw?

p0pr0ck5 · 2017-07-10T19:25:26Z

Found another issue: if the URI is not ending with a regex (for example \d+) the API is not found:

"uris": [ "/test/\d", "/test2/\d /resource" ]

curl http://localhost:8000/test2/123/resource
{"message":"no API found with those values"}

@RolphH can you check your regex? you defined /test2/\d /resource, perhaps you mean /test2/\d+/resource?

p0pr0ck5 · 2017-07-10T19:26:29Z

So far it seems to be working as expected. The output seems to be somewhat strange when creating an API (and getting the API). See the \d / in response below:

{ ..., "uris": [ "/v1/test/\d /testresource", "/v1/test/\d /testresource2", "/v1/test2/\d /" ] }

This is just a result of the data being JSON encoded (backslashes are escaped). This is expected behavior :)

update I was under the impression you were referring to the double backslash, not the missing +. see below.

RolphH · 2017-07-10T19:48:23Z

@p0pr0ck5 cheked, and I added with \d+.

curl -i -X POST --url http://localhost:8001/apis/ --data 'name=testl-api' --data 'uris=/test/\d+/resource' --data 'upstream_url=http://httpbin.org'
HTTP/1.1 201 Created
Date: Mon, 10 Jul 2017 19:45:28 GMT
Content-Type: application/json; charset=utf-8
Transfer-Encoding: chunked
Connection: keep-alive
Access-Control-Allow-Origin: *
Server: kong/0.10.3

{"http_if_terminated":false,"id":"aeb5670c-2c7b-473b-b878-0a601c1df86a","retries":5,"preserve_host":false,"created_at":1499715928000,"upstream_connect_timeout":60000,"upstream_url":"http://httpbin.org","upstream_read_timeout":60000,"https_only":false,"upstream_send_timeout":60000,"strip_uri":true,"name":"testl-api","uris":["/test/\d /resource"]}

thibaultcha · 2017-07-10T20:35:10Z

Hi @RolphH,

You will need to url encode your uris parameter like so if you are sending a body with application/x-www-form-urlencoded, as is the default with curl:

curl -i -X POST --url http://localhost:8001/apis/ --data 'name=testl-api' --data-urlencode 'uris=/test/\d+/resource' --data 'upstream_url=http://httpbin.org'

Either that, or send your body with application/json instead.

thibaultcha · 2017-07-11T01:21:41Z

Updated with the provided feedback and a couple of other improvements.

@RolphH, it would be of great help if you could try your use-case again (with the proper encoding or MIME type), and let us know your thoughts and whether your use-case is covered. Thanks!

RolphH · 2017-07-11T06:42:32Z

@thibaultcha a quick check shows that it is working as expected. Going to do more tests tomorrow.

updated

p0pr0ck5 · 2017-07-11T15:52:48Z

kong/core/router.lua

@@ -62,18 +72,25 @@ local match_api
 local reduce


+local function has_capturing_groups(subj)
+  local s =      find(subj, "[^\\]%(.-[^\\]%)")


weird whitespacing alignment here

intentional

p0pr0ck5 · 2017-07-11T16:01:30Z

spec/01-unit/11-router_spec.lua

@@ -434,7 +474,31 @@ describe("Router", function()
        assert.same(use_case[2], api_t.api)
      end)

-      it("[method] does not supersede non-plain [uri]", function()
+      it("half [uri regex] amd [method] match does not supersede another API", function()


amd -> and?

p0pr0ck5 · 2017-07-11T16:04:44Z

spec/01-unit/11-router_spec.lua

+        ["()/world"]                     = true,
+        ["(/hello)/world"]               = true,
+
+        ["/users/\\(foo\\)"] = false,


odd whitespacing alignment

intentional

RolphH · 2017-07-14T06:48:27Z

@thibaultcha I have tested the feature and with the improvements it works like a charm for our use case. With this, we are able to start using Kong in our environment once 0.11 is released (we will start with this change in our dev environment).
Thank you for this feature.

p0pr0ck5 · 2017-07-14T17:31:20Z

kong/core/router.lua

+          insert(api_t.hosts, {
+            wildcard = true,
+            value    = host_value,
+            regex    = wildcard_host_regex


pedantic, missing trailing comma

p0pr0ck5 · 2017-07-14T17:35:01Z

kong/core/router.lua

+        if re_match(uri, [[^[a-zA-Z0-9\.\-_~/%]*$]]) then
+          -- plain URI or URI prefix
+          local escaped_uri = [[\Q]] .. uri .. [[\E]]
+          local strip_regex = escaped_uri .. [[/?(?P<stripped_uri>.*)]]


the P isnt necessary; i wonder if removing it will help keep things cleaner:

$ cat t.lua local m, err = ngx.re.match("hello, world", "[a-z](?<name>[a-z]+)", "ajo") print(m[0]) print(m[1]) print(m.name) $ resty ./t.lua hello ello ello

I know it isn't, and have no strong feelings about either syntaxes.

i would say lets get rid of it, in favor of being concise, and we're not python/php ;)

p0pr0ck5 · 2017-07-14T17:35:11Z

kong/core/router.lua

-        }
+        else
+          -- regex URI
+          local strip_regex  = uri .. [[/?(?P<stripped_uri>.*)]]


idem re: P

p0pr0ck5 · 2017-07-14T17:41:09Z

kong/core/router.lua

-          return
-        end
+              if regex_t.has_captures then
+                m[0] = nil


wondering a bit about this. is there logic around removing this group? are there no cases where we would not want to expose the whole match further on?

There would be, but only if we also expose a match from a plain/prefix URI as well, otherwise the behavior is inconsistent between the two possibilities. Since we do not (for now) expose it on a plain match, we do not expose it on a regexp match either.

p0pr0ck5 · 2017-07-14T18:27:40Z

LGTM besides 2 pending comments

When matching URIs, we iterate over a list of APIs, and not a list of URIs (for various, performance reasons, although a refactor of the router is scheduled with some performance improvements later on). This list of APIs is sorted per URI length. But an API can have more than one URI. As such, the router can potentially evaluate an API with a very long URI, and a much shorter one that matches the request URI, instead of a following API which would be an exact match for that API. To fix this, we sort our deserialized list of URIs to iterate on in the select method, and if we do have a match, we conserve the URI that was matched, to check if it belongs to the right API at API match time. Fix #2662

When a value from an API's `uris` property does not respect the reserved character set from RFC 3986, the router now assumes this value is a user-specified regex. User regexes are stored in another index than what we consider "URI prefixes" (non-regex `uris` values), because while the later needs to be evaluated based on their length, the same does not apply for the former. The order in which regexes are specified (and ultimately, APIs are registered) will be the order in which they will be evaluated. Because users might specify their own capturing groups once we support parameters extraction, the URI stripping now uses named capturing groups. This requires PCRE 7.2+. The original implementation used numbered capturing groups, but this is not aligned with our long-term goal of allowing dynamic URI rewriting as part of our Plugins API or request-transformer plugin. Breaking changes: requires PCRE 7.2+ Implements: #677 (partially)

APIs can now receive user-specified regexes in their `uris` property. Those regexes will be treated as such by the router. A URI is considered a regex when its character set does not respect the reserved list from RFC 3986. Implements: #677 (partially)

* remove some duplicated tests * clarify the titles of a few tests to better understand their scope at first sight

We now perform efficient matching for URIs based on its property and the regex being evaluated. We favorise running `ngx.re.find` when the more expensive `ngx.re.match` is not necessary. When a regex URI has capture groups, we return them from the router `exec()` for further processing inside the core, which will be responsible for passing them along to plugins. This also has the advantage of being more performant on router cache hits, since URI stripping is done only once, at matching type, and cached in the LRU structure. The same way, groups matching is cached as well. Implements: #677 (partially)

We now return any value that matched back to the core so that plugins can make use of it. This also improves the router performance by moving the URI stripping/trailing slash/preserve host header to be properly cached by the router. We also simplify the return values of the router by providing a table instead of yet one more return value.

* use string.find instead of re.find whenever possible when matching URI prefixes * use string.sub for prefix URI stripping instead of re.match * use string.find/string.sub for Host port stripping

0.11.0 feature introduced in Kong/kong#2681

wandenberg · 2017-08-21T22:02:40Z

Is there a way to use the captured values in the upsteam url? I tried with $1 like suggested in #677 but didn't work.
Also took a look at the code but couldn't identify if it is possible or not.
If not possible to use directly in the upstream url like a rewrite, can this be done using the rewrite phase of a plugin?
I really need a way to do a rewrite based on regexp. I also tried to set a rewrite directly in the Nginx configuration, but Kong always uses the original url, no matter if I use last or break in the rewrite. (The idea was to rewrite the url and configure only the new one in Kong. Any help will be appreciated.

thibaultcha · 2017-08-21T22:17:01Z

@wandenberg The Mashape Enterprise subscription has an updated version of the request-transformer plugin that allows for substituting such captured groups directly in the upstream_url value, like so: config.replace.uri=/users/$(uri_captures.user2)/foo. See our Enterprise Subscription page for more information.

Otherwise, those values are exposed in the Lua land via the ngx.ctx.router_matches value for plugins consumption.

wandenberg · 2017-08-21T22:24:24Z

Thanks for the quick response @thibaultcha I will take a look on it.

thibaultcha mentioned this pull request Jul 7, 2017

allow for parameter mapping in path based routing #677

Closed

thibaultcha added the pr/status/please review label Jul 7, 2017

thibaultcha force-pushed the feat/uris-regex-support branch 2 times, most recently from 52e11ad to f9b4dbd Compare July 7, 2017 21:44

p0pr0ck5 previously requested changes Jul 7, 2017

View reviewed changes

thibaultcha commented Jul 7, 2017

View reviewed changes

thibaultcha force-pushed the feat/uris-regex-support branch from f9b4dbd to c9e3aad Compare July 8, 2017 02:22

thibaultcha added this to the 0.11.0rc2 milestone Jul 8, 2017

thibaultcha self-assigned this Jul 8, 2017

bungle reviewed Jul 8, 2017

View reviewed changes

p0pr0ck5 reviewed Jul 10, 2017

View reviewed changes

thibaultcha force-pushed the feat/uris-regex-support branch 2 times, most recently from 69c7ce0 to ba52268 Compare July 11, 2017 01:20

p0pr0ck5 reviewed Jul 11, 2017

View reviewed changes

thibaultcha force-pushed the feat/uris-regex-support branch from ba52268 to fd0b11d Compare July 11, 2017 21:40

p0pr0ck5 reviewed Jul 14, 2017

View reviewed changes

thibaultcha force-pushed the feat/uris-regex-support branch 7 times, most recently from b966ab2 to 678c84a Compare July 19, 2017 21:54

thibaultcha changed the base branch from master to next July 19, 2017 21:54

thibaultcha added 7 commits July 19, 2017 17:58

tests(router) cleanup/improve test suite

5597b7f

* remove some duplicated tests * clarify the titles of a few tests to better understand their scope at first sight

perf(router) various improvements

1dec37d

* use string.find instead of re.find whenever possible when matching URI prefixes * use string.sub for prefix URI stripping instead of re.match * use string.find/string.sub for Host port stripping

thibaultcha force-pushed the feat/uris-regex-support branch from 678c84a to 1dec37d Compare July 20, 2017 00:58

thibaultcha merged commit c2abc93 into next Jul 20, 2017

thibaultcha deleted the feat/uris-regex-support branch July 20, 2017 01:51

p0pr0ck5 mentioned this pull request Jul 21, 2017

/consumers/:username_or_id endpoints incorrectly uri decodes usernames #2721

Closed

thibaultcha added a commit to Kong/docs.konghq.com that referenced this pull request Aug 11, 2017

docs(proxy) add reference usage for regex URIs

37eabca

0.11.0 feature introduced in Kong/kong#2681

thibaultcha mentioned this pull request Aug 11, 2017

docs(proxy) add reference usage for regex URIs Kong/docs.konghq.com#483

Merged

thibaultcha added a commit to Kong/docs.konghq.com that referenced this pull request Aug 16, 2017

docs(proxy) add reference usage for regex URIs

a30aa4a

0.11.0 feature introduced in Kong/kong#2681

feat(router) support for regex URIs #2681

feat(router) support for regex URIs #2681

Conversation

thibaultcha commented Jul 7, 2017 • edited Loading

Summary

Full changelog

Issues resolved

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thibaultcha Jul 8, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bungle commented Jul 8, 2017

RolphH commented Jul 8, 2017

RolphH commented Jul 10, 2017 • edited Loading

RolphH commented Jul 10, 2017 • edited Loading

bungle commented Jul 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

p0pr0ck5 Jul 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thibaultcha Jul 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

p0pr0ck5 commented Jul 10, 2017 • edited Loading

p0pr0ck5 commented Jul 10, 2017 • edited Loading

RolphH commented Jul 10, 2017

thibaultcha commented Jul 10, 2017

thibaultcha commented Jul 11, 2017

RolphH commented Jul 11, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RolphH commented Jul 14, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

p0pr0ck5 commented Jul 14, 2017

wandenberg commented Aug 21, 2017

thibaultcha commented Aug 21, 2017

wandenberg commented Aug 21, 2017

thibaultcha commented Jul 7, 2017 •

edited

Loading

thibaultcha Jul 8, 2017 •

edited

Loading

RolphH commented Jul 10, 2017 •

edited

Loading

RolphH commented Jul 10, 2017 •

edited

Loading

p0pr0ck5 Jul 9, 2017 •

edited

Loading

thibaultcha Jul 10, 2017 •

edited

Loading

p0pr0ck5 commented Jul 10, 2017 •

edited

Loading

p0pr0ck5 commented Jul 10, 2017 •

edited

Loading