-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kong 1.0rc2 migrations from 0.14.1 to 1.0rc2 issues #3924
Comments
Update, follow up migration run yields this:
I am starting to get a clearer picture as to why there maybe the 3 "pending" executions which get wrapped up after the |
Update: I cannot go any further with 1.0rc2 testing and finalizing my migrations, seems custom plugins cannot be run on 1.0rc2 with 0.14.1 compatible schemas:
https://github.com/Optum/kong-splunk-log/blob/master/src/schema.lua#L8 @thibaultcha hope this gives some insight, seems a bit of the legacy schema is still not supported like fields declared as text vs I think in new format yall changed to "string". Let me know if you get a patch in and can drop it into docker-kong repo, then I can continue testing this all out 👍 . |
I think I realized another portion of the process that was wrong in my mind for this upgrade (vs how we have always done it in the past). Right now an env consists of 2 running Kong pods Migrations for us have consisted of a temporary OpenShift "job" that just spins up a temp new version of kong, and runs kong migrations up to upgrade the db then kills off. After running the migrations up job we bring up the existing 2 kong nodes on the newer version of Kong. I think what deviates here for this upgrade is after running kong migrations up, that its necessary to run kong migrations finish from that same 1.0rc2 kong node(and ONLY then can we launch a 1.0rc2 node). So for now I will modify that job to specifically just run Kong migrations finish only to get around this right now in my dev environment. For future kong migrations as we iterate versions would it make sense to just chain the upgrades to do both commands so I can keep the whole migrations process as a separate "job"? Would this be safe from iteration to iteration of Kong 1.0+ and beyond?
|
Okay, after modifying my migrations job just to execute the
Seems ACL uuids not playing nice in the migrations process with code. |
@jeremyjpj0916 Hi, thanks for the feedback. The migrations are voluntarily split in two steps (pending/executed). Not all migrations have two steps, only ones that would be breaking for the underlying schema when 2 versions of Kong depend on it. You should not run Pinging @hishamhm for the schema conversion issue (we are aware that the legacy schema support is incomplete, as pointed out on Kong Nation a few days ago), and I will submit a patch fixing two things:
Thanks again! Do you have any further questions? Are you considering running 1.0rc2 without the custom plugins that cannot run on it maybe? Or we could also give you pointers to update your custom plugins schema for the short term as well. |
(The point of the 2 steps migrations process really is that in between the |
Wanted to drop in what some of the data of acls table looks like per record in our dev db if that helps(our ACL groups are route id's for programmatic resolving of what Routes -> Services consumers have access to since we apply auth plugins on the route resources):
Okay so the fact my custom plugin caused errors against 1.0rc2node prior to the migrations finish was not intended by Kong, glad to get that clarification as I was thinking maybe the custom plugin incompatibility was due to my lack of running the second migrations command. But at this stage I did run the migrations finish as a job which did primarily push us to 1.0rc2 compatibility minus the acl bug(with .14 nodes still up and running so they probably are whacky but hey its dev so not a big deal). So maybe what I do now is just roll back the db to a day ago and give this a go at a later day when the custom plugins with legacy schema can be supported on 1.0rc2 and the ACLs C* migration command fixed up. Ideally I would like the 0.14.1 to 1.0 GA to go smooth with the minimal steps possible. I am hoping it could go something like:
Sounds this may not be set in stone since its not clear yet if 1.0 will still allow older plugin schema's and new format, or force a breaking change with all custom plugins by only allowing new format. Would prefer 1.0 having support for both so we can separate Kong upgrade from plugin refactors and not do everything all at once. Another option would be to run all environments and prod on 1.0rc2 if it will have support for old and new schema and then work on changing up the plugins before doing another upgrade from 1.0rc2 to 1.0(which at that point probably would not even require further db changes but just pulling down latest kong 1.0 from bintray). Would feel weird running a rc in prod though so prefer not to go that route.
Not really looking to run Kong without the plugins as our plugins are what enable us to validate Kong is behaving as expected and provides all our important logging, even in dev we run all our plugins so it stays in line with other environments. I think I know what needs to change in our custom plugins when the time comes but going back to separation of tasks was ideal(upgrade kong | change 8 or so plugins). If not possible I may refactor how we pull from luarocks to actually use versions in our deploys because right now I just pull the latest which you can see in situations like this would not work. Say i update all our plugins now, well prod runs on that luarocks latest so the next build will cause broken prod 😄 , really my fault more than anything but its kind of nice to minimize the edits on our build scripts to not worry about versioning with plugins when pulling from luarocks(for plugins we manage, obviously an outsiders luarocks plugin is subject to change at any time). |
@jeremyjpj0916 Regarding the plugin error at https://github.com/Optum/kong-splunk-log/blob/master/src/schema.lua#L8 the correct name for that type is "string" (even in 0.x schemas), but if that is working for you than it seems that "text" is being silently ignored by the 0.x schema validator? (It shouldn't be the case from my reading of kong/dao/schemas_validation.lua, though, so something's weird is at play.) Changing that type to "string" in your plugin should make it work on both 0.x and 1.x. If 0.x is indeed treating unknown types as "string", I guess we could replicate that behavior in our auto-converter to improve backwards compatibility. |
@hishamhm thats funny to learn its wrong for 0.x too 😆 , but yeah it does run and work fine on 0.14.x . Let me see if changing to string is okay for existing 0.14.1 version myself, if so then I have no problem making that change now(which might fix the schema error I mentioned earlier). I also don't think you necessarily need to replicate that behavior but other dummies like myself may end up posting with similar errors if they leverage a plugin with a wrong schema type 😺 . @thibaultcha if I were so daring to continue on 1.0rc2 with hishamhm's suggestion, is there a cqlsh query I could just run to get the acl table as needed(if there even are further changes) while you work on the migration query fix? Will come back with the results of fixing my plugins schema tho(and reviewing any other plugins). |
UPDATE: Well fixing from text to string did get us launched for that custom plugin. New error though around an instance of the existing kong correlation id plugin enabled during runtime:
Where optum-cid-ext is the specific header we set the correlation id too. DB Data
Decided to glance at your source too:
Somehow Update with further thoughts I have had on possible culprits: The specific route throwing the errors happens to be our healthcheck endpoint(yay our dev cluster is officially dead to our Loadbalancers 🤣) also has the request termination plugin enabled set to return Http Status 200 with a Kong message body of "Success", on 0.14.x correlation-id(which I have as a global plugin btw) does not seem to be executing on this route with request-termination(on 1.0 it certainly is, hence the nil pointer error logs and proxy call failing). I have also tested other proxy urls through the kong node terminal using localhost and they work same for admin api so it seems its JUST due to the request-termination plugin and this enabled on a route sharing the correlation-id plugin together that causes this break(probably correlation-id global or local to a route with request termination would reproduce this in yalls test env). Edit again - YES, it was the fact I had request-termination enabled on this route. Disabling that plugin made the route work again(and subsequently I can see the correlation id in the headers on response. So Kong squad guru's something has changed in behavior for when correlation-id and request-termination share a route, and it produces a nil pointer on an expected header value for the correlation-id plugin during the response header_filter phase. I will leave yall to it, probably has to do with clashing priority or one plugin overwrites the local ctx of the other plugin somehow?? I am not sure exactly. Edit - Seems the request-termination plugins config changed during migration or something... I honestly dunno what happened. I may try from scratch and see what happens again:
Vs a 0.14.x Stage
How dev got to code 503 and all those nulls I am not sure... I also tried to patch the plugin back to a working state and that didn't seem to work either:
Edit(I am dumb, fixed my curl and the patch worked):
For now will write off the 503 and nulls as a failed patch attempt earlier at one point that messed it up and caused some 503 default. Probably not the migrations itself that did that Last edit - Okay even with the termination plugin set to 200 Success again in dev the coorelation-id plugin is still throwing those nil pointer errors and the route health-check is failing... So def even if migrations did something weird to that request-termination plugin even in proper config state it and coorelation-id are clashing. Last last edit - Also disabling the global correlation-id plugin while leaving request termination plugin got my healthcheck endpoint working again as well. So yeah def these plugins are not playing nice with eachother in 1.0. TLDR - Wow I realize I wrote a book, the correlation-id plugin(enabled globally too, not that it probably matters) and request termination plugin sharing a route breaks the correlation-id plugin and subsequently the proxy route itself fails to respond successfully. |
I thought about why they clash in 1.0 looking at the code at the moment(have not compared 0.14.1's logic though to see what changed and why this works in 0.14.1): We see correlation-id executes with priority 1 with its access phase setting the header value. Then somehow we still end up executing correlation-id header_filter phase, which we all know happens after a proxy call from backend (backend in our case being calling the localhost:8001/status admin endpoint. Request-termination by now has impacted the correlation-id's header plugin ctx value to nil during this phase). And the client receive an error and not what was set with the request-termination in the first place. Strange stuff. A dirty fix for this in correlation-id for v1.0 would be:
But does not solve the root cause of why the header_filter phase of correlation-id would execute at all since request-termination's access phase should have been the end of kong's plugin execution. Other fixes(may be dirty as well) would be swapping the plugins priorities of 1 and 2. Do proxies that terminate with static content REALLY need a correlation id(which is generally used for tx level debugging that actually routes to backends?). I personally was fine with it not returning the correlation id in the header upon response for this specific ping endpoint with the request termination, which is the behavior of 0.14.1. |
The `id` column is a UUID, but given as a string, producing the error: [Cassandra error] failed to run migration '001_14_to_15' teardown: ...are/lua/5.1/kong/plugins/acl/migrations/001_14_to_15.lua:81: [Invalid] Invalid STRING constant (c8b871b7-7a18-45a9-9cd3-aa0a78c6073d) for "id" of type uuid See #3924
The `id` column is a UUID, but given as a string, producing the error: [Cassandra error] failed to run migration '001_14_to_15' teardown: ...are/lua/5.1/kong/plugins/acl/migrations/001_14_to_15.lua:81: [Invalid] Invalid STRING constant (c8b871b7-7a18-45a9-9cd3-aa0a78c6073d) for "id" of type uuid See #3924 From #3953
Hi @jeremyjpj0916, Some news:
It is actually intended behavior that the
Since -- cumbersome
if value ~= nil then
kong.response.set_header("foo", value)
else
kong.response.clear_header("foo")
end
-- leaner
kong.response.set_header("foo", value) That said, the In short, we need a few action items:
As always, thank you for all the details and collaboration! |
@thibaultcha and thank you for digging deep and taking time to investigate and address accordingly, the analysis on correlation-id and request termination plugin clash seems spot on now to me. All this effort from yourself and team + community testing will yield a great 1.0 release. Hopefully most of the plugin conversions to the pdk will not yield clashing behaviors, and what I stumbled into was more of a strange situational case. Exhaustive combinations of Kong plugins on top of services and routes would be pretty tiring to make, but maybe someone internally to Kong could make it as its own big test suite 😄(at the end of the day it would probably catch a lot of any future pdk changes that could break Kong core plugins in ways like I saw). My takeaways for next steps will be this:
|
@jeremyjpj0916 Oh, an item I left out from my previous comment on updates to this thread:
|
Maybe this logic for the pdk checks.lua to allow/account for value nil in header set?
|
I agree that the current
When a plugin author forgets to test the As with everything, it's a balance between safety and convenience. I'd prefer if we had an explicit way to state that the desired behavior is "set-or-clear". So, we'd have either a very explicit
(I considered suggesting |
Sounds like there may be time for internal discussion around how to best lay it out in the pdk for Kong. I prefer simplicity for an end user so if I want to set a header to nil I should be allowed to do so calling something that is as straight forward and easy+human readable as : For now I don't see too much harm in having the correlation-id plugin have this logic:
So if there are not any objections I will PR this later so the 2 plugins don't break me on 1.0rc3 for now testing around. But that PR will come after the http-log PR gets wrapped up because I am a scrub and have not figured out how to break out commits to the same branch into separate PRs(if thats even possible 😆). Gave that a go earlier today at one point with my pdk change idea and realized it leaked right into my other http-log PR lol... |
The fix to the correlation-id plugin is required regardless of the direction followed by the PDK anyways (unless if passing |
@thibaultcha comments shows the problem with |
Used my colleagues account to PR the fix for correlation-id when request-termination is present on service/route. |
Fix clashing plugins issue reported in Kong#3924
Btw. not allowing nil on set header has introduced a new bug: |
Hmm, guess we will find out just how many plugins were impacted by that decision as we progress here and get into each ones nitty gritty different config options. |
Signed-off-by: Thibault Charbonnier <[email protected]> In f84af08, we refactored the correlation-id plugin to use the PDK (as with all plugins before our 1.0 release). This introduced the use of the `kong.response.set_header()` API, which produces an error when the header value is `nil`. However, the header value may be `nil` if the access phase did not execute (e.g. upon an Nginx or Kong-produced error). This fix ensures that the header value is not `nil` before invoking the PDK, and provides a regression test. Until the PDK API is (potentially) revisited, this is an acceptable fix. See #3924
@jeremyjpj0916 Last Wednesday, we planned some work to audit plugins and ensure none is vulnerable to this bug. @bungle is not it but hasn't been through all of them yet :) |
@thibaultcha , did test your 1.0rc3 latest version candidate, the initial migrations command seemed to work but I can' bring up a node currently. Result of initial migrations:
Trying to Start up a 1.0rc3 version
I think @james-callahan did do a PR for this fix but for now can't progress with testing on 1.0rc3, but did confirm the ACL and migration consensus worked. Looks like I will roll back again to 0.14.1 until rc4 is here and should have the fix. cc'ed @p0pr0ck5 : Added you here since I know I mentioned I would have the results tonight from the other 404 issue, looks like it won't be so until later. |
@jeremyjpj0916 Yes, several reports of this issue came up. We are introducing a way to upgrade between release candidates in rc4 (see #4071 and the As of today, resetting the database and doing a fresh bootstrap required for upgrading between release candidates. Or using a fresh keyspace. |
Okay makes sense, closing this as I think we got the things this issue brings up fixed. Will reopen or make new issues if needed around testing the 1.0rc4. I would prefer testing 1.0rc4 from a 0.14.1 node so I will be rolling back my dev environment completely rather than smaller rc3 to rc4 upgrades. |
Figured Kong may want some external feedback on what migrations looks like from an outside user. So here is my initial thought based on what I see:
Summary
Generally I expect conclusive results at the end of a migrations informing me everything completed successfully or not. My first run of the migrations from a Cassandra database running the 0.14.1 Kong using Cassandra 3.11.x seems to have some mixed bag of results. Log detailed below
Prior migrations have usually finalized themselves with a comment of
With no further details. This migrations ran through till OpenShift reports "completed" so the
kong migrations up
command returned success on execution, but the logs leave me scratching my head as it says 3 things are pending as the last known info?I am about to attempt a rerun of the migrations one more time to see if it reveals any extra detail or says everything looks good (and will return to post back 2nd migrations execution results). Regardless this may be a point in the migration process Kong could clean up so its clear to the end user the result.
The text was updated successfully, but these errors were encountered: