Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better signal handling for proxy, stern #2715

Merged
merged 15 commits into from
Sep 23, 2022

Conversation

fisx
Copy link
Contributor

@fisx fisx commented Sep 20, 2022

https://wearezeta.atlassian.net/browse/SQPIT-1431

Checklist

  • Add a new entry in an appropriate subdirectory of changelog.d
  • Read and follow the PR guidelines

@fisx fisx temporarily deployed to cachix September 20, 2022 11:55 Inactive
@fisx fisx temporarily deployed to cachix September 20, 2022 11:55 Inactive
@zebot zebot added the ok-to-test Approved for running tests in CI, overrides not-ok-to-test if both labels exist label Sep 20, 2022
@fisx
Copy link
Contributor Author

fisx commented Sep 20, 2022

Can't reproduce, what am I doing wrong?

When I kill -SIGINT <cannonpid>, I get both files /tmp/x[12], so the handler is called. If I kill -SIGINT <brigpid>, I only get /tmp/x1, but not /tmp/x2. When I don't install the signal handlers, I get the same, so the finally doesn't seem to be skipped if we don't handle signals here.

@akshaymankar
Copy link
Member

akshaymankar commented Sep 20, 2022

Can you please try SIGTERM? SIGNIT is actually handled correctly by default, as the issue says.

@fisx
Copy link
Contributor Author

fisx commented Sep 20, 2022

Can you please try SIGTERM? SIGNIT is actually handled correctly by default, as the issue says.

Doesn't change anything; the handler still doesn't get caught.

Anyway that's besides the point, no? I'm installing a SIGINT handler that doesn't get called if I send a SIGINT. I must be holding it wrong, somehow.

@fisx
Copy link
Contributor Author

fisx commented Sep 20, 2022

ah, i see: brig is calling runSettingsWithShutdown, and cannon isn't. that function installs its own signal handlers and thus removes the ones i've just installed. so it looks like brig is fine? which services throw the 502 / which services are they talking to at the time. I'm going to check the others.

This reverts commit a175577.
@akshaymankar
Copy link
Member

ah, i see: brig is calling runSettingsWithShutdown, and cannon isn't. that function installs its own signal handlers and thus removes the ones i've just installed. so it looks like brig is fine? which services throw the 502 / which services are they talking to at the time. I'm going to check the others.

Nice catch! We don't actually have logs from last deployment to prod because they're gone. So, I guess we can either spam staging with a lot of requests and do a deployment and find out or, we could just go through all the services and see which ones don't use runSettingsWithShutdown.

@fisx
Copy link
Contributor Author

fisx commented Sep 20, 2022

or, we could just go through all the services and see which ones don't use runSettingsWithShutdown.

that was easy! there is only proxy who calls runSettings directly. patch coming up!

@fisx fisx temporarily deployed to cachix September 20, 2022 13:30 Inactive
@fisx fisx temporarily deployed to cachix September 20, 2022 13:30 Inactive
@fisx fisx temporarily deployed to cachix September 20, 2022 13:35 Inactive
@fisx fisx temporarily deployed to cachix September 20, 2022 13:35 Inactive
@fisx fisx marked this pull request as ready for review September 20, 2022 13:35
@fisx fisx changed the title debug some other thign Better signal handling for proxy, stern. Sep 20, 2022
@fisx fisx changed the title Better signal handling for proxy, stern. Better signal handling for proxy, stern Sep 20, 2022
@fisx fisx requested a review from akshaymankar September 20, 2022 13:36
@fisx
Copy link
Contributor Author

fisx commented Sep 20, 2022

Not at all confident, but this might resolve it, no? Thanks for finding the issue, @akshaymankar, this was fun!

@fisx
Copy link
Contributor Author

fisx commented Sep 20, 2022

we should also add an hlint rule that forbids us to use runSettings anywhere... :-)

void $ installHandler sigTERM (signalHandler (env e) tid) Nothing
void $ installHandler sigINT (signalHandler (env e) tid) Nothing
runSettings s app `finally` do
runSettingsWithShutdown s app 5 `finally` do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about the impact of this 5 second wait, so I will run a couple of experiments to see that the draining happens correctly. Earlier we were also not closing the listen socket, so impact of that is another mystery to me.
I am being super careful because this functionality is important and we don't have nice ways of automatically testing it.

Copy link
Member

@akshaymankar akshaymankar Sep 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is closing all the websockets at the same time and then calling the drain script. I suspect this might be due to some cleanup in wai. Given this place is for sure not our problem, I suggest that we revert the change in cannon and rest of the things can be applied.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

67946bd (also 0f264d2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

67946bd (also 0f264d2)

no, that's different commits. but the github UI is messing with me, so i'll just leave you to the actually commit history and the "files changed" tab. :)

elland added a commit that referenced this pull request Sep 21, 2022
Also applies hlint again to the whole codebase (excluding tests), as we
had some drift between finalising hlint and new PRs being merged without
being linted / having CI catch those cases.

I also disalbed the pipefail from the script, as that would
short-circuit the linter on first issue found. Hopefully that doesn't
mess with CI.

PS: This will fail CI linters phase until #2715 has been merged.
@elland
Copy link
Contributor

elland commented Sep 21, 2022

After running hlint with the changes from #2718 it found a possible spot where this is also not used, in Federator.Response.

@akshaymankar
Copy link
Member

I also suspect that the reason for 502s is that we have 5 second grace period for requests which are already running, maybe we can bump it to 30s. Specially in cargohold where the uploads could take some time to finish.

@fisx
Copy link
Contributor Author

fisx commented Sep 21, 2022

After running hlint with the changes from #2718 it found a possible spot where this is also not used, in Federator.Response.

unrelated.

@fisx
Copy link
Contributor Author

fisx commented Sep 21, 2022

I also suspect that the reason for 502s is that we have 5 second grace period for requests which are already running, maybe we can bump it to 30s. Specially in cargohold where the uploads could take some time to finish.

I'll do that 👍

@fisx fisx temporarily deployed to cachix September 21, 2022 10:31 Inactive
@fisx fisx temporarily deployed to cachix September 21, 2022 10:31 Inactive
@fisx fisx requested a review from akshaymankar September 21, 2022 10:33
Copy link
Member

@akshaymankar akshaymankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Please update the changelog though!

tools/stern/src/Stern/API.hs Outdated Show resolved Hide resolved
changelog.d/3-bug-fixes/sqpit-1431 Outdated Show resolved Hide resolved
@fisx fisx temporarily deployed to cachix September 22, 2022 08:58 Inactive
@fisx fisx temporarily deployed to cachix September 22, 2022 08:58 Inactive
@fisx fisx temporarily deployed to cachix September 22, 2022 08:59 Inactive
@fisx fisx temporarily deployed to cachix September 22, 2022 08:59 Inactive
elland added a commit that referenced this pull request Sep 22, 2022
* Add new custom hlint rule for runSetting.

Also applies hlint again to the whole codebase (excluding tests), as we
had some drift between finalising hlint and new PRs being merged without
being linted / having CI catch those cases.

I also disalbed the pipefail from the script, as that would
short-circuit the linter on first issue found. Hopefully that doesn't
mess with CI.

PS: This will fail CI linters phase until #2715 has been merged.

* Removed Federator.Response from runSettings rule.
@fisx fisx temporarily deployed to cachix September 22, 2022 19:25 Inactive
@fisx fisx temporarily deployed to cachix September 22, 2022 19:25 Inactive
@fisx fisx merged commit b83b671 into develop Sep 23, 2022
@fisx fisx deleted the SQPIT-1431-explicit-sigint-handler branch September 23, 2022 07:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test Approved for running tests in CI, overrides not-ok-to-test if both labels exist
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants