Skip to content
This repository has been archived by the owner on May 21, 2024. It is now read-only.

Signal handling #1384

Merged
merged 5 commits into from
Sep 30, 2019
Merged

Signal handling #1384

merged 5 commits into from
Sep 30, 2019

Conversation

lbonn
Copy link
Contributor

@lbonn lbonn commented Sep 23, 2019

@mike-sul Here is a try for handling signals more cleanly. I resurrected a very old branch of mine. Is it close to what you had in mind?

Right now, it's only enabled in aktualizr's RunForever loop. Adding it to other parts of the code might be possible but would need more thinking.

@codecov-io
Copy link

codecov-io commented Sep 23, 2019

Codecov Report

Merging #1384 into master will increase coverage by 0.15%.
The diff coverage is 97.72%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1384      +/-   ##
==========================================
+ Coverage   80.19%   80.35%   +0.15%     
==========================================
  Files         178      180       +2     
  Lines       10585    10624      +39     
==========================================
+ Hits         8489     8537      +48     
+ Misses       2096     2087       -9
Impacted Files Coverage Δ
src/libaktualizr/utilities/sig_handler.h 100% <100%> (ø)
src/libaktualizr/primary/aktualizr.cc 97.61% <100%> (+0.98%) ⬆️
src/libaktualizr/primary/aktualizr.h 100% <100%> (ø) ⬆️
src/aktualizr_primary/main.cc 86.66% <100%> (+0.8%) ⬆️
src/libaktualizr/utilities/sig_handler.cc 96.29% <96.29%> (ø)
src/libaktualizr/storage/sqlstorage.cc 77.12% <0%> (+0.57%) ⬆️
src/libaktualizr/package_manager/ostreemanager.cc 77.95% <0%> (+0.9%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 42f2cc7...48a6dc9. Read the comment docs.

@lbonn lbonn force-pushed the fix/OTA-3744/sig-handling branch 2 times, most recently from 766c0dc to c1c1c68 Compare September 23, 2019 15:53
bool exiting = false;
std::mutex exit_m;
std::condition_variable exit_cv;
SigHandler::get().start([&exit_m, &exit_cv, &exiting]() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, a single handler should 'live' at an application level not at a library one since an application might incorporate not just libaktualizr but also another libraries and functionality that require triggering specific teardown actions on a system signal. Specifically, I would put a signal handler registration&instantion at aktualizr app (aktualizr-primary) level and triggers teardown functionality via Aktualizr API, it will allow an application developer to extend or override the signal handling functionality.

return;
}

boost::this_thread::sleep_for(boost::chrono::seconds(1));
Copy link
Collaborator

@mike-sul mike-sul Sep 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just sleep for masked_secs_ and then run on_signal() ? Or this is for case when another signal arrives during masking/delay ? In this case it can be just ignored, not sure if I fully understand a value in the delayed signal handling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial idea was that some users could mask signals to protect some critical operation, but only for a limited time so that there could never be any infinite blocking. But the aktualizr code base has changed quite a lot since then.

I'll probably remove this functionality entirely.

bool signal = signal_marker_.exchange(false);

if (signal) {
LOG_INFO << "received KILL request";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not just KILL signal handler

}
});

signal(SIGHUP, signal_handler);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make it configurable for an application level, i.e. the one who defines a signal handler is supposed to specify what is/are signals this handler is intended for. Ideally, I as an application/executable developer should have possibility to define different signal handlers for different signals.

@KostiantynBushko
Copy link

KostiantynBushko commented Sep 24, 2019

I agree with @mike-sul the signal handler should not be a part of the libaktualizr...
If, for example, we set a handler for SIGINT signal and do not perform exit it means we will not be able to kill app ever!
I tested it with my client, using RunForever() feature I can't stop app by pressing Ctrl-C or using kill which is obvious since we handle SIGINT and do not call exit...

@lbonn
Copy link
Contributor Author

lbonn commented Sep 24, 2019

Yes, you're both right, this PR is not ready yet.

@lbonn lbonn force-pushed the fix/OTA-3744/sig-handling branch 3 times, most recently from 7a0ecc6 to 470cc63 Compare September 24, 2019 13:09
@lbonn
Copy link
Contributor Author

lbonn commented Sep 24, 2019

It should be simpler and more inter-operable now.

@lbonn lbonn force-pushed the fix/OTA-3744/sig-handling branch 2 times, most recently from 36a4d85 to f1231c6 Compare September 24, 2019 15:32
}
uptane_client_->completeInstall();
});
return future;
}

void Aktualizr::Shutdown() {
Abort();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Abort() stops the command queue threads which is stopped in CommandQueue() dtor which itself is triggered during Aktualizr instance destruction, so calling Abort() in case of signal handling from the app/exe that destroys Aktualizr instance anyway looks like redundant. Having said that, I think it, might be useful if there is a need in calling RunForever() and Shutdown() multiple times within a single Aktualizr instance lifetime. Perhaps, break it into two methods ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Abort() docstring says that it aborts the currently running command if it can be aborted, that's why it sounded appropriate to call it from the signal handling.

So you're suggesting moving away Abort() from this method to the signal handler which will call Abort();Shutdown();?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or given the docstring I should maybe merge Shutdown() into Abort(): Abort() would also apply to RunForever()?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, it depends on what Shutdown() is intended for ? If we would like to stop Aktualizr during an app exit then this method is useless at all because teardown/shutdown happens at Aktualizr dtor.

If Shutdown() is intended for the case when an user/app would like to stop aktualizr's UptaneCycle and then run it again within a single Aktualizr instance lifetime then the user will need to Initialize() if abort is part of Shutdown. BTW, abort doesn't stop the report queue thread.

Whether abort() should be part of the signal handler routine depends on a need in aborting ongoing operation during restart. I don't have strong opinion here, taking into account that the most timing operation is download and we support download resuming I think, I am fine with abort, but it might have negative impact when some write to DB or communication with secondaries is interrupted. I think, I would go with abort() at the moment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you exactly mean in your first point. You mean that Shutdown() wouldn't be useful as an handler to a gui event for example? And you're saying that calling Abort() there is not useful because it's also called later? But the point is that calling Abort() early can speed up the tear down and trigger the ~Aktualizr() call in a shorter delay. Or am I missing something

But your second point sounds right, calling Initialize() again after Shutdown() sounds quite wrong. I'll move Abort() to the signal handler.

if (exit_cond_.cv.wait_for(l, std::chrono::seconds(config_.uptane.polling_sec),
[this] { return exit_cond_.flag; })) {
break;
}
}
uptane_client_->completeInstall();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If completeInstall() is not called will it be properly handled during the subsequent Aktualizr/UptaneCycle() ?

Copy link
Contributor Author

@lbonn lbonn Sep 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now this method is used for rebooting. So:

  • if the system reboots and aktualizr is killed before this method, nothing is lost
  • if aktualizr receives SIGKILL or crashes before this method (rare), isInstallCompletionRequired() should still return true when systemd would restart aktualizr and it should reboot then. But I don't think we cover that with any unit test.

SigHandler::get().start([&aktualizr]() { aktualizr.Shutdown(); });
SigHandler::signal(SIGHUP);
SigHandler::signal(SIGINT);
SigHandler::signal(SIGTERM);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope UptaneCycle() can end its work within systemd's DefaultTimeoutStopSec= default to 90s otherwise it will send SIGKILL and causes non-graceful shutdown of Aktualizr. Perhaps, we might consider setting higher value of TimeoutStopSec= in https://github.com/advancedtelematic/meta-updater/blob/master/recipes-sota/aktualizr/files/aktualizr-secondary.service

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured that 90s is actually reasonable. How long would you suggest it should wait?

In any way, we cannot avoid all the cases of abrupt shutdown and if aktualizr doesn't recover after one, we consider that a bug.

Copy link
Collaborator

@mike-sul mike-sul Sep 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long would you suggest it should wait?

I suggest to consider changing the default value (90s) if we face any issue during CI/test phase. No need to do anything here at the moment, just keep it in mind.

@lbonn lbonn force-pushed the fix/OTA-3744/sig-handling branch from f1231c6 to 98ba6e4 Compare September 25, 2019 09:29
@mike-sul
Copy link
Collaborator

Perhaps, it makes sense to add a test that test it from aktualizr client standpoint of view, e.g spawn aktualizr process and send SIGTERM signal to it and wait/join the aktualizr process ?

@lbonn lbonn force-pushed the fix/OTA-3744/sig-handling branch from 98ba6e4 to 352c714 Compare September 30, 2019 13:18
@lbonn
Copy link
Contributor Author

lbonn commented Sep 30, 2019

@mike-sul ok, I've added a test using the python fixtures

@lbonn lbonn force-pushed the fix/OTA-3744/sig-handling branch from 352c714 to 143e63a Compare September 30, 2019 13:46
Catch the signal and transmit a shutdown message

Signal are difficult to handle safely: use an atomic boolean and poll
for changes

Signed-off-by: Laurent Bonnans <[email protected]>
Interrupts the main loop cleanly when a signal is caught

Signed-off-by: Laurent Bonnans <[email protected]>
- calls to signal(2) should be out of the library
- remove the temporary mask feature

Signed-off-by: Laurent Bonnans <[email protected]>
@lbonn lbonn force-pushed the fix/OTA-3744/sig-handling branch from 143e63a to f88722a Compare September 30, 2019 13:56
@lbonn
Copy link
Contributor Author

lbonn commented Sep 30, 2019

Rebased after the merge of #1395

@lbonn lbonn merged commit ccf304a into master Sep 30, 2019
@lbonn lbonn deleted the fix/OTA-3744/sig-handling branch September 30, 2019 15:06
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants