Signal handling #1384

lbonn · 2019-09-23T13:55:05Z

@mike-sul Here is a try for handling signals more cleanly. I resurrected a very old branch of mine. Is it close to what you had in mind?

Right now, it's only enabled in aktualizr's RunForever loop. Adding it to other parts of the code might be possible but would need more thinking.

codecov-io · 2019-09-23T15:19:51Z

Codecov Report

Merging #1384 into master will increase coverage by 0.15%.
The diff coverage is 97.72%.

@@            Coverage Diff             @@
##           master    #1384      +/-   ##
==========================================
+ Coverage   80.19%   80.35%   +0.15%     
==========================================
  Files         178      180       +2     
  Lines       10585    10624      +39     
==========================================
+ Hits         8489     8537      +48     
+ Misses       2096     2087       -9

Impacted Files	Coverage Δ
src/libaktualizr/utilities/sig_handler.h	`100% <100%> (ø)`
src/libaktualizr/primary/aktualizr.cc	`97.61% <100%> (+0.98%)`	⬆️
src/libaktualizr/primary/aktualizr.h	`100% <100%> (ø)`	⬆️
src/aktualizr_primary/main.cc	`86.66% <100%> (+0.8%)`	⬆️
src/libaktualizr/utilities/sig_handler.cc	`96.29% <96.29%> (ø)`
src/libaktualizr/storage/sqlstorage.cc	`77.12% <0%> (+0.57%)`	⬆️
src/libaktualizr/package_manager/ostreemanager.cc	`77.95% <0%> (+0.9%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 42f2cc7...48a6dc9. Read the comment docs.

mike-sul · 2019-09-24T07:07:05Z

src/libaktualizr/primary/aktualizr.cc

+    bool exiting = false;
+    std::mutex exit_m;
+    std::condition_variable exit_cv;
+    SigHandler::get().start([&exit_m, &exit_cv, &exiting]() {


IMHO, a single handler should 'live' at an application level not at a library one since an application might incorporate not just libaktualizr but also another libraries and functionality that require triggering specific teardown actions on a system signal. Specifically, I would put a signal handler registration&instantion at aktualizr app (aktualizr-primary) level and triggers teardown functionality via Aktualizr API, it will allow an application developer to extend or override the signal handling functionality.

mike-sul · 2019-09-24T07:14:46Z

src/libaktualizr/utilities/sig_handler.cc

+        return;
+      }
+
+      boost::this_thread::sleep_for(boost::chrono::seconds(1));


Why not just sleep for masked_secs_ and then run on_signal() ? Or this is for case when another signal arrives during masking/delay ? In this case it can be just ignored, not sure if I fully understand a value in the delayed signal handling.

The initial idea was that some users could mask signals to protect some critical operation, but only for a limited time so that there could never be any infinite blocking. But the aktualizr code base has changed quite a lot since then.

I'll probably remove this functionality entirely.

mike-sul · 2019-09-24T07:15:09Z

src/libaktualizr/utilities/sig_handler.cc

+      bool signal = signal_marker_.exchange(false);
+
+      if (signal) {
+        LOG_INFO << "received KILL request";


It's not just KILL signal handler

mike-sul · 2019-09-24T07:18:04Z

src/libaktualizr/utilities/sig_handler.cc

+    }
+  });
+
+  signal(SIGHUP, signal_handler);


I would make it configurable for an application level, i.e. the one who defines a signal handler is supposed to specify what is/are signals this handler is intended for. Ideally, I as an application/executable developer should have possibility to define different signal handlers for different signals.

KostiantynBushko · 2019-09-24T07:45:37Z

I agree with @mike-sul the signal handler should not be a part of the libaktualizr...
If, for example, we set a handler for SIGINT signal and do not perform exit it means we will not be able to kill app ever!
I tested it with my client, using RunForever() feature I can't stop app by pressing Ctrl-C or using kill which is obvious since we handle SIGINT and do not call exit...

lbonn · 2019-09-24T09:28:53Z

Yes, you're both right, this PR is not ready yet.

lbonn · 2019-09-24T13:09:55Z

It should be simpler and more inter-operable now.

mike-sul · 2019-09-25T06:44:12Z

src/libaktualizr/primary/aktualizr.cc

    }
    uptane_client_->completeInstall();
  });
  return future;
 }

+void Aktualizr::Shutdown() {
+  Abort();


Abort() stops the command queue threads which is stopped in CommandQueue() dtor which itself is triggered during Aktualizr instance destruction, so calling Abort() in case of signal handling from the app/exe that destroys Aktualizr instance anyway looks like redundant. Having said that, I think it, might be useful if there is a need in calling RunForever() and Shutdown() multiple times within a single Aktualizr instance lifetime. Perhaps, break it into two methods ?

Abort() docstring says that it aborts the currently running command if it can be aborted, that's why it sounded appropriate to call it from the signal handling.

So you're suggesting moving away Abort() from this method to the signal handler which will call Abort();Shutdown();?

Or given the docstring I should maybe merge Shutdown() into Abort(): Abort() would also apply to RunForever()?

I think, it depends on what Shutdown() is intended for ? If we would like to stop Aktualizr during an app exit then this method is useless at all because teardown/shutdown happens at Aktualizr dtor.

If Shutdown() is intended for the case when an user/app would like to stop aktualizr's UptaneCycle and then run it again within a single Aktualizr instance lifetime then the user will need to Initialize() if abort is part of Shutdown. BTW, abort doesn't stop the report queue thread.

Whether abort() should be part of the signal handler routine depends on a need in aborting ongoing operation during restart. I don't have strong opinion here, taking into account that the most timing operation is download and we support download resuming I think, I am fine with abort, but it might have negative impact when some write to DB or communication with secondaries is interrupted. I think, I would go with abort() at the moment.

I'm not sure what you exactly mean in your first point. You mean that Shutdown() wouldn't be useful as an handler to a gui event for example? And you're saying that calling Abort() there is not useful because it's also called later? But the point is that calling Abort() early can speed up the tear down and trigger the ~Aktualizr() call in a shorter delay. Or am I missing something

But your second point sounds right, calling Initialize() again after Shutdown() sounds quite wrong. I'll move Abort() to the signal handler.

mike-sul · 2019-09-25T06:46:44Z

src/libaktualizr/primary/aktualizr.cc

+      if (exit_cond_.cv.wait_for(l, std::chrono::seconds(config_.uptane.polling_sec),
+                                 [this] { return exit_cond_.flag; })) {
+        break;
+      }
    }
    uptane_client_->completeInstall();


If completeInstall() is not called will it be properly handled during the subsequent Aktualizr/UptaneCycle() ?

Right now this method is used for rebooting. So:

if the system reboots and aktualizr is killed before this method, nothing is lost

if aktualizr receives SIGKILL or crashes before this method (rare), isInstallCompletionRequired() should still return true when systemd would restart aktualizr and it should reboot then. But I don't think we cover that with any unit test.

mike-sul · 2019-09-25T06:49:42Z

src/aktualizr_primary/main.cc

+    SigHandler::get().start([&aktualizr]() { aktualizr.Shutdown(); });
+    SigHandler::signal(SIGHUP);
+    SigHandler::signal(SIGINT);
+    SigHandler::signal(SIGTERM);


I hope UptaneCycle() can end its work within systemd's DefaultTimeoutStopSec= default to 90s otherwise it will send SIGKILL and causes non-graceful shutdown of Aktualizr. Perhaps, we might consider setting higher value of TimeoutStopSec= in https://github.com/advancedtelematic/meta-updater/blob/master/recipes-sota/aktualizr/files/aktualizr-secondary.service

I figured that 90s is actually reasonable. How long would you suggest it should wait?

In any way, we cannot avoid all the cases of abrupt shutdown and if aktualizr doesn't recover after one, we consider that a bug.

How long would you suggest it should wait?

I suggest to consider changing the default value (90s) if we face any issue during CI/test phase. No need to do anything here at the moment, just keep it in mind.

mike-sul · 2019-09-25T10:43:56Z

Perhaps, it makes sense to add a test that test it from aktualizr client standpoint of view, e.g spawn aktualizr process and send SIGTERM signal to it and wait/join the aktualizr process ?

lbonn · 2019-09-30T13:19:16Z

@mike-sul ok, I've added a test using the python fixtures

tests/test_aktualizr_kill.py

Catch the signal and transmit a shutdown message Signal are difficult to handle safely: use an atomic boolean and poll for changes Signed-off-by: Laurent Bonnans <[email protected]>

Interrupts the main loop cleanly when a signal is caught Signed-off-by: Laurent Bonnans <[email protected]>

- calls to signal(2) should be out of the library - remove the temporary mask feature Signed-off-by: Laurent Bonnans <[email protected]>

Signed-off-by: Laurent Bonnans <[email protected]>

lbonn · 2019-09-30T13:57:30Z

Rebased after the merge of #1395

Signed-off-by: Laurent Bonnans <[email protected]>

lbonn requested review from pattivacek, eu-siemann, mike-sul, Zee314159, xcheng-here and kbushgit September 23, 2019 13:55

lbonn force-pushed the fix/OTA-3744/sig-handling branch from 91c7bea to 70e2d67 Compare September 23, 2019 15:18

lbonn force-pushed the fix/OTA-3744/sig-handling branch 2 times, most recently from 766c0dc to c1c1c68 Compare September 23, 2019 15:53

mike-sul reviewed Sep 24, 2019

View reviewed changes

lbonn force-pushed the fix/OTA-3744/sig-handling branch 3 times, most recently from 7a0ecc6 to 470cc63 Compare September 24, 2019 13:09

lbonn force-pushed the fix/OTA-3744/sig-handling branch 2 times, most recently from 36a4d85 to f1231c6 Compare September 24, 2019 15:32

mike-sul reviewed Sep 25, 2019

View reviewed changes

lbonn force-pushed the fix/OTA-3744/sig-handling branch from f1231c6 to 98ba6e4 Compare September 25, 2019 09:29

lbonn force-pushed the fix/OTA-3744/sig-handling branch from 98ba6e4 to 352c714 Compare September 30, 2019 13:18

mike-sul reviewed Sep 30, 2019

View reviewed changes

tests/test_aktualizr_kill.py Outdated Show resolved Hide resolved

mike-sul reviewed Sep 30, 2019

View reviewed changes

tests/test_aktualizr_kill.py Outdated Show resolved Hide resolved

lbonn force-pushed the fix/OTA-3744/sig-handling branch from 352c714 to 143e63a Compare September 30, 2019 13:46

lbonn added 4 commits September 30, 2019 15:50

Try to handle SIGINT more cleanly

abf9b0c

Catch the signal and transmit a shutdown message Signal are difficult to handle safely: use an atomic boolean and poll for changes Signed-off-by: Laurent Bonnans <[email protected]>

Add signal handler to aktualizr daemon

35bce90

Interrupts the main loop cleanly when a signal is caught Signed-off-by: Laurent Bonnans <[email protected]>

Clean and simplify unix signal handling

d339254

- calls to signal(2) should be out of the library - remove the temporary mask feature Signed-off-by: Laurent Bonnans <[email protected]>

Add test for aktualizr graceful exit

f88722a

Signed-off-by: Laurent Bonnans <[email protected]>

lbonn force-pushed the fix/OTA-3744/sig-handling branch from 143e63a to f88722a Compare September 30, 2019 13:56

Add noptest property to test_install_aktualizr_and_update

48a6dc9

Signed-off-by: Laurent Bonnans <[email protected]>

mike-sul approved these changes Sep 30, 2019

View reviewed changes

lbonn merged commit ccf304a into master Sep 30, 2019

lbonn deleted the fix/OTA-3744/sig-handling branch September 30, 2019 15:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Signal handling #1384

Signal handling #1384

lbonn commented Sep 23, 2019

codecov-io commented Sep 23, 2019 •

edited

Loading

mike-sul Sep 24, 2019

mike-sul Sep 24, 2019 •

edited

Loading

lbonn Sep 24, 2019

mike-sul Sep 24, 2019

mike-sul Sep 24, 2019

KostiantynBushko commented Sep 24, 2019 •

edited

Loading

lbonn commented Sep 24, 2019

lbonn commented Sep 24, 2019

mike-sul Sep 25, 2019

lbonn Sep 25, 2019

lbonn Sep 25, 2019

mike-sul Sep 25, 2019

lbonn Sep 25, 2019

mike-sul Sep 25, 2019

lbonn Sep 25, 2019 •

edited

Loading

mike-sul Sep 25, 2019

lbonn Sep 25, 2019

mike-sul Sep 25, 2019 •

edited

Loading

mike-sul commented Sep 25, 2019

lbonn commented Sep 30, 2019

lbonn commented Sep 30, 2019

Signal handling #1384

Signal handling #1384

Conversation

lbonn commented Sep 23, 2019

codecov-io commented Sep 23, 2019 • edited Loading

Codecov Report

Choose a reason for hiding this comment

mike-sul Sep 24, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KostiantynBushko commented Sep 24, 2019 • edited Loading

lbonn commented Sep 24, 2019

lbonn commented Sep 24, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lbonn Sep 25, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mike-sul Sep 25, 2019 • edited Loading

Choose a reason for hiding this comment

mike-sul commented Sep 25, 2019

lbonn commented Sep 30, 2019

lbonn commented Sep 30, 2019

codecov-io commented Sep 23, 2019 •

edited

Loading

mike-sul Sep 24, 2019 •

edited

Loading

KostiantynBushko commented Sep 24, 2019 •

edited

Loading

lbonn Sep 25, 2019 •

edited

Loading

mike-sul Sep 25, 2019 •

edited

Loading