Skip to content

Commit

Permalink
ceph: stop osd process more quickly during pod shutdown
Browse files Browse the repository at this point in the history
The OSD needs to shut down quickly during upgrade, or other scenarios
where the OSD is being restarted. To facilitate this fast shutdown,
rook will run kill -9 on the osd process. The Ceph OSD is designed
to be safe even when killed like this. This allows the
ECONNREFUSED to be returned sooner, which will redirect the OSD
traffic to other OSDs and cause less downtime.

Signed-off-by: Travis Nielsen <[email protected]>
  • Loading branch information
travisn authored and binoue committed Apr 10, 2020
1 parent ea29e34 commit 81346b8
Showing 1 changed file with 8 additions and 2 deletions.
10 changes: 8 additions & 2 deletions pkg/daemon/ceph/osd/daemon.go
Original file line number Diff line number Diff line change
Expand Up @@ -120,8 +120,14 @@ func killCephOSDProcess(context *clusterd.Context, lvPath string) error {

// shut down the osd-ceph process so that lvm release does not show device in use error.
if pid != "" {
if err := context.Executor.ExecuteCommand(false, "", "kill", pid); err != nil {
return fmt.Errorf("failed to delete ceph-osd process. %+v", err)
// The OSD needs to exit as quickly as possible in order for the IO requests
// to be redirected to other OSDs in the cluster. The OSD is designed to tolerate failures
// of any kind, including power loss or kill -9. The upstream Ceph tests have for many years
// been testing with kill -9 so this is expected to be safe. There is a fix upstream Ceph that will
// improve the shutdown time of the OSD. For cleanliness we should consider removing the -9
// once it is backported to Nautilus: https://github.com/ceph/ceph/pull/31677.
if err := context.Executor.ExecuteCommand(false, "", "kill", "-9", pid); err != nil {
return fmt.Errorf("failed to kill ceph-osd process. %+v", err)
}
}

Expand Down

0 comments on commit 81346b8

Please sign in to comment.