Skip to content

Commit

Permalink
Merge #1564
Browse files Browse the repository at this point in the history
1564: fix(pstor): increase persistence timeouts r=tiagolobocastro a=tiagolobocastro

The pstor might might unavailable for some time, for example during upgrade. We should have sufficient timeouts to cope with this. Also pstor can get slow at times as it's writing to disk, so we also should have a larger timeout on store.

todo: should we still carry on trying to persist on separate task after we fail the nexus?
todo: what happens when we try to shutdown nexus first and pstor fails?

We'll revisit this post-release.

Co-authored-by: Tiago Castro <[email protected]>
  • Loading branch information
mayastor-bors and tiagolobocastro committed Dec 15, 2023
2 parents 51b47df + cfb9e62 commit e342d74
Show file tree
Hide file tree
Showing 2 changed files with 14 additions and 7 deletions.
17 changes: 12 additions & 5 deletions io-engine/src/bdev/nexus/nexus_persistence.rs
Original file line number Diff line number Diff line change
Expand Up @@ -215,6 +215,7 @@ impl<'n> Nexus<'n> {
};

let mut retry = PersistentStore::retries();
let mut log = true;
loop {
let Err(err) = PersistentStore::put(&key, &info.inner).await else {
trace!(?key, "{self:?}: the state was saved successfully");
Expand All @@ -223,20 +224,26 @@ impl<'n> Nexus<'n> {

retry -= 1;
if retry == 0 {
error!(
"{self:?}: failed to persist nexus information: {err}, giving up..."
);
return Err(Error::SaveStateFailed {
source: err,
name: self.name.clone(),
});
}

error!(
"{self:?}: failed to persist nexus information, \
will retry ({retry} left): {err}"
);
if log {
error!(
"{self:?}: failed to persist nexus information, \
will retry silently ({retry} left): {err}..."
);
log = false;
}

// Allow some time for the connection to the persistent
// store to be re-established before retrying the operation.
if mayastor_sleep(Duration::from_secs(1)).await.is_err() {
if mayastor_sleep(Duration::from_secs(2)).await.is_err() {
error!("{self:?}: failed to wait for sleep");
}
}
Expand Down
4 changes: 2 additions & 2 deletions io-engine/src/core/env.rs
Original file line number Diff line number Diff line change
Expand Up @@ -172,12 +172,12 @@ pub struct MayastorCliArgs {
pub ps_endpoint: Option<String>,
#[clap(
long = "ps-timeout",
default_value = "10s",
default_value = "15s",
value_parser = parse_ps_timeout,
)]
/// Persistent store timeout.
pub ps_timeout: Duration,
#[clap(long = "ps-retries", default_value = "30")]
#[clap(long = "ps-retries", default_value = "100")]
/// Persistent store operation retries.
pub ps_retries: u8,
#[clap(long = "bdev-pool-size", default_value = "65535")]
Expand Down

0 comments on commit e342d74

Please sign in to comment.