Skip to content

Commit

Permalink
pageserver: generation number support in keys and indices (#5140)
Browse files Browse the repository at this point in the history
## Problem

To implement split brain protection, we need tenants and timelines to be
aware of their current generation, and use it when composing S3 keys.


## Summary of changes

- A `Generation` type is introduced in the `utils` crate -- it is in
this broadly-visible location because it will later be used from
`control_plane/` as well as `pageserver/`. Generations can be a number,
None, or Broken, to support legacy content (None), and Tenants in the
broken state (Broken).
- Tenant, Timeline, and RemoteTimelineClient all get a generation
attribute
- IndexPart's IndexLayerMetadata has a new `generation` attribute.
Legacy layers' metadata will deserialize to Generation::none().
- Remote paths are composed with a trailing generation suffix. If a
generation is equal to Generation::none() (as it currently always is),
then this suffix is an empty string.
- Functions for composing remote storage paths added in
remote_timeline_client: these avoid the way that we currently always
compose a local path and then strip the prefix, and avoid requiring a
PageserverConf reference on functions that want to create remote paths
(the conf is only needed for local paths). These are less DRY than the
old functions, but remote storage paths are a very rarely changing
thing, so it's better to write out our paths clearly in the functions
than to compose timeline paths from tenant paths, etc.
- Code paths that construct a Tenant take a `generation` argument in
anticipation that we will soon load generations on startup before
constructing Tenant.

Until the whole feature is done, we don't want any generation-ful keys
though: so initially we will carry this everywhere with the special
Generation::none() value.

Closes: #5135

Co-authored-by: Christian Schwarz <[email protected]>
  • Loading branch information
jcsp and problame authored Aug 31, 2023
1 parent f2c2144 commit 83ae2bd
Show file tree
Hide file tree
Showing 13 changed files with 433 additions and 159 deletions.
113 changes: 113 additions & 0 deletions libs/utils/src/generation.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
use std::fmt::Debug;

use serde::{Deserialize, Serialize};

/// Tenant generations are used to provide split-brain safety and allow
/// multiple pageservers to attach the same tenant concurrently.
///
/// See docs/rfcs/025-generation-numbers.md for detail on how generation
/// numbers are used.
#[derive(Copy, Clone, Eq, PartialEq, PartialOrd, Ord)]
pub enum Generation {
// Generations with this magic value will not add a suffix to S3 keys, and will not
// be included in persisted index_part.json. This value is only to be used
// during migration from pre-generation metadata to generation-aware metadata,
// and should eventually go away.
//
// A special Generation is used rather than always wrapping Generation in an Option,
// so that code handling generations doesn't have to be aware of the legacy
// case everywhere it touches a generation.
None,
// Generations with this magic value may never be used to construct S3 keys:
// we will panic if someone tries to. This is for Tenants in the "Broken" state,
// so that we can satisfy their constructor with a Generation without risking
// a code bug using it in an S3 write (broken tenants should never write)
Broken,
Valid(u32),
}

/// The Generation type represents a number associated with a Tenant, which
/// increments every time the tenant is attached to a new pageserver, or
/// an attached pageserver restarts.
///
/// It is included as a suffix in S3 keys, as a protection against split-brain
/// scenarios where pageservers might otherwise issue conflicting writes to
/// remote storage
impl Generation {
/// Create a new Generation that represents a legacy key format with
/// no generation suffix
pub fn none() -> Self {
Self::None
}

// Create a new generation that will panic if you try to use get_suffix
pub fn broken() -> Self {
Self::Broken
}

pub fn new(v: u32) -> Self {
Self::Valid(v)
}

pub fn is_none(&self) -> bool {
matches!(self, Self::None)
}

pub fn get_suffix(&self) -> String {
match self {
Self::Valid(v) => {
format!("-{:08x}", v)
}
Self::None => "".into(),
Self::Broken => {
panic!("Tried to use a broken generation");
}
}
}
}

impl Serialize for Generation {
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
if let Self::Valid(v) = self {
v.serialize(serializer)
} else {
// We should never be asked to serialize a None or Broken. Structures
// that include an optional generation should convert None to an
// Option<Generation>::None
Err(serde::ser::Error::custom(
"Tried to serialize invalid generation ({self})",
))
}
}
}

impl<'de> Deserialize<'de> for Generation {
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where
D: serde::Deserializer<'de>,
{
Ok(Self::Valid(u32::deserialize(deserializer)?))
}
}

// We intentionally do not implement Display for Generation, to reduce the
// risk of a bug where the generation is used in a format!() string directly
// instead of using get_suffix().
impl Debug for Generation {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
Self::Valid(v) => {
write!(f, "{:08x}", v)
}
Self::None => {
write!(f, "<none>")
}
Self::Broken => {
write!(f, "<broken>")
}
}
}
}
3 changes: 3 additions & 0 deletions libs/utils/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,9 @@ pub mod id;
// http endpoint utils
pub mod http;

// definition of the Generation type for pageserver attachment APIs
pub mod generation;

// common log initialisation routine
pub mod logging;

Expand Down
17 changes: 0 additions & 17 deletions pageserver/src/config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -643,23 +643,6 @@ impl PageServerConf {
.join(METADATA_FILE_NAME)
}

/// Files on the remote storage are stored with paths, relative to the workdir.
/// That path includes in itself both tenant and timeline ids, allowing to have a unique remote storage path.
///
/// Errors if the path provided does not start from pageserver's workdir.
pub fn remote_path(&self, local_path: &Path) -> anyhow::Result<RemotePath> {
local_path
.strip_prefix(&self.workdir)
.context("Failed to strip workdir prefix")
.and_then(RemotePath::new)
.with_context(|| {
format!(
"Failed to resolve remote part of path {:?} for base {:?}",
local_path, self.workdir
)
})
}

/// Turns storage remote path of a file into its local path.
pub fn local_path(&self, remote_path: &RemotePath) -> PathBuf {
remote_path.with_base(&self.workdir)
Expand Down
27 changes: 21 additions & 6 deletions pageserver/src/tenant.rs
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,7 @@ pub use pageserver_api::models::TenantState;
use toml_edit;
use utils::{
crashsafe,
generation::Generation,
id::{TenantId, TimelineId},
lsn::{Lsn, RecordLsn},
};
Expand Down Expand Up @@ -178,6 +179,11 @@ pub struct Tenant {
tenant_conf: Arc<RwLock<TenantConfOpt>>,

tenant_id: TenantId,

/// The remote storage generation, used to protect S3 objects from split-brain.
/// Does not change over the lifetime of the [`Tenant`] object.
generation: Generation,

timelines: Mutex<HashMap<TimelineId, Arc<Timeline>>>,
// This mutex prevents creation of new timelines during GC.
// Adding yet another mutex (in addition to `timelines`) is needed because holding
Expand Down Expand Up @@ -522,6 +528,7 @@ impl Tenant {
pub(crate) fn spawn_attach(
conf: &'static PageServerConf,
tenant_id: TenantId,
generation: Generation,
broker_client: storage_broker::BrokerClientChannel,
tenants: &'static tokio::sync::RwLock<TenantsMap>,
remote_storage: GenericRemoteStorage,
Expand All @@ -538,6 +545,7 @@ impl Tenant {
tenant_conf,
wal_redo_manager,
tenant_id,
generation,
Some(remote_storage.clone()),
));

Expand Down Expand Up @@ -648,12 +656,8 @@ impl Tenant {
.as_ref()
.ok_or_else(|| anyhow::anyhow!("cannot attach without remote storage"))?;

let remote_timeline_ids = remote_timeline_client::list_remote_timelines(
remote_storage,
self.conf,
self.tenant_id,
)
.await?;
let remote_timeline_ids =
remote_timeline_client::list_remote_timelines(remote_storage, self.tenant_id).await?;

info!("found {} timelines", remote_timeline_ids.len());

Expand All @@ -665,6 +669,7 @@ impl Tenant {
self.conf,
self.tenant_id,
timeline_id,
self.generation,
);
part_downloads.spawn(
async move {
Expand Down Expand Up @@ -851,6 +856,7 @@ impl Tenant {
TenantConfOpt::default(),
wal_redo_manager,
tenant_id,
Generation::broken(),
None,
))
}
Expand All @@ -868,6 +874,7 @@ impl Tenant {
pub(crate) fn spawn_load(
conf: &'static PageServerConf,
tenant_id: TenantId,
generation: Generation,
resources: TenantSharedResources,
init_order: Option<InitializationOrder>,
tenants: &'static tokio::sync::RwLock<TenantsMap>,
Expand All @@ -893,6 +900,7 @@ impl Tenant {
tenant_conf,
wal_redo_manager,
tenant_id,
generation,
remote_storage.clone(),
);
let tenant = Arc::new(tenant);
Expand Down Expand Up @@ -2274,6 +2282,7 @@ impl Tenant {
ancestor,
new_timeline_id,
self.tenant_id,
self.generation,
Arc::clone(&self.walredo_mgr),
resources,
pg_version,
Expand All @@ -2291,6 +2300,7 @@ impl Tenant {
tenant_conf: TenantConfOpt,
walredo_mgr: Arc<dyn WalRedoManager + Send + Sync>,
tenant_id: TenantId,
generation: Generation,
remote_storage: Option<GenericRemoteStorage>,
) -> Tenant {
let (state, mut rx) = watch::channel(state);
Expand Down Expand Up @@ -2349,6 +2359,7 @@ impl Tenant {

Tenant {
tenant_id,
generation,
conf,
// using now here is good enough approximation to catch tenants with really long
// activation times.
Expand Down Expand Up @@ -2931,6 +2942,7 @@ impl Tenant {
self.conf,
self.tenant_id,
timeline_id,
self.generation,
);
Some(remote_client)
} else {
Expand Down Expand Up @@ -3454,6 +3466,7 @@ pub mod harness {
pub conf: &'static PageServerConf,
pub tenant_conf: TenantConf,
pub tenant_id: TenantId,
pub generation: Generation,
}

static LOG_HANDLE: OnceCell<()> = OnceCell::new();
Expand Down Expand Up @@ -3495,6 +3508,7 @@ pub mod harness {
conf,
tenant_conf,
tenant_id,
generation: Generation::new(0xdeadbeef),
})
}

Expand All @@ -3521,6 +3535,7 @@ pub mod harness {
TenantConfOpt::from(self.tenant_conf),
walredo_mgr,
self.tenant_id,
self.generation,
remote_storage,
));
tenant
Expand Down
12 changes: 11 additions & 1 deletion pageserver/src/tenant/mgr.rs
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ use crate::tenant::{create_tenant_files, CreateTenantFilesMode, Tenant, TenantSt
use crate::{InitializationOrder, IGNORED_TENANT_FILE_NAME};

use utils::fs_ext::PathExt;
use utils::generation::Generation;
use utils::id::{TenantId, TimelineId};

use super::delete::DeleteTenantError;
Expand Down Expand Up @@ -202,6 +203,7 @@ pub(crate) fn schedule_local_tenant_processing(
match Tenant::spawn_attach(
conf,
tenant_id,
Generation::none(),
resources.broker_client,
tenants,
remote_storage,
Expand All @@ -224,7 +226,15 @@ pub(crate) fn schedule_local_tenant_processing(
} else {
info!("tenant {tenant_id} is assumed to be loadable, starting load operation");
// Start loading the tenant into memory. It will initially be in Loading state.
Tenant::spawn_load(conf, tenant_id, resources, init_order, tenants, ctx)
Tenant::spawn_load(
conf,
tenant_id,
Generation::none(),
resources,
init_order,
tenants,
ctx,
)
};
Ok(tenant)
}
Expand Down
Loading

1 comment on commit 83ae2bd

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No tests were run or test report is not available

The comment gets automatically updated with the latest test results
83ae2bd at 2023-08-31T08:24:56.381Z :recycle:

Please sign in to comment.