Skip to content

Commit

Permalink
Roll forward of bazelbuild@aa7a577: Take a shared lock on the install…
Browse files Browse the repository at this point in the history
… base from the client side.

These are the client-side changes required to implement garbage collection of stale install bases: since one Bazel server might attempt to collect the install base of another, we must ensure that a running Bazel server *or* client prevents collection of its own install base, which is achieved by acquiring an exclusive lock prior to collection (to be implemented in a followup).

Note that we keep the existing mechanism for handling concurrent attempts to create the same install base don't clash (atomic rename) because it's simpler than using the lock (which would require upgrading it from shared to exclusive and back).

Progress on bazelbuild#2109.

PiperOrigin-RevId: 702433270
Change-Id: I474f3d56ec126de5f975c543f7bf9b64f4f08124
  • Loading branch information
tjgq authored and copybara-github committed Dec 3, 2024
1 parent 2df486e commit 478209f
Show file tree
Hide file tree
Showing 8 changed files with 133 additions and 65 deletions.
2 changes: 1 addition & 1 deletion src/main/cpp/bazel_startup_options.cc
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
namespace blaze {

BazelStartupOptions::BazelStartupOptions()
: StartupOptions("Bazel"),
: StartupOptions("Bazel", /* lock_install_base= */ true),
user_bazelrc_(""),
use_system_rc(true),
use_workspace_rc(true),
Expand Down
164 changes: 107 additions & 57 deletions src/main/cpp/blaze.cc
Original file line number Diff line number Diff line change
Expand Up @@ -94,16 +94,21 @@ using std::vector;
// The following is a treatise on how the interaction between the client and the
// server works.
//
// First, the client acquires a lock on $OUTPUT_BASE/lock (on Unix, we use
// fcntl(F_OFD_SETLK) if available or fcntl(F_SETLK) otherwise; on Windows, we
// use LockFileEx), blocking until it's available unless otherwise requested by
// the startup options. Then it verifies if it has already extracted itself by
// checking if the directory it extracts itself to (install base + a checksum)
// is present. If not, then it does the extraction. Care is taken that this
// process is atomic so that Blazen in multiple output bases do not clash.
// First the client acquires filesystem locks on the install and output bases
// (using fcntl() on Unix and LockFileEx() on Windows). The install base is
// locked in shared mode, because two simultaneous commands may run against the
// same install base. The output base is locked in exclusive mode, because two
// simultaneous commands may not run against the same output base. The client
// will busy-wait until both locks are available; however, if --noblock_for_lock
// is set, it will immediately exit if the output base lock cannot be obtained.
//
// Then the client tries to connect to the currently executing server and kills
// it if at least one of the following conditions is true:
// Then the client verifies whether it has already extracted itself into the
// install base and that it hasn't been tampered with. Otherwise, it extracts
// itself. Since the install base is not locked exclusively, an atomic rename is
// used to prevent simultaneous extraction attempts from clashing.
//
// Then the client tries to connect to the currently executing server and
// kills it if at least one of the following conditions is true:
//
// - The server is of the wrong version (as determined by the
// $OUTPUT_BASE/install symlink)
Expand All @@ -113,13 +118,13 @@ using std::vector;
// Then, if needed, the client adjusts the install link to indicate which
// version of the server it is running.
//
// In batch mode, the client then simply executes the server while taking care
// that the output base lock is kept until it finishes.
// In batch mode, the client then simply executes the server while taking
// care that the install and output base locks are held until it finishes.
//
// If in server mode, the client starts up a server if needed then sends the
// command to the client and streams back stdout and stderr. The output base
// lock is released after the command is sent to the server (the server
// implements its own locking mechanism).
// command to the client and streams back stdout and stderr. The client-side
// locks are released after the command has been sent; since the server may
// outlive the client, it must implement its own locking anyway.

// Synchronization between the client and the server is a little precarious
// because the client needs to know the PID of the server and it is not
Expand Down Expand Up @@ -196,9 +201,9 @@ class BlazeServer final {
public:
explicit BlazeServer(const StartupOptions &startup_options);

// Acquire a lock for the output base this server is running in.
// Returns the time spent waiting for the lock.
DurationMillis AcquireLock();
// Acquires locks for the install and output bases this server is running in.
// Returns the time spent waiting for locks.
DurationMillis AcquireLocks();

// Whether there is an active connection to a server.
bool Connected() const { return client_.get(); }
Expand Down Expand Up @@ -232,6 +237,7 @@ class BlazeServer final {
const ServerProcessInfo &ProcessInfo() const { return process_info_; }

private:
std::optional<LockHandle> install_base_lock_;
std::optional<LockHandle> output_base_lock_;

enum CancelThreadAction { NOTHING, JOIN, CANCEL, COMMAND_ID_RECEIVED };
Expand All @@ -250,7 +256,7 @@ class BlazeServer final {
// actions from.
std::unique_ptr<blaze_util::IPipe> pipe_;

void ReleaseLock();
void ReleaseLocks();
bool TryConnect(CommandServer::Stub *client);
void CancelThread();
void SendAction(CancelThreadAction action);
Expand All @@ -262,6 +268,8 @@ class BlazeServer final {
const bool block_for_lock_;
const bool quiet_;
const bool preemptible_;
const bool lock_install_base_;
const blaze_util::Path install_base_;
const blaze_util::Path output_base_;
};

Expand All @@ -274,38 +282,73 @@ static BlazeServer *blaze_server;
// _exit(2) (attributed with ATTRIBUTE_NORETURN) meaning we have to delete the
// objects before those.

DurationMillis BlazeServer::AcquireLock() {
DurationMillis BlazeServer::AcquireLocks() {
DurationMillis wait_time;

if (lock_install_base_) {
// Take a shared lock on the install base, because two simultaneous commands
// may run against the same install base. The lock file is a sibling of the
// install base directory to make it possible to synchronize concurrent
// creation through an atomic rename. An exclusive lock is only required to
// run automatic garbage collection of stale install bases.
// Always block waiting for the lock: in the event that the install base is
// currently being garbage collected, we want to recreate it.
if (install_base_lock_.has_value()) {
BAZEL_DIE(blaze_exit_code::INTERNAL_ERROR)
<< "AcquireLocks() called but the install base lock is already held.";
}
blaze_util::Path install_base_parent = install_base_.GetParent();
blaze_util::MakeDirectories(install_base_parent, 0777);
auto install_base_result = blaze::AcquireLock(
"install base",
install_base_parent.GetRelative(install_base_.GetBaseName() + ".lock"),
LockMode::kShared, batch_, /* block= */ true);
install_base_lock_ = install_base_result.first;
wait_time += install_base_result.second;
}

// Take an exclusive lock on the output base, because two simultaneous
// commands may not run against the same output base.
if (output_base_lock_.has_value()) {
BAZEL_DIE(blaze_exit_code::INTERNAL_ERROR)
<< "AcquireLock() called but the lock is already held.";
<< "AcquireLocks() called but the output base lock is already held.";
}
// Take an exclusive lock on the output base, because two simultaneous
// commands may not run against the same output base. Note that this lock will
// be released by ReleaseLock() once the server is running, as it can handle
// concurrent clients on its own.
auto lock_and_time =
auto output_base_result =
blaze::AcquireLock("output base", output_base_.GetRelative("lock"),
LockMode::kExclusive, batch_, block_for_lock_);
output_base_lock_ = lock_and_time.first;
return lock_and_time.second;
output_base_lock_ = output_base_result.first;
wait_time += output_base_result.second;

return wait_time;
}

void BlazeServer::ReleaseLock() {
void BlazeServer::ReleaseLocks() {
if (!output_base_lock_.has_value()) {
BAZEL_DIE(blaze_exit_code::INTERNAL_ERROR)
<< "ReleaseLock() called without a lock to release.";
<< "ReleaseLocks() called but the output base lock has already been "
"released.";
}
blaze::ReleaseLock(*output_base_lock_);
output_base_lock_ = std::nullopt;

if (lock_install_base_) {
if (!install_base_lock_.has_value()) {
BAZEL_DIE(blaze_exit_code::INTERNAL_ERROR)
<< "ReleaseLocks() called but the install base lock has already been "
"released.";
}
blaze::ReleaseLock(*install_base_lock_);
install_base_lock_ = std::nullopt;
}
}

////////////////////////////////////////////////////////////////////////
// Logic

static map<string, EnvVarValue> PrepareEnvironmentForJvm();

// Escapes colons by replacing them with '_C' and underscores by replacing them
// with '_U'. E.g. "name:foo_bar" becomes "name_Cfoo_Ubar"
// Escapes colons by replacing them with '_C' and underscores by replacing
// them with '_U'. E.g. "name:foo_bar" becomes "name_Cfoo_Ubar"
static string EscapeForOptionSource(const string &input) {
string result = input;
blaze_util::Replace("_", "_U", &result);
Expand All @@ -330,9 +373,9 @@ static vector<string> GetServerExeArgs(const blaze_util::Path &jvm_path,
startup_options.AddJVMArgumentPrefix(jvm_path.GetParent().GetParent(),
&result);

// com.google.devtools.build.lib.unsafe.StringUnsafe uses reflection to access
// private fields in java.lang.String. The Bazel server requires Java 11, so
// this option is known to be supported.
// com.google.devtools.build.lib.unsafe.StringUnsafe uses reflection to
// access private fields in java.lang.String. The Bazel server requires Java
// 11, so this option is known to be supported.
result.push_back("--add-opens=java.base/java.lang=ALL-UNNAMED");

result.push_back("-Xverify:none");
Expand Down Expand Up @@ -371,8 +414,8 @@ static vector<string> GetServerExeArgs(const blaze_util::Path &jvm_path,
}
result.push_back(java_library_path.str());

// TODO: Investigate whether this still has any effect. File name encoding is
// governed by sun.jnu.encoding in JDKs with JEP 400, which can't be
// TODO: Investigate whether this still has any effect. File name encoding
// is governed by sun.jnu.encoding in JDKs with JEP 400, which can't be
// influenced by setting a property.
result.push_back("-Dfile.encoding=ISO-8859-1");
// Force into the root locale to ensure consistent behavior of string
Expand All @@ -397,11 +440,11 @@ static vector<string> GetServerExeArgs(const blaze_util::Path &jvm_path,
&result);

// JVM arguments are complete. Now pass in Blaze startup options.
// Note that we always use the --flag=ARG form (instead of the --flag ARG one)
// so that BlazeRuntime#splitStartupOptions has an easy job.
// Note that we always use the --flag=ARG form (instead of the --flag ARG
// one) so that BlazeRuntime#splitStartupOptions has an easy job.

// TODO(b/152047869): Test that whatever the list constructed after this line
// is actually a list of parseable startup options.
// TODO(b/152047869): Test that whatever the list constructed after this
// line is actually a list of parseable startup options.
if (!startup_options.batch) {
result.push_back("--max_idle_secs=" +
blaze_util::ToString(startup_options.max_idle_secs));
Expand Down Expand Up @@ -480,11 +523,11 @@ static vector<string> GetServerExeArgs(const blaze_util::Path &jvm_path,
result.push_back("--nowindows_enable_symlinks");
}
// We use this syntax so that the logic in AreStartupOptionsDifferent() that
// decides whether the server needs killing is simpler. This is parsed by the
// Java code where --noclient_debug and --client_debug=false are equivalent.
// Note that --client_debug false (separated by space) won't work either,
// because the logic in AreStartupOptionsDifferent() assumes that every
// argument is in the --arg=value form.
// decides whether the server needs killing is simpler. This is parsed by
// the Java code where --noclient_debug and --client_debug=false are
// equivalent. Note that --client_debug false (separated by space) won't
// work either, because the logic in AreStartupOptionsDifferent() assumes
// that every argument is in the --arg=value form.
if (startup_options.client_debug) {
result.push_back("--client_debug=true");
} else {
Expand Down Expand Up @@ -653,7 +696,8 @@ static void RunBatchMode(
const OptionProcessor &option_processor,
const StartupOptions &startup_options, LoggingInfo *logging_info,
const std::optional<DurationMillis> extract_data_duration,
DurationMillis command_wait_duration, BlazeServer *server) {
const std::optional<DurationMillis> command_wait_duration,
BlazeServer *server) {
if (server->Connected()) {
server->KillRunningServer();
}
Expand Down Expand Up @@ -1054,8 +1098,8 @@ static ATTRIBUTE_NORETURN void RunClientServerMode(
const string &workspace, const OptionProcessor &option_processor,
const StartupOptions &startup_options, LoggingInfo *logging_info,
const std::optional<DurationMillis> extract_data_duration,
DurationMillis command_wait_duration, BlazeServer *server,
const string &build_label) {
const std::optional<DurationMillis> command_wait_duration,
BlazeServer *server, const string &build_label) {
while (true) {
if (!server->Connected()) {
StartServerAndConnect(server_exe, server_exe_args, server_dir,
Expand Down Expand Up @@ -1370,9 +1414,12 @@ static void RunLauncher(const string &self_path,
const string &workspace, LoggingInfo *logging_info) {
blaze_server = new BlazeServer(startup_options);

const DurationMillis command_wait_duration = blaze_server->AcquireLock();
BAZEL_LOG(INFO) << "Acquired the client lock, waited "
<< command_wait_duration.millis << " milliseconds";
const std::optional<DurationMillis> command_wait_duration =
blaze_server->AcquireLocks();
const uint64_t wait_ms =
command_wait_duration.has_value() ? command_wait_duration->millis : 0;
BAZEL_LOG(INFO) << "Acquired the client lock, waited " << wait_ms
<< " milliseconds";

WarnFilesystemType(startup_options.output_base);

Expand Down Expand Up @@ -1549,6 +1596,8 @@ BlazeServer::BlazeServer(const StartupOptions &startup_options)
block_for_lock_(startup_options.block_for_lock),
quiet_(startup_options.quiet),
preemptible_(startup_options.preemptible),
lock_install_base_(startup_options.lock_install_base),
install_base_(startup_options.install_base),
output_base_(startup_options.output_base) {
pipe_.reset(blaze_util::CreatePipe());
if (!pipe_) {
Expand Down Expand Up @@ -1840,12 +1889,13 @@ unsigned int BlazeServer::Communicate(
std::unique_ptr<grpc::ClientReader<command_server::RunResponse>> reader(
client_->Run(context.get(), request));

// Release the client lock because the gRPC server handles concurrent clients
// just fine. Note that this may result in two "waiting for other client"
// messages (one during server startup and one emitted by the server).
BAZEL_LOG(INFO) << "Releasing the client lock, as the server can manage "
"concurrent clients on its own.";
ReleaseLock();
// Release the client-side locks, as the server may outlive the client and
// must implement its own locking of the install and output bases.
// This may result in two "waiting for lock" messages, one emitted by client
// during server startup, and another emitted by the server. This is harmless.
BAZEL_LOG(INFO)
<< "Released the client-side locks on the install and output bases";
ReleaseLocks();

std::thread cancel_thread(&BlazeServer::CancelThread, this);
bool command_id_set = false;
Expand Down
9 changes: 7 additions & 2 deletions src/main/cpp/blaze_util.h
Original file line number Diff line number Diff line change
Expand Up @@ -19,15 +19,15 @@
#ifndef BAZEL_SRC_MAIN_CPP_BLAZE_UTIL_H_
#define BAZEL_SRC_MAIN_CPP_BLAZE_UTIL_H_

#include <stdint.h>
#include <sys/types.h>

#include <map>
#include <sstream>
#include <string>
#include <vector>

#include "src/main/cpp/util/logging.h"
#include "src/main/cpp/util/path.h"
#include "src/main/cpp/util/path_platform.h"

namespace blaze {

Expand Down Expand Up @@ -140,6 +140,11 @@ struct DurationMillis {
DurationMillis(const uint64_t start, const uint64_t end)
: millis(ComputeDuration(start, end)) {}

DurationMillis& operator+=(DurationMillis& other) {
millis += other.millis;
return *this;
}

private:
static uint64_t ComputeDuration(const uint64_t start, const uint64_t end) {
if (end < start) {
Expand Down
4 changes: 3 additions & 1 deletion src/main/cpp/startup_options.cc
Original file line number Diff line number Diff line change
Expand Up @@ -64,8 +64,10 @@ void StartupOptions::OverrideOptionSourcesKey(const std::string &flag_name,
option_sources_key_override_[flag_name] = new_name;
}

StartupOptions::StartupOptions(const string &product_name)
StartupOptions::StartupOptions(const string &product_name,
bool lock_install_base)
: product_name(product_name),
lock_install_base(lock_install_base),
ignore_all_rc_files(false),
block_for_lock(true),
host_jvm_debug(false),
Expand Down
6 changes: 5 additions & 1 deletion src/main/cpp/startup_options.h
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,10 @@ class StartupOptions {
// See UpdateConfiguration().
blaze_util::Path install_base;

// Whether the install base should be locked before use.
// This is always true for Bazel, but overridden for Blaze at Google.
bool lock_install_base;

// Override more finegrained rc file flags and ignore them all.
bool ignore_all_rc_files;

Expand Down Expand Up @@ -291,7 +295,7 @@ class StartupOptions {
// Constructor for subclasses only so that site-specific extensions of this
// class can override the product name. The product_name must be capitalized,
// as in "Bazel".
StartupOptions(const std::string &product_name);
StartupOptions(const std::string &product_name, bool lock_install_base);

// Returns the default output root location.
virtual blaze_util::Path GetDefaultOutputRoot() const = 0;
Expand Down
4 changes: 4 additions & 0 deletions src/test/cpp/bazel_startup_options_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -347,4 +347,8 @@ TEST_F(BazelStartupOptionsTest, FinalValueOfIgnoreIsUsedForWarning) {
"");
}

TEST_F(BazelStartupOptionsTest, LockInstallBase) {
EXPECT_TRUE(startup_options_->lock_install_base);
}

} // namespace blaze
3 changes: 2 additions & 1 deletion src/test/cpp/startup_options_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,8 @@ namespace blaze {
// Minimal StartupOptions class for testing.
class FakeStartupOptions : public StartupOptions {
public:
FakeStartupOptions() : StartupOptions("Bazel") {}
FakeStartupOptions()
: StartupOptions("Bazel", /* lock_install_base= */ true) {}
blaze_exit_code::ExitCode ProcessArgExtra(
const char *arg, const char *next_arg, const std::string &rcfile,
const char **value, bool *is_processed, std::string *error) override {
Expand Down
Loading

0 comments on commit 478209f

Please sign in to comment.