Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initialize Tablet with super_read_only mode #12206

Merged
merged 74 commits into from
Mar 14, 2023

Conversation

rsajwani
Copy link
Contributor

@rsajwani rsajwani commented Feb 1, 2023

Description

As of today MySQL server which is controlled by VTTablet, starts in read-only mode. This is because Vitess controls the MySQL cluster topology and chooses one tablet to be the primary for each shard, and only that one MySQL server should be read-write. All other servers in that shard are replicas, and should not be taking any writes. However this doesn't prevent users like root and vt_dba who have SUPER privileges from writing to the database on replicas and causing errant GTIDs. We want to leverage GLOBAL super_read_only configuration in order to protect against these scenarios. This will make sure that apart from primary no other component or offline system can mutate DB resulting in errant GTIDs that are then lying in wait to cause later failures, as you can see in #9312 and #10094

Furthermore, it is possible that due to program bugs or unexpected failures during PRS/ERS, we end up with a replica that is not in read-only mode. With super-read-only change we will make sure that we cover those cases as well.

Tablet schema refactoring

This change is built on top of #11520, where instead of using withddl we use a declarative approach to initializing vttablet's sidecar schema. Using a declarative approach helps us condense all our schema changes to one place, which allows us to make changes to the code to apply super-read-only to the DB.
The previous approach of using withddl meant that schema changes were scattered all across the code, which made it extremely difficult to do the super-read-only change.

Details

This PR usese #11520 as a building block. I am listing major changes below to help you review the PR.

  • mysql57.cnf & mysql80.cnf contains super-read-only, to ensure all Mysql instances comes up in super-read-only mode.
  • During init_db.sql we switch super-read-only OFF temporarily in order to perform some mutations like creating necessary users and permissions.
  • Since VTTestServer doesn't need super-read-only, therefore we need to come up with a separate SQL initialization (init_testserver_db.sql) for it.
  • Adding # {{custom_sql}} inside init_db.sql file. This is because in our tests we modify this file and add custom SQL like (change password of DBA user etc). Now that we expect setting super-read-only ON to execute at the end, we need these custom sqls to be added before that last line. See example: go/test/endtoend/backup/vtbackup/main_test.go
  • During PRS & ERS, we make sure that only primary turns off super-read-only and rest of replica are strictly in super-read-only mode.
  • Adding 'SuperReadOnly' property, which is available through VtctldClient GetFullStatus.

For Reference

Some related work items done in the past

#11706
#10094
#9312
#10448

Unresolved Issues during code changes:

I have filed few issues which I will need to resolve once I check-in this PR.

Related Issue(s)

Fixes #10363
Fixes #12180
Fixes #12140

Test Results

vtctld command to show super_read_only state

 vtctldclient --server localhost:15999 GetFullStatus zone1-0000000101
{
  "server_id": 493124066,
  "server_uuid": "ed6d02e6-a35b-11ed-b467-9cf2e7b0ba93",
  "replication_status": null,
  "primary_status": {
    "position": "MySQL56/ed6d02e6-a35b-11ed-b467-9cf2e7b0ba93:1-26",
    "file_position": "FilePos/vt-0000000101-bin.000001:14540"
  },
  "gtid_purged": "MySQL56/",
  "version": "8.0.31",
  "version_comment": "Homebrew",
  "read_only": false,
  "gtid_mode": "ON",
  "binlog_format": "ROW",
  "binlog_row_image": "FULL",
  "log_bin_enabled": true,
  "log_replica_updates": true,
  "semi_sync_primary_enabled": true,
  "semi_sync_replica_enabled": true,
  "semi_sync_primary_status": true,
  "semi_sync_replica_status": false,
  "semi_sync_primary_clients": 1,
  "semi_sync_primary_timeout": "1000000000000000000",
  "semi_sync_wait_for_replica_count": 1,
  "super_read_only": false
}

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

Signed-off-by: Rameez Sajwani <[email protected]>
Signed-off-by: Rameez Sajwani <[email protected]>
Signed-off-by: Rameez Sajwani <[email protected]>
Signed-off-by: Rameez Sajwani <[email protected]>
Signed-off-by: Rameez Sajwani <[email protected]>
Signed-off-by: Rameez Sajwani <[email protected]>
Signed-off-by: Rameez Sajwani <[email protected]>
Signed-off-by: Rameez Sajwani <[email protected]>
Signed-off-by: Rameez Sajwani <[email protected]>
Signed-off-by: Rameez Sajwani <[email protected]>
Signed-off-by: Rameez Sajwani <[email protected]>
Signed-off-by: Rameez Sajwani <[email protected]>
Signed-off-by: Rameez Sajwani <[email protected]>
Signed-off-by: Rameez Sajwani <[email protected]>
Signed-off-by: Rameez Sajwani <[email protected]>
Signed-off-by: Rameez Sajwani <[email protected]>
Signed-off-by: Rameez Sajwani <[email protected]>
Signed-off-by: Rameez Sajwani <[email protected]>
Signed-off-by: Rameez Sajwani <[email protected]>
Copy link
Collaborator

@systay systay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in vtexplain are hard to understand, and seem unrelated to the rest of the changes.

@@ -112,7 +119,11 @@ func TestMain(m *testing.M) {
tablet.VttabletProcess.ExtraArgs = commonTabletArg
tablet.VttabletProcess.SupportsBackup = true

tablet.MysqlctlProcess = *cluster.MysqlCtlProcessInstance(tablet.TabletUID, tablet.MySQLPort, localCluster.TmpDirectory)
mysqlctlProcess, err := cluster.MysqlCtlProcessInstance(tablet.TabletUID, tablet.MySQLPort, localCluster.TmpDirectory)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is due to refactoring done in go/test/endtoend/cluster/mysqlctl_process.go

"vt_app": [
"password"
"VtAppPass"
Copy link
Contributor Author

@rsajwani rsajwani Mar 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changing these passwords so I can reuse GetPasswordUpdateSQL. This will make tests consistent as well ...

Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there 😄
Approving provisionally, the remaining changes are pretty small, but should be addressed before we merge.

config/mycnf/default.cnf Show resolved Hide resolved
go/test/endtoend/cluster/mysqlctl_process.go Show resolved Hide resolved
config/init_db.sql Outdated Show resolved Hide resolved
go/vt/mysqlctl/builtinbackupengine.go Show resolved Hide resolved
Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on this, @rsajwani ! I only had minor nits/comments/suggestions. We can discuss those and quickly get this thing merged already! 🙂

config/init_db.sql Outdated Show resolved Hide resolved
config/mycnf/default.cnf Show resolved Hide resolved
go/vt/dbconfigs/dbconfigs.go Outdated Show resolved Hide resolved
config/init_db.sql Outdated Show resolved Hide resolved
doc/releasenotes/17_0_0_summary.md Outdated Show resolved Hide resolved
go/test/endtoend/backup/vtbackup/backup_only_test.go Outdated Show resolved Hide resolved
go/test/endtoend/vault/vault_test.go Outdated Show resolved Hide resolved
Signed-off-by: Rameez Sajwani <[email protected]>
Signed-off-by: Rameez Sajwani <[email protected]>
Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! ❤️ I only had one insignificant nit. Up to you if you think it's worth changing.

@@ -57,6 +57,8 @@ func TestTabletInitialBackup(t *testing.T) {
// - list the backups, remove them
defer cluster.PanicHandler(t)

waitForReplicationToCatchup([]cluster.Vttablet{*replica1, *replica2})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a reminder that if vtorc is used in these tests, which tablets are replicas is no longer fixed. That's a bigger issue though, not related to this PR.

Comment on lines +592 to +593
var timeoutDuration = time.Duration(5 * len(sqls))
ctx, cancel := context.WithTimeout(context.Background(), timeoutDuration*time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking nit, but IMO it's more natural to do this all in the duration definition:

var timeoutDuration = time.Duration(5 * time.Second * len(sqls))
ctx, cancel := context.WithTimeout(context.Background(), timeoutDuration)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
6 participants