Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support deterministic failover schedule for placement rules #37251

Open
Tracked by #18030
morgo opened this issue Aug 21, 2022 · 18 comments
Open
Tracked by #18030

support deterministic failover schedule for placement rules #37251

morgo opened this issue Aug 21, 2022 · 18 comments
Labels
type/enhancement The issue or PR belongs to an enhancement.

Comments

@morgo
Copy link
Contributor

morgo commented Aug 21, 2022

Enhancement

My deployment scenario involves two "primary" regions in AWS:

  • us-east-1
  • us-west-2

I have been experimenting with placement rules with a third region: us-east-2. This region should only be used for quorum, as there are no application servers hosted in it. So I define a placement policy as follows:

CREATE PLACEMENT POLICY `defaultpolicy` PRIMARY_REGION="us-east-1" REGIONS="us-east-1,us-west-2,us-east-2";

Because the pd-server supports a weight concept, when us-east-1 fails I can have the pd-leader be deterministic and us-west-2 will become the leader. However, there is no deterministic behavior of where the leader of regions for defaultpolicy will go. They will likely balance across us-west-2 and us-east-2, which is not the desired behavior.

Ideally I want the priority for the leader to be in-order of the region-list. This means that us-west-2 will become the new leader for all regions. Perhaps this could be conveyed with syntax like:

CREATE PLACEMENT POLICY `defaultpolicy` PRIMARY_REGION="us-east-1" REGIONS="us-east-1,us-west-2,us-east-2" SCHEDULE="DETERMINISTIC";

In fact, if this worked deterministically for leader-scheduling and follower-scheduling, and extension of this is I could create the following:

CREATE PLACEMENT POLICY `defaultpolicy` PRIMARY_REGION="us-east-1" REGIONS="us-east-1,us-west-2,us-east-2,us-west-1" SCHEDULE="DETERMINISTIC";

Since the default followers is 2, it would mean that us-west-1 won't get regions scheduled unless one of the other regions fails, which suits me perfectly. It will also mean that commit latency is only initially bad when failover to us-west-2 first occurs. Over time as regions as migrated to us-west-1, the performance should be ~restored as quorum can be achieved on the west coast.

This is a really common deployment pattern in the continental USA, so I'm hoping it can be implemented :-)

@morgo morgo added the type/enhancement The issue or PR belongs to an enhancement. label Aug 21, 2022
@morgo
Copy link
Contributor Author

morgo commented Sep 1, 2022

An alternative to this proposal, is to use the leader-weight property that pd can set on stores. But it currently doesn't work as expected:

  1. Assume I have a placement group of PRIMARY_REGION="us-east-1" REGIONS="us-east-1,us-west-2,us-east-2"
  2. I set the leader weight to zero on all stores in us-east-2.
  3. When us-east-1 fails, leaders are randomly scattered across us-west-2 and us-east-2
  4. The leader balance schedule does not apply until the cluster is healthy again, preventing failover.

The reason is because in (3) the new leader is chosen by an election by the tikv raft group, which has no knowledge (or concern) for leader-weight. But what I would like to suggest, is that if a heartbeat is sent from a leader in a zero leader-weight store, a forced leader transfer occurs.

I took a look at a quick hack to do this, but it didn't work :-) I'm hoping someone who knows pd better can help here.

@nolouch
Copy link
Member

nolouch commented Sep 2, 2022

Hi @morgo, if you set the leader weight to zero, the score calculation would like count/weight(1e-6). the balance leader will transfer the leader from us-east-2 to us-west-2 in step 3. but it may cannot transfer out to all leaders, because the balance-leader scheduler's goal is to balance the score.

An alternative method, use label-property. rather than leader score, it will always transfer leader out from the reject leader stores to other stores, the operators :

pd-ctl schedule add label-scheduler // active the label-scheduler
pd-ctl config set label-property reject-leader region us-west-2 // all tikv leaders will transfer to other regions(us-west-2) when failover  

you can check the implementation in: https://github.com/tikv/pd/blob/master/server/schedulers/label.go#L117-L124

@morgo
Copy link
Contributor Author

morgo commented Sep 2, 2022

This is great! Thank you @nolouch

@nolouch
Copy link
Member

nolouch commented Sep 2, 2022

I test this scenario with the rule and this scheduler. and found that label-scheduler does not work well as we expect. the log:

[2022/09/02 11:45:59.171 +08:00] [DEBUG] [label.go:139] ["fail to create transfer label reject leader operator"] [error="cannot create operator: target leader is not allowed"]

it shows try to create an operator but failed. it is caused by the placement rule is explicitly specifies that this store should place followers, which is reasonable in the error.

After I change the policy, it works.

for failover, the placement policy should change from :

CREATE PLACEMENT POLICY primary_east PRIMARY_REGION="us-east-1" REGIONS="us-east-1,us-east-2,us-west-2";

to

CREATE PLACEMENT POLICY primary_east_2 LEADER_CONSTRAINTS="[+region=us-east-1]" FOLLOWER_CONSTRAINTS="{+region=us-east-2: 1,+region=us-wetst-2: 1}"
the difference between them can check the raw rule in PD

the raw rule in PD will change from :

 {
    "group_id": "TiDB_DDL_71",
    "id": "table_rule_71_0",
    "index": 40,
    "start_key": "7480000000000000ff4700000000000000f8",
    "end_key": "7480000000000000ff4800000000000000f8",
    "role": "voter",
    "count": 1,
    "label_constraints": [
      {
        "key": "region",
        "op": "in",
        "values": [
          "us-east-1"
        ]
      },
      {
        "key": "engine",
        "op": "notIn",
        "values": [
          "tiflash"
        ]
      }
    ],
    "create_timestamp": 1662089499
  },
  {
    "group_id": "TiDB_DDL_71",
    "id": "table_rule_71_1",
    "index": 40,
    "start_key": "7480000000000000ff4700000000000000f8",
    "end_key": "7480000000000000ff4800000000000000f8",
    "role": "follower",
    "count": 2,
    "label_constraints": [
      {
        "key": "region",
        "op": "in",
        "values": [
          "us-east-2",
          "us-west-1"
        ]
      },
      {
        "key": "engine",
        "op": "notIn",
        "values": [
          "tiflash"
        ]
      }
    ],
  },

to

  {
    "group_id": "TiDB_DDL_71",
    "id": "table_rule_71_0",
    "index": 40,
    "start_key": "7480000000000000ff4700000000000000f8",
    "end_key": "7480000000000000ff4800000000000000f8",
    "role": "leader",
    "count": 1,
    "label_constraints": [
      {
        "key": "region",
        "op": "in",
        "values": [
          "us-east-1"
        ]
      },
      {
        "key": "engine",
        "op": "notIn",
        "values": [
          "tiflash"
        ]
      }
    ],
    "version": 1,
    "create_timestamp": 1662089499
  },
  {
    "group_id": "TiDB_DDL_71",
    "id": "table_rule_71_1",
    "index": 40,
    "start_key": "7480000000000000ff4700000000000000f8",
    "end_key": "7480000000000000ff4800000000000000f8",
    "role": "voter",
    "count": 1,
    "label_constraints": [
      {
        "key": "region",
        "op": "in",
        "values": [
          "us-east-2"
        ]
      },
      {
        "key": "engine",
        "op": "notIn",
        "values": [
          "tiflash"
        ]
      }
    ],
    "version": 1,
    "create_timestamp": 1662089499
  },
  {
    "group_id": "TiDB_DDL_71",
    "id": "table_rule_71_2",
    "index": 40,
    "start_key": "7480000000000000ff4700000000000000f8",
    "end_key": "7480000000000000ff4800000000000000f8",
    "role": "voter",
    "count": 1,
    "label_constraints": [
      {
        "key": "region",
        "op": "in",
        "values": [
          "us-west-1"
        ]
      },
      {
        "key": "engine",
        "op": "notIn",
        "values": [
          "tiflash"
        ]
      }
    ],
    "create_timestamp": 1662092753
  },

@nolouch
Copy link
Member

nolouch commented Sep 2, 2022

Hi, @morgo. an easier way is just to use one policy (works in 3 regions) like:

CREATE PLACEMENT POLICY  primary_east_backup_west LEADER_CONSTRAINTS="[+region=us-east-1,-region=us-east-2]" FOLLOWER_CONSTRAINTS="{+region=us-east-2: 1,+region=us-west-2: 1}"

If want to apply to the cluster level, I think we can use the raw placement rule, example:

show details
[
 {
    "group_id": "cluster_rule",
    "id": "cluster_rule_0_primary",
    "index": 500,
    "start_key": "",
    "end_key": "",
    "role": "leader",
    "count": 1,
    "label_constraints": [
      {
        "key": "region",
        "op": "in",
        "values": [
          "us-east-1"
        ]
      },
      {
        "key": "region",
        "op": "notIn",
        "values": [
          "us-east-2"
        ]
      },
      {
        "key": "engine",
        "op": "notIn",
        "values": [
          "tiflash"
        ]
      }
    ]
  },
  {
    "group_id": "cluster_rule",
    "id": "cluster_rule_1_us_west_2",
    "index": 500,
    "start_key": "",
    "end_key": "",
    "role": "voter",
    "count": 1,
    "label_constraints": [
      {
        "key": "region",
        "op": "in",
        "values": [
          "us-west-2"
        ]
      },
      {
        "key": "engine",
        "op": "notIn",
        "values": [
          "tiflash"
        ]
      }
    ]
  },
  {
    "group_id": "cluster_rule",
    "id": "cluster_rule_2_us_east_2",
    "index": 500,
    "start_key": "",
    "end_key": "",
    "role": "follower",
    "count": 1,
    "label_constraints": [
      {
        "key": "region",
        "op": "in",
        "values": [
          "us-east-2"
        ]
      },
      {
        "key": "engine",
        "op": "notIn",
        "values": [
          "tiflash"
        ]
      }
    ]
  }
]

@morgo
Copy link
Contributor Author

morgo commented Sep 2, 2022

If want to apply to the cluster level, I think we can use the raw placement rule, example:

I'd prefer to keep it in SQL rules, so its easier for other users on my team to change them if needed. It's okay though, the only other schema I need to change is mysql. This is important actually because SHOW VARIABLES reads from the mysql.tidb table for the gc variables. Since various client libraries run show variables like 'x' on a new connection, if this table isn't in the primary region, it is going to cause performance problems.

It can be done with:

mysql -e "ALTER DATABASE mysql PLACEMENT POLICY=defaultpolicy;"
for TABLE in `mysql mysql -BNe "SHOW TABLES"`; do
  mysql mysql -e "ALTER TABLE $TABLE PLACEMENT POLICY=defaultpolicy;"
done;

@nolouch
Copy link
Member

nolouch commented Sep 2, 2022

Well, do you think the system level is needed in the placement rule in SQL? Is it more friendly for your scenarios?
Actually, if we support the system level, there is only one rule in PD. but in the current way, It will create many rules in PD. it's about 3 raw rules per table, I worry about the burden of too many rules.

@morgo
Copy link
Contributor Author

morgo commented Sep 2, 2022

This is essentially this feature request: #29677

There are some strange behaviors that need to be determined, but yes: I think the feature system level has merit.

@nolouch
Copy link
Member

nolouch commented Sep 9, 2022

Hi, @morgo. an easier way is just to use one policy (works in 3 regions) like:

CREATE PLACEMENT POLICY  primary_east_backup_west LEADER_CONSTRAINTS="[+region=us-east-1,-region=us-east-2]" FOLLOWER_CONSTRAINTS="{+region=us-east-2: 1,+region=us-west-2: 1}"

If want to apply to the cluster level, I think we can use the raw placement rule, example:

show details

@morgo I confirmed that this placement policy cannot achieve the purpose of automatic switching, the problem needs the placement policy to distinguish voter and follower, setting follower raw rule for east2 let it cannot be leader.
So if you want to use SQL, you still need to use label scheduler as #37251 (comment) commented.

@nolouch
Copy link
Member

nolouch commented Sep 9, 2022

BTW, Do you need to set the placement policy for the metadata, I think metadata likes read schema information, which may cause performance problems.

@nolouch
Copy link
Member

nolouch commented Sep 14, 2022

BTW, Do you need to set the placement policy for the metadata, I think metadata likes read schema information, which may cause performance problems.

I really suggest use cluster-level setting with raw placement rule for this scenario now. in my test met fewer problems. use pd-ctl to setting for keyrange from "" to "", rules like:

// setting rule group  https://docs.pingcap.com/tidb/dev/configure-placement-rules#use-pd-ctl-to-configure-rule-groups
>> pd-ctl config placement-rules rule-group set cluster_rule 2 true
>> cat cluster_rule_group.json
{
    "group_id": "cluster_rule",
    "group_index": 2,
    "group_override": true,
    "rules": [
        {
            "group_id": "cluster_rule",
            "id": "cluster_rule_0_primary_leader",
            "index": 1,
            "start_key": "",
            "end_key": "",
            "role": "leader",
            "count": 1,
            "label_constraints": [
                {
                    "key": "region",
                    "op": "in",
                    "values": [
                        "us-east-1"
                    ]
                }
            ]
        },
        {
            "group_id": "cluster_rule",
            "id": "cluster_rule_1_primary_voter",
            "index": 1,
            "start_key": "",
            "end_key": "",
            "role": "voter",
            "count": 1,
            "label_constraints": [
                {
                    "key": "region",
                    "op": "in",
                    "values": [
                        "us-east-1"
                    ]
                }
            ]
        },
        {
            "group_id": "cluster_rule",
            "id": "cluster_rule_3_us_east_2",
            "index": 1,
            "start_key": "",
            "end_key": "",
            "role": "voter",
            "count": 2,
            "label_constraints": [
                {
                    "key": "region",
                    "op": "in",
                    "values": [
                        "us-east-2"
                    ]
                }
            ]
        },
        {
            "group_id": "cluster_rule",
            "id": "cluster_rule_2_us_west_2",
            "index": 1,
            "start_key": "",
            "end_key": "",
            "role": "follower",
            "count": 1,
            "label_constraints": [
                {
                    "key": "region",
                    "op": "in",
                    "values": [
                        "us-west-2"
                    ]
                }
            ]
        }
    ]
}

// apply the rule for the group https://docs.pingcap.com/tidb/dev/configure-placement-rules#use-pd-ctl-to-batch-update-groups-and-rules-in-groups
>>  pd-ctl config placement-rules rule-bundle set cluster_rule --in="cluster_rule_group.json"

@tonyxuqqi
Copy link

I confirmed that this placement policy cannot achieve the purpose of automatic switching, the problem needs the placement policy to distinguish voter and follower, setting follower raw rule for east2 let it cannot be leader.

@nolouch Is this follower rule enforced in PD side or in TikV raft protocol?

@morgo
Copy link
Contributor Author

morgo commented Sep 14, 2022

@nolouch Happy to try with one policy. I'm getting an error with what you pasted above though :(

$ ./bin/pd-ctl config placement-rules rule-bundle set cluster_rule --in="cluster_rule_group.json"
json: cannot unmarshal array into Go value of type struct { GroupID string "json:\"group_id\"" }

Using pd-ctl from v6.2.0.

@nolouch
Copy link
Member

nolouch commented Sep 14, 2022

@morgo sorry, I updated the comment in #37251 (comment). you can try again.

@kolbe
Copy link
Contributor

kolbe commented Sep 14, 2022

To be used in a kubernetes environment (until pingcap/tidb-operator#4678 is implemented), "region" should be changed to "topology.kubernetes.io/region".

@SunRunAway
Copy link
Contributor

@morgo
Since you never read from us-east-2, is it better to set us-east-2 as witness if posible?

@kolbe
Copy link
Contributor

kolbe commented Sep 16, 2022

@SunRunAway what is "witness"? This is not mentioned anywhere in our documentation.

@SunRunAway
Copy link
Contributor

@kolbe I'm discussing a developing feature, see tikv/tikv#12876

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

No branches or pull requests

5 participants