Support customizing the rebalance threshold #422

cenkalti · 2023-03-14T04:33:57Z

Lines 420 to 424 in 838586a

    
           // Ignore if node is above threshold (25%) and has enough keys. 
        
           var threshold = n.bucket.tx.db.pageSize / 4 
        
           if n.size() > threshold && len(n.inodes) > n.minKeys() { 
        
           	return 
        
           }

How this number was chosen?

What happens if we make it configurable and increase or decrease it?

ahrtr · 2023-03-15T07:53:19Z

The intention might be setting it as 50% * DefaultFillPercent (50% * 0.5 = 25%)? cc @benbjohnson

ahrtr · 2023-04-26T00:08:47Z

I think we can add a field FillPercent (defaults to 0.5) into both Options and DB. It's the default value for Bucket.FillPercent, users can also set a different value after opening a bucket if they want.

// FillPercent sets the threshold for filling nodes when they split. By default,
// the bucket will fill to 50%, but it can be useful to increase this
// amount if you know that your write workloads are mostly append-only.
FillPercent float64

The rebalance threshold should be 50% * FillPercent,

bbolt/node.go

Line 375 in bd7d6e9

var threshold = n.bucket.tx.db.pageSize / 4

cenkalti · 2023-04-26T06:20:22Z

Can we make rebalance threshold configurable that defaults to 25% (to keep it backwards compatible)?

It can be set on the Bucket similar to FillPercent field.

type Bucket struct {

	// Sets the threshold for filling nodes when they split. By default,
	// the bucket will fill to 50% but it can be useful to increase this
	// amount if you know that your write workloads are mostly append-only.
	//
	// This is non-persisted across transactions so it must be set in every Tx.
	FillPercent float64

	// new field
	RebalancePercent float64
}

I think this is simpler change to make.

ahrtr · 2023-04-26T08:13:27Z

Can we make rebalance threshold configurable that defaults to 25% (to keep it backwards compatible)?

More flexibility doesn't mean better. I think we would better not to expose the RebalancePercent parameter. Usually a B-tree is kept balanced after deletion by merging or redistributing keys among siblings to maintain the d-key minimum for non-root nodes, and the d is half the maximum number of keys.

Similarly, I think it makes sense to set the rebalancePercent to 0.5 * FillPercent.

cenkalti · 2023-05-08T22:53:47Z

I am trying to maximize page utilization for etcd's use case.

Compaction deletes revisions from "keys" bucket but ~~pages~~ nodes do not get rebalanced until data amount in the ~~page~~ node goes below 25%.

Compaction deletes revisions up to a given revision number but it skips the revision if it is the only revision of a key.

Because the revision numbers are incremental, there is no point of having free space in pages other than the latest page at the end of b+ tree.

If a page is empty, it gets removed from the tree and added to the freelist. However, these half-filled pages could stay forever and waste disk and memory space.

If my understanding is correct and we can solve this issue we can eliminate the need to defrag completely.

cenkalti · 2023-05-11T22:40:22Z

Write keys to etcd, fill roughly 3.5 pages. For the KV size below, one pages takes up to ~60 items.

import etcd3

client = etcd3.client()

key = "mykey"
val = "." * 10

# put few unique keys
for i in range(30):
    client.put(key=key+str(i), value=val)

# overwrite same key
for i in range(180):
    client.put(key=key, value=val)

After writing data, pages are:

% bbolt pages ~/projects/etcd/default.etcd/member/snap/db
ID       TYPE       ITEMS  OVRFLW
======== ========== ====== ======
0        meta       0
1        meta       0
2        leaf       62
3        leaf       61
4        leaf       63
5        leaf       24
6        branch     4
7        leaf       10
8        free
9        free
10       free

Pages of “key” bucket:

% bbolt page ~/projects/etcd/default.etcd/member/snap/db 0
Page ID:    0
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=7>
Freelist:   <pgid=18446744073709551615>
HWM:        <pgid=11>
Txn ID:     12
Checksum:   6e26f9e5b4162263

% bbolt page ~/projects/etcd/default.etcd/member/snap/db 1
Page ID:    1
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=10>
Freelist:   <pgid=18446744073709551615>
HWM:        <pgid=11>
Txn ID:     11
Checksum:   03270b449e68cf59

% bbolt page ~/projects/etcd/default.etcd/member/snap/db 7
Page ID:    7
Page Type:  leaf
Total Size: 4096 bytes
Overflow pages: 0
Consumed Size: 368
Item Count: 10

"alarm": <pgid=0,seq=0>
"auth": <pgid=0,seq=0>
"authRoles": <pgid=0,seq=0>
"authUsers": <pgid=0,seq=0>
"cluster": <pgid=0,seq=0>
"key": <pgid=6,seq=0>
"lease": <pgid=0,seq=0>
"members": <pgid=0,seq=0>
"members_removed": <pgid=0,seq=0>
"meta": <pgid=0,seq=0>

% bbolt page ~/projects/etcd/default.etcd/member/snap/db 6
Page ID:    6
Page Type:  branch
Total Size: 4096 bytes
Overflow pages: 0
Item Count: 4

00000000000000025f0000000000000000: <pgid=2>
00000000000000405f0000000000000000: <pgid=4>
000000000000007f5f0000000000000000: <pgid=3>
00000000000000bc5f0000000000000000: <pgid=5>

Pages 2, 4, 3, 5 has 62, 63, 61, 24 items. The first 30 keys are unique, the rest (180 keys) are revisions of the same key.

Get latest revision:

% bin/etcdctl get '/' -w json | jq .header.revision
211

Compact to the middle of the 3rd page:

% etcdctl compact 150
compacted revision 150

After compaction “key” bucket contains these pages:

% bbolt pages ~/projects/etcd/default.etcd/member/snap/db
ID       TYPE       ITEMS  OVRFLW
======== ========== ====== ======
0        meta       0
1        meta       0
2        free
3        free
4        free
5        leaf       24
6        leaf       10
7        leaf       30
8        free
9        leaf       38
10       branch     3
11       free
% bbolt page ~/projects/etcd/default.etcd/member/snap/db 0
Page ID:    0
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=11>
Freelist:   <pgid=18446744073709551615>
HWM:        <pgid=12>
Txn ID:     18
Checksum:   13f6eecf472396c6

% bbolt page ~/projects/etcd/default.etcd/member/snap/db 1
Page ID:    1
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=6>
Freelist:   <pgid=18446744073709551615>
HWM:        <pgid=12>
Txn ID:     19
Checksum:   1ef4a4bd5af5d6fa

% bbolt page ~/projects/etcd/default.etcd/member/snap/db 6
Page ID:    6
Page Type:  leaf
Total Size: 4096 bytes
Overflow pages: 0
Consumed Size: 369
Item Count: 10

"alarm": <pgid=0,seq=0>
"auth": <pgid=0,seq=0>
"authRoles": <pgid=0,seq=0>
"authUsers": <pgid=0,seq=0>
"cluster": <pgid=0,seq=0>
"key": <pgid=10,seq=0>
"lease": <pgid=0,seq=0>
"members": <pgid=0,seq=0>
"members_removed": <pgid=0,seq=0>
"meta": <pgid=0,seq=0>

% bbolt page ~/projects/etcd/default.etcd/member/snap/db 10
Page ID:    10
Page Type:  branch
Total Size: 4096 bytes
Overflow pages: 0
Item Count: 3

00000000000000025f0000000000000000: <pgid=7>
00000000000000965f0000000000000000: <pgid=9>
00000000000000bc5f0000000000000000: <pgid=5>

Pages 7, 9, 5 has 30, 38, 24 items.

1st page	2nd page	3rd page	4th page
62	63	61	24
30	deleted	38	24

The 30 unique keys in the 1st page will never get deleted if they don’t get a new revision because etcd never deletes the latest revision of the key.

Since the revision number are incremental, the items in the first page will take up full page space forever unless that page's utilization goes below 25% which is not a configurable value.

ahrtr · 2023-05-20T04:46:33Z

The compacted/removed key (revisions in etcd) may be scattered everywhere in the db pages, but the key point is etcd always adds new k/v at the end of the last page. So for etcd, we should set a high value for both FillPercent (e.g 1.0) and RebalancePercent (e.g 0.9); In this way, we can eventually get rid of defragmentation in etcd 5.0+. Note we still need defragmentation in the transition stage, because some middle pages will never be rebalanced if there is no deletion in them.

The downside is there will be more rebalance operation and accordingly more page writing & syncing in each etcd compaction operation. So we need to evaluate the performance impact.

The other concern is that there are other buckets (e.g. lease, auth*) which are not append-only. Since K8s doesn't use auth, so it isn't a problem. I am not sure the exact usage of lease in K8s, e.g. how often it's created & deleted. Please also get this clarified.

Note: It only makes sense to set a high value for FillPercent and RebalancePercent for append (specifically, only add K/V at the tail) only use case, e.g. etcd.

ahrtr · 2023-05-20T05:22:28Z

Note it still makes sense to set the rebalancePercent to 0.5 * FillPercent by default just as my previous comment #422 (comment) mentioned. We just need to support customizing it so that users (e.g etcd) can set a different value based on their scenarios.

cenkalti · 2023-05-31T19:39:58Z

After #518 is merged, users are able to set rebalance percent to values between 0.0 and 0.5.

However, it's still not possible to set it to values larger than 0.5.

Can we consider again introducing a new field that is set to 0.25 by default?

type Bucket struct {
	FillPercent float64

	// new field
	RebalancePercent float64
}

ahrtr · 2023-05-31T21:41:25Z

If you read my comment above and the new title of this issue "Support customizing the rebalance threshold", you will get that I am not against the proposal. But please think about the concern in above comment.

cenkalti · 2023-06-01T16:58:23Z

Cool. I'll try to come up with a way to evaluate the performance impact.

ahrtr · 2024-05-09T08:46:45Z

@cenkalti are you still working on this? Please feel free to let me know if you don't have bandwidth, then I am happy to take over.

cenkalti · 2024-05-09T13:27:53Z

No, I am not working on this anymore.

ahrtr · 2024-05-20T16:08:25Z

It seems that it isn't that simple. Setting a bigger RebalancePercent can trigger rebalance more easily, and accordingly a node may be merged with next or prev node more easily. But the merged node may be split again if we don't set a reasonable RebalancePercent and FillPercent,

bbolt/node.go

Lines 230 to 246 in 7eb39a6

    
           // Ignore the split if the page doesn't have at least enough nodes for 
        
           // two pages or if the nodes can fit in a single page. 
        
           if len(n.inodes) <= (common.MinKeysPerPage*2) || n.sizeLessThan(pageSize) { 
        
           	return n, nil 
        
           } 
        
           // Determine the threshold before starting a new node. 
        
           var fillPercent = n.bucket.FillPercent 
        
           if fillPercent < minFillPercent { 
        
           	fillPercent = minFillPercent 
        
           } else if fillPercent > maxFillPercent { 
        
           	fillPercent = maxFillPercent 
        
           } 
        
           threshold := int(float64(pageSize) * fillPercent) 
        
           // Determine split position and sizes of the two pages. 
        
           splitIndex, _ := n.splitIndex(threshold)

The default values for FillPercent and RebalancePercent are is 50% and 25% respectively, so a merged node/page has at most 75% pageSize, so a merged node will never be split again in current transaction by default. But If we set a bigger RebalancePercent, it may cause a merged node to be split again.

tjungblu · 2024-05-21T14:23:15Z

what do you guys want to optimize for? tighter packing of pages? less memory allocations?

ahrtr · 2024-05-21T14:26:28Z

eventually get rid of defragmentation

The goal was to eventually get rid of defragmentation, see #422 (comment).

tjungblu · 2024-05-21T14:33:43Z

Ah I see, so we have different thresholds and then we wanted to configure them and now you've found a new parameter that needs tweaking alongside it :)

cenkalti mentioned this issue Apr 11, 2023

Script for defragmentation etcd-io/etcd#15477

Closed

This was referenced Apr 26, 2023

server: update defrag logic to use bolt.Compact() etcd-io/etcd#15470

Open

backend: add experimental defrag txn limit flag etcd-io/etcd#15511

Draft

cenkalti added the type/question label May 9, 2023

cenkalti changed the title ~~Question: Why rebalance page threshold is 25%?~~ Why rebalance page threshold is 25%? May 9, 2023

cenkalti added the area/bbolt-core label May 19, 2023

cenkalti self-assigned this May 19, 2023

ahrtr added the type/feature label May 20, 2023

ahrtr changed the title ~~Why rebalance page threshold is 25%?~~ Support customizing the rebalance threshold May 20, 2023

ahrtr removed the type/question label May 20, 2023

cenkalti mentioned this issue May 29, 2023

Adjust rebalance threshold with Bucket.FillPercent #518

Merged

ahrtr mentioned this issue Jun 9, 2023

Restoring etcd without breaking API guarantees, aka etcd restore for Kubernetes etcd-io/etcd#16028

Closed

5 tasks

This was referenced Jul 21, 2023

Documentation: add roadmap etcd-io/etcd#16279

Merged

Introduce grpc health check in etcd client etcd-io/etcd#16276

Open

ahrtr mentioned this issue Aug 17, 2023

feature: add new compactor based revision count etcd-io/etcd#16426

Open

github-actions bot added the stale label Apr 17, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 9, 2024

ahrtr reopened this May 9, 2024

ahrtr removed the stale label May 9, 2024

cenkalti removed their assignment May 9, 2024

ahrtr self-assigned this May 9, 2024

github-actions bot added the stale label Aug 20, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 10, 2024

ahrtr reopened this Sep 10, 2024

github-actions bot removed the stale label Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support customizing the rebalance threshold #422

Support customizing the rebalance threshold #422

cenkalti commented Mar 14, 2023

ahrtr commented Mar 15, 2023

ahrtr commented Apr 26, 2023

cenkalti commented Apr 26, 2023

ahrtr commented Apr 26, 2023

cenkalti commented May 8, 2023 •

edited

Loading

cenkalti commented May 11, 2023

ahrtr commented May 20, 2023

ahrtr commented May 20, 2023

cenkalti commented May 31, 2023

ahrtr commented May 31, 2023

cenkalti commented Jun 1, 2023

ahrtr commented May 9, 2024

cenkalti commented May 9, 2024

ahrtr commented May 20, 2024

tjungblu commented May 21, 2024

ahrtr commented May 21, 2024

tjungblu commented May 21, 2024

Support customizing the rebalance threshold #422

Support customizing the rebalance threshold #422

Comments

cenkalti commented Mar 14, 2023

ahrtr commented Mar 15, 2023

ahrtr commented Apr 26, 2023

cenkalti commented Apr 26, 2023

ahrtr commented Apr 26, 2023

cenkalti commented May 8, 2023 • edited Loading

cenkalti commented May 11, 2023

ahrtr commented May 20, 2023

ahrtr commented May 20, 2023

cenkalti commented May 31, 2023

ahrtr commented May 31, 2023

cenkalti commented Jun 1, 2023

ahrtr commented May 9, 2024

cenkalti commented May 9, 2024

ahrtr commented May 20, 2024

tjungblu commented May 21, 2024

ahrtr commented May 21, 2024

tjungblu commented May 21, 2024

cenkalti commented May 8, 2023 •

edited

Loading