Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix disk computation when initializing unassigned shards in desired balance computation #102207

Merged
merged 25 commits into from
Jan 4, 2024

Conversation

idegtiarenko
Copy link
Contributor

This change fixes a bug of a node disk usage computation when starting an unassigned shard that already have desired node assignment. Previously such shard assumed to use no space. This is not true when such shard is restored from snapshot.

Related to: #91386

@idegtiarenko idegtiarenko added >bug :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.12.0 labels Nov 15, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine
Copy link
Collaborator

Hi @idegtiarenko, I've created a changelog YAML for you.

# Conflicts:
#	server/src/test/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputerTests.java
# Conflicts:
#	server/src/test/java/org/elasticsearch/cluster/routing/allocation/allocator/DesiredBalanceComputerTests.java
@kingherc
Copy link
Contributor

@idegtiarenko is this ready for review now?

@arteam arteam self-requested a review December 21, 2023 15:52
@arteam arteam removed their request for review January 2, 2024 09:26
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me, I think there's a little bit of missing test coverage tho (see inline comments)

}

/**
* Must be called all shards that are in progress of initializations are processed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't totally follow this comment, although I think I see what's going on. Could you add some more detail about how/why we only use the reservedSpace when setting up the desired-balance computation and not while it's iterating?

@@ -254,6 +257,7 @@ public DesiredBalance compute(
final long computationStartedTime = threadPool.relativeTimeInMillis();
long nextReportTime = computationStartedTime + timeWarningInterval;

clusterInfoSimulator.discardReservedSpace();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DesiredBalanceComputerTests all pass if this line is removed, which seems suspicious. Could we have a test that exercises this?

Comment on lines 228 to 229
final long expectedShardSize = getExpectedShardSize(shardRouting, 0L, routingAllocation);
final var shardToInitialize = unassignedReplicaIterator.initialize(nodeId, null, expectedShardSize, changes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, reverting this change doesn't make any DesiredBalanceComputerTests fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this and below is no longer necessary. Now we rely on getExpectedShardSize rather then on expected shard size set on shard itself.

Comment on lines 207 to 208
final long expectedShardSize = getExpectedShardSize(shardRouting, 0L, routingAllocation);
final var shardToInitialize = unassignedPrimaryIterator.initialize(nodeId, null, expectedShardSize, changes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, reverting this change doesn't make any DesiredBalanceComputerTests fail.

// ensure new value is within bounds
leastAvailableSpaceUsage.put(nodeId, updateWithFreeBytes(leastUsage, delta));
private void updateDiskUsage(Map<String, DiskUsage> availableSpaceUsage, String nodeId, String path, ShardId shardId, long delta) {
if (reservedSpace.getOrDefault(new NodeAndPath(nodeId, path), ReservedSpace.EMPTY).containsShardId(shardId)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify which shards have reserved space already?

);
}

public void testInitializeShardFromSearchableSnapshot() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
public void testInitializeShardFromSearchableSnapshot() {
public void testInitializeShardFromPartialSearchableSnapshot() {

since the simple (non-partial) searchable snapshot case is tested in the test above with a randomBool.

);
}

public void testRelocateSearchableSnapshotShard() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
public void testRelocateSearchableSnapshotShard() {
public void testRelocatePartialSearchableSnapshotShard() {

new ClusterInfoTestBuilder() //
.withNode(fromNodeId, new DiskUsageBuilder(1000, 1000))
.withNode(toNodeId, new DiskUsageBuilder(1000, 1000))
.withShard(shard, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit Could also have the comment

// partial searchable snapshot uses DiskThresholdDecider.SETTING_IGNORE_DISK_WATERMARKS resulting
                                         // in a 0 size reported

here as well


var snapshot = new Snapshot("repository", new SnapshotId("snapshot", randomUUID()));

var shardSizeInfo = Maps.<String, Long>newHashMapWithExpectedSize(5);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit Why 5 here and below? I only see 3 potential .put() operations below.

shardSizeInfo.put(shardIdentifierFromRouting(shardIdFrom(indexMetadata1, 0), true), ByteSizeValue.ofGb(8).getBytes());
shardSizeInfo.put(shardIdentifierFromRouting(shardIdFrom(indexMetadata1, 1), true), ByteSizeValue.ofGb(8).getBytes());

// index-2 is restored earlier but not allocated according to the desired balance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but not allocated according to the desired balance

How is that signified? E.g., in case 1 below it says it's initializing on desired node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, it is initializing, but not started yet. Clarifying that.

@idegtiarenko
Copy link
Contributor Author

I am going to split the reserved space handling into a separate pr to keep this one simpler and as that aspect turned out to be more complex then originally anticipated

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@kingherc kingherc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM2 as long as CI is happy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.13.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants