Skip to content
This repository has been archived by the owner on Dec 16, 2021. It is now read-only.

Add simple logic locality-awareness reassignments #116

Merged

Conversation

kabochya
Copy link
Contributor

This is the first part for locality-aware reassignments using rackId/availability zone info. Currently it will not failover to other localities nor will it balance replicas throughout the localities. These issues will be issued afterwards.

@kabochya kabochya requested a review from yuyang08 April 1, 2019 18:56
@ambud
Copy link
Contributor

ambud commented Apr 2, 2019

@kabochya could you please squash the commits?

}
}

// Second pass to fill in replacements without target brokers in locality
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still work-in-progress?

Copy link
Contributor Author

@kabochya kabochya Apr 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a ready PR, it implements trivial logic without handling cases such as failover to other localities if no instances are available in the current locality, and spreading out the reassignments to different localities if there are multiple replicas of the same topic partition on the dead broker that is in the same locality. These issues will be addressed in subsequent PRs.

@@ -12,34 +12,37 @@
import org.apache.kafka.common.Node;
import org.apache.kafka.common.PartitionInfo;
import org.apache.kafka.common.TopicPartition;
import org.junit.jupiter.api.AfterAll;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this unused import ?

import org.junit.jupiter.api.Test;

import java.sql.Time;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this unused import ?

ReplicaStatsManager.bytesInStats = oldBytesInStats;
ReplicaStatsManager.bytesOutStats = oldBytesOutStats;
@Test
void testLocalityAwareReassignments() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add another test to test the scenario that there is not enough capacity in the rack for assigning the partition?

}
// push the brokers back to brokerQueue to keep invariant true
brokerQueue.addAll(unusableBrokers);
return success ? result : null;
}

/**
* Similar to getAlternativeBrokers, but locality aware
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a general question: when broker.rack is not set on the kafka cluster side, kafkastats will have null for rackId. In that case, if the user set locality_awareness.enabled to true in the setting, what will be the expected behavior? can we add explanation about this in the comment?

@kabochya kabochya force-pushed the feature/locality-aware-reassignment branch 3 times, most recently from 10a8edf to f7a67ce Compare April 15, 2019 18:42
@kabochya
Copy link
Contributor Author

@yuyang08
Added notifications for failure on locality-aware reassignments, however the partital reassignment strategy (still assigning oosReplicas without caring if some other oosReplicas fail to get reassigned). This behaviour is inconsistent with the original one, which is drop the whole reassignment if even one oosReplica failed to get reassigned. Should we unify the behavior of these two methods? I have documented the differences in the comments of the generateReassignmentPlanForDeadBrokers method.

@kabochya kabochya force-pushed the feature/locality-aware-reassignment branch from f7a67ce to 2aa0ff4 Compare April 15, 2019 20:13
@kabochya kabochya force-pushed the feature/locality-aware-reassignment branch from 2aa0ff4 to 308725c Compare April 24, 2019 20:38
if (replacedNodes == null) {
if (isLocalityAware) {
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with this, will we exit the loop with success == true ?

in that case, will we get an email alert if doctorkafka failes to find a locality aware assignment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuyang08 Yes, we will get a different email alert if the assignment partially failed and these failure will be captured in the hashmap reassignmentToLocalityFailures. L672-680 will then check if there are partial failures during locality aware reassignments and alert by email.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a guard to set success to false if no locality-aware reassignments were made.

@kabochya kabochya force-pushed the feature/locality-aware-reassignment branch from 308725c to 941b142 Compare April 26, 2019 17:02
Unit test for testing SINGLE oosReplica
@kabochya kabochya force-pushed the feature/locality-aware-reassignment branch from 941b142 to 34a4665 Compare April 30, 2019 17:28
Copy link
Contributor

@yuyang08 yuyang08 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for making the fix!!

@kabochya
Copy link
Contributor Author

kabochya commented Apr 30, 2019

Removed partial reassignments

@kabochya kabochya merged commit e1e7139 into pinterest:master Apr 30, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants