-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Flesh out v2 LoadBalancer #203
Conversation
WIP effort at filling out a handler for the ApplicationLoadBalancer to match the interface from the old (v1/classic) load balancer within the BaragonService. Should allow us to handle BaragonAgent connections / lgoad balancer connections using the new API, once it gets finished.
Move away from using `String.format(...)` in logging calls, in favor of slf4j's native interpolation with `{}`. This has no real impact and was just a quick win in the process of working on other parts of the application.
Removing a target (=== Instance) should (probably) work now, both individually and as part of the sync cycle. Adding new instances is going to take a little more work because ApplicationLoadBalancers have a little bit more infrastructural overhead than Classic load balancers. This commit also finishes making the change-over from using a string to represent the load balancer (TrafficSource). There was a datastore that should needed to move over to the new interface, which meant that there was a resource that also needed to move over.
Currently have sync ~ working in the ApplicationLoadBalancer. At the moment, syncing will 1. Find a target group with the same name as a baragon group, if possible 2. Guarantee that the targets (== instances) that baragon agents think that they're on are members of the target group 3. Guarantee that the load balancers that they baragon agents think they're talking to can reach all the az's that the baragon agents think that they're on 4. De-register targets that don't appear to have a known baragon agent associated with them, or which appear to have an uncommunicative baragon agent associated with them. It is limited in that, among other things, it requires that the target group be named the same as the baragon group. At the moment, the application load balancer will not support explicit requests to add or remove targets and requires a certain amount of infrastructure to already be in place.
cc @ssalinas Mind taking a look at this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bunch of smaller comments for efficiency/etc. Nice first draft of this!
* For this class I am working under the assumption that the ID's of Targets can | ||
* be interchanged with the ID's of instances. That is, if you ask this class | ||
* for the health of instance 12345, then you will actually expect the health | ||
* of target 12345. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be accurate. The docs allude to the ability to register something besides an ec2 instance as a target (maybe future functionality?) but we won't have to worry about that. For our purposes it's accurate that instance id == target id 👍
} else { | ||
LOG.debug(String.format("Agent %s is already registered", agent)); | ||
LOG.debug("Agent {} is already registered", agent); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for updating all of these
@@ -68,7 +70,8 @@ public LoadBalancerResource(BaragonLoadBalancerDatastore loadBalancerDatastore, | |||
@POST | |||
@Path("/{clusterName}/sources") | |||
public BaragonGroup addSource(@PathParam("clusterName") String clusterName, @QueryParam("source") String source) { | |||
return loadBalancerDatastore.addSourceToGroup(clusterName, source); | |||
TrafficSource trafficSource = new TrafficSource(source, TrafficSourceType.CLASSIC); | |||
return loadBalancerDatastore.addSourceToGroup(clusterName, trafficSource); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll need an additional endpoint that can take a TrafficSource
json body. Maybe add that as one with @Path("/{clusterName}/traffic-sources")
and we can mark the old one as @Deprecated
if (baragonGroup.getSources().isEmpty()) { | ||
LOG.debug("No traffic sources present for group {}", baragonGroup.getName()); | ||
} else { | ||
Collection<LoadBalancer> elbsForBaragonGroup = getLoadBalancersByBaragonGroup(baragonGroup); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we've had problems in the past with hitting aws' api rate limits. Rather than making an individual call for each (even though it is filtered down), it might be wise to move this outside of the loop. We can describe all ALBs in one call, then grab what we need from the object. It's likely faster to loop through the object in memory than to make a new call for new data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that I got most of this, but it is still making calls in a pretty tight loop in isLastHeathlyInstance
during de-registration. I can try and patch that up too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a tough one, especially since we really want to be sure that info is up to date for that check. If we can lessen the api calls in other areas at least i think that is fine. We are doing something similar for classic load balancers with isLastHealthyInstance
as well
|
||
LOG.debug("Registering TargetGroup for baragon group {}", baragonGroup.getName()); | ||
Optional<TargetGroup> maybeTargetGroup = guaranteeTargetGroupFor(baragonGroup); | ||
if (maybeTargetGroup.isPresent()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this isn't present let's log, maybe at WARN
level, so that we have an indicator of what's going on
} | ||
} else { | ||
/* Is this really an appropriate thing to do here? */ | ||
throw new IllegalArgumentException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good here since we catch it right below. We want to continue trying to sync the remaining agents (which we do), but this is a rather blocking error for that one agent, so logging at error level + the notify is good to do here.
} | ||
|
||
guaranteeHasAllTargets(targetGroup, baragonAgents); | ||
guaranteeAzEnabled(baragonAgents, loadBalancers); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The az enable will need to go before the target addition most likely. In the case of the classic load balancers, if the az/subnet were not enabled, any registration in that az/subnet would fail. I would suspect it is likely the case here as well, although I can't find any docs that say so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Swapped. I think it mattered more when the instances were registered against the load balancers directly, but I couldn't find anything to say that TargetGroup
s cared about subnets or az's.
} | ||
|
||
try { | ||
SetSubnetsRequest subnetsRequest = new SetSubnetsRequest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does SetSubnets override all current ones? If so we'd want to make sure this list also includes ones currently active so we aren't removing any that might still be needed (i.e. subnetsOnLoadBalancer + subnetIds here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thank you!
} | ||
} | ||
|
||
private void guaranteeAzEnabled(Collection<BaragonAgentMetadata> agents, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Read over the docs here again. In the old classic the az/subnet operations were a bit more separated. I think here we can get away with gathering the list of subnets from the agents and feeding that to guaranteeHasAllSubnets
, rather than doing the az check and than having to check subnets after
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that if there were agents that were assigned to the same az, but different subnets then we might accidentally wind up clobbering access to one of the subnets. I think load balancers are still on a one-subnet-per-az policy, and I didn't see anything that said what would happen if two subnets in the same az were added at the same time.
|
||
Collection<TargetDescription> removableTargets = new HashSet<>(); | ||
for (TargetDescription targetDescription : targetsOnGroup) { | ||
if (! knownAgentIds.contains(targetDescription.getId()) && canDeregisterAgent(baragonGroup, targetDescription.getId())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should be able to copy some of the logic from here for this. If you find yourself rewriting bits of those methods, feel free to move them to the ElasticLoadBalancer
class so they can be shared
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the data you are looking at is slightly different than you are expecting here. Your agentIds
method below is returning Baragon's agent id (which in this case is the url on which the agent can be hit by BaragonService). What you probably want in the knwonAgentIds
variable is the instance IDs which should be equal to the target id. Instance ID should also be available in the BaraongAgentMetadata. Maybe rename that list as well to better show what it contains once you update it.
Added a number of comments around efficiency of the api calls we are making/etc @PtrTeixeira , nice first draft of this! |
Patch up (most) of the issues raised during review. Still to do: 1. Migrate duplicated logic in `knownAgents` &c into `ElasticLoadBalancer` 2. Add an additional endpoint that offers a little bit more control over creating traffic sources to `LoadBalancerResource`
I had duplicated some of the functionality in `ElasticLoadBalancer` w/r/t checking whether an instance / target could be de-registered in order to keep the scope of the changes that I was making smaller. This DRYs it a little bit; makes `ElasticLoadBalancer` take an ID rather than a model object from the elasticloadbalancingv1, so it can be used from both the `ApplicationLoadBalancer` and the `ClassicLoadBalancer`.
String nextPage = result.getNextMarker(); | ||
loadBalancers.addAll(result.getLoadBalancers()); | ||
|
||
while (! nextPage.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking, will the aws API ever return a null for this, or is it just an empty string when there are no more pages? If it's possible there might be a null, I'd opt for !Strings.isNullOrEmpty()
here
deregisterIfOld(baragonGroup, targetGroup, baragonAgents, targets); | ||
} | ||
} else { | ||
LOG.debug("Not TargetGroup for Baragon Group {}", baragonGroup); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not -> No
public void syncAll(Collection<BaragonGroup> baragonGroups) { | ||
Collection<LoadBalancer> allLoadBalancers = getAllLoadBalancers(baragonGroups); | ||
try { | ||
for (BaragonGroup baragonGroup : baragonGroups) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since a few of the methods in this for loop can possibly throw exceptions, we should probably move the try/catch inside the loop. This way we are still syncing the remaining groups, even if one should fail
return targetDescriptions; | ||
} | ||
|
||
private Collection<TargetDescription> listOldTargets(BaragonGroup baragonGroup, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the sake of clarify I'd rename this method to something like listRemovableTargets
. Old
is just a bit generic in terms of what we are actually finding here.
|
||
private Collection<TargetDescription> listOldTargets(BaragonGroup baragonGroup, | ||
Collection<TargetDescription> targetsOnGroup, | ||
Collection<BaragonAgentMetadata> knownAgents) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be careful what you call 'known' for agents. There is a concept of BaragonKnownAgentMetadata
which is the agent metadata along with a 'list seen' timestamp (stored in a separate zk node). However, what you are passing here is actually the regular agent metadata for the active agent (as stored in the leader-latch zk node)
|
||
Collection<TargetDescription> removableTargets = new HashSet<>(); | ||
for (TargetDescription targetDescription : targetsOnGroup) { | ||
if (! knownAgentIds.contains(targetDescription.getId()) && canDeregisterAgent(baragonGroup, targetDescription.getId())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the data you are looking at is slightly different than you are expecting here. Your agentIds
method below is returning Baragon's agent id (which in this case is the url on which the agent can be hit by BaragonService). What you probably want in the knwonAgentIds
variable is the instance IDs which should be equal to the target id. Instance ID should also be available in the BaraongAgentMetadata. Maybe rename that list as well to better show what it contains once you update it.
Once the comments above are addressed we should just need to implement the |
Involved a handful of changes, including a nullity check, swapping the order of a loop and try block, and clarifying some conceptual differences related to agent metadata.
} | ||
|
||
LOG.debug("ELB sync complete for group {}", baragonGroup); | ||
} catch (AmazonClientException acexn) { | ||
LOG.error("Could not retrieve elb information due to ELB client error {}", acexn); | ||
exceptionNotifier.notify(acexn, ImmutableMap.of("groups", baragonGroups == null ? "" : baragonGroups.toString())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We're already accessing baragonGroups
above by entering the for loop, probably don't need an additional null check
This was just a copy-paste relic; when I swapped the order of the try block and the loop, the error message still checked for the nullity of the _collection_ and had the _collection_ in the error message. The nullity check was removed as being unnecessary, and the error message now refers to the particular group that failed, rather than the collection of groups.
Implement methods to register and remove instances (ie, targets) in the context of an application load balancer. This works a little bit differently, semantically, than the equivalent methods for the classic load balancers. In particular, it treats the `TargetGroup` as the `TrafficSource`, and registers and deregisters from the `TargetGroup` rather than from load balancers individually. This means that whereas before it was possible to register and deregister a single instances from individual load balancers, now it is only possible to register and deregister from load balancers on the scope of the target group. For example, suppose that LB1 and LB2 were talking to Instance. It would have been possible to remove Instance from LB2 and leave it talking to LB1. With ApplicationLoadBalancers, LB1 and LB2 talk to a target group, so they can't be individually talked to at the instance level. It winds up working out the same way, if you consider ClassicLoadBalancers and TargetGroups to be the equivalent TrafficSources, rather than ClassicLoadBalancers and ApplicationLoadBalancers. You just have the target belong to multiple TargetGroups.
Going to merge this into the other ALB pr, will be simpler to rebase off of the basepom upgrade that way |
WIP effort at filling out a handler for the ApplicationLoadBalancer to
match the interface from the old (v1/classic) load balancer within the
BaragonService. Should allow us to handle BaragonAgent connections /
lgoad balancer connections using the new API, once it gets finished.