Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor EIP Operator #348

Closed
wants to merge 4 commits into from
Closed

Refactor EIP Operator #348

wants to merge 4 commits into from

Conversation

jubrad
Copy link
Contributor

@jubrad jubrad commented Jul 21, 2023

    - transition eip/pod watchers to mostly adjust tags
    - update logic of eip controller to:
      1. validate current claim or remove
      2. detatch if attatched and unclaimed 3. find a pod/node to claim 4. attach to node or pod node 5. update status to reflect the current state          

Refactor EIP Operator to better handle multiple selector matches

I believe this change set represents the changes discussed here:

node:
reconciliation
-  search through eip , find ones that match labels and are unclaimed
-  if no matches nothing
- if matches claim with node uuid

cleanup
  - remove claim from eip


EIP:
reconciliation:
  - if claimed
     - check node/pod conditions - disconnect from terminating resoruces
     - if still claimed 
	associate with claimed node/pod
	return 

  - if unclaimed
      - detach if attached to something
	  - search through nodes / pods
	  - find ones that match and are not terminating
		attatch and update status
		return


Pods: 
basically the same as nodes (except for handling target annotations and auto-create eip)

Concerns:

  • I'm not sure I understand fully how this interacts with the pod controllers dns_target_annotations.
  • I believe we've walked through all of the scenarios we can think of there are potentially some we hadn't thought of. Hopefully those get spotted during test/review.
  • In the eip controller, I'm using a mutable status spec to update the resource status as I operate and I patch at the end. I think this works, but I have some concerns that some action will be performed and an error down the line will lead to a an unpatched status. Just wanted to get some thoughts on this.

Testing:
This passed tests on the first test run which is either very surprising or an indicating that the things I broke our outside our local test scope. Either way I'd like to test this on a personal MZ stack, I'll reach out for some pointers and update this with results.

** update **
I've tested this on a personal stack.

  • For pods, I deleted envd pods, and added/removed environments, these all seem to work great.
  • For nodes, I deleted egress nodes, and upped the replica count on the egress no-op replica set so two nodes were competing for the same eip. This seems to work great. In the two nodes case, we wont' swap over until the node with the existing claim removes its claim, likely by termination, although failing to matche eip is a possible reason. In both cases the failover happens super quick and appears stable.

I believe I've also handled the transition case to this new method.
To test this I patched the eip resource status to set claims to None, and ensured reconciliation of the EIP correctly maintained the association and only added the new claim value to the status.

Any happy to do any other recommended testing.

@jubrad jubrad force-pushed the refactor-eip-operator branch 2 times, most recently from a69caab to f659285 Compare July 21, 2023 15:34
        - transition eip/pod watchers to mostly adjust tags
        - update logic of eip controller to:
          1. validate current claim or remove
          2. detatch if attatched and unclaimed
          3. find a pod/node to claim
          4. attach to node or pod node
          5. update status to refect current state
@jubrad jubrad force-pushed the refactor-eip-operator branch from f659285 to c8a9a95 Compare March 8, 2024 21:55
@jubrad jubrad marked this pull request as ready for review March 8, 2024 21:55
@jubrad jubrad requested a review from a team as a code owner March 8, 2024 21:55
@jubrad jubrad force-pushed the refactor-eip-operator branch 2 times, most recently from a555827 to e13d38a Compare March 11, 2024 19:00
Copy link
Collaborator

@alex-hunt-materialize alex-hunt-materialize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get current status of EIP.
Get matching pods or nodes.
Sort list for consistency.
Choose single matching pod or node.
If associated, check if EIP is associated with matched pod or node. If not disassociate.
If not associated, associate with matched pod or node.
Update claim.

eip_operator/src/controller/eip.rs Outdated Show resolved Hide resolved
eip_operator/src/eip.rs Outdated Show resolved Hide resolved
eip_operator/src/controller/eip.rs Outdated Show resolved Hide resolved
eip_operator/src/controller/eip.rs Outdated Show resolved Hide resolved
eip_operator/src/controller/eip.rs Outdated Show resolved Hide resolved
eip_operator/src/controller/eip.rs Outdated Show resolved Hide resolved
eip_operator/src/controller/eip.rs Outdated Show resolved Hide resolved
.map(|pod| { pod.metadata.name.clone().unwrap_or_default() })
.collect::<Vec<String>>()
.join(",")
);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be unreachable, as kubernetes does not allow multiple pods with the same name.

eip_operator/src/controller/eip.rs Outdated Show resolved Hide resolved
eip_operator/src/controller/eip.rs Outdated Show resolved Hide resolved
@jubrad jubrad force-pushed the refactor-eip-operator branch 3 times, most recently from 9e95f6c to 34e5be5 Compare March 12, 2024 16:10
deny.toml Show resolved Hide resolved
deny.toml Show resolved Hide resolved
@@ -20,9 +18,10 @@ skip = [
{ name = "hashbrown", version = "0.12.3" },
{ name = "hashbrown", version = "0.14.0" },
{ name = "nix", version = "0.26.4" },
{ name = "nix", version = "0.27.1" },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we continue to add all versions of duplicate deps here? it makes it easier to track when things change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can but cargo deny was yelling at me, maybe I need to adjust my configuration?

{ name = "ordered-float", version = "3.4.0" },
{ name = "fastrand", version = "2.0.1" },
{ name = "regex-automata", version = "0.4.6" },
{ name = "regex-syntax", version = "0.6.29" },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm fairly sure we can get rid of these duplicates with some pretty easy selective downgrades, which i think would be preferable if possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely possible, I spent like 2 hours fighting with versions and updating code after updates. I don't really care to hop down that rabbit hole again.

// get potential attachments
let matched_pods: Vec<Pod> = if eip.selects_with_pod() {
pod_api
.list(&ListParams::default())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think this is true - manage=true means that the eip operator's pod operator should auto-create the eip, but if we just manually create an Eip resource, it should be able to attach to any pod, even without that label

}

// Find new claim, it's possible it'll match the existing
// association, in which case, we won't disassociate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this comment still true? it looks like the disassociation logic is already handled separately above

let mut associatiate_with_node: Option<Node> = None;
if status.claim.is_none() {
match eip.spec.selector {
EipSelector::Node { selector: _ } => match matched_nodes.len() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if matched_nodes.is_empty() { maybe?

eip_operator/src/controller/eip.rs Show resolved Hide resolved
eip_operator/src/controller/eip.rs Outdated Show resolved Hide resolved
eip_operator/src/controller/eip.rs Outdated Show resolved Hide resolved
@jubrad jubrad force-pushed the refactor-eip-operator branch 2 times, most recently from dfe117d to c77400b Compare March 12, 2024 22:32
eip_operator/src/controller/eip.rs Outdated Show resolved Hide resolved
eip_operator/src/controller/eip.rs Show resolved Hide resolved
eip_operator/src/controller/eip.rs Show resolved Hide resolved
);
}
(None, None) => {
info!("Eip {} is correctly associated!", eip.name_unchecked());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
info!("Eip {} is correctly associated!", eip.name_unchecked());
info!("Eip {} is correctly not associated!", eip.name_unchecked());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the node and ip are none then we don't want to modify our association or create a new association.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is correct that it is not associated. The phrasing of the log message makes it seem like it is associated and that that is correct, which isn't what we are trying to indicate. We are trying to indicate both that it is not associated, and that that is the correct state.

}
}
false
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking, AWS is the source of truth for where things are associated, not the status field. The status field should only be a nice reference for the humans when looking at k9s, not part of the core decision logic.

This is especially true when you are making a clone of the status and making changes along the way, as it is easy to get confused about what the real state is vs the one stored on the EIP object, which hasn't been updated.

I'm pretty sure everywhere you use these functions, you have already described allocations, so you would not need to make any additional network calls to use the AWS values, rather than the status.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

matches_pod and matches_node, and are not used in the eip controller where we're using a cloned status... In the pod/node controller cases we don't actually have the AWS resource (except in the pod case to do the dns annotation in a very specific scenario). associated_with_node is used but only during disassociate which makes no sense to come after associate and should always use the eip objects status never a cloned status.

If we want to use AWS as the source of truth in the pod/node case we'll need to fetch every eip that matches the reconciled resource to check if we're associated with it. I don't believe that is a required change.

eip_operator/src/eip.rs Outdated Show resolved Hide resolved
eip_operator/src/controller/node.rs Outdated Show resolved Hide resolved
eip_operator/src/controller/node.rs Show resolved Hide resolved
eip_operator/src/controller/pod.rs Show resolved Hide resolved
add_dns_target_annotation(&pod_api, name, &public_ip, allocation_id).await?;
return Ok(None);
}
} else if let Some(eip) = eips.iter().find(|eip| eip.matches_pod_name(&pod.name_unchecked()) && !eip.attached()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why && !eip.attached()? Don't we want it to re-reconcile if it is attached to the wrong pod?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to steal EIPs, in fact that's the entire reason we're in the predicament of needing this rewrite... there's probably an argument for pods that if one is matched and attached to something else we should take it, but that really only works because we only match on name for pods, if we matched on labels we wouldn't be able to make that assumption.

.ok_or(Error::MissingAddresses)?
.swap_remove(0);
let public_ip = eip_description.public_ip.ok_or(Error::MissingPublicIp)?;
add_dns_target_annotation(&pod_api, name, &public_ip, allocation_id).await?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alex-hunt-materialize
There is a previously unhandled case here where add_dns_target_annotation only occurs when the EIP is added to a pod, if the eip no longer matches a the pod it is associated with we did not remove this annotation. My inclination is to let this one slip in this PR. I don't think it's likely we encounter this since we're not modifying EIPs this way and likely don't need to support pods much longer.

@jubrad jubrad force-pushed the refactor-eip-operator branch 3 times, most recently from 229c6b2 to 88560a2 Compare March 13, 2024 13:53
@jubrad
Copy link
Contributor Author

jubrad commented Mar 13, 2024

Side note here:
There's a larger issue with the autocreate_eip label.. If an EIP is created and matches a pod we'll remove it, since pods are unique this is probably fine, but if there are two eips that match a pod it'll just delete one kind of at random.. along with this if an eip is auto created and then modified to point at a different pod it will not be deleted, this one is probably fine, especially if we only ever allow matching by pod name, but I'm not sure these edge cases are well documented or entirely thought through. I don't plan on handling either of these cases.

 - remove claims
@jubrad jubrad force-pushed the refactor-eip-operator branch from 88560a2 to e7b9df0 Compare March 13, 2024 15:57
.ok_or(Error::MissingAddresses)?
.swap_remove(0);
let public_ip = eip_description.public_ip.ok_or(Error::MissingPublicIp)?;
add_dns_target_annotation(&pod_api, name, &public_ip, allocation_id).await?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to add this target annotation when we associate the EIP with the Pod? We could then check for the annotation when disassociating, and remove it if it matches the EIP address.

// we are not associated
if let Some(eip) = eips
.iter()
.find(|eip| eip.matches_pod_name(&pod.name_unchecked()) && !eip.attached())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder about the !eip.attached() part. EIPs with pod matchers may only match a single pod, so if it is attached, but not associated with this pod, that means it is incorrectly associated and should be re-reconciled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants