-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dynamic timeout for Azure operations in machinery module #2233
base: master
Are you sure you want to change the base?
Conversation
This pull request introduces a dynamic timeout mechanism for Azure operations in the CAPE machinery module. The new feature enhances the reliability and efficiency of Azure-related tasks, particularly during scale set operations and long-running processes.
Thank you for this work. I am going to run this in my setup now. I use the scaling feature, so I will be able to test that portion for you. |
Perfect, let me know if you have any issues i will be more than happy to help you |
@leoiancu21 I had hoped that your new timeout design would help with the issue I am dealing with in the auto-scaling, but it has the same end behavior of locking up the analysis machines. When I force CAPE to remain at a single instance, your timeouts work well for me. I have found the where the issues are with auto-scaling and am working on a clean fix. When I get that done, I will revisit these changes to give better feedback. Sorry I could not help out more. |
I was having the same issue today (I also noticed that sometimes when the instances have the Update (Running) status the analysis start breaking the azsniffer), check this parameter on your VMSS |
The scheduler attempts to give a task to the machines before they have the agent up and running. This is an azure specific issue, I believe, because of how we have to start our agent on each reimaging. You will see that the same thing happens with new instances spun up. The I think the scaling options there may be misleading, too. We are technically using a "manual scale" as far as Azure is concerned. We send a signal to update the |
@leoiancu21 Another thing to note, as it is not in any documentation.. If you do not append your Bad naming: I am working on a backwards compatible PR for this now that will work for everyone. |
Thank you very much for that, on a previous issue this problem was discussed but i never dared to touch that part of the code |
@ChrisThibodeaux gave a look at azure logs and i can confirm that after a VMSS instance delete (when cape deltes it from the db too) the instance SKU count gets edited too, from what i've seen it works in reverse too so maybe using the same function to rise the instance count with more machines could be our answer, seems like the overprovisioning declared in the configs with this build is not totally respected and since I didn't see this behaviour before my changes I think that it has something to do with what I've added. What do you think about it, did this issue start after using my code or you had previous experiences with it ? Does it still happen when using the right tag naming convention ? Any feedback would be more than helpful |
thank you for PR, can you plz fix
|
Always a pleasure, after i fix the scale set capacity issue described in the previous comments i will commit the code removing those unused variables too |
@leoiancu21 For the increase/decrease in size of the scale set, look at the It is my take that this is performing perfectly, and as expected. If you haven't already, you can change this line to be this: The real issue with all of this is the freezing that happens with added instances (beyond the first one) and reimaged instances. They are handed tasks before their agent is ready. I am hitting issues with my reworking of that part, but I may have a solution using The hard part is going to be getting that concept to work when reimaging. |
@ChrisThibodeaux Love it, thanks, I saw that there is a procedure for scaling down the scaleset that checks for the SKU cores if usage_to_look_for:
usage = next((item for item in usages if item.name.value == usage_to_look_for), None)
if usage:
number_of_new_cpus_required = self.instance_type_cpus * (
number_of_relevant_machines_required - number_of_machines
)
# Leaving at least five spaces in the usage quota for a spot VM, let's not push it!
number_of_new_cpus_available = int(usage.limit) - usage.current_value - int(self.instance_type_cpus * 5)
if number_of_new_cpus_available < 0:
number_of_relevant_machines_required = machine_pools[vmss_name]["size"]
elif number_of_new_cpus_required > number_of_new_cpus_available:
old_number_of_relevant_machines_required = number_of_relevant_machines_required
number_of_relevant_machines_required = (
number_of_relevant_machines + number_of_new_cpus_available / self.instance_type_cpus
)
log.debug(
f"Quota could be exceeded with projected number of machines ({old_number_of_relevant_machines_required}). Setting new limit to {number_of_relevant_machines_required}"
) This could also be related to our issue since |
@leoiancu21 Are you using spot instances? I am not currently, as I am too unfamiliar with what they are. That would be the only way to enter that conditional. |
@ChrisThibodeaux Yep, the only difference between spot instances and normal ones (leaving out the price) is that microsoft can kill your VM if they need some more computing resources, this logic also applies to the time needed to generate a new VM this is why I started working on a dynamic timeout, the price difference is too much to not use it. I'll find a way to make it wait for the updated SKU cores value that Azure provides, I just have to make some tests before |
@leoiancu21 Don't make the change to Edit: My initial comment here was wrong. There is nothing breaking anymore, I believe it had to have been other changes I made. |
@leoiancu21 Update from my end. My assumptions about the agent not being in a ready state before a job is handed to it's machine was wrong. In reality, the ScalingBoundedSemaphore was not being released soon enough after a VMSS instance was reimaged. Once it got to the machine lock, it was getting permanently stuck. I am putting together a PR today with, hopefully, the fix to this. I have been able to run 30+ tasks in a row with 5 machines up, so there is progress. |
@leoiancu21 Apologies for taking so long on figuring out the last of the bugs. I have a PR with the changes that allow me to use auto-scaling without issue now, here. I placed in a fix for the |
Thank you so much, I'll try this in the next few days, I'll proceed to close this pull request as soon as it works. Thanks again for your work |
btw can you fix this 2?
|
so is this ready? |
@leoiancu21 I can pull this branch and test it out with the new changes I made in the other PR. Do you want me to send you a diff of any updates I make? |
Sorry for my absence, I was out of office this week, I'll pull the new changes and fix my code asap |
@ChrisThibodeaux |
@leoiancu21 Sorry, I won't be able to test that for a few more days. Our little one came earlier than expected, so I am out of office for a while. I will try to find time to test this out though. Is there any way you could push a commit with the merges? |
@ChrisThibodeaux I'd wait for it, for what I've seen at the moment when the scaleset is created it loads the machines inside the DB, assigns tasks and processes them, at the moment I also forced the scaleset sku capacity with Azure rules : The main issue now concerns tasks in pending that are assigned to old machines ids (when an analysis is completed a reimage is triggered that causes the machine to change it's id), apparently Azure treats a reimaged machine as a new one and our DB is not updated at that specific point :
As you can see it proceeds to delete machines that don't match the current scaleset machine ids only after it fails binding the machine ID to the new task, this causes Cape to go into a loop where every 5 minutes checks Machines that, as I said and showed inside the debug logs before, don't match with the currently available hosts running in the scaleset (I belive there is something missinig in the update db function when machines are deleted after an analysis and when they are deleted after a failed bind for a new task) : This causes new tasks to be assigned wrongly and either fail the analysis or get stuck inside the pending queue. At the moment I'm still investigating which function is not working properly, I'll publish a commit inside this thread as soon as I'm able to make it work, at the moment I don't think that publishing the broken merge would be useful in any way. So take your time and congrats on your newborn baby |
@leoiancu21 Compare these two sections from your branch and master's
I am nearly positive you are experiencing the same issue I was and what lead to me creating that PR in the first place. I can't remember the exact details, but I had instances being deleted after jobs just like you describe. Rebasing your branch to master (or atleast this commit) will almost certainly fix the issue you are seeing. |
Sorry for the long wait, got assigned to other projects after the summer holidays, I'm going to try rebasing az.py right now, I'll let you know what comes out of it |
any progres here? |
At the moment i switched to normal VMs instead of spot due to too many inconsistencies, and i'm working on fixing other small issues, once i'm back on my test environment i will provide some updates |
Add dynamic timeout and VM status check for Azure operations in machinery module
Overview
This pull request introduces two main improvements to the Azure machinery module in CAPE:
These enhancements aim to improve the reliability and efficiency of Azure-related tasks, particularly during scale set operations and long-running processes.
Key Changes
1. Dynamic Timeout Implementation
_handle_poller_result
method that dynamically adjusts timeout durations based on the state of Azure operations._are_machines_still_upgrading
method to check the status of machines in a scale set.AZURE_TIMEOUT
constant for other Azure operations where dynamic timing was not observed to be necessary.2. VM Status Check During Initialization
Benefits
Implementation Details
AZURE_TIMEOUT
.Azure._azure_api_call()
for consistency._wait_for_vms_running
,_are_all_vms_running
, and_is_vm_running
have been added to manage VM status checks._add_machines_to_db
method has been updated to only add machines in the 'Running' state.Testing
Reviewer Notes
Reviewers, please pay particular attention to:
_handle_poller_result
and_are_machines_still_upgrading
methods, especially their application to VMSS creation and reimaging.AZURE_TIMEOUT
for non-VMSS operations and whether this approach is appropriate.Conclusion
These changes aim to enhance the robustness of CAPE's Azure integration without introducing significant complexity or changing the overall architecture of the machinery module. They address important issues with timeout handling for specific VMSS operations and VM initialization, leading to a more stable and reliable Azure integration for CAPE, while maintaining existing behavior for other operations.
Your feedback and suggestions are greatly appreciated!