Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bring Azure machinery auto-scaling up to date. #2243

Merged
merged 9 commits into from
Jul 23, 2024

Conversation

ChrisThibodeaux
Copy link
Contributor

@ChrisThibodeaux ChrisThibodeaux commented Jul 21, 2024

Updated the Azure machinery, specifically with auto-scaling, to a usable state out-of-the-box.

Updates include:

  • Config key to allow the selection of ephemeral OS disk type, allowing a wider range of Azure VM types to be used.
  • Config key to allow changing the timer set for the monitor thread to scale up/down the VMSS.
  • Fix for the missing machines key in the config, which was leading to errors.
  • Change the method of finding "relevant" machines for submitted tasks, allowing more flexibility in naming your VMSS in the config. This removes the need to append the pool_tag to the end of the VMSS name.
  • Workaround to semaphore deadlocking issues.
  • Removal of machine deletion in the modules/machinery/az.py method stop when the scale-set is set to is_scaling_down, which was leading to a StaleDataError.
  • Tiny, non-critical, update to a log in lib/cuckoo/core/guest.py correcting the order of variables.

Concerns:
The fix for the StaleDataError is not ideal. It means that the process of scaling down when no relevant tasks are available must wait for all current tasks to finish AND for the monitor thread to wake up and run. What would be a much better fix would be to find a way to signal the machine for deletion in az.py immediately after the AnalysisManager releases the machine. Any tips are greatly appreciated on how that might be done. (Edit: This no longer applies. Now correctly deleting machines in such cases.)

If there are any ideas on how to better handle the issue with the semaphore, I would love to better apply a solution there. I am not entirely comfortable with what I have in place here. The main issues leading to deadlocking that I have noticed are:

  1. In update_limit -- The _limit_value is only updated when there are machines, and the amount of machines is less than the _upper_limit. So no update for _limit_value occurs when we are at the upper limit.
  2. In check_for_starvation -- The available_count is the number of unlocked machines. If machines are waiting at the semaphore, they are already locked. This makes updating the _value here impossible.

As a check for releasing the semaphore, I used the condition of locked machines <= configured machine limit, but I think that should maybe be total machines (locked or unlocked) <= configured machine limit.

Thoughts?

@doomedraven
Copy link
Collaborator

let me know when is ready to merge

@ChrisThibodeaux
Copy link
Contributor Author

ChrisThibodeaux commented Jul 22, 2024

@doomedraven I've tested trying to get the semaphore to lock up in various situations, but it looks good now. Should be good to merge.

@ChrisThibodeaux ChrisThibodeaux force-pushed the master branch 2 times, most recently from d3d751d to 86c5e0a Compare July 22, 2024 15:28
@ChrisThibodeaux
Copy link
Contributor Author

@doomedraven I believe I have found a better way of managing machine deletion. Please hold off on the merge until I can get the commits together.

@ChrisThibodeaux
Copy link
Contributor Author

@doomedraven Okay, fix is in. Now operating as expected.

@doomedraven
Copy link
Collaborator

thank you for update

@doomedraven doomedraven merged commit 3b4dfd7 into kevoreilly:master Jul 23, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants