Bring Azure machinery auto-scaling up to date. #2243

ChrisThibodeaux · 2024-07-21T21:24:14Z

Updated the Azure machinery, specifically with auto-scaling, to a usable state out-of-the-box.

Updates include:

Config key to allow the selection of ephemeral OS disk type, allowing a wider range of Azure VM types to be used.
Config key to allow changing the timer set for the monitor thread to scale up/down the VMSS.
Fix for the missing machines key in the config, which was leading to errors.
Change the method of finding "relevant" machines for submitted tasks, allowing more flexibility in naming your VMSS in the config. This removes the need to append the pool_tag to the end of the VMSS name.
Workaround to semaphore deadlocking issues.
Removal of machine deletion in the modules/machinery/az.py method stop when the scale-set is set to is_scaling_down, which was leading to a StaleDataError.
Tiny, non-critical, update to a log in lib/cuckoo/core/guest.py correcting the order of variables.

Concerns:
The fix for the StaleDataError is not ideal. It means that the process of scaling down when no relevant tasks are available must wait for all current tasks to finish AND for the monitor thread to wake up and run. What would be a much better fix would be to find a way to signal the machine for deletion in az.py immediately after the AnalysisManager releases the machine. Any tips are greatly appreciated on how that might be done. (Edit: This no longer applies. Now correctly deleting machines in such cases.)

If there are any ideas on how to better handle the issue with the semaphore, I would love to better apply a solution there. I am not entirely comfortable with what I have in place here. The main issues leading to deadlocking that I have noticed are:

In update_limit -- The _limit_value is only updated when there are machines, and the amount of machines is less than the _upper_limit. So no update for _limit_value occurs when we are at the upper limit.
In check_for_starvation -- The available_count is the number of unlocked machines. If machines are waiting at the semaphore, they are already locked. This makes updating the _value here impossible.

As a check for releasing the semaphore, I used the condition of locked machines <= configured machine limit, but I think that should maybe be total machines (locked or unlocked) <= configured machine limit.

Thoughts?

doomedraven · 2024-07-22T06:29:56Z

let me know when is ready to merge

ChrisThibodeaux · 2024-07-22T14:12:42Z

@doomedraven I've tested trying to get the semaphore to lock up in various situations, but it looks good now. Should be good to merge.

ChrisThibodeaux · 2024-07-22T18:44:21Z

@doomedraven I believe I have found a better way of managing machine deletion. Please hold off on the merge until I can get the commits together.

ChrisThibodeaux · 2024-07-22T19:54:51Z

@doomedraven Okay, fix is in. Now operating as expected.

…nd ephemeral OS disk type.

…_tag` to the name.

…o sqlalchemy.orm.exc.StaleDataError.

…ng down.

doomedraven · 2024-07-23T12:31:26Z

thank you for update

ChrisThibodeaux mentioned this pull request Jul 21, 2024

Add dynamic timeout for Azure operations in machinery module #2233

Open

ChrisThibodeaux force-pushed the master branch 2 times, most recently from d3d751d to 86c5e0a Compare July 22, 2024 15:28

ChrisThibodeaux force-pushed the master branch from 05a5c7e to 076c6c3 Compare July 22, 2024 19:11

ChrisThibodeaux added 8 commits July 22, 2024 23:42

Update az conf/machinery to enable specifying scaling monitor timer a…

fbfc300

…nd ephemeral OS disk type.

Fix missing machines key error for azure machinery.

c3c0c6c

Allow for more flexible VMSS naming, without the need to append `pool…

37dc4a0

…_tag` to the name.

Fix ordering in log output for mixed up task_id and completed_as.

01027ea

Fix deadlock occuring with az machinery and the semaphore machine_lock.

30aea47

Prevent Azure machinery form prematurely deleting machines, leading t…

52d5214

…o sqlalchemy.orm.exc.StaleDataError.

Fix error with mmanager_opts placed incorrectly.

c62b3c7

Enable machine deletions upon task completion when scale-set is scali…

7f0827c

…ng down.

ChrisThibodeaux force-pushed the master branch from c56765f to 7f0827c Compare July 23, 2024 04:43

Update views.py

20543dc

doomedraven merged commit 3b4dfd7 into kevoreilly:master Jul 23, 2024
5 checks passed

ChrisThibodeaux mentioned this pull request Aug 28, 2024

sqalchemy StaleDataError with Azure machinery #2242

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bring Azure machinery auto-scaling up to date. #2243

Bring Azure machinery auto-scaling up to date. #2243

ChrisThibodeaux commented Jul 21, 2024 •

edited

Loading

doomedraven commented Jul 22, 2024

ChrisThibodeaux commented Jul 22, 2024 •

edited

Loading

ChrisThibodeaux commented Jul 22, 2024

ChrisThibodeaux commented Jul 22, 2024

doomedraven commented Jul 23, 2024

Bring Azure machinery auto-scaling up to date. #2243

Bring Azure machinery auto-scaling up to date. #2243

Conversation

ChrisThibodeaux commented Jul 21, 2024 • edited Loading

doomedraven commented Jul 22, 2024

ChrisThibodeaux commented Jul 22, 2024 • edited Loading

ChrisThibodeaux commented Jul 22, 2024

ChrisThibodeaux commented Jul 22, 2024

doomedraven commented Jul 23, 2024

ChrisThibodeaux commented Jul 21, 2024 •

edited

Loading

ChrisThibodeaux commented Jul 22, 2024 •

edited

Loading