Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqalchemy StaleDataError with Azure machinery #2242

Closed
6 tasks done
ChrisThibodeaux opened this issue Jul 20, 2024 · 27 comments
Closed
6 tasks done

sqalchemy StaleDataError with Azure machinery #2242

ChrisThibodeaux opened this issue Jul 20, 2024 · 27 comments

Comments

@ChrisThibodeaux
Copy link
Contributor

ChrisThibodeaux commented Jul 20, 2024

About accounts on capesandbox.com

  • Issues isn't the way to ask for account activation. Ping capesandbox in Twitter with your username

This is open source and you are getting free support so be friendly!

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest version
  • I did read the README!
  • I checked the documentation and found no answer
  • I checked to make sure that this issue has not already been filed
  • I'm reporting the issue to the correct repository (for multi-repository projects)
  • I have read and checked all configs (with all optional parts)

Expected Behavior

When using Azure machinery with a VMSS, analysis should complete without database errors.

Current Behavior

When the last available task has been assigned to a machine, any other finishing analysis causes a sqlalchemy.orm.exc.StaleDataError. After, remaining tasks will always remain in the running state.

Failure Information (for bugs)

Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.

Steps to Reproduce

Please provide detailed steps for reproducing the issue.

  1. Allow VMSS max size to be greater than one.
  2. Submit any number of tasks greater than one. We want at least two machines to be running analysis.
  3. When the last task is assigned, whichever running task finishes next produced a StaleDataError.

Context

Sqalchemy version = 1.4.50

Using Azure machinery with auto-scaling. The ScalingBoundSemaphore usage in the master branch has a bug when used with Azure auto-scaling that consistently leads to deadlocking. I have implemented a workaround that updates the semaphore in analysis_manager immediately before the machine is locked. If that is relevant, I can share the fix I have for that as well.

Question Answer
Git commit 303bdcc
OS version Ubuntu 22.04

Failure Logs

Jul 20 17:33:16 hostname python3[251990]: 2024-07-20 17:33:16,107 [lib.cuckoo.core.guest] DEBUG: Task #8: Analysis is still running (id=cape-guest-vmss-win10x64_0, ip=10.2.2.5)
Jul 20 17:33:16 hostname python3[251990]: 2024-07-20 17:33:16,923 [lib.cuckoo.core.guest] DEBUG: Task #6: Analysis is still running (id=cape-guest-vmss-win10x64_2, ip=10.2.2.6)
Jul 20 17:33:17 hostname python3[251990]: 2024-07-20 17:33:17,099 [lib.cuckoo.core.resultserver] DEBUG: Task #6 had connection reset by peer for <Context for b'LOG'>
Jul 20 17:33:17 hostname python3[251990]: 2024-07-20 17:33:17,100 [lib.cuckoo.core.resultserver] DEBUG: Task #6 had connection reset by peer for <Context for None>
Jul 20 17:33:17 hostname python3[251990]: 2024-07-20 17:33:17,100 [lib.cuckoo.core.resultserver] DEBUG: Task #6 had connection reset by peer for <Context for b'BSON'>
Jul 20 17:33:17 hostname python3[251990]: 2024-07-20 17:33:17,939 [lib.cuckoo.core.guest] INFO: Task #6: Analysis completed successfully (id=cape-guest-vmss-win10x64_2, ip=10.2.2.6)
Jul 20 17:33:18 hostname sudo[252717]: pam_unix(sudo:session): session closed for user root
Jul 20 17:33:18 hostname python3[251990]: 2024-07-20 17:33:18,004 [lib.cuckoo.core.plugins] DEBUG: Stopped auxiliary module: Sniffer
Jul 20 17:33:18 hostname python3[251990]: 2024-07-20 17:33:18,125 [lib.cuckoo.core.resultserver] DEBUG: Task #6: Stopped tracking machine 10.2.2.6
Jul 20 17:33:18 hostname python3[251990]: 2024-07-20 17:33:18,125 [modules.machinery.az] DEBUG: Stopping machine 'cape-guest-vmss-win10x64_2'
Jul 20 17:33:18 hostname python3[251990]: 2024-07-20 17:33:18,138 [lib.cuckoo.core.analysis_manager] ERROR: Task #6: failure in AnalysisManager.run
Jul 20 17:33:18 hostname python3[251990]: Traceback (most recent call last):
Jul 20 17:33:18 hostname python3[251990]:   File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 519, in run
Jul 20 17:33:18 hostname python3[251990]:     self.launch_analysis()
Jul 20 17:33:18 hostname python3[251990]:   File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 481, in launch_analysis
Jul 20 17:33:18 hostname python3[251990]:     success = self.perform_analysis()
Jul 20 17:33:18 hostname python3[251990]:   File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 455, in perform_analysis
Jul 20 17:33:18 hostname python3[251990]:     with self.machine_running(), self.result_server(), self.network_routing(), self.run_auxiliary():
Jul 20 17:33:18 hostname python3[251990]:   File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__
Jul 20 17:33:18 hostname python3[251990]:     next(self.gen)
Jul 20 17:33:18 hostname python3[251990]:   File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 346, in machine_running
Jul 20 17:33:18 hostname python3[251990]:     with self.db.session.begin():
Jul 20 17:33:18 hostname python3[251990]:   File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/engine/util.py", line 235, in __exit__
Jul 20 17:33:18 hostname python3[251990]:     with util.safe_reraise():
Jul 20 17:33:18 hostname python3[251990]:   File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
Jul 20 17:33:18 hostname python3[251990]:     compat.raise_(
Jul 20 17:33:18 hostname python3[251990]:   File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
Jul 20 17:33:18 hostname python3[251990]:     raise exception
Jul 20 17:33:18 hostname python3[251990]:   File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/engine/util.py", line 233, in __exit__
Jul 20 17:33:18 hostname python3[251990]:     self.commit()
Jul 20 17:33:18 hostname python3[251990]:   File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 832, in commit
Jul 20 17:33:18 hostname python3[251990]:     self._prepare_impl()
Jul 20 17:33:18 hostname python3[251990]:   File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 811, in _prepare_impl
Jul 20 17:33:18 hostname python3[251990]:     self.session.flush()
Jul 20 17:33:18 hostname python3[251990]:   File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 3449, in flush
Jul 20 17:33:18 hostname python3[251990]:     self._flush(objects)
Jul 20 17:33:18 hostname python3[251990]:   File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 3588, in _flush
Jul 20 17:33:18 hostname python3[251990]:     with util.safe_reraise():
Jul 20 17:33:18 hostname python3[251990]:   File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
Jul 20 17:33:18 hostname python3[251990]:     compat.raise_(
Jul 20 17:33:18 hostname python3[251990]:   File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
Jul 20 17:33:18 hostname python3[251990]:     raise exception
Jul 20 17:33:18 hostname python3[251990]:   File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 3549, in _flush
Jul 20 17:33:18 hostname python3[251990]:     flush_context.execute()
Jul 20 17:33:18 hostname python3[251990]:   File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/unitofwork.py", line 456, in execute
Jul 20 17:33:18 hostname python3[251990]:     rec.execute(self)
Jul 20 17:33:18 hostname python3[251990]:   File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/unitofwork.py", line 630, in execute
Jul 20 17:33:18 hostname python3[251990]:     util.preloaded.orm_persistence.save_obj(
Jul 20 17:33:18 hostname python3[251990]:   File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/persistence.py", line 237, in save_obj
Jul 20 17:33:18 hostname python3[251990]:     _emit_update_statements(
Jul 20 17:33:18 hostname python3[251990]:   File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/persistence.py", line 1035, in _emit_update_statements
Jul 20 17:33:18 hostname python3[251990]:     raise orm_exc.StaleDataError(
Jul 20 17:33:18 hostname python3[251990]: sqlalchemy.orm.exc.StaleDataError: UPDATE statement on table 'machines' expected to update 1 row(s); 0 were matched.

Task 8 was the last task. It began running while task 6 was running. When task 6 finished, I got the error above. This is after successfully reporting on tasks 1-5 and 7.

Thoughts on where to begin looking for the cause?

@ChrisThibodeaux
Copy link
Contributor Author

ChrisThibodeaux commented Jul 20, 2024

Correction to the setup steps in my post. This error is also happening when using a single VMSS instance. All other info applies.

@ChrisThibodeaux
Copy link
Contributor Author

Found the issue. In the def stop(self) of az.py, this line was deleting the machine from the database before AnalysisManager could release it here. I suppose this only happened when the last task was taken because the azure machinery was not attempting to scale down until there were no relevant tasks remaining.

I will have a fix for this in the PR bringing azure machinery up to date.

@dor15
Copy link

dor15 commented Aug 27, 2024

Hey Chris, what's up?
We are facing the same issue but with aws.py.
This is the error we got:
"sqlalchemy.orm.exc.StaleDataError: UPDATE statement on table 'machines' expected to update 1 row(s); 0 were matched"

"Traceback (most recent call last):
File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 498, in run
self.launch_analysis()
File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 460, in launch_analysis
success = self.perform_analysis()
File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 444, in perform_analysis
with self.machine_running(), self.result_server(), self.network_routing(), self.run_auxiliary():
File "/usr/lib/python3.10/contextlib.py", line 142, in exit
next(self.gen)
File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 334, in machine_running
self.machinery_manager.machinery.release(self.machine)
File "/opt/CAPEv2/modules/machinery/aws.py", line 339, in release
self._start_or_create_machines()
File "/opt/CAPEv2/modules/machinery/aws.py", line 231, in _start_or_create_machines
current_available_machines = self.db.count_machines_available()
File "/opt/CAPEv2/lib/cuckoo/core/database.py", line 1002, in count_machines_available
return machines.count()
"

Do you think it's the same issue you were faced with? if so, can you share the solution you made?

@ChrisThibodeaux
Copy link
Contributor Author

@dor15 I think it is exactly like the issue I faced. You are going to want to look at aws.py and review the two methods stop and release. In essence, if you are removing the machine in stop, you are going to have a bad time. That must occur in release. Should this be the actual issue, the fix is to move machine deletion into the release method and out of stop.

This is where stop and release are happening in the flow of a sandbox run: https://github.com/kevoreilly/CAPEv2/blob/master/lib/cuckoo/core/analysis_manager.py#L292-L341

Notice that the stop occurs first by calling self.machinery_manager.stop_machine(self.machine) which is just this:

def stop_machine(self, machine: Machine) -> None:
    self.machinery.stop(machine.label)

You can see the fix that I made for the azure machinery:
before
after

@ChrisThibodeaux
Copy link
Contributor Author

Honestly, I'm surprised you are running into that now and no one else has reported it before. If you do happen to find the problem and the above fixes it for you, please create a pull request with the changes. It goes a long way!

@doomedraven
Copy link
Collaborator

maybe it only related to cloud providers? i don't have those with kvm

@ChrisThibodeaux
Copy link
Contributor Author

ChrisThibodeaux commented Aug 27, 2024

@doomedraven I think it is most likely how az/aws.py have chosen to deal with autoscaling. The az module was having issues because it was deleting the machine from the database in the stop call. Looking at aws.py, I see that only happens when self._is_autoscaled is true for the machine its called on. When the AnalysisManager tells the MachineryManager to release the machine, it has been removed from the database, so the call using that machine's ID will result in stale data errors.

I don't think any other machinery module explicitly removes machines from the database as the normal flow in stop. Thus, when the MachineryManager attempts to release it, the machine is still present in the database.

@dor15 I would be very interested in knowing a bit more about your setup. Have I correctly guessed that you are using autoscaling?

@doomedraven
Copy link
Collaborator

so maybe we need to backport that aws feature to az?

@doomedraven doomedraven reopened this Aug 28, 2024
@doomedraven
Copy link
Collaborator

im gonna keep this open for some time till we see what gonna do, as is summer and i don't have a lot of time to properly concentrate on big tasks here

@dor15
Copy link

dor15 commented Aug 28, 2024

Honestly, I'm surprised you are running into that now and no one else has reported it before. If you do happen to find the problem and the above fixes it for you, please create a pull request with the changes. It goes a long way!

more than that, in the original code, in the part of "def _delete_machine_form_db(self, label):", it is trying to create a new "session" -
session = db.Session()

and it got the following exception:

"/opt/CAPEv2/modules/machinery/aws.py" 435L, 16769B 107,18 21%
with self.machine_running(), self.result_server(), self.network_routing(), self.run_auxiliary():
File "/usr/lib/python3.10/contextlib.py", line 142, in exit
next(self.gen)
File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 325, in machine_running
self.machinery_manager.stop_machine(self.machine)
File "/opt/CAPEv2/lib/cuckoo/core/machinery_manager.py", line 308, in stop_machine
self.machinery.stop(machine.label)
File "/opt/CAPEv2/modules/machinery/aws.py", line 300, in stop
self._delete_machine_form_db(label)
File "/opt/CAPEv2/modules/machinery/aws.py", line 107, in _delete_machine_form_db
session = db.Session()
NameError: name 'db' is not defined

So I used the existing session with "self.db.session" to proceed with running the code, and then I got the error I described above.
I agree with you, and it is a bit weird that no one reported at least about this issue

@dor15
Copy link

dor15 commented Aug 28, 2024

@doomedraven I think it is most likely how az/aws.py have chosen to deal with autoscaling. The az module was having issues because it was deleting the machine from the database in the stop call. Looking at aws.py, I see that only happens when self._is_autoscaled is true for the machine its called on. When the AnalysisManager tells the MachineryManager to release the machine, it has been removed from the database, so the call using that machine's ID will result in stale data errors.

I don't think any other machinery module explicitly removes machines from the database as the normal flow in stop. Thus, when the MachineryManager attempts to release it, the machine is still present in the database.

@dor15 I would be very interested in knowing a bit more about your setup. Have I correctly guessed that you are using autoscaling?

yes -

from aws.conf:
[autoscale]

Enable auto-scale in cuckoo(by setting autoscale = yes). recommended for better performance.
autoscale = yes

@ChrisThibodeaux
Copy link
Contributor Author

ChrisThibodeaux commented Aug 28, 2024

@dor15 I don't see session = db.session in the file at this line. I would make sure you are on the most current commit before going any further.

Anyway, try using this and see if it fixes the problem.

    def stop(self, label):
        """
        Stops a virtual machine.
        If the machine has initialized from autoscaled component, then terminate it.
        @param label: virtual machine label.
        @raise CuckooMachineError: if unable to stop.
        """
        log.debug("Stopping vm %s" % label)

        status = self._status(label)

        if status == AWS.POWEROFF:
            raise CuckooMachineError("Trying to stop an already stopped VM: %s" % label)

        if self._is_autoscaled(self.ec2_machines[label]):
            self.ec2_machines[label].terminate()
        else:
            self.ec2_machines[label].stop(Force=True)
            self._wait_status(label, AWS.POWEROFF)
            self._restore(label)

    """override Machinery method"""

    def release(self, machine: Machine) -> Machine:
        """
        we override it to have the ability to run start_or_create_machines() after unlocking the last machine
        Release a machine.
        @param label: machine label.
        """
        retval = super(AWS, self).release(machine)

        if self._is_autoscaled(self.ec2_machines[machine.label]):
            self._delete_machine_form_db(machine.label)
            self.dynamic_machines_count -= 1

        self._start_or_create_machines()
        return retval

I don't know what self.ec2_machines[label].terminate() does, so I left it alone. I shifted down the machine deletion and count reduction to immediately after the machine's release.

@dor15
Copy link

dor15 commented Aug 28, 2024

yes, I meant this line https://github.com/kevoreilly/CAPEv2/blob/d0732ee9d4b01184f86136336ee4ed5af98c34ca/modules/machinery/aws.py#L108
session = self.db.Session(),

instead of this I use "self.db.session" before any session action, for example -
machine = session.query(Machine).filter_by(label=label).first() --> machine = self.db.session.query(Machine).filter_by(label=label).first()

I'll try to use your fix and update. thanks!

@dor15
Copy link

dor15 commented Aug 28, 2024

trying to apply your fix and now getting a new error:
2024-08-28 09:02:48,881 [lib.cuckoo.core.analysis_manager] ERROR: Task #100: failure in AnalysisManager.run Traceback (most recent call last): File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 498, in run self.launch_analysis() File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 460, in launch_analysis success = self.perform_analysis() File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 444, in perform_analysis with self.machine_running(), self.result_server(), self.network_routing(), self.run_auxiliary(): File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__ next(self.gen) File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 334, in machine_running self.machinery_manager.machinery.release(self.machine) File "/opt/CAPEv2/modules/machinery/aws.py", line 322, in release self._start_or_create_machines() File "/opt/CAPEv2/modules/machinery/aws.py", line 207, in _start_or_create_machines current_available_machines = self.db.count_machines_available() File "/opt/CAPEv2/lib/cuckoo/core/database.py", line 1001, in count_machines_available return machines.count() File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 3176, in count return self._from_self(col).enable_eagerloads(False).scalar() File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2893, in scalar ret = self.one() File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2870, in one return self._iter().one() File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2916, in _iter result = self.session.execute( File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1716, in execute conn = self._connection_for_bind(bind) File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1552, in _connection_for_bind TransactionalContext._trans_ctx_check(self) File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/engine/util.py", line 199, in _trans_ctx_check raise exc.InvalidRequestError( sqlalchemy.exc.InvalidRequestError: Can't operate on closed transaction inside context manager. Please complete the context manager before emitting further commands.

it's probably related to the way I create/use the current session in "_delete_machine_form_db" -

` def _delete_machine_form_db(self, label):
"""
cuckoo's DB class does not implement machine deletion, so we made one here
:param label: the machine label
"""
try:
self.db.session.begin_nested()
from lib.cuckoo.core.database import Machine

        machine = self.db.session.query(Machine).filter_by(label=label).first()
        if machine:
            self.db.session.delete(machine)
            self.db.session.commit()
    except SQLAlchemyError as e:
        log.debug("Database error removing machine: {0}".format(e))
        self.db.session.rollback()
        return
    finally:
        self.db.session.close()`

@dor15
Copy link

dor15 commented Aug 28, 2024

trying to apply your fix and now getting a new error: 2024-08-28 09:02:48,881 [lib.cuckoo.core.analysis_manager] ERROR: Task #100: failure in AnalysisManager.run Traceback (most recent call last): File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 498, in run self.launch_analysis() File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 460, in launch_analysis success = self.perform_analysis() File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 444, in perform_analysis with self.machine_running(), self.result_server(), self.network_routing(), self.run_auxiliary(): File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__ next(self.gen) File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 334, in machine_running self.machinery_manager.machinery.release(self.machine) File "/opt/CAPEv2/modules/machinery/aws.py", line 322, in release self._start_or_create_machines() File "/opt/CAPEv2/modules/machinery/aws.py", line 207, in _start_or_create_machines current_available_machines = self.db.count_machines_available() File "/opt/CAPEv2/lib/cuckoo/core/database.py", line 1001, in count_machines_available return machines.count() File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 3176, in count return self._from_self(col).enable_eagerloads(False).scalar() File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2893, in scalar ret = self.one() File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2870, in one return self._iter().one() File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2916, in _iter result = self.session.execute( File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1716, in execute conn = self._connection_for_bind(bind) File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1552, in _connection_for_bind TransactionalContext._trans_ctx_check(self) File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/engine/util.py", line 199, in _trans_ctx_check raise exc.InvalidRequestError( sqlalchemy.exc.InvalidRequestError: Can't operate on closed transaction inside context manager. Please complete the context manager before emitting further commands.

it's probably related to the way I create/use the current session in "_delete_machine_form_db" -

` def _delete_machine_form_db(self, label): """ cuckoo's DB class does not implement machine deletion, so we made one here :param label: the machine label """ try: self.db.session.begin_nested() from lib.cuckoo.core.database import Machine

        machine = self.db.session.query(Machine).filter_by(label=label).first()
        if machine:
            self.db.session.delete(machine)
            self.db.session.commit()
    except SQLAlchemyError as e:
        log.debug("Database error removing machine: {0}".format(e))
        self.db.session.rollback()
        return
    finally:
        self.db.session.close()`

the issue was solved after I comment-out the last line, and now it deletes the "old" machine and creates new one, and the whole process ends successfully
self.db.session.close()

But I'm not sure how it will affect the rest of the code (if it will affect anything).

These are the final changes I made in the code:

def stop(self, label):
    log.info("Stopping vm %s" % label)

    status = self._status(label)

    if status == AWS.POWEROFF:
        raise CuckooMachineError("Trying to stop an already stopped VM: %s" % label)

    if self._is_autoscaled(self.ec2_machines[label]):
        self.ec2_machines[label].terminate()
    else:
        self.ec2_machines[label].stop(Force=True)
        self._wait_status(label, AWS.POWEROFF)
        self._restore(label)

"""override Machinery method"""

def release(self, machine: Machine) -> Machine:
    """
    we override it to have the ability to run start_or_create_machines() after unlocking the last machine
    Release a machine.
    @param label: machine label.
    """

    retval = super(AWS, self).release(machine)
    if self._is_autoscaled(self.ec2_machines[machine.label]):
        log.info("release - start delete_machine_from_db")
        self._delete_machine_form_db(machine.label)
        self.dynamic_machines_count -= 1
    

    self._start_or_create_machines()
    return retval
    
def _delete_machine_form_db(self, label):
    """
    cuckoo's DB class does not implement machine deletion, so we made one here
    :param label: the machine label
    """
    try:
        self.db.session.begin_nested()
        from lib.cuckoo.core.database import Machine
        
        machine = self.db.session.query(Machine).filter_by(label=label).first()
        if machine:
            self.db.session.delete(machine)
            self.db.session.commit()
    except SQLAlchemyError as e:
        log.debug("Database error removing machine: {0}".format(e))
        self.db.session.rollback()
        return
    #finally:
    #    self.db.session.close()

@ChrisThibodeaux
Copy link
Contributor Author

ChrisThibodeaux commented Aug 28, 2024

@dor15 In stop, can you replace self._delete_machine_form_db(label) with super(AWS, self).delete_machine(label).

Total changes would now look like this:

    def stop(self, label):
        """
        Stops a virtual machine.
        If the machine has initialized from autoscaled component, then terminate it.
        @param label: virtual machine label.
        @raise CuckooMachineError: if unable to stop.
        """
        log.debug("Stopping vm %s" % label)

        status = self._status(label)

        if status == AWS.POWEROFF:
            raise CuckooMachineError("Trying to stop an already stopped VM: %s" % label)

        if self._is_autoscaled(self.ec2_machines[label]):
            self.ec2_machines[label].terminate()
        else:
            self.ec2_machines[label].stop(Force=True)
            self._wait_status(label, AWS.POWEROFF)
            self._restore(label)

    """override Machinery method"""

    def release(self, machine: Machine) -> Machine:
        """
        we override it to have the ability to run start_or_create_machines() after unlocking the last machine
        Release a machine.
        @param label: machine label.
        """
        retval = super(AWS, self).release(machine)

        if self._is_autoscaled(self.ec2_machines[machine.label]):
            super(AWS, self).delete_machine(machine.label)
            self.dynamic_machines_count -= 1

        self._start_or_create_machines()
        return retval

I think the machinery file should do as little session spawning as possible, but that's just me. Let me know if this fixes the last part of the problem.

@doomedraven
Copy link
Collaborator

but we speak about azure not aws issue ?

@ChrisThibodeaux
Copy link
Contributor Author

@doomedraven Azure issue was solved with #2243. This is a separate issue with AWS. I have a PR draft waiting for confirmation that these changes fix it.

@doomedraven
Copy link
Collaborator

ah ok thank you

@dor15
Copy link

dor15 commented Aug 29, 2024

@dor15 In stop, can you replace self._delete_machine_form_db(label) with super(AWS, self).delete_machine(label).

Total changes would now look like this:

    def stop(self, label):
        """
        Stops a virtual machine.
        If the machine has initialized from autoscaled component, then terminate it.
        @param label: virtual machine label.
        @raise CuckooMachineError: if unable to stop.
        """
        log.debug("Stopping vm %s" % label)

        status = self._status(label)

        if status == AWS.POWEROFF:
            raise CuckooMachineError("Trying to stop an already stopped VM: %s" % label)

        if self._is_autoscaled(self.ec2_machines[label]):
            self.ec2_machines[label].terminate()
        else:
            self.ec2_machines[label].stop(Force=True)
            self._wait_status(label, AWS.POWEROFF)
            self._restore(label)

    """override Machinery method"""

    def release(self, machine: Machine) -> Machine:
        """
        we override it to have the ability to run start_or_create_machines() after unlocking the last machine
        Release a machine.
        @param label: machine label.
        """
        retval = super(AWS, self).release(machine)

        if self._is_autoscaled(self.ec2_machines[machine.label]):
            super(AWS, self).delete_machine(machine.label)
            self.dynamic_machines_count -= 1

        self._start_or_create_machines()
        return retval

I think the machinery file should do as little session spawning as possible, but that's just me. Let me know if this fixes the last part of the problem.

Hey,
I believe you meant to "replace self._delete_machine_form_db(label) with super(AWS, self).delete_machine(label)" in "release" func and not in "stop".
I tried to apply the changes that you suggested, but the process doesn't create a new machine instead of the one that finished the job -

2024-08-29 09:17:58,140 [modules.machinery.aws] INFO: Stopping vm i-0eef625a9c17d557b
2024-08-29 09:17:58,339 [modules.machinery.aws] INFO: instance state: running
2024-08-29 09:17:58,671 [lib.cuckoo.core.database] WARNING: i-0eef625a9c17d557b does not exist in the database.

@ChrisThibodeaux
Copy link
Contributor Author

@dor15 You're right, I said that wrong. The code was what I had intended, though. I don't understand where you are getting that log line from [lib.cuckoo.core.database] WARNING: i-0eef625a9c17d557b does not exist in the database.. This would be easier if I could see your code. Do you have a fork that I can view?

@ethhart
Copy link

ethhart commented Aug 30, 2024

You are passing the label (instance id), not the Machine.name.

@ChrisThibodeaux
Copy link
Contributor Author

You are passing the label (instance id), not the Machine.name.

@ethhart Great catch, that is definitely the problem.
@dor15 The line should be super(AWS, self).delete_machine(machine.name)

I'll update the PR draft.

@dor15
Copy link

dor15 commented Sep 1, 2024

@dor15 You're right, I said that wrong. The code was what I had intended, though. I don't understand where you are getting that log line from [lib.cuckoo.core.database] WARNING: i-0eef625a9c17d557b does not exist in the database.. This would be easier if I could see your code. Do you have a fork that I can view?

no, I don't have a fork, but the line is from the original code database

You are passing the label (instance id), not the Machine.name.

@ethhart Great catch, that is definitely the problem. @dor15 The line should be super(AWS, self).delete_machine(machine.name)

I'll update the PR draft.

@ethhart, you are right; now it's working.

so the changes that @ChrisThibodeaux suggested work and solve the issue.

@doomedraven
Copy link
Collaborator

so, now that we merged fix, is this issue is relevant? as it says Azure, not AWS

@ChrisThibodeaux
Copy link
Contributor Author

@doomedraven I should have asked dor15 to split this off into a separate ticket sooner. I think we can close now.

@doomedraven
Copy link
Collaborator

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants