LXC spawner proof of concept #4158

pevogam · 2020-08-27T21:52:04Z

This is just a very early implementation of an LXC spawner for post 82 LTS,
submitted to gather any opinions and possibly clarify some questions.

pevogam · 2020-08-27T21:56:39Z

Hi @clebergnu, @beraldoleal, @willianrampazzo, this draft PR is not meant to attract any of your resources. I wanted to push early however in order to also debug possible race conditions. In particular, in its current form if I do

AVOCADO_LOG_EARLY=1 AVOCADO_LOG_DEBUG=1 avocado run --test-runner nrunner --nrunner-status-server-uri 0.0.0.0:8888 --nrunner-spawner lxc /bin/true /bin/true /bin/true

this will lead to an interesting race condition where the status server is not yet started and I get some

avocado.test: Command exited with code 256 (Exception ignored in: <function _after_at_fork_child_reinit_locks at 0x7f2a5a4e1940>
Traceback (most recent call last):
  File "/usr/lib64/python3.8/logging/__init__.py", line 260, in _after_at_fork_child_reinit_locks
  File "/usr/lib64/python3.8/logging/__init__.py", line 228, in _releaseLock
RuntimeError: cannot release un-acquired lock
Traceback (most recent call last):
  File "/root/avocado-runner", line 942, in <module>
    main()
  File "/root/avocado-runner", line 938, in main
    app.run()
  File "/root/avocado-runner", line 825, in run
    return kallable(args)
  File "/root/avocado-runner", line 914, in command_task_run
    for status in task.run():
  File "/root/avocado-runner", line 714, in run
    status_service.post(status)
  File "/root/avocado-runner", line 604, in post
    self.connection = socket.create_connection((host, port))
  File "/usr/lib64/python3.7/socket.py", line 728, in create_connection
    raise err
  File "/usr/lib64/python3.7/socket.py", line 716, in create_connection
    sock.connect(sa)

failing tests as a result. I do notice the status server eventually starting though but I think it giving control back to the main loop in

            self._server_task = await asyncio.start_server(
                self.cb,
                host=host,
                port=port)

is the reason the other coroutines progress too far before control is returned to the initialization. Do you think I might be doing the avocado-runner call too early? I have used the podman example for multiple comparison and this seems appropriate time to call it.

clebergnu · 2020-08-27T22:03:27Z

Hi @clebergnu, @beraldoleal, @willianrampazzo, this draft PR is not meant to attract any of your resources. I wanted to push early however in order to also debug possible race conditions. In particular, in its current form if I do
AVOCADO_LOG_EARLY=1 AVOCADO_LOG_DEBUG=1 avocado run --test-runner nrunner --nrunner-status-server-uri 0.0.0.0:8888 --nrunner-spawner lxc /bin/true /bin/true /bin/true
this will lead to an interesting race condition where the status server is not yet started and I get some
avocado.test: Command exited with code 256 (Exception ignored in: <function _after_at_fork_child_reinit_locks at 0x7f2a5a4e1940>
Traceback (most recent call last):
  File "/usr/lib64/python3.8/logging/__init__.py", line 260, in _after_at_fork_child_reinit_locks
  File "/usr/lib64/python3.8/logging/__init__.py", line 228, in _releaseLock
RuntimeError: cannot release un-acquired lock
Traceback (most recent call last):
  File "/root/avocado-runner", line 942, in <module>
    main()
  File "/root/avocado-runner", line 938, in main
    app.run()
  File "/root/avocado-runner", line 825, in run
    return kallable(args)
  File "/root/avocado-runner", line 914, in command_task_run
    for status in task.run():
  File "/root/avocado-runner", line 714, in run
    status_service.post(status)
  File "/root/avocado-runner", line 604, in post
    self.connection = socket.create_connection((host, port))
  File "/usr/lib64/python3.7/socket.py", line 728, in create_connection
    raise err
  File "/usr/lib64/python3.7/socket.py", line 716, in create_connection
    sock.connect(sa)
failing tests as a result. I do notice the status server eventually starting though but I think it giving control back to the main loop in
            self._server_task = await asyncio.start_server(
                self.cb,
                host=host,
                port=port)
is the reason the other coroutines progress too far before control is returned to the initialization. Do you think I might be doing the avocado-runner call too early? I have used the podman example for multiple comparison and this seems appropriate time to call it.

Hard to tell, and ensure, most things within a thread running multiple asyncio coroutines. The correct solution is to ensure that the server is up before going further. I think we may need to introduce a ping command/response to the server, given that some advanced scenarios may put a status server in a different place / machine altogether.

What do you think?

pevogam · 2020-08-27T22:07:54Z

is the reason the other coroutines progress too far before control is returned to the initialization. Do you think I might be doing the avocado-runner call too early? I have used the podman example for multiple comparison and this seems appropriate time to call it.

Hard to tell, and ensure, most things within a thread running multiple asyncio coroutines. The correct solution is to ensure that the server is up before going further. I think we may need to introduce a ping command/response to the server, given that some advanced scenarios may put a status server in a different place / machine altogether.

What do you think?

I guess it all boils down to whether we are using a simple standard control loop for all coroutines or something more elaborate to support some staging. If certain co-routines remain preconditions for others we will definitely have to ensure those preconditions are met before we give control back to the main loop. A ping in this particular case is a good way to add extra assertions and fail early with a clearer error message.

beraldoleal · 2020-08-27T22:12:37Z

This is awesome @pevogam , I will take a closer look as soon as possible.

pevogam · 2021-05-13T16:09:43Z

Hi all, considering all the changes Avocado sees in this direction I assume by now this code is extremely out of date?

beraldoleal · 2021-05-13T16:30:55Z

Hi @pevogam IMO this is very welcome. I didn't reviewed because of the draft state. But if you still have interest in pushing this, I will review it during the next days.

clebergnu · 2021-05-13T16:53:09Z

Hi @pevogam IMO this is very welcome. I didn't reviewed because of the draft state. But if you still have interest in pushing this, I will review it during the next days.

Absolutely, this is very welcome! Sorry @pevogam we got distracted and did not gave it the proper level of attention. TBH, I need to revisit my LXC-foo, and properly review/test this. I have assigned myself too to this issue to do it ASAP.

pevogam · 2021-05-13T18:00:27Z

No worries, I was just wondering what is its status.

It is draft yes because is a very initial proposal from somebody not directly involved in core avocado development (thus possibly full of potential problems and needing some early reviews to even be considered something to continue working on). Also despite the draft state I pinged you for any feedback before just to avoid confusion from it being a draft. But again, this is a low prio and I am aware core features must be developed first so I just wanted to see if you think it is still legitimate to remain a PR or it should be discarded.

pevogam · 2021-05-21T14:00:04Z

I pushed on top of 88.1 to catch up with the various versions, will run some functional testing on my side in the coming week.

lgtm-com · 2021-05-28T17:41:42Z

This pull request introduces 1 alert when merging 9b624c9 into a82c5eb - view on LGTM.com

new alerts:

1 for Unused import

codeclimate · 2021-05-28T20:57:28Z

Code Climate has analyzed commit b0db4bb and detected 0 issues on this pull request.

The test coverage on the diff in this pull request is 36.7% (50% is the threshold).

This pull request will bring the total coverage in the repository to 69.9% (-0.1% change).

View more on Code Climate.

pevogam · 2021-05-31T13:15:26Z

I only have some spellcheck failures failed which can be easily amended. I also made sure to refresh the implementation a bit with recent changes I could spot on the podman spawner.

What I am wondering now is what is your opinion about some of our use cases for these that also relate a lot to potential Avocado VT's use cases. In particular, we current do not recreate containers for each test but reuse them across tests. This means that the typical workflow like "create a brand new container for the current test" has potential disadvantages we should either dispel or solve with extended functionality for the LXC spawner and I need you opinion about it, i.e. whether to implement it or we have other ideas. Here are some of the points against recreating LXC containers for each test:

Isolating VT tests in a container spawner (not a mere process spawner) means using vms within the container (already works well on our side) but also potentially more elaborate preparation of these containers. This means that merely installing a few packages is not enough for us as we have to create some bridges and internal networking configuration, as well as some SSH accessibility and other customization. This is well solved with container recipes in docker so I wonder if we could later define a script or recipe requirement for tests that could invoke such custom scripts. @willianrampazzo I wonder what you think about such a possibility and if you foresee some immediate problems with it.
Our plugin also focuses on greater reusability together with Avocado VT (reusable vm environment) so being able to rerun a test on an old container is not supposed to be an issue despite the good practice of requiring clean environment for each test. This is easier if one only runs a few vms of the same type or something more elaborate (we use have various vms with different roles in a single VT test but we still don't have serious problems reusing containers like this).
Some types of setup could be particularly heavy, meaning both IO intensive and time consuming. For instance even copying typically large size vms for each test and spawned container could significantly increase the test duration and multiplying this by a factor of a few thousand already means a large portion (hours) of time is spent just on copying alone.

Of course, we should still propose the standard new-container-on-new-test approach but I wonder if some of these arguments imply that a given list of "container slots" where the state machine workers could operate on would also be welcome here. What is your opinion about this? I do have a working prototype of this with the current code here and our own scheduler (our container setup is currently too complex to migrate to requirements) so some testing was already done but I wonder if this is the best solution.

willianrampazzo · 2021-05-31T18:00:52Z

Isolating VT tests in a container spawner (not a mere process spawner) means using vms within the container (already works well on our side) but also potentially more elaborate preparation of these containers. This means that merely installing a few packages is not enough for us as we have to create some bridges and internal networking configuration, as well as some SSH accessibility and other customization. This is well solved with container recipes in docker so I wonder if we could later define a script or recipe requirement for tests that could invoke such custom scripts. @willianrampazzo I wonder what you think about such a possibility and if you foresee some immediate problems with it.

Your comment reminded me about this comment from Requirement Resolver v1: https://github.com/avocado-framework/avocado/pull/4423/files#r585816355

So, I made an effort to shape the Requirement Resolver in a way that, if a runner exists for that kind of requirement, the Requirement Resolver will create a task and will try to fulfill it. This means that as long as there is a runner capable of handling container recipes, this should work without major problems.

Our plugin also focuses on greater reusability together with Avocado VT (reusable vm environment) so being able to rerun a test on an old container is not supposed to be an issue despite the good practice of requiring clean environment for each test. This is easier if one only runs a few vms of the same type or something more elaborate (we use have various vms with different roles in a single VT test but we still don't have serious problems reusing containers like this).

This comment also reminds me about this issue: #4458

We should have a new cache mechanism in the future, which may help to decrease the setup time of containers.

pevogam · 2021-05-31T23:12:31Z

Isolating VT tests in a container spawner (not a mere process spawner) means using vms within the container (already works well on our side) but also potentially more elaborate preparation of these containers. This means that merely installing a few packages is not enough for us as we have to create some bridges and internal networking configuration, as well as some SSH accessibility and other customization. This is well solved with container recipes in docker so I wonder if we could later define a script or recipe requirement for tests that could invoke such custom scripts. @willianrampazzo I wonder what you think about such a possibility and if you foresee some immediate problems with it.

Your comment reminded me about this comment from Requirement Resolver v1: https://github.com/avocado-framework/avocado/pull/4423/files#r585816355

So, I made an effort to shape the Requirement Resolver in a way that, if a runner exists for that kind of requirement, the Requirement Resolver will create a task and will try to fulfill it. This means that as long as there is a runner capable of handling container recipes, this should work without major problems.

Sounds good and portable enough. I assume for instance we could use a bash script and point to it in the way you describe. Thanks for the hints!

Our plugin also focuses on greater reusability together with Avocado VT (reusable vm environment) so being able to rerun a test on an old container is not supposed to be an issue despite the good practice of requiring clean environment for each test. This is easier if one only runs a few vms of the same type or something more elaborate (we use have various vms with different roles in a single VT test but we still don't have serious problems reusing containers like this).

This comment also reminds me about this issue: #4458

We should have a new cache mechanism in the future, which may help to decrease the setup time of containers.

I guess whether this can help really depends on the way one understands cache here. As I said above, we have a network of vms inside each container and some vms (e.g. Windows) could be 50GB or more so I am not sure how caching this across containers would help reduce the big IO unless by caching one really means reusing the container itself.

willianrampazzo · 2021-06-01T18:32:31Z

Isolating VT tests in a container spawner (not a mere process spawner) means using vms within the container (already works well on our side) but also potentially more elaborate preparation of these containers. This means that merely installing a few packages is not enough for us as we have to create some bridges and internal networking configuration, as well as some SSH accessibility and other customization. This is well solved with container recipes in docker so I wonder if we could later define a script or recipe requirement for tests that could invoke such custom scripts. @willianrampazzo I wonder what you think about such a possibility and if you foresee some immediate problems with it.

Your comment reminded me about this comment from Requirement Resolver v1: https://github.com/avocado-framework/avocado/pull/4423/files#r585816355
So, I made an effort to shape the Requirement Resolver in a way that, if a runner exists for that kind of requirement, the Requirement Resolver will create a task and will try to fulfill it. This means that as long as there is a runner capable of handling container recipes, this should work without major problems.

Sounds good and portable enough. I assume for instance we could use a bash script and point to it in the way you describe. Thanks for the hints!

Our plugin also focuses on greater reusability together with Avocado VT (reusable vm environment) so being able to rerun a test on an old container is not supposed to be an issue despite the good practice of requiring clean environment for each test. This is easier if one only runs a few vms of the same type or something more elaborate (we use have various vms with different roles in a single VT test but we still don't have serious problems reusing containers like this).

This comment also reminds me about this issue: #4458
We should have a new cache mechanism in the future, which may help to decrease the setup time of containers.

I guess whether this can help really depends on the way one understands cache here. As I said above, we have a network of vms inside each container and some vms (e.g. Windows) could be 50GB or more so I am not sure how caching this across containers would help reduce the big IO unless by caching one really means reusing the container itself.

I remember @clebergnu gave the example of a base container and new containers with required packages installed on top of the base one and then some metadata cached referring to this new container. So, basically, it would be the reuse of a container image.

I can imagine it working with virtual machines and snapshots, but I think we need some brainstorming first.

pevogam · 2021-06-01T22:16:19Z

I guess whether this can help really depends on the way one understands cache here. As I said above, we have a network of vms inside each container and some vms (e.g. Windows) could be 50GB or more so I am not sure how caching this across containers would help reduce the big IO unless by caching one really means reusing the container itself.

I remember @clebergnu gave the example of a base container and new containers with required packages installed on top of the base one and then some metadata cached referring to this new container. So, basically, it would be the reuse of a container image.

I can imagine it working with virtual machines and snapshots, but I think we need some brainstorming first.

This sounds like multiple write layers on top of the same container which could provide it with a clean starting point but also store additional setup on top that is separately accessible. This reminds me further of LXC states/snapshots and features I proposed in #4375. It is a good place for further brainstorming regarding this.

For the time being however, I still propose we allow container reuse in the current PR until we have a better solution fully implemented.

beraldoleal

Hi @pevogam, sorry for the delay reviewing this. This looks promising, and I'm looking forward for the non-draft version. Nice work here.

I just left a few comments for you.

beraldoleal · 2021-06-07T12:37:41Z