Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LXC spawner proof of concept #4158

Merged
merged 4 commits into from
May 2, 2023
Merged

Conversation

pevogam
Copy link
Contributor

@pevogam pevogam commented Aug 27, 2020

This is just a very early implementation of an LXC spawner for post 82 LTS,
submitted to gather any opinions and possibly clarify some questions.

@pevogam
Copy link
Contributor Author

pevogam commented Aug 27, 2020

Hi @clebergnu, @beraldoleal, @willianrampazzo, this draft PR is not meant to attract any of your resources. I wanted to push early however in order to also debug possible race conditions. In particular, in its current form if I do

AVOCADO_LOG_EARLY=1 AVOCADO_LOG_DEBUG=1 avocado run --test-runner nrunner --nrunner-status-server-uri 0.0.0.0:8888 --nrunner-spawner lxc /bin/true /bin/true /bin/true

this will lead to an interesting race condition where the status server is not yet started and I get some

avocado.test: Command exited with code 256 (Exception ignored in: <function _after_at_fork_child_reinit_locks at 0x7f2a5a4e1940>
Traceback (most recent call last):
  File "/usr/lib64/python3.8/logging/__init__.py", line 260, in _after_at_fork_child_reinit_locks
  File "/usr/lib64/python3.8/logging/__init__.py", line 228, in _releaseLock
RuntimeError: cannot release un-acquired lock
Traceback (most recent call last):
  File "/root/avocado-runner", line 942, in <module>
    main()
  File "/root/avocado-runner", line 938, in main
    app.run()
  File "/root/avocado-runner", line 825, in run
    return kallable(args)
  File "/root/avocado-runner", line 914, in command_task_run
    for status in task.run():
  File "/root/avocado-runner", line 714, in run
    status_service.post(status)
  File "/root/avocado-runner", line 604, in post
    self.connection = socket.create_connection((host, port))
  File "/usr/lib64/python3.7/socket.py", line 728, in create_connection
    raise err
  File "/usr/lib64/python3.7/socket.py", line 716, in create_connection
    sock.connect(sa)

failing tests as a result. I do notice the status server eventually starting though but I think it giving control back to the main loop in

            self._server_task = await asyncio.start_server(
                self.cb,
                host=host,
                port=port)

is the reason the other coroutines progress too far before control is returned to the initialization. Do you think I might be doing the avocado-runner call too early? I have used the podman example for multiple comparison and this seems appropriate time to call it.

@clebergnu
Copy link
Contributor

Hi @clebergnu, @beraldoleal, @willianrampazzo, this draft PR is not meant to attract any of your resources. I wanted to push early however in order to also debug possible race conditions. In particular, in its current form if I do

AVOCADO_LOG_EARLY=1 AVOCADO_LOG_DEBUG=1 avocado run --test-runner nrunner --nrunner-status-server-uri 0.0.0.0:8888 --nrunner-spawner lxc /bin/true /bin/true /bin/true

this will lead to an interesting race condition where the status server is not yet started and I get some

avocado.test: Command exited with code 256 (Exception ignored in: <function _after_at_fork_child_reinit_locks at 0x7f2a5a4e1940>
Traceback (most recent call last):
  File "/usr/lib64/python3.8/logging/__init__.py", line 260, in _after_at_fork_child_reinit_locks
  File "/usr/lib64/python3.8/logging/__init__.py", line 228, in _releaseLock
RuntimeError: cannot release un-acquired lock
Traceback (most recent call last):
  File "/root/avocado-runner", line 942, in <module>
    main()
  File "/root/avocado-runner", line 938, in main
    app.run()
  File "/root/avocado-runner", line 825, in run
    return kallable(args)
  File "/root/avocado-runner", line 914, in command_task_run
    for status in task.run():
  File "/root/avocado-runner", line 714, in run
    status_service.post(status)
  File "/root/avocado-runner", line 604, in post
    self.connection = socket.create_connection((host, port))
  File "/usr/lib64/python3.7/socket.py", line 728, in create_connection
    raise err
  File "/usr/lib64/python3.7/socket.py", line 716, in create_connection
    sock.connect(sa)

failing tests as a result. I do notice the status server eventually starting though but I think it giving control back to the main loop in

            self._server_task = await asyncio.start_server(
                self.cb,
                host=host,
                port=port)

is the reason the other coroutines progress too far before control is returned to the initialization. Do you think I might be doing the avocado-runner call too early? I have used the podman example for multiple comparison and this seems appropriate time to call it.

Hard to tell, and ensure, most things within a thread running multiple asyncio coroutines. The correct solution is to ensure that the server is up before going further. I think we may need to introduce a ping command/response to the server, given that some advanced scenarios may put a status server in a different place / machine altogether.

What do you think?

@pevogam
Copy link
Contributor Author

pevogam commented Aug 27, 2020

is the reason the other coroutines progress too far before control is returned to the initialization. Do you think I might be doing the avocado-runner call too early? I have used the podman example for multiple comparison and this seems appropriate time to call it.

Hard to tell, and ensure, most things within a thread running multiple asyncio coroutines. The correct solution is to ensure that the server is up before going further. I think we may need to introduce a ping command/response to the server, given that some advanced scenarios may put a status server in a different place / machine altogether.

What do you think?

I guess it all boils down to whether we are using a simple standard control loop for all coroutines or something more elaborate to support some staging. If certain co-routines remain preconditions for others we will definitely have to ensure those preconditions are met before we give control back to the main loop. A ping in this particular case is a good way to add extra assertions and fail early with a clearer error message.

@beraldoleal
Copy link
Member

This is awesome @pevogam , I will take a closer look as soon as possible.

@beraldoleal beraldoleal self-requested a review August 27, 2020 22:13
@pevogam
Copy link
Contributor Author

pevogam commented May 13, 2021

Hi all, considering all the changes Avocado sees in this direction I assume by now this code is extremely out of date?

@beraldoleal
Copy link
Member

Hi @pevogam IMO this is very welcome. I didn't reviewed because of the draft state. But if you still have interest in pushing this, I will review it during the next days.

@clebergnu clebergnu self-assigned this May 13, 2021
@clebergnu
Copy link
Contributor

Hi @pevogam IMO this is very welcome. I didn't reviewed because of the draft state. But if you still have interest in pushing this, I will review it during the next days.

Absolutely, this is very welcome! Sorry @pevogam we got distracted and did not gave it the proper level of attention. TBH, I need to revisit my LXC-foo, and properly review/test this. I have assigned myself too to this issue to do it ASAP.

@pevogam
Copy link
Contributor Author

pevogam commented May 13, 2021

No worries, I was just wondering what is its status.

It is draft yes because is a very initial proposal from somebody not directly involved in core avocado development (thus possibly full of potential problems and needing some early reviews to even be considered something to continue working on). Also despite the draft state I pinged you for any feedback before just to avoid confusion from it being a draft. But again, this is a low prio and I am aware core features must be developed first so I just wanted to see if you think it is still legitimate to remain a PR or it should be discarded.

@pevogam
Copy link
Contributor Author

pevogam commented May 21, 2021

I pushed on top of 88.1 to catch up with the various versions, will run some functional testing on my side in the coming week.

@pevogam pevogam force-pushed the lxc-spawner branch 2 times, most recently from 7a90e22 to 9b624c9 Compare May 28, 2021 17:32
@lgtm-com
Copy link

lgtm-com bot commented May 28, 2021

This pull request introduces 1 alert when merging 9b624c9 into a82c5eb - view on LGTM.com

new alerts:

  • 1 for Unused import

@pevogam pevogam force-pushed the lxc-spawner branch 2 times, most recently from 3c94ee4 to b0db4bb Compare May 28, 2021 20:23
@codeclimate
Copy link

codeclimate bot commented May 28, 2021

Code Climate has analyzed commit b0db4bb and detected 0 issues on this pull request.

The test coverage on the diff in this pull request is 36.7% (50% is the threshold).

This pull request will bring the total coverage in the repository to 69.9% (-0.1% change).

View more on Code Climate.

@pevogam
Copy link
Contributor Author

pevogam commented May 31, 2021

I only have some spellcheck failures failed which can be easily amended. I also made sure to refresh the implementation a bit with recent changes I could spot on the podman spawner.

What I am wondering now is what is your opinion about some of our use cases for these that also relate a lot to potential Avocado VT's use cases. In particular, we current do not recreate containers for each test but reuse them across tests. This means that the typical workflow like "create a brand new container for the current test" has potential disadvantages we should either dispel or solve with extended functionality for the LXC spawner and I need you opinion about it, i.e. whether to implement it or we have other ideas. Here are some of the points against recreating LXC containers for each test:

  1. Isolating VT tests in a container spawner (not a mere process spawner) means using vms within the container (already works well on our side) but also potentially more elaborate preparation of these containers. This means that merely installing a few packages is not enough for us as we have to create some bridges and internal networking configuration, as well as some SSH accessibility and other customization. This is well solved with container recipes in docker so I wonder if we could later define a script or recipe requirement for tests that could invoke such custom scripts. @willianrampazzo I wonder what you think about such a possibility and if you foresee some immediate problems with it.

  2. Our plugin also focuses on greater reusability together with Avocado VT (reusable vm environment) so being able to rerun a test on an old container is not supposed to be an issue despite the good practice of requiring clean environment for each test. This is easier if one only runs a few vms of the same type or something more elaborate (we use have various vms with different roles in a single VT test but we still don't have serious problems reusing containers like this).

  3. Some types of setup could be particularly heavy, meaning both IO intensive and time consuming. For instance even copying typically large size vms for each test and spawned container could significantly increase the test duration and multiplying this by a factor of a few thousand already means a large portion (hours) of time is spent just on copying alone.

Of course, we should still propose the standard new-container-on-new-test approach but I wonder if some of these arguments imply that a given list of "container slots" where the state machine workers could operate on would also be welcome here. What is your opinion about this? I do have a working prototype of this with the current code here and our own scheduler (our container setup is currently too complex to migrate to requirements) so some testing was already done but I wonder if this is the best solution.

@willianrampazzo
Copy link
Contributor

  1. Isolating VT tests in a container spawner (not a mere process spawner) means using vms within the container (already works well on our side) but also potentially more elaborate preparation of these containers. This means that merely installing a few packages is not enough for us as we have to create some bridges and internal networking configuration, as well as some SSH accessibility and other customization. This is well solved with container recipes in docker so I wonder if we could later define a script or recipe requirement for tests that could invoke such custom scripts. @willianrampazzo I wonder what you think about such a possibility and if you foresee some immediate problems with it.

Your comment reminded me about this comment from Requirement Resolver v1: https://github.com/avocado-framework/avocado/pull/4423/files#r585816355

So, I made an effort to shape the Requirement Resolver in a way that, if a runner exists for that kind of requirement, the Requirement Resolver will create a task and will try to fulfill it. This means that as long as there is a runner capable of handling container recipes, this should work without major problems.

  1. Our plugin also focuses on greater reusability together with Avocado VT (reusable vm environment) so being able to rerun a test on an old container is not supposed to be an issue despite the good practice of requiring clean environment for each test. This is easier if one only runs a few vms of the same type or something more elaborate (we use have various vms with different roles in a single VT test but we still don't have serious problems reusing containers like this).

This comment also reminds me about this issue: #4458

We should have a new cache mechanism in the future, which may help to decrease the setup time of containers.

@pevogam
Copy link
Contributor Author

pevogam commented May 31, 2021

  1. Isolating VT tests in a container spawner (not a mere process spawner) means using vms within the container (already works well on our side) but also potentially more elaborate preparation of these containers. This means that merely installing a few packages is not enough for us as we have to create some bridges and internal networking configuration, as well as some SSH accessibility and other customization. This is well solved with container recipes in docker so I wonder if we could later define a script or recipe requirement for tests that could invoke such custom scripts. @willianrampazzo I wonder what you think about such a possibility and if you foresee some immediate problems with it.

Your comment reminded me about this comment from Requirement Resolver v1: https://github.com/avocado-framework/avocado/pull/4423/files#r585816355

So, I made an effort to shape the Requirement Resolver in a way that, if a runner exists for that kind of requirement, the Requirement Resolver will create a task and will try to fulfill it. This means that as long as there is a runner capable of handling container recipes, this should work without major problems.

Sounds good and portable enough. I assume for instance we could use a bash script and point to it in the way you describe. Thanks for the hints!

  1. Our plugin also focuses on greater reusability together with Avocado VT (reusable vm environment) so being able to rerun a test on an old container is not supposed to be an issue despite the good practice of requiring clean environment for each test. This is easier if one only runs a few vms of the same type or something more elaborate (we use have various vms with different roles in a single VT test but we still don't have serious problems reusing containers like this).

This comment also reminds me about this issue: #4458

We should have a new cache mechanism in the future, which may help to decrease the setup time of containers.

I guess whether this can help really depends on the way one understands cache here. As I said above, we have a network of vms inside each container and some vms (e.g. Windows) could be 50GB or more so I am not sure how caching this across containers would help reduce the big IO unless by caching one really means reusing the container itself.

@willianrampazzo
Copy link
Contributor

  1. Isolating VT tests in a container spawner (not a mere process spawner) means using vms within the container (already works well on our side) but also potentially more elaborate preparation of these containers. This means that merely installing a few packages is not enough for us as we have to create some bridges and internal networking configuration, as well as some SSH accessibility and other customization. This is well solved with container recipes in docker so I wonder if we could later define a script or recipe requirement for tests that could invoke such custom scripts. @willianrampazzo I wonder what you think about such a possibility and if you foresee some immediate problems with it.

Your comment reminded me about this comment from Requirement Resolver v1: https://github.com/avocado-framework/avocado/pull/4423/files#r585816355
So, I made an effort to shape the Requirement Resolver in a way that, if a runner exists for that kind of requirement, the Requirement Resolver will create a task and will try to fulfill it. This means that as long as there is a runner capable of handling container recipes, this should work without major problems.

Sounds good and portable enough. I assume for instance we could use a bash script and point to it in the way you describe. Thanks for the hints!

  1. Our plugin also focuses on greater reusability together with Avocado VT (reusable vm environment) so being able to rerun a test on an old container is not supposed to be an issue despite the good practice of requiring clean environment for each test. This is easier if one only runs a few vms of the same type or something more elaborate (we use have various vms with different roles in a single VT test but we still don't have serious problems reusing containers like this).

This comment also reminds me about this issue: #4458
We should have a new cache mechanism in the future, which may help to decrease the setup time of containers.

I guess whether this can help really depends on the way one understands cache here. As I said above, we have a network of vms inside each container and some vms (e.g. Windows) could be 50GB or more so I am not sure how caching this across containers would help reduce the big IO unless by caching one really means reusing the container itself.

I remember @clebergnu gave the example of a base container and new containers with required packages installed on top of the base one and then some metadata cached referring to this new container. So, basically, it would be the reuse of a container image.

I can imagine it working with virtual machines and snapshots, but I think we need some brainstorming first.

@pevogam
Copy link
Contributor Author

pevogam commented Jun 1, 2021

I guess whether this can help really depends on the way one understands cache here. As I said above, we have a network of vms inside each container and some vms (e.g. Windows) could be 50GB or more so I am not sure how caching this across containers would help reduce the big IO unless by caching one really means reusing the container itself.

I remember @clebergnu gave the example of a base container and new containers with required packages installed on top of the base one and then some metadata cached referring to this new container. So, basically, it would be the reuse of a container image.

I can imagine it working with virtual machines and snapshots, but I think we need some brainstorming first.

This sounds like multiple write layers on top of the same container which could provide it with a clean starting point but also store additional setup on top that is separately accessible. This reminds me further of LXC states/snapshots and features I proposed in #4375. It is a good place for further brainstorming regarding this.

For the time being however, I still propose we allow container reuse in the current PR until we have a better solution fully implemented.

Copy link
Member

@beraldoleal beraldoleal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pevogam, sorry for the delay reviewing this. This looks promising, and I'm looking forward for the non-draft version. Nice work here.

I just left a few comments for you.

import tempfile

try:
import lxc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing something, IIUC, lcx is necessary for your code. Without it, we can't run it. So following the same logic as others imports, I would recommend to remove this try..except here and add lxc into the plugin's requirements.txt with instructions to install on the plugin's README.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Beraldo and thank for you for taking the time to validate the implementation! Now about this:

This is a good point but don't you think LXC should be an optional requirement for people that won't use the LXC spawners? For somebody that might want a basic Avocado install and just using process spawners it might be preferable to only have a lightweight Avocado install without building or downloading Podman/LXC dependencies. So this was an approach to keep things optional. What do you think about this idea overall? Of course, it is easier to just important the module and require the pip dependency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked around but didn't see an easy way to specify dependencies for the LXC spawner as it is just a module and not a plugin with its own setup.py, RPM subpackage, and so on. Currently there is only process and podman listed under the spawners folder, I thought it is more reasonable to simply also provide lxc there. Or is podman optional in some sense and having its own plugin dependencies specified somewhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much all other plugins that have Python deps other than setuptools, will go into optional_plugins. The podman spawner is a different beast because it requires a system binary (to put it simply). The LXC plugin is a bit different because it requires the Python bindings, but not only that. Until we have a better way of describing dependencies such as system binaries, this is OK IMO.

avocado/plugins/spawners/lxc.py Outdated Show resolved Hide resolved
msg = 'LXC python bindings not available on the system'
runtime_task.status = msg
# we shouldn't reach this point
raise RuntimeError("LXC dependency is missing")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my previous comment, this if becomes unnecessary. Users will get a ModuleNotFoundError: No module named 'lxc' at the begin of this load.

avocado/plugins/spawners/lxc.py Outdated Show resolved Hide resolved
avocado/plugins/spawners/lxc.py Outdated Show resolved Hide resolved
if not LXC_AVAILABLE:
msg = 'LXC python bindings not available on the system'
runtime_task.status = msg
return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

runtime_task.status = msg
return False

container_id = self.cid
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm missing where it was defined self.cid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly the point: it is not defined anywhere since I was discussing reusing containers in the main thread of this PR. For more details you can check it but the essentially the problem is that we want to reuse containers since some could have 10+ mins setup time that cannot be spent on each individual test. What we do at the moment is that we create the containers as "run slots" (e.g. 3 containers for 3x faster running) and reuse these when a worker picks up a new task for running. This roughly translates to having three processes running at the same time as an upper limit.

I will easily move this to the constructor where it belongs as container name but wanted to first discuss the more global problem with all of you in order to know whether I should just generate a uid here or can actually pass such an argument and control the number of containers from the outside.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out using a constructor is not as easy as I thought since the spawner is initiated outside of our runner's scope and an instance accessed via a dispatcher:

spawner = SpawnerDispatcher(job.config, job)[spawner_name].obj

Then I started thinking about doing something like this

diff --git a/avocado/plugins/spawners/lxc.py b/avocado/plugins/spawners/lxc.py
index 09f71b35..200d9950 100644
--- a/avocado/plugins/spawners/lxc.py
+++ b/avocado/plugins/spawners/lxc.py
@@ -122,6 +122,9 @@ class LXCSpawner(Spawner, SpawnerMixin):
         distro = self.config.get('spawner.lxc.distro')
         release = self.config.get('spawner.lxc.release')
         arch = self.config.get('spawner.lxc.arch')
+        # LXC differs from Podman in the fact that we can reuse heavier complete subsystem
+        # and not just lightweight application-centric containers if necessary
+        container_id = self.config.get('spawner.lxc.cid')
 
         if not LXC_AVAILABLE:
             msg = 'LXC python bindings not available on the system'

but this doesn't make much sense still since the container ID can be different per test and doesn't sound at all like something that belongs to the overall job config. Essentially what I need is to provide a reusable container ID (container slot) via a config or some task attribute. The way this is done with Podman right now is not reusable since the container ID is returned from a freshly created container each time and thus retrieved within the scope of spawning a task. @richtja @willianrampazzo do you have a good suggestion on what is the best approach here to provide the spawner with some test specific configuration?

avocado/plugins/spawners/lxc.py Outdated Show resolved Hide resolved
avocado/plugins/spawners/lxc.py Outdated Show resolved Hide resolved
@pevogam
Copy link
Contributor Author

pevogam commented Mar 2, 2023

I also adapted all static checks with the exception of the lint ones like:

diff --git a/avocado/plugins/spawners/lxc.py b/avocado/plugins/spawners/lxc.py
index 233e4ba0..b5d59975 100644
--- a/avocado/plugins/spawners/lxc.py
+++ b/avocado/plugins/spawners/lxc.py
@@ -295,11 +295,11 @@ class LXCSpawner(Spawner, SpawnerMixin):
         # right now, limit the check to the LXC availability
         return LXC_AVAILABLE
 
-    async def is_requirement_in_cache(self, runtime_task):
+    async def is_requirement_in_cache(self, runtime_task):  # pylint: disable=W0221
         return False
 
-    async def save_requirement_in_cache(self, runtime_task):
+    async def save_requirement_in_cache(self, runtime_task):  # pylint: disable=W0221
         pass
 
-    async def update_requirement_cache(self, runtime_task, result):
+    async def update_requirement_cache(self, runtime_task, result):  # pylint: disable=W0221
         pass

I wonder here why podman violated the default spawner interface and disregarded the linter errors - is the interface meant to change soon (in which case the above diff will make more sense here too)? Or is this just a temporary change on the side of podman (in which case I best revert to the static methods)?

@pevogam
Copy link
Contributor Author

pevogam commented Mar 2, 2023

Finally, if you are interested in a bash script to create your own LXC containers (like the custom hook above), you can find a more or less minimal sample at https://github.com/intra2net/avocado-i2n/blob/master-ci/selftests/integration/container.sh. You could use it for your own experimentation with this, it is even smaller script if you drop the avocado local installation part to get a workable container with functioning networking.

@richtja
Copy link
Contributor

richtja commented Mar 6, 2023

So the unit tests pass now also in the CI but I get this strange "checkp.y didn't clean test results" behavior, is this still caused by the new unit tests?

RESULTS    : PASS 1134 | ERROR 0 | FAIL 0 | SKIP 53 | WARN 0 | INTERRUPT 0 | CANCEL 6
JOB HTML   : /home/runner/avocado/job-results/job-2023-03-02T04.29-0d4c2f0/results.html
JOB TIME   : 628.15 s
check.py didn't clean test results.
uncleaned directories:
{'job-2023-03-02T04.38-a1a64fa', 'job-2023-03-02T04.38-9054000', 'job-2023-03-02T04.38-f9d1511', 'job-2023-03-02T04.38-9644f5c'}
Error: Process completed with exit code 1.

Hi @pevogam, yes, this is caused by your new unit tests. It checks if avocado selftests doesn't affect ~/avocado/job-results directory, you can fix it by changing the test result directory by run.results_dir in test config. As example, you can look into our check.py

@richtja
Copy link
Contributor

richtja commented Mar 6, 2023

I also adapted all static checks with the exception of the lint ones like:

diff --git a/avocado/plugins/spawners/lxc.py b/avocado/plugins/spawners/lxc.py
index 233e4ba0..b5d59975 100644
--- a/avocado/plugins/spawners/lxc.py
+++ b/avocado/plugins/spawners/lxc.py
@@ -295,11 +295,11 @@ class LXCSpawner(Spawner, SpawnerMixin):
         # right now, limit the check to the LXC availability
         return LXC_AVAILABLE
 
-    async def is_requirement_in_cache(self, runtime_task):
+    async def is_requirement_in_cache(self, runtime_task):  # pylint: disable=W0221
         return False
 
-    async def save_requirement_in_cache(self, runtime_task):
+    async def save_requirement_in_cache(self, runtime_task):  # pylint: disable=W0221
         pass
 
-    async def update_requirement_cache(self, runtime_task, result):
+    async def update_requirement_cache(self, runtime_task, result):  # pylint: disable=W0221
         pass

I wonder here why podman violated the default spawner interface and disregarded the linter errors - is the interface meant to change soon (in which case the above diff will make more sense here too)? Or is this just a temporary change on the side of podman (in which case I best revert to the static methods)?

We do not plan to change the interface, so I would use the static methods here.

@pevogam
Copy link
Contributor Author

pevogam commented Mar 7, 2023

@richtja Thanks for the hints! The CI tests and static checks seem to be passing now with a few problems in Python 3.7 and 3.10.0 that seem to be from other tests not related to the changes here but feel free to verify I didn't miss something here.

@pevogam
Copy link
Contributor Author

pevogam commented Mar 13, 2023

Hi @clebergnu @richtja could you confirm if these CI failures are caused by the unit tests I introduced (so I can investigate) or there is another reason for them? Not sure but it seems to me that the current CI has a lot of intricacies that an outsider has to know before they write more (self)tests for it which could indicate some excessive complexity in the way it is currently structured.

I also rebased on 101.0 now which I assume has all its checks passing and should therefore rule out any outdated sources of problems.

@pevogam
Copy link
Contributor Author

pevogam commented Mar 23, 2023

Hi @richtja and @clebergnu, I pushed a change that uses avocado.SkipIf to skip the version python 3.7 and all major testing works but I still get a failure in the coverage and rpmbuilds due to inability to skip the tests (?) like

======================================================================
ERROR: 0-LXCSpawnerTest
Checks if free slots could be used from cache.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.11.2/x64/lib/python3.11/unittest/mock.py", line 1369, in patched
    return func(*newargs, **newkeywargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/avocado/avocado/avocado/core/decorators.py", line 94, in wrapper
    raise core_exceptions.TestSkipError(message)
avocado.core.exceptions.TestSkipError: Not compatible with Python under 3.7.0

as well as outright insistence of python3.7 resulting from the above in some cases. So how can I skip python 3.6 in the few remaining cases? Why do cases related to coverage and rpmbuilds also run the unit tests if this is already done in other CI workflows above (seems a bit like GH actions overuse to me) and in particular why not using avocado for all runs including coverage and rpm builds? When do we plan to drop extremely old python versions like 3.6 and 3.7?

Or should we just disable the unit tests I added? I thought they would be useful to keep the coverage in check and now they turned to be the one thing keeping the pull request from getting merged because of all the rather unrelated CI rigidness and fragility.

@clebergnu
Copy link
Contributor

Hi @pevogam ,

I've spent some time with this version, and I have a few questions / requests:

  1. Can we add documentation with references on how to get a system configured for this spawner? This will of course include basic LXC set up, but also the specifics about the Avocado spawner.

  2. Can you share your customization hook (or a simplified version of it) so that we can let users re-use it?

  3. Do you think users will always need to use a customization hook, or deploying Avocado should be considered as part of the task spawning (like in the podman spawner)?

  4. Are you able to use this spawner with the "auto" status server? I've only been able to get jobs running with something like --status-server-disable-auto --status-server-listen=192.168.122.1:9999 --status-server-uri=192.168.122.1:9999.

  5. Are you running tasks that have and use an output_dir? When running a simple exec-test I get:

# avocado run --status-server-disable-auto --status-server-listen=192.168.122.1:9999 --status-server-uri=192.168.122.1:9999 --spawner=lxc -- /bin/true 
JOB ID     : be2ec7cada127d0a969d646d1b7fe7365463421f
JOB LOG    : /root/avocado/job-results/job-2023-04-18T22.54-be2ec7c/job.log
 (1/1) /bin/true: STARTED
 (1/1) /bin/true: ERROR: [Errno 2] No such file or directory: '/root/avocado/job-results/job-2023-04-18T22.54-be2ec7c/test-results/1-_bin_true/stdout' (0.00 s)
RESULTS    : PASS 0 | ERROR 1 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0
JOB HTML   : /root/avocado/job-results/job-2023-04-18T22.54-be2ec7c/results.html
JOB TIME   : 1.79 s

Test summary:
/bin/true: ERROR

This is because the LXC implementation seems to be only handling things at the job/host side:

    def create_task_output_dir(self, runtime_task):
        output_dir_path = self.task_output_dir(runtime_task)
        runtime_task.task.setup_output_dir(output_dir_path)

The podman spawner has something similar, but then it maps a volume into the container that is then accessible to the task (and thus to the runnable).

@pevogam
Copy link
Contributor Author

pevogam commented Apr 19, 2023

Hi @clebergnu, you have collected a set of truly wonderful and down to the point questions:

I've spent some time with this version, and I have a few questions / requests:

1. Can we add documentation with references on how to get a system configured for this spawner? This will of course include basic LXC set up, but also the specifics about the Avocado spawner.

Right now there is no documentation since we want to get this up and running with general avocado first (we use the spawner perfectly fine but with our Avocado I2N plugin (VT tests) which has its own scheduler. I have already shared a minimal option on how to set up a container here which we can reuse and add to the repo as a pull request once we iron out the kinks there.

2. Can you share your customization hook (or a simplified version of it) so that we can let users re-use it?

The customization hook I use is based on the minimal script above which is more or less a stripped-down version removing the dependencies particular to our plugin (at least the packages and larger changes). So I hope it may serve as a good starting point.

3. Do you think users will always need to use a customization hook, or deploying Avocado should be considered as part of the task spawning (like in the podman spawner)?

This is a very good question and I assume it related to the python avocado eggs and similar deployment. I was actually postponing this conversation for after the pull request is merged since we will definitely have to do some unifying job here. In short though: the users might benefit from less setup on their own side and such automation could be kept optional for the users with more elaborate containers (as LXC containers could be entire set up subsystems and not a small ecosystem around a single app the way podman containers are).

4. Are you able to use this spawner with the "auto" status server?  I've only been able to get jobs running with something like `--status-server-disable-auto --status-server-listen=192.168.122.1:9999 --status-server-uri=192.168.122.1:9999`.

Another great question - since this PR predates the addition of "auto" status server which happened later on I have not. In fact I had set up a manual configuration with the two parameters above similarly to the way you did and simply stayed with it until now. I am not very familiar of this auto status server functionality and whether it can handle LXC containers, perhaps it could be extended.

5. Are you running tasks that have and use an `output_dir`?  When running a simple `exec-test` I get:
# avocado run --status-server-disable-auto --status-server-listen=192.168.122.1:9999 --status-server-uri=192.168.122.1:9999 --spawner=lxc -- /bin/true 
JOB ID     : be2ec7cada127d0a969d646d1b7fe7365463421f
JOB LOG    : /root/avocado/job-results/job-2023-04-18T22.54-be2ec7c/job.log
 (1/1) /bin/true: STARTED
 (1/1) /bin/true: ERROR: [Errno 2] No such file or directory: '/root/avocado/job-results/job-2023-04-18T22.54-be2ec7c/test-results/1-_bin_true/stdout' (0.00 s)
RESULTS    : PASS 0 | ERROR 1 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0
JOB HTML   : /root/avocado/job-results/job-2023-04-18T22.54-be2ec7c/results.html
JOB TIME   : 1.79 s

Test summary:
/bin/true: ERROR

This is because the LXC implementation seems to be only handling things at the job/host side:

    def create_task_output_dir(self, runtime_task):
        output_dir_path = self.task_output_dir(runtime_task)
        runtime_task.task.setup_output_dir(output_dir_path)

The podman spawner has something similar, but then it maps a volume into the container that is then accessible to the task (and thus to the runnable).

Indeed, I also tried running some minimal /bin/true tests to stay more in line with avocado's scope and got into the same problem and was waiting for some confirmation here so that we could see where the problem is. The LXC spawner copied most of these functions from the podman spawner and oddly I have no such problems in production. IIUC this output directory is simply the mount-accessible test directory? In this case the reason it works for me is because we have mounted a partition like /mnt/local/results to all LXC containers and might be worth adding to the minimal container creation script above. How does that sound?

Thank you very much for your reply @clebergnu, how would you like to proceed regarding the points above?

@clebergnu
Copy link
Contributor

Thank you very much for your reply @clebergnu, how would you like to proceed regarding the points above?

I've created #5666 to keep track of those points. Also, I've experimented with this branch of yours, and hacked a bit to get the CI running green. The result is here.

It's not intended to be used as is, but it contains:

  • Code that I can also suggest here as review comments (let me know if you'd prefer that)
  • The order of commits to avoid regressions (and avoid future issues with bisecting). E.g.: .pylintrc change before the actual "violations".

Also, a lot (if not all) of my commits are not intended to be standalone. Feel free to either take them and squash into yours, or simply use the suggestion or raw code.

I'm hoping this helps, as I'm eager to merge this!

@pevogam
Copy link
Contributor Author

pevogam commented Apr 28, 2023

Thank you very much for your reply @clebergnu, how would you like to proceed regarding the points above?

I've created #5666 to keep track of those points. Also, I've experimented with this branch of yours, and hacked a bit to get the CI running green. The result is here.

It's not intended to be used as is, but it contains:

* Code that I can also suggest here as review comments (let me know if you'd prefer that)

* The order of commits to avoid regressions (and avoid future issues with bisecting).  E.g.: .pylintrc change before the actual "violations".

Also, a lot (if not all) of my commits are not intended to be standalone. Feel free to either take them and squash into yours, or simply use the suggestion or raw code.

I'm hoping this helps, as I'm eager to merge this!

Thanks a lot @clebergnu for going through this effort! I will take a look next week. This is a good cause for wonderful weekend vibes already!

The arguments on why we should do this are well-explained and fully
provided in:

pylint-dev/pylint#2354 (comment)

Signed-off-by: Plamen Dimitrov <[email protected]>
Our container creation and overall Avocado setup relies on an external
fully customizable hook (which may even be a bash script or other
form of container recipe) with the major assumptions that all LXC
containers are identical reusable slots to spawn new tests in.

For instance, preparing an LXC container with full Avocado (VT)
dependencies installed could take 10+ minutes which is unfeasible
to do for each test. In the future we plan to look into LXC state
(snapshot) reuse to leverage this but it goes beyond the current
proposal here.

Signed-off-by: Plamen Dimitrov <[email protected]>
Custom schedulers can still pass a custom spawner handle via the
runtime task but the default avocado state machine will instead be
able to operate without it with the spawner managing the available
and occupied container slots (all of which are also configurable).

Thanks to Cleber for the built-in support for lists in avocado's
core settings, adaptation of the unit tests for avocado runs, as
well as extra mock reset safety for regular python runs.

Signed-off-by: Plamen Dimitrov <[email protected]>
Signed-off-by: Cleber Rosa <[email protected]>
Use avocado's decorators to skip versions of python 3.7 and below
in order to support both the newer mock as well as some enhanced
asyncio behavior.

Signed-off-by: Plamen Dimitrov <[email protected]>
@pevogam
Copy link
Contributor Author

pevogam commented May 2, 2023

@clebergnu I have squashed all changes and updated some of the commit messages -> all tests are passing right now and the CI is finally green. Thanks a lot for your effort to integrate the tests and settings better into the avocado project! The only remaining diff between your and my branch is the order of the pylint exception as I follow (and the file seems to follow) lexicographical order. Other than that I think we are ready to tackle further challenges in #5666 and major spawners in #5621 🚀

Copy link
Contributor

@clebergnu clebergnu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is happy, and so am I! Thanks @pevogam!

@clebergnu clebergnu merged commit 2f8775a into avocado-framework:master May 2, 2023
@pevogam pevogam deleted the lxc-spawner branch May 2, 2023 04:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants