-
Notifications
You must be signed in to change notification settings - Fork 343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LXC spawner proof of concept #4158
Conversation
Hi @clebergnu, @beraldoleal, @willianrampazzo, this draft PR is not meant to attract any of your resources. I wanted to push early however in order to also debug possible race conditions. In particular, in its current form if I do
this will lead to an interesting race condition where the status server is not yet started and I get some
failing tests as a result. I do notice the status server eventually starting though but I think it giving control back to the main loop in self._server_task = await asyncio.start_server(
self.cb,
host=host,
port=port) is the reason the other coroutines progress too far before control is returned to the initialization. Do you think I might be doing the |
Hard to tell, and ensure, most things within a thread running multiple asyncio coroutines. The correct solution is to ensure that the server is up before going further. I think we may need to introduce a ping command/response to the server, given that some advanced scenarios may put a status server in a different place / machine altogether. What do you think? |
I guess it all boils down to whether we are using a simple standard control loop for all coroutines or something more elaborate to support some staging. If certain co-routines remain preconditions for others we will definitely have to ensure those preconditions are met before we give control back to the main loop. A ping in this particular case is a good way to add extra assertions and fail early with a clearer error message. |
This is awesome @pevogam , I will take a closer look as soon as possible. |
Hi all, considering all the changes Avocado sees in this direction I assume by now this code is extremely out of date? |
Hi @pevogam IMO this is very welcome. I didn't reviewed because of the draft state. But if you still have interest in pushing this, I will review it during the next days. |
Absolutely, this is very welcome! Sorry @pevogam we got distracted and did not gave it the proper level of attention. TBH, I need to revisit my LXC-foo, and properly review/test this. I have assigned myself too to this issue to do it ASAP. |
No worries, I was just wondering what is its status. It is draft yes because is a very initial proposal from somebody not directly involved in core avocado development (thus possibly full of potential problems and needing some early reviews to even be considered something to continue working on). Also despite the draft state I pinged you for any feedback before just to avoid confusion from it being a draft. But again, this is a low prio and I am aware core features must be developed first so I just wanted to see if you think it is still legitimate to remain a PR or it should be discarded. |
I pushed on top of 88.1 to catch up with the various versions, will run some functional testing on my side in the coming week. |
7a90e22
to
9b624c9
Compare
This pull request introduces 1 alert when merging 9b624c9 into a82c5eb - view on LGTM.com new alerts:
|
3c94ee4
to
b0db4bb
Compare
Code Climate has analyzed commit b0db4bb and detected 0 issues on this pull request. The test coverage on the diff in this pull request is 36.7% (50% is the threshold). This pull request will bring the total coverage in the repository to 69.9% (-0.1% change). View more on Code Climate. |
I only have some spellcheck failures failed which can be easily amended. I also made sure to refresh the implementation a bit with recent changes I could spot on the podman spawner. What I am wondering now is what is your opinion about some of our use cases for these that also relate a lot to potential Avocado VT's use cases. In particular, we current do not recreate containers for each test but reuse them across tests. This means that the typical workflow like "create a brand new container for the current test" has potential disadvantages we should either dispel or solve with extended functionality for the LXC spawner and I need you opinion about it, i.e. whether to implement it or we have other ideas. Here are some of the points against recreating LXC containers for each test:
Of course, we should still propose the standard new-container-on-new-test approach but I wonder if some of these arguments imply that a given list of "container slots" where the state machine workers could operate on would also be welcome here. What is your opinion about this? I do have a working prototype of this with the current code here and our own scheduler (our container setup is currently too complex to migrate to requirements) so some testing was already done but I wonder if this is the best solution. |
Your comment reminded me about this comment from Requirement Resolver v1: https://github.com/avocado-framework/avocado/pull/4423/files#r585816355 So, I made an effort to shape the Requirement Resolver in a way that, if a runner exists for that kind of requirement, the Requirement Resolver will create a task and will try to fulfill it. This means that as long as there is a runner capable of handling container recipes, this should work without major problems.
This comment also reminds me about this issue: #4458 We should have a new cache mechanism in the future, which may help to decrease the setup time of containers. |
Sounds good and portable enough. I assume for instance we could use a bash script and point to it in the way you describe. Thanks for the hints!
I guess whether this can help really depends on the way one understands cache here. As I said above, we have a network of vms inside each container and some vms (e.g. Windows) could be 50GB or more so I am not sure how caching this across containers would help reduce the big IO unless by caching one really means reusing the container itself. |
I remember @clebergnu gave the example of a base container and new containers with required packages installed on top of the base one and then some metadata cached referring to this new container. So, basically, it would be the reuse of a container image. I can imagine it working with virtual machines and snapshots, but I think we need some brainstorming first. |
This sounds like multiple write layers on top of the same container which could provide it with a clean starting point but also store additional setup on top that is separately accessible. This reminds me further of LXC states/snapshots and features I proposed in #4375. It is a good place for further brainstorming regarding this. For the time being however, I still propose we allow container reuse in the current PR until we have a better solution fully implemented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @pevogam, sorry for the delay reviewing this. This looks promising, and I'm looking forward for the non-draft version. Nice work here.
I just left a few comments for you.
import tempfile | ||
|
||
try: | ||
import lxc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm missing something, IIUC, lcx is necessary for your code. Without it, we can't run it. So following the same logic as others imports, I would recommend to remove this try..except here and add lxc into the plugin's requirements.txt
with instructions to install on the plugin's README.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Beraldo and thank for you for taking the time to validate the implementation! Now about this:
This is a good point but don't you think LXC should be an optional requirement for people that won't use the LXC spawners? For somebody that might want a basic Avocado install and just using process spawners it might be preferable to only have a lightweight Avocado install without building or downloading Podman/LXC dependencies. So this was an approach to keep things optional. What do you think about this idea overall? Of course, it is easier to just important the module and require the pip dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked around but didn't see an easy way to specify dependencies for the LXC spawner as it is just a module and not a plugin with its own setup.py
, RPM subpackage, and so on. Currently there is only process
and podman
listed under the spawners
folder, I thought it is more reasonable to simply also provide lxc
there. Or is podman
optional in some sense and having its own plugin dependencies specified somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty much all other plugins that have Python deps other than setuptools, will go into optional_plugins
. The podman spawner is a different beast because it requires a system binary (to put it simply). The LXC plugin is a bit different because it requires the Python bindings, but not only that. Until we have a better way of describing dependencies such as system binaries, this is OK IMO.
avocado/plugins/spawners/lxc.py
Outdated
msg = 'LXC python bindings not available on the system' | ||
runtime_task.status = msg | ||
# we shouldn't reach this point | ||
raise RuntimeError("LXC dependency is missing") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my previous comment, this if becomes unnecessary. Users will get a ModuleNotFoundError: No module named 'lxc'
at the begin of this load.
if not LXC_AVAILABLE: | ||
msg = 'LXC python bindings not available on the system' | ||
runtime_task.status = msg | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
avocado/plugins/spawners/lxc.py
Outdated
runtime_task.status = msg | ||
return False | ||
|
||
container_id = self.cid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm missing where it was defined self.cid
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is exactly the point: it is not defined anywhere since I was discussing reusing containers in the main thread of this PR. For more details you can check it but the essentially the problem is that we want to reuse containers since some could have 10+ mins setup time that cannot be spent on each individual test. What we do at the moment is that we create the containers as "run slots" (e.g. 3 containers for 3x faster running) and reuse these when a worker picks up a new task for running. This roughly translates to having three processes running at the same time as an upper limit.
I will easily move this to the constructor where it belongs as container name but wanted to first discuss the more global problem with all of you in order to know whether I should just generate a uid here or can actually pass such an argument and control the number of containers from the outside.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out using a constructor is not as easy as I thought since the spawner is initiated outside of our runner's scope and an instance accessed via a dispatcher:
spawner = SpawnerDispatcher(job.config, job)[spawner_name].obj
Then I started thinking about doing something like this
diff --git a/avocado/plugins/spawners/lxc.py b/avocado/plugins/spawners/lxc.py
index 09f71b35..200d9950 100644
--- a/avocado/plugins/spawners/lxc.py
+++ b/avocado/plugins/spawners/lxc.py
@@ -122,6 +122,9 @@ class LXCSpawner(Spawner, SpawnerMixin):
distro = self.config.get('spawner.lxc.distro')
release = self.config.get('spawner.lxc.release')
arch = self.config.get('spawner.lxc.arch')
+ # LXC differs from Podman in the fact that we can reuse heavier complete subsystem
+ # and not just lightweight application-centric containers if necessary
+ container_id = self.config.get('spawner.lxc.cid')
if not LXC_AVAILABLE:
msg = 'LXC python bindings not available on the system'
but this doesn't make much sense still since the container ID can be different per test and doesn't sound at all like something that belongs to the overall job config. Essentially what I need is to provide a reusable container ID (container slot) via a config or some task attribute. The way this is done with Podman right now is not reusable since the container ID is returned from a freshly created container each time and thus retrieved within the scope of spawning a task. @richtja @willianrampazzo do you have a good suggestion on what is the best approach here to provide the spawner with some test specific configuration?
fb0ba4c
to
6b86e2e
Compare
39faa85
to
e1bb61e
Compare
I also adapted all static checks with the exception of the lint ones like: diff --git a/avocado/plugins/spawners/lxc.py b/avocado/plugins/spawners/lxc.py
index 233e4ba0..b5d59975 100644
--- a/avocado/plugins/spawners/lxc.py
+++ b/avocado/plugins/spawners/lxc.py
@@ -295,11 +295,11 @@ class LXCSpawner(Spawner, SpawnerMixin):
# right now, limit the check to the LXC availability
return LXC_AVAILABLE
- async def is_requirement_in_cache(self, runtime_task):
+ async def is_requirement_in_cache(self, runtime_task): # pylint: disable=W0221
return False
- async def save_requirement_in_cache(self, runtime_task):
+ async def save_requirement_in_cache(self, runtime_task): # pylint: disable=W0221
pass
- async def update_requirement_cache(self, runtime_task, result):
+ async def update_requirement_cache(self, runtime_task, result): # pylint: disable=W0221
pass I wonder here why podman violated the default spawner interface and disregarded the linter errors - is the interface meant to change soon (in which case the above diff will make more sense here too)? Or is this just a temporary change on the side of podman (in which case I best revert to the static methods)? |
Finally, if you are interested in a bash script to create your own LXC containers (like the custom hook above), you can find a more or less minimal sample at https://github.com/intra2net/avocado-i2n/blob/master-ci/selftests/integration/container.sh. You could use it for your own experimentation with this, it is even smaller script if you drop the avocado local installation part to get a workable container with functioning networking. |
Hi @pevogam, yes, this is caused by your new unit tests. It checks if avocado selftests doesn't affect |
We do not plan to change the interface, so I would use the static methods here. |
@richtja Thanks for the hints! The CI tests and static checks seem to be passing now with a few problems in Python 3.7 and 3.10.0 that seem to be from other tests not related to the changes here but feel free to verify I didn't miss something here. |
Hi @clebergnu @richtja could you confirm if these CI failures are caused by the unit tests I introduced (so I can investigate) or there is another reason for them? Not sure but it seems to me that the current CI has a lot of intricacies that an outsider has to know before they write more (self)tests for it which could indicate some excessive complexity in the way it is currently structured. I also rebased on 101.0 now which I assume has all its checks passing and should therefore rule out any outdated sources of problems. |
018966f
to
bcaae43
Compare
Hi @richtja and @clebergnu, I pushed a change that uses
as well as outright insistence of python3.7 resulting from the above in some cases. So how can I skip python 3.6 in the few remaining cases? Why do cases related to coverage and rpmbuilds also run the unit tests if this is already done in other CI workflows above (seems a bit like GH actions overuse to me) and in particular why not using avocado for all runs including coverage and rpm builds? When do we plan to drop extremely old python versions like 3.6 and 3.7? Or should we just disable the unit tests I added? I thought they would be useful to keep the coverage in check and now they turned to be the one thing keeping the pull request from getting merged because of all the rather unrelated CI rigidness and fragility. |
Hi @pevogam , I've spent some time with this version, and I have a few questions / requests:
This is because the LXC implementation seems to be only handling things at the job/host side: def create_task_output_dir(self, runtime_task):
output_dir_path = self.task_output_dir(runtime_task)
runtime_task.task.setup_output_dir(output_dir_path) The podman spawner has something similar, but then it maps a volume into the container that is then accessible to the task (and thus to the runnable). |
Hi @clebergnu, you have collected a set of truly wonderful and down to the point questions:
Right now there is no documentation since we want to get this up and running with general avocado first (we use the spawner perfectly fine but with our Avocado I2N plugin (VT tests) which has its own scheduler. I have already shared a minimal option on how to set up a container here which we can reuse and add to the repo as a pull request once we iron out the kinks there.
The customization hook I use is based on the minimal script above which is more or less a stripped-down version removing the dependencies particular to our plugin (at least the packages and larger changes). So I hope it may serve as a good starting point.
This is a very good question and I assume it related to the python avocado eggs and similar deployment. I was actually postponing this conversation for after the pull request is merged since we will definitely have to do some unifying job here. In short though: the users might benefit from less setup on their own side and such automation could be kept optional for the users with more elaborate containers (as LXC containers could be entire set up subsystems and not a small ecosystem around a single app the way podman containers are).
Another great question - since this PR predates the addition of "auto" status server which happened later on I have not. In fact I had set up a manual configuration with the two parameters above similarly to the way you did and simply stayed with it until now. I am not very familiar of this auto status server functionality and whether it can handle LXC containers, perhaps it could be extended.
Indeed, I also tried running some minimal /bin/true tests to stay more in line with avocado's scope and got into the same problem and was waiting for some confirmation here so that we could see where the problem is. The LXC spawner copied most of these functions from the podman spawner and oddly I have no such problems in production. IIUC this output directory is simply the mount-accessible test directory? In this case the reason it works for me is because we have mounted a partition like Thank you very much for your reply @clebergnu, how would you like to proceed regarding the points above? |
I've created #5666 to keep track of those points. Also, I've experimented with this branch of yours, and hacked a bit to get the CI running green. The result is here. It's not intended to be used as is, but it contains:
Also, a lot (if not all) of my commits are not intended to be standalone. Feel free to either take them and squash into yours, or simply use the suggestion or raw code. I'm hoping this helps, as I'm eager to merge this! |
Thanks a lot @clebergnu for going through this effort! I will take a look next week. This is a good cause for wonderful weekend vibes already! |
The arguments on why we should do this are well-explained and fully provided in: pylint-dev/pylint#2354 (comment) Signed-off-by: Plamen Dimitrov <[email protected]>
Our container creation and overall Avocado setup relies on an external fully customizable hook (which may even be a bash script or other form of container recipe) with the major assumptions that all LXC containers are identical reusable slots to spawn new tests in. For instance, preparing an LXC container with full Avocado (VT) dependencies installed could take 10+ minutes which is unfeasible to do for each test. In the future we plan to look into LXC state (snapshot) reuse to leverage this but it goes beyond the current proposal here. Signed-off-by: Plamen Dimitrov <[email protected]>
Custom schedulers can still pass a custom spawner handle via the runtime task but the default avocado state machine will instead be able to operate without it with the spawner managing the available and occupied container slots (all of which are also configurable). Thanks to Cleber for the built-in support for lists in avocado's core settings, adaptation of the unit tests for avocado runs, as well as extra mock reset safety for regular python runs. Signed-off-by: Plamen Dimitrov <[email protected]> Signed-off-by: Cleber Rosa <[email protected]>
Use avocado's decorators to skip versions of python 3.7 and below in order to support both the newer mock as well as some enhanced asyncio behavior. Signed-off-by: Plamen Dimitrov <[email protected]>
@clebergnu I have squashed all changes and updated some of the commit messages -> all tests are passing right now and the CI is finally green. Thanks a lot for your effort to integrate the tests and settings better into the avocado project! The only remaining diff between your and my branch is the order of the pylint exception as I follow (and the file seems to follow) lexicographical order. Other than that I think we are ready to tackle further challenges in #5666 and major spawners in #5621 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI is happy, and so am I! Thanks @pevogam!
This is just a very early implementation of an LXC spawner for post 82 LTS,
submitted to gather any opinions and possibly clarify some questions.