nrunner: introduce status server auto mode #4879

clebergnu · 2021-08-19T15:01:21Z

This introduces a mode under which the status server will set up its own "listen" and "uri" values.

The "listen" value is of course related to where the status server will be listening for status updates, from clients (tasks) which are given the "uri" value.

Instead of a TCP based socket, the auto mode defaults to using a UNIX domain socket created within the job's own directory, which better avoids clashes (no two jobs should be using the same job directory anyway).

This introduces a mode under which the status server will set up its own "listen" and "uri" values. The "listen" value is of course related to where the status server will be listening for status updates, from clients (tasks) which are given the "uri" value. Instead of a TCP based socket, the auto mode defaults to using a UNIX domain socket created within the job's own directory, which better avoids clashes (no two jobs should be using the same job directory anyway). One way to see the difference in action is to run: $ nc -l 127.0.0.1 8888 And on another session: $ avocado run --test-runner=nrunner --nrunner-status-server-auto /bin/true And compare it to: $ avocado run --test-runner=nrunner /bin/true But not using a resource available system wide, but one inside the job's directory, the possibility of clashes are virtually non-existing. Signed-off-by: Cleber Rosa <[email protected]>

Users often complain about jobs with no test results. There are of course many possible causes, but a common one is a clash and failure of communication with the status server. Because it's currently dependent on a fixed TCP port, clashes are quite easy, especially when running multiple jobs at once (or nested). Also, a good number of tests have to go through the trouble of setting up custom status server to avoid clashes. This removes those custom and simply relies on the now automatic status servers. This enables the automatic status server, based on a much more private UNIX domain socket, so that clashes become virtually impossible. Signed-off-by: Cleber Rosa <[email protected]>

willianrampazzo · 2021-08-19T18:04:55Z

FYI, I was testing it and got:

$ avocado run --test-runner=nrunner --nrunner-status-server-disable-auto /bin/true
JOB ID     : 401574a8309a16a27e3efec4806747b881663ad9
JOB LOG    : /home/wrampazz/avocado/job-results/job-2021-08-19T15.02-401574a/job.log

Avocado crashed: AttributeError: 'NoneType' object has no attribute 'close'
Traceback (most recent call last):

  File "/home/wrampazz/src/avocado/avocado.dev/avocado/core/job.py", line 635, in run_tests
    summary |= suite.run(self)

  File "/home/wrampazz/src/avocado/avocado.dev/avocado/core/suite.py", line 333, in run
    return self.runner.run_suite(job, self)

  File "/home/wrampazz/src/avocado/avocado.dev/avocado/plugins/runner_nrunner.py", line 360, in run_suite
    self.status_server.close()

  File "/home/wrampazz/src/avocado/avocado.dev/avocado/core/status/server.py", line 53, in close
    self._server_task.close()

AttributeError: 'NoneType' object has no attribute 'close'

Please include the traceback info and command line used on your bug report
Report bugs visiting https://github.com/avocado-framework/avocado/issues/new

These are the status server messages:

$ nc -l 127.0.0.1 8888
{"status": "started", "time": 113424.944129556, "output_dir": "/var/tmp/.avocado-task-u911hv13", "id": "1-/bin/true", "job_id": "401574a8309a16a27e3efec4806747b881663ad9"}
{"status": "running", "time": 113424.955764594, "id": "1-/bin/true", "job_id": "401574a8309a16a27e3efec4806747b881663ad9"}
{"type": "stdout", "log": {"__base64_encoded__": ""}, "status": "running", "time": 113424.955809164, "id": "1-/bin/true", "job_id": "401574a8309a16a27e3efec4806747b881663ad9"}
{"type": "stderr", "log": {"__base64_encoded__": ""}, "status": "running", "time": 113424.955856262, "id": "1-/bin/true", "job_id": "401574a8309a16a27e3efec4806747b881663ad9"}
{"result": "pass", "returncode": 0, "status": "finished", "time": 113424.955890854, "id": "1-/bin/true", "job_id": "401574a8309a16a27e3efec4806747b881663ad9"}

clebergnu · 2021-08-19T18:10:55Z

FYI, I was testing it and got:

$ avocado run --test-runner=nrunner --nrunner-status-server-disable-auto /bin/true
JOB ID     : 401574a8309a16a27e3efec4806747b881663ad9
JOB LOG    : /home/wrampazz/avocado/job-results/job-2021-08-19T15.02-401574a/job.log

Avocado crashed: AttributeError: 'NoneType' object has no attribute 'close'
Traceback (most recent call last):

  File "/home/wrampazz/src/avocado/avocado.dev/avocado/core/job.py", line 635, in run_tests
    summary |= suite.run(self)

  File "/home/wrampazz/src/avocado/avocado.dev/avocado/core/suite.py", line 333, in run
    return self.runner.run_suite(job, self)

  File "/home/wrampazz/src/avocado/avocado.dev/avocado/plugins/runner_nrunner.py", line 360, in run_suite
    self.status_server.close()

  File "/home/wrampazz/src/avocado/avocado.dev/avocado/core/status/server.py", line 53, in close
    self._server_task.close()

AttributeError: 'NoneType' object has no attribute 'close'

Please include the traceback info and command line used on your bug report
Report bugs visiting https://github.com/avocado-framework/avocado/issues/new

You got those when not using the auto mode, right? If so, this is to the best of my knowledge not a regression, but just what tampering with the socket that would be used by the status servers results in.

These are the status server messages:

$ nc -l 127.0.0.1 8888
{"status": "started", "time": 113424.944129556, "output_dir": "/var/tmp/.avocado-task-u911hv13", "id": "1-/bin/true", "job_id": "401574a8309a16a27e3efec4806747b881663ad9"}
{"status": "running", "time": 113424.955764594, "id": "1-/bin/true", "job_id": "401574a8309a16a27e3efec4806747b881663ad9"}
{"type": "stdout", "log": {"__base64_encoded__": ""}, "status": "running", "time": 113424.955809164, "id": "1-/bin/true", "job_id": "401574a8309a16a27e3efec4806747b881663ad9"}
{"type": "stderr", "log": {"__base64_encoded__": ""}, "status": "running", "time": 113424.955856262, "id": "1-/bin/true", "job_id": "401574a8309a16a27e3efec4806747b881663ad9"}
{"result": "pass", "returncode": 0, "status": "finished", "time": 113424.955890854, "id": "1-/bin/true", "job_id": "401574a8309a16a27e3efec4806747b881663ad9"}

willianrampazzo · 2021-08-19T18:35:58Z

You got those when not using the auto mode, right?

Right.

If so, this is to the best of my knowledge not a regression, but just what tampering with the socket that would be used by the status servers results in.

Okay, so what would be the correct way to run the following command without crashing Avocado?

$ avocado run --test-runner=nrunner --nrunner-status-server-disable-auto /bin/true

I don't think I got the idea of the option here.

clebergnu · 2021-08-19T18:40:12Z

You got those when not using the auto mode, right?

Right.

Good! I was starting to think I had things upside down here :)

If so, this is to the best of my knowledge not a regression, but just what tampering with the socket that would be used by the status servers results in.

Okay, so what would be the correct way to run the following command without crashing Avocado?

$ avocado run --test-runner=nrunner --nrunner-status-server-disable-auto /bin/true

I don't think I got the idea of the option here.

The correct way is to just use the auto mode, the default after the 2nd commit, meaning:

$ avocado run --test-runner=nrunner /bin/true`

With that, even if you have something using the status server port, you'd be fine.

clebergnu · 2021-08-19T18:46:33Z

You got those when not using the auto mode, right?

Right.

If so, this is to the best of my knowledge not a regression, but just what tampering with the socket that would be used by the status servers results in.

Okay, so what would be the correct way to run the following command without crashing Avocado?

$ avocado run --test-runner=nrunner --nrunner-status-server-disable-auto /bin/true

I don't think I got the idea of the option here.

Another example that shows clash-free behavior is:

$ avocado run --test-runner=nrunner examples/jobs/nrunner.py

Resulting in:

JOB ID     : 6027b4057f0ff555324937691a18ec2f35cbc7bb
JOB LOG    : /home/cleber/avocado/job-results/job-2021-08-19T14.45-6027b40/job.log
 (1/1) examples/jobs/nrunner.py: STARTED
 (1/1) examples/jobs/nrunner.py: PASS (4.01 s)
RESULTS    : PASS 1 | ERROR 0 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0
JOB HTML   : /home/cleber/avocado/job-results/job-2021-08-19T14.45-6027b40/results.html
JOB TIME   : 5.20 s

willianrampazzo · 2021-08-19T18:50:20Z

You got those when not using the auto mode, right?

Right.

Good! I was starting to think I had things upside down here :)

If so, this is to the best of my knowledge not a regression, but just what tampering with the socket that would be used by the status servers results in.

Okay, so what would be the correct way to run the following command without crashing Avocado?
$ avocado run --test-runner=nrunner --nrunner-status-server-disable-auto /bin/true
I don't think I got the idea of the option here.

The correct way is to just use the auto mode, the default after the 2nd commit, meaning:
$ avocado run --test-runner=nrunner /bin/true`
With that, even if you have something using the status server port, you'd be fine.

I got the idea of the auto mode. What I'm wondering is why do we have the command-line option --nrunner-status-server-disable-auto if it crashes avocado when we use it?

willianrampazzo · 2021-08-19T18:55:32Z

You got those when not using the auto mode, right?

Right.

Good! I was starting to think I had things upside down here :)

If so, this is to the best of my knowledge not a regression, but just what tampering with the socket that would be used by the status servers results in.

Okay, so what would be the correct way to run the following command without crashing Avocado?
$ avocado run --test-runner=nrunner --nrunner-status-server-disable-auto /bin/true
I don't think I got the idea of the option here.

The correct way is to just use the auto mode, the default after the 2nd commit, meaning:
$ avocado run --test-runner=nrunner /bin/true`
With that, even if you have something using the status server port, you'd be fine.
I got the idea of the auto mode. What I'm wondering is why do we have the command-line option --nrunner-status-server-disable-auto if it crashes avocado when we use it?

Okay, forget about it, I ran the command multiple times again, with and without the command-line option and it is not failing anymore. Don't know what happened.

clebergnu · 2021-08-19T18:55:59Z

You got those when not using the auto mode, right?

Right.

Good! I was starting to think I had things upside down here :)

If so, this is to the best of my knowledge not a regression, but just what tampering with the socket that would be used by the status servers results in.

Okay, so what would be the correct way to run the following command without crashing Avocado?
$ avocado run --test-runner=nrunner --nrunner-status-server-disable-auto /bin/true
I don't think I got the idea of the option here.

The correct way is to just use the auto mode, the default after the 2nd commit, meaning:
$ avocado run --test-runner=nrunner /bin/true`
With that, even if you have something using the status server port, you'd be fine.
I got the idea of the auto mode. What I'm wondering is why do we have the command-line option --nrunner-status-server-disable-auto if it crashes avocado when we use it?

The purpose of --nrunner-status-server-disable-auto is to allow users to control where they want the status server to listen on. That may be needed on some situations, or we may find out that users don't care about that if Avocado does a good job. I'm adding that option because I don't think I have the complete picture.

Now, it's not --nrunner-status-server-disable-auto that crashes Avocado, it's the current code and behavior that does that. Sure, we can better handle the crash and provide an error message, etc. But the crash only when the old behavior is used, and there's a port clash. Just to make sure I'm not missing that important point: that happens on the master branch right now (without --nrunner-status-server-disable-auto).

willianrampazzo · 2021-08-19T19:33:49Z

You got those when not using the auto mode, right?

Right.

Good! I was starting to think I had things upside down here :)

If so, this is to the best of my knowledge not a regression, but just what tampering with the socket that would be used by the status servers results in.

Okay, so what would be the correct way to run the following command without crashing Avocado?
$ avocado run --test-runner=nrunner --nrunner-status-server-disable-auto /bin/true
I don't think I got the idea of the option here.

The correct way is to just use the auto mode, the default after the 2nd commit, meaning:
$ avocado run --test-runner=nrunner /bin/true`
With that, even if you have something using the status server port, you'd be fine.
I got the idea of the auto mode. What I'm wondering is why do we have the command-line option --nrunner-status-server-disable-auto if it crashes avocado when we use it?
The purpose of --nrunner-status-server-disable-auto is to allow users to control where they want the status server to listen on. That may be needed on some situations, or we may find out that users don't care about that if Avocado does a good job. I'm adding that option because I don't think I have the complete picture.

Now, it's not --nrunner-status-server-disable-auto that crashes Avocado, it's the current code and behavior that does that. Sure, we can better handle the crash and provide an error message, etc. But the crash only when the old behavior is used, and there's a port clash. Just to make sure I'm not missing that important point: that happens on the master branch right now (without --nrunner-status-server-disable-auto).

Thanks for the explanation. I was testing it with nc listening to the same port as Avocado, so, causing a clash. My fault :) sorry for the noise.

richtja

LGTM, thanks

beraldoleal

Hi @clebergnu, besides my squash comments. I do have one question:

How this is going to work with --nrunner-spawner='podman' ? Aren't the UNIX sockets a local thing?

I'm getting this result:

 avocado run --test-runner=nrunner  --nrunner-spawner=podman /bin/true
JOB ID     : fc75cc05d89ffadf3871f61cc18ec20482d419b7
JOB LOG    : /home/local/avocado/job-results/job-2021-08-20T10.04-fc75cc0/job.log
RESULTS    : PASS 0 | ERROR 0 | FAIL 0 | SKIP 1 | WARN 0 | INTERRUPT 0 | CANCEL 0
JOB HTML   : /home/local/avocado/job-results/job-2021-08-20T10.04-fc75cc0/results.html
JOB TIME   : 45.74 s

beraldoleal · 2021-08-20T12:55:45Z

avocado/plugins/runner_nrunner.py

                    'status server.')
        settings.register_option(section=section,
                                 key='status_server_auto',
-                                 default=False,
+                                 default=True,


I don't mind if those changes are squashed with the previous commit.

I kept the two apart, because I felt like the switch to auto mode, would be more controversial than the introduction of the feature itself.

But, seems I was wrong. I'll squash them on v2.

beraldoleal · 2021-08-20T12:56:19Z

avocado/plugins/runner_nrunner.py

-            long_arg='--nrunner-status-server-auto',
-            action='store_true')
+            long_arg='--nrunner-status-server-disable-auto',
+            action='store_false')


clebergnu · 2021-08-20T13:26:18Z

@beraldoleal good point about the podman spawner. Right now, it depends on a network based status server, so --nrunner-status-server-disable-auto works. I'll document that on a v2.

On the short term, we can look into:

Passing through the unix domain socket file to the podman container (I'm not sure it's possible, but it's worth investigating)
Coming up with a mechanism in which spawners can declare what kind feature they need -- so that it can be respected by an "auto status server" mechanism

How does that sound?

clebergnu · 2021-08-20T13:46:28Z

@beraldoleal good point about the podman spawner. Right now, it depends on a network based status server, so --nrunner-status-server-disable-auto works. I'll document that on a v2.

On the short term, we can look into:

Passing through the unix domain socket file to the podman container (I'm not sure it's possible, but it's worth investigating)

^ Not a good idea, tried and failed. According to this, it requires a privileged container, so this would work:

diff --git a/avocado/plugins/spawners/podman.py b/avocado/plugins/spawners/podman.py
index 14b4b34df..9b0745a83 100644
--- a/avocado/plugins/spawners/podman.py
+++ b/avocado/plugins/spawners/podman.py
@@ -87,6 +87,15 @@ class PodmanSpawner(Spawner, SpawnerMixin):
         return out.startswith(b'Up ')
 
     async def spawn_task(self, runtime_task):
+
+        mount_status_server_socket = False
+        mounted_status_server_socket = '/tmp/.status_server.sock'
+        status_server_uri = runtime_task.task.status_services[0].uri
+        if ':' not in status_server_uri:
+            # a unix domain socket is being used
+            mount_status_server_socket = True
+            runtime_task.task.status_services[0].uri = mounted_status_server_socket
+
         task = runtime_task.task
         entry_point_cmd = '/tmp/avocado-runner'
         entry_point_args = task.get_command_args()
@@ -109,6 +118,8 @@ class PodmanSpawner(Spawner, SpawnerMixin):
             proc = await asyncio.create_subprocess_exec(
                 podman_bin, "create",
                 "--net=host",
+                "--privileged",
+                "-v", "%s:%s" % (status_server_uri, '/tmp/.status_server.sock'),
                 entry_point_arg,
                 image,
                 stdout=asyncio.subprocess.PIPE,

I'm not sure how I feel about it though.

Coming up with a mechanism in which spawners can declare what kind feature they need -- so that it can be respected by an "auto status server" mechanism

How does that sound?

willianrampazzo · 2021-08-20T13:53:01Z

@beraldoleal good point about the podman spawner. Right now, it depends on a network based status server, so --nrunner-status-server-disable-auto works. I'll document that on a v2.
On the short term, we can look into:

Passing through the unix domain socket file to the podman container (I'm not sure it's possible, but it's worth investigating)

^ Not a good idea, tried and failed. According to this, it requires a privileged container, so this would work:

diff --git a/avocado/plugins/spawners/podman.py b/avocado/plugins/spawners/podman.py
index 14b4b34df..9b0745a83 100644
--- a/avocado/plugins/spawners/podman.py
+++ b/avocado/plugins/spawners/podman.py
@@ -87,6 +87,15 @@ class PodmanSpawner(Spawner, SpawnerMixin):
         return out.startswith(b'Up ')
 
     async def spawn_task(self, runtime_task):
+
+        mount_status_server_socket = False
+        mounted_status_server_socket = '/tmp/.status_server.sock'
+        status_server_uri = runtime_task.task.status_services[0].uri
+        if ':' not in status_server_uri:
+            # a unix domain socket is being used
+            mount_status_server_socket = True
+            runtime_task.task.status_services[0].uri = mounted_status_server_socket
+
         task = runtime_task.task
         entry_point_cmd = '/tmp/avocado-runner'
         entry_point_args = task.get_command_args()
@@ -109,6 +118,8 @@ class PodmanSpawner(Spawner, SpawnerMixin):
             proc = await asyncio.create_subprocess_exec(
                 podman_bin, "create",
                 "--net=host",
+                "--privileged",
+                "-v", "%s:%s" % (status_server_uri, '/tmp/.status_server.sock'),
                 entry_point_arg,
                 image,
                 stdout=asyncio.subprocess.PIPE,

I'm not sure how I feel about it though.

Looks like a quick fix/workaround until we come up with the implementation for the second option, which I think is a valid use case for other spawners too.

Coming up with a mechanism in which spawners can declare what kind feature they need -- so that it can be respected by an "auto status server" mechanism

How does that sound?

beraldoleal · 2021-08-20T14:00:49Z

@beraldoleal good point about the podman spawner. Right now, it depends on a network based status server, so --nrunner-status-server-disable-auto works. I'll document that on a v2.

On the short term, we can look into:
* Passing through the unix domain socket file to the podman container (I'm not sure it's possible, but it's worth investigating)

This seems to be a podman only solution, and maybe we could think a little bit further for a more generic solution (I'm thinking about future possible spawners)

* Coming up with a mechanism in which spawners can declare what kind feature they need -- so that it can be respected by an "auto status server" mechanism

Hum... Not sure if I understood you here.

But one "ideal" scenario, this should still be done via TCP. And to avoid collisions maybe some kind of token/identifier of the client to avoid messing the messages. Again, this could also be solved with a PUB/SUB mechanism and different channels.

How does that sound?

Sure, yes, lets document, move forward and try to investigate how to work with a more robust solution for all spawners.

Now, thinking about the default behavior, since this is not working with podman spawner, maybe we should discard the second commit changes that makes this default. Makes sense?

clebergnu · 2021-08-20T14:01:11Z

I'm not sure how I feel about it though.

Looks like a quick fix/workaround until we come up with the implementation for the second option, which I think is a valid use case for other spawners too.

And TBH we can trade (drop) the --net=host for the privileged + mount when using UNIX domain sockets.

beraldoleal · 2021-08-20T14:02:38Z

Ok, just found the workaround solution exporting to the spawner. So yes, I'm ok with using it, but IMO we still need a more generic solution here.

clebergnu added the nrunner (previously nrun2run) label Aug 19, 2021

clebergnu added this to the #91(Thelma & Louise) milestone Aug 19, 2021

beraldoleal self-requested a review August 19, 2021 15:08

clebergnu force-pushed the nrunner_status_server_auto branch from dd01e5b to c470b8c Compare August 19, 2021 16:20

clebergnu added 2 commits August 19, 2021 13:20

clebergnu force-pushed the nrunner_status_server_auto branch from c470b8c to 6a2c45e Compare August 19, 2021 17:27

willianrampazzo self-requested a review August 19, 2021 18:00

willianrampazzo approved these changes Aug 19, 2021

View reviewed changes

richtja approved these changes Aug 20, 2021

View reviewed changes

beraldoleal requested changes Aug 20, 2021

View reviewed changes

clebergnu mentioned this pull request Aug 20, 2021

nrunner: introduce status server auto mode [v2] #4885

Merged

clebergnu closed this Aug 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nrunner: introduce status server auto mode #4879

nrunner: introduce status server auto mode #4879

clebergnu commented Aug 19, 2021 •

edited

Loading

willianrampazzo commented Aug 19, 2021

clebergnu commented Aug 19, 2021

willianrampazzo commented Aug 19, 2021

clebergnu commented Aug 19, 2021

clebergnu commented Aug 19, 2021

willianrampazzo commented Aug 19, 2021

willianrampazzo commented Aug 19, 2021

clebergnu commented Aug 19, 2021

willianrampazzo commented Aug 19, 2021

richtja left a comment

beraldoleal left a comment

beraldoleal Aug 20, 2021

clebergnu Aug 20, 2021

beraldoleal Aug 20, 2021

clebergnu commented Aug 20, 2021

clebergnu commented Aug 20, 2021 •

edited

Loading

willianrampazzo commented Aug 20, 2021

beraldoleal commented Aug 20, 2021 •

edited

Loading

clebergnu commented Aug 20, 2021

beraldoleal commented Aug 20, 2021

nrunner: introduce status server auto mode #4879

nrunner: introduce status server auto mode #4879

Conversation

clebergnu commented Aug 19, 2021 • edited Loading

willianrampazzo commented Aug 19, 2021

clebergnu commented Aug 19, 2021

willianrampazzo commented Aug 19, 2021

clebergnu commented Aug 19, 2021

clebergnu commented Aug 19, 2021

willianrampazzo commented Aug 19, 2021

willianrampazzo commented Aug 19, 2021

clebergnu commented Aug 19, 2021

willianrampazzo commented Aug 19, 2021

richtja left a comment

Choose a reason for hiding this comment

beraldoleal left a comment

Choose a reason for hiding this comment

beraldoleal Aug 20, 2021

Choose a reason for hiding this comment

clebergnu Aug 20, 2021

Choose a reason for hiding this comment

beraldoleal Aug 20, 2021

Choose a reason for hiding this comment

clebergnu commented Aug 20, 2021

clebergnu commented Aug 20, 2021 • edited Loading

willianrampazzo commented Aug 20, 2021

beraldoleal commented Aug 20, 2021 • edited Loading

clebergnu commented Aug 20, 2021

beraldoleal commented Aug 20, 2021

clebergnu commented Aug 19, 2021 •

edited

Loading

clebergnu commented Aug 20, 2021 •

edited

Loading

beraldoleal commented Aug 20, 2021 •

edited

Loading