Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nrunner: introduce status server auto mode #4879

Closed

Conversation

clebergnu
Copy link
Contributor

@clebergnu clebergnu commented Aug 19, 2021

This introduces a mode under which the status server will set up its own "listen" and "uri" values.

The "listen" value is of course related to where the status server will be listening for status updates, from clients (tasks) which are given the "uri" value.

Instead of a TCP based socket, the auto mode defaults to using a UNIX domain socket created within the job's own directory, which better avoids clashes (no two jobs should be using the same job directory anyway).

@clebergnu clebergnu added the nrunner (previously nrun2run) label Aug 19, 2021
@clebergnu clebergnu added this to the #91(Thelma & Louise) milestone Aug 19, 2021
@beraldoleal beraldoleal self-requested a review August 19, 2021 15:08
@clebergnu clebergnu force-pushed the nrunner_status_server_auto branch from dd01e5b to c470b8c Compare August 19, 2021 16:20
This introduces a mode under which the status server will set up its
own "listen" and "uri" values.

The "listen" value is of course related to where the status server
will be listening for status updates, from clients (tasks) which are
given the "uri" value.

Instead of a TCP based socket, the auto mode defaults to using a UNIX
domain socket created within the job's own directory, which better
avoids clashes (no two jobs should be using the same job directory
anyway).

One way to see the difference in action is to run:

 $ nc -l 127.0.0.1 8888

And on another session:

 $ avocado run --test-runner=nrunner --nrunner-status-server-auto /bin/true

And compare it to:

 $ avocado run --test-runner=nrunner /bin/true

But not using a resource available system wide, but one inside the
job's directory, the possibility of clashes are virtually
non-existing.

Signed-off-by: Cleber Rosa <[email protected]>
Users often complain about jobs with no test results.  There are of
course many possible causes, but a common one is a clash and failure
of communication with the status server.

Because it's currently dependent on a fixed TCP port, clashes are
quite easy, especially when running multiple jobs at once (or nested).

Also, a good number of tests have to go through the trouble of setting
up custom status server to avoid clashes.  This removes those custom
and simply relies on the now automatic status servers.

This enables the automatic status server, based on a much more private
UNIX domain socket, so that clashes become virtually impossible.

Signed-off-by: Cleber Rosa <[email protected]>
@clebergnu clebergnu force-pushed the nrunner_status_server_auto branch from c470b8c to 6a2c45e Compare August 19, 2021 17:27
@willianrampazzo willianrampazzo self-requested a review August 19, 2021 18:00
@willianrampazzo
Copy link
Contributor

FYI, I was testing it and got:

$ avocado run --test-runner=nrunner --nrunner-status-server-disable-auto /bin/true
JOB ID     : 401574a8309a16a27e3efec4806747b881663ad9
JOB LOG    : /home/wrampazz/avocado/job-results/job-2021-08-19T15.02-401574a/job.log

Avocado crashed: AttributeError: 'NoneType' object has no attribute 'close'
Traceback (most recent call last):

  File "/home/wrampazz/src/avocado/avocado.dev/avocado/core/job.py", line 635, in run_tests
    summary |= suite.run(self)

  File "/home/wrampazz/src/avocado/avocado.dev/avocado/core/suite.py", line 333, in run
    return self.runner.run_suite(job, self)

  File "/home/wrampazz/src/avocado/avocado.dev/avocado/plugins/runner_nrunner.py", line 360, in run_suite
    self.status_server.close()

  File "/home/wrampazz/src/avocado/avocado.dev/avocado/core/status/server.py", line 53, in close
    self._server_task.close()

AttributeError: 'NoneType' object has no attribute 'close'

Please include the traceback info and command line used on your bug report
Report bugs visiting https://github.com/avocado-framework/avocado/issues/new

These are the status server messages:

$ nc -l 127.0.0.1 8888
{"status": "started", "time": 113424.944129556, "output_dir": "/var/tmp/.avocado-task-u911hv13", "id": "1-/bin/true", "job_id": "401574a8309a16a27e3efec4806747b881663ad9"}
{"status": "running", "time": 113424.955764594, "id": "1-/bin/true", "job_id": "401574a8309a16a27e3efec4806747b881663ad9"}
{"type": "stdout", "log": {"__base64_encoded__": ""}, "status": "running", "time": 113424.955809164, "id": "1-/bin/true", "job_id": "401574a8309a16a27e3efec4806747b881663ad9"}
{"type": "stderr", "log": {"__base64_encoded__": ""}, "status": "running", "time": 113424.955856262, "id": "1-/bin/true", "job_id": "401574a8309a16a27e3efec4806747b881663ad9"}
{"result": "pass", "returncode": 0, "status": "finished", "time": 113424.955890854, "id": "1-/bin/true", "job_id": "401574a8309a16a27e3efec4806747b881663ad9"}

@clebergnu
Copy link
Contributor Author

FYI, I was testing it and got:

$ avocado run --test-runner=nrunner --nrunner-status-server-disable-auto /bin/true
JOB ID     : 401574a8309a16a27e3efec4806747b881663ad9
JOB LOG    : /home/wrampazz/avocado/job-results/job-2021-08-19T15.02-401574a/job.log

Avocado crashed: AttributeError: 'NoneType' object has no attribute 'close'
Traceback (most recent call last):

  File "/home/wrampazz/src/avocado/avocado.dev/avocado/core/job.py", line 635, in run_tests
    summary |= suite.run(self)

  File "/home/wrampazz/src/avocado/avocado.dev/avocado/core/suite.py", line 333, in run
    return self.runner.run_suite(job, self)

  File "/home/wrampazz/src/avocado/avocado.dev/avocado/plugins/runner_nrunner.py", line 360, in run_suite
    self.status_server.close()

  File "/home/wrampazz/src/avocado/avocado.dev/avocado/core/status/server.py", line 53, in close
    self._server_task.close()

AttributeError: 'NoneType' object has no attribute 'close'

Please include the traceback info and command line used on your bug report
Report bugs visiting https://github.com/avocado-framework/avocado/issues/new

You got those when not using the auto mode, right? If so, this is to the best of my knowledge not a regression, but just what tampering with the socket that would be used by the status servers results in.

These are the status server messages:

$ nc -l 127.0.0.1 8888
{"status": "started", "time": 113424.944129556, "output_dir": "/var/tmp/.avocado-task-u911hv13", "id": "1-/bin/true", "job_id": "401574a8309a16a27e3efec4806747b881663ad9"}
{"status": "running", "time": 113424.955764594, "id": "1-/bin/true", "job_id": "401574a8309a16a27e3efec4806747b881663ad9"}
{"type": "stdout", "log": {"__base64_encoded__": ""}, "status": "running", "time": 113424.955809164, "id": "1-/bin/true", "job_id": "401574a8309a16a27e3efec4806747b881663ad9"}
{"type": "stderr", "log": {"__base64_encoded__": ""}, "status": "running", "time": 113424.955856262, "id": "1-/bin/true", "job_id": "401574a8309a16a27e3efec4806747b881663ad9"}
{"result": "pass", "returncode": 0, "status": "finished", "time": 113424.955890854, "id": "1-/bin/true", "job_id": "401574a8309a16a27e3efec4806747b881663ad9"}

@willianrampazzo
Copy link
Contributor

You got those when not using the auto mode, right?

Right.

If so, this is to the best of my knowledge not a regression, but just what tampering with the socket that would be used by the status servers results in.

Okay, so what would be the correct way to run the following command without crashing Avocado?

$ avocado run --test-runner=nrunner --nrunner-status-server-disable-auto /bin/true

I don't think I got the idea of the option here.

@clebergnu
Copy link
Contributor Author

You got those when not using the auto mode, right?

Right.

Good! I was starting to think I had things upside down here :)

If so, this is to the best of my knowledge not a regression, but just what tampering with the socket that would be used by the status servers results in.

Okay, so what would be the correct way to run the following command without crashing Avocado?

$ avocado run --test-runner=nrunner --nrunner-status-server-disable-auto /bin/true

I don't think I got the idea of the option here.

The correct way is to just use the auto mode, the default after the 2nd commit, meaning:

$ avocado run --test-runner=nrunner /bin/true`

With that, even if you have something using the status server port, you'd be fine.

@clebergnu
Copy link
Contributor Author

You got those when not using the auto mode, right?

Right.

If so, this is to the best of my knowledge not a regression, but just what tampering with the socket that would be used by the status servers results in.

Okay, so what would be the correct way to run the following command without crashing Avocado?

$ avocado run --test-runner=nrunner --nrunner-status-server-disable-auto /bin/true

I don't think I got the idea of the option here.

Another example that shows clash-free behavior is:

$ avocado run --test-runner=nrunner examples/jobs/nrunner.py 

Resulting in:

JOB ID     : 6027b4057f0ff555324937691a18ec2f35cbc7bb
JOB LOG    : /home/cleber/avocado/job-results/job-2021-08-19T14.45-6027b40/job.log
 (1/1) examples/jobs/nrunner.py: STARTED
 (1/1) examples/jobs/nrunner.py: PASS (4.01 s)
RESULTS    : PASS 1 | ERROR 0 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0
JOB HTML   : /home/cleber/avocado/job-results/job-2021-08-19T14.45-6027b40/results.html
JOB TIME   : 5.20 s

@willianrampazzo
Copy link
Contributor

You got those when not using the auto mode, right?

Right.

Good! I was starting to think I had things upside down here :)

If so, this is to the best of my knowledge not a regression, but just what tampering with the socket that would be used by the status servers results in.

Okay, so what would be the correct way to run the following command without crashing Avocado?
$ avocado run --test-runner=nrunner --nrunner-status-server-disable-auto /bin/true
I don't think I got the idea of the option here.

The correct way is to just use the auto mode, the default after the 2nd commit, meaning:

$ avocado run --test-runner=nrunner /bin/true`

With that, even if you have something using the status server port, you'd be fine.

I got the idea of the auto mode. What I'm wondering is why do we have the command-line option --nrunner-status-server-disable-auto if it crashes avocado when we use it?

@willianrampazzo
Copy link
Contributor

You got those when not using the auto mode, right?

Right.

Good! I was starting to think I had things upside down here :)

If so, this is to the best of my knowledge not a regression, but just what tampering with the socket that would be used by the status servers results in.

Okay, so what would be the correct way to run the following command without crashing Avocado?
$ avocado run --test-runner=nrunner --nrunner-status-server-disable-auto /bin/true
I don't think I got the idea of the option here.

The correct way is to just use the auto mode, the default after the 2nd commit, meaning:

$ avocado run --test-runner=nrunner /bin/true`

With that, even if you have something using the status server port, you'd be fine.

I got the idea of the auto mode. What I'm wondering is why do we have the command-line option --nrunner-status-server-disable-auto if it crashes avocado when we use it?

Okay, forget about it, I ran the command multiple times again, with and without the command-line option and it is not failing anymore. Don't know what happened.

@clebergnu
Copy link
Contributor Author

You got those when not using the auto mode, right?

Right.

Good! I was starting to think I had things upside down here :)

If so, this is to the best of my knowledge not a regression, but just what tampering with the socket that would be used by the status servers results in.

Okay, so what would be the correct way to run the following command without crashing Avocado?
$ avocado run --test-runner=nrunner --nrunner-status-server-disable-auto /bin/true
I don't think I got the idea of the option here.

The correct way is to just use the auto mode, the default after the 2nd commit, meaning:

$ avocado run --test-runner=nrunner /bin/true`

With that, even if you have something using the status server port, you'd be fine.

I got the idea of the auto mode. What I'm wondering is why do we have the command-line option --nrunner-status-server-disable-auto if it crashes avocado when we use it?

The purpose of --nrunner-status-server-disable-auto is to allow users to control where they want the status server to listen on. That may be needed on some situations, or we may find out that users don't care about that if Avocado does a good job. I'm adding that option because I don't think I have the complete picture.

Now, it's not --nrunner-status-server-disable-auto that crashes Avocado, it's the current code and behavior that does that. Sure, we can better handle the crash and provide an error message, etc. But the crash only when the old behavior is used, and there's a port clash. Just to make sure I'm not missing that important point: that happens on the master branch right now (without --nrunner-status-server-disable-auto).

@willianrampazzo
Copy link
Contributor

You got those when not using the auto mode, right?

Right.

Good! I was starting to think I had things upside down here :)

If so, this is to the best of my knowledge not a regression, but just what tampering with the socket that would be used by the status servers results in.

Okay, so what would be the correct way to run the following command without crashing Avocado?
$ avocado run --test-runner=nrunner --nrunner-status-server-disable-auto /bin/true
I don't think I got the idea of the option here.

The correct way is to just use the auto mode, the default after the 2nd commit, meaning:

$ avocado run --test-runner=nrunner /bin/true`

With that, even if you have something using the status server port, you'd be fine.

I got the idea of the auto mode. What I'm wondering is why do we have the command-line option --nrunner-status-server-disable-auto if it crashes avocado when we use it?

The purpose of --nrunner-status-server-disable-auto is to allow users to control where they want the status server to listen on. That may be needed on some situations, or we may find out that users don't care about that if Avocado does a good job. I'm adding that option because I don't think I have the complete picture.

Now, it's not --nrunner-status-server-disable-auto that crashes Avocado, it's the current code and behavior that does that. Sure, we can better handle the crash and provide an error message, etc. But the crash only when the old behavior is used, and there's a port clash. Just to make sure I'm not missing that important point: that happens on the master branch right now (without --nrunner-status-server-disable-auto).

Thanks for the explanation. I was testing it with nc listening to the same port as Avocado, so, causing a clash. My fault :) sorry for the noise.

Copy link
Contributor

@richtja richtja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

Copy link
Member

@beraldoleal beraldoleal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @clebergnu, besides my squash comments. I do have one question:

How this is going to work with --nrunner-spawner='podman' ? Aren't the UNIX sockets a local thing?

I'm getting this result:

 avocado run --test-runner=nrunner  --nrunner-spawner=podman /bin/true
JOB ID     : fc75cc05d89ffadf3871f61cc18ec20482d419b7
JOB LOG    : /home/local/avocado/job-results/job-2021-08-20T10.04-fc75cc0/job.log
RESULTS    : PASS 0 | ERROR 0 | FAIL 0 | SKIP 1 | WARN 0 | INTERRUPT 0 | CANCEL 0
JOB HTML   : /home/local/avocado/job-results/job-2021-08-20T10.04-fc75cc0/results.html
JOB TIME   : 45.74 s

'status server.')
settings.register_option(section=section,
key='status_server_auto',
default=False,
default=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind if those changes are squashed with the previous commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the two apart, because I felt like the switch to auto mode, would be more controversial than the introduction of the feature itself.

But, seems I was wrong. I'll squash them on v2.

long_arg='--nrunner-status-server-auto',
action='store_true')
long_arg='--nrunner-status-server-disable-auto',
action='store_false')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here?

@clebergnu
Copy link
Contributor Author

@beraldoleal good point about the podman spawner. Right now, it depends on a network based status server, so --nrunner-status-server-disable-auto works. I'll document that on a v2.

On the short term, we can look into:

  • Passing through the unix domain socket file to the podman container (I'm not sure it's possible, but it's worth investigating)
  • Coming up with a mechanism in which spawners can declare what kind feature they need -- so that it can be respected by an "auto status server" mechanism

How does that sound?

@clebergnu
Copy link
Contributor Author

clebergnu commented Aug 20, 2021

@beraldoleal good point about the podman spawner. Right now, it depends on a network based status server, so --nrunner-status-server-disable-auto works. I'll document that on a v2.

On the short term, we can look into:

  • Passing through the unix domain socket file to the podman container (I'm not sure it's possible, but it's worth investigating)

^ Not a good idea, tried and failed. According to this, it requires a privileged container, so this would work:

diff --git a/avocado/plugins/spawners/podman.py b/avocado/plugins/spawners/podman.py
index 14b4b34df..9b0745a83 100644
--- a/avocado/plugins/spawners/podman.py
+++ b/avocado/plugins/spawners/podman.py
@@ -87,6 +87,15 @@ class PodmanSpawner(Spawner, SpawnerMixin):
         return out.startswith(b'Up ')
 
     async def spawn_task(self, runtime_task):
+
+        mount_status_server_socket = False
+        mounted_status_server_socket = '/tmp/.status_server.sock'
+        status_server_uri = runtime_task.task.status_services[0].uri
+        if ':' not in status_server_uri:
+            # a unix domain socket is being used
+            mount_status_server_socket = True
+            runtime_task.task.status_services[0].uri = mounted_status_server_socket
+
         task = runtime_task.task
         entry_point_cmd = '/tmp/avocado-runner'
         entry_point_args = task.get_command_args()
@@ -109,6 +118,8 @@ class PodmanSpawner(Spawner, SpawnerMixin):
             proc = await asyncio.create_subprocess_exec(
                 podman_bin, "create",
                 "--net=host",
+                "--privileged",
+                "-v", "%s:%s" % (status_server_uri, '/tmp/.status_server.sock'),
                 entry_point_arg,
                 image,
                 stdout=asyncio.subprocess.PIPE,

I'm not sure how I feel about it though.

  • Coming up with a mechanism in which spawners can declare what kind feature they need -- so that it can be respected by an "auto status server" mechanism

How does that sound?

@willianrampazzo
Copy link
Contributor

@beraldoleal good point about the podman spawner. Right now, it depends on a network based status server, so --nrunner-status-server-disable-auto works. I'll document that on a v2.
On the short term, we can look into:

  • Passing through the unix domain socket file to the podman container (I'm not sure it's possible, but it's worth investigating)

^ Not a good idea, tried and failed. According to this, it requires a privileged container, so this would work:

diff --git a/avocado/plugins/spawners/podman.py b/avocado/plugins/spawners/podman.py
index 14b4b34df..9b0745a83 100644
--- a/avocado/plugins/spawners/podman.py
+++ b/avocado/plugins/spawners/podman.py
@@ -87,6 +87,15 @@ class PodmanSpawner(Spawner, SpawnerMixin):
         return out.startswith(b'Up ')
 
     async def spawn_task(self, runtime_task):
+
+        mount_status_server_socket = False
+        mounted_status_server_socket = '/tmp/.status_server.sock'
+        status_server_uri = runtime_task.task.status_services[0].uri
+        if ':' not in status_server_uri:
+            # a unix domain socket is being used
+            mount_status_server_socket = True
+            runtime_task.task.status_services[0].uri = mounted_status_server_socket
+
         task = runtime_task.task
         entry_point_cmd = '/tmp/avocado-runner'
         entry_point_args = task.get_command_args()
@@ -109,6 +118,8 @@ class PodmanSpawner(Spawner, SpawnerMixin):
             proc = await asyncio.create_subprocess_exec(
                 podman_bin, "create",
                 "--net=host",
+                "--privileged",
+                "-v", "%s:%s" % (status_server_uri, '/tmp/.status_server.sock'),
                 entry_point_arg,
                 image,
                 stdout=asyncio.subprocess.PIPE,

I'm not sure how I feel about it though.

Looks like a quick fix/workaround until we come up with the implementation for the second option, which I think is a valid use case for other spawners too.

  • Coming up with a mechanism in which spawners can declare what kind feature they need -- so that it can be respected by an "auto status server" mechanism

How does that sound?

@beraldoleal
Copy link
Member

beraldoleal commented Aug 20, 2021

@beraldoleal good point about the podman spawner. Right now, it depends on a network based status server, so --nrunner-status-server-disable-auto works. I'll document that on a v2.

On the short term, we can look into:

* Passing through the unix domain socket file to the podman container (I'm not sure it's possible, but it's worth investigating)

This seems to be a podman only solution, and maybe we could think a little bit further for a more generic solution (I'm thinking about future possible spawners)

* Coming up with a mechanism in which spawners can declare what kind feature they need -- so that it can be respected by an "auto status server" mechanism

Hum... Not sure if I understood you here.

But one "ideal" scenario, this should still be done via TCP. And to avoid collisions maybe some kind of token/identifier of the client to avoid messing the messages. Again, this could also be solved with a PUB/SUB mechanism and different channels.

How does that sound?

Sure, yes, lets document, move forward and try to investigate how to work with a more robust solution for all spawners.

Now, thinking about the default behavior, since this is not working with podman spawner, maybe we should discard the second commit changes that makes this default. Makes sense?

@clebergnu
Copy link
Contributor Author

I'm not sure how I feel about it though.

Looks like a quick fix/workaround until we come up with the implementation for the second option, which I think is a valid use case for other spawners too.

And TBH we can trade (drop) the --net=host for the privileged + mount when using UNIX domain sockets.

@beraldoleal
Copy link
Member

Ok, just found the workaround solution exporting to the spawner. So yes, I'm ok with using it, but IMO we still need a more generic solution here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nrunner (previously nrun2run)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants