Delay until Docker Swarm recognizes finished service #18270

CodingJonas · 2020-06-15T14:05:58Z

CodingJonas
Jun 15, 2020

Apache Airflow version: 2.0.0dev

OS (e.g. from /etc/os-release): Ubuntu 18.04
Install tools: pip

What happened:
When running a DockerSwarmOperator the service finishes, but Airflow will detect that finished service only 60 seconds after it has finished.

The issue lies in this line:

airflow/airflow/providers/docker/operators/docker_swarm.py

Line 171 in 832593a

log = next(logs)

    def _stream_logs_to_output(self):
        logs = self.cli.service_logs(
            self.service['ID'], follow=True, stdout=True, stderr=True, is_tty=self.tty
        )
        line = ''
        while True:
            try:
                log = next(logs)

When removing this it works as expected. next(logs) is a blocking call, and for some reason the docker-py library, which is behind this call, does not recognize the finished service. After I think exactly 60 seconds this call crashes, which allows the operator to continue.

How to reproduce it:
Any DAG using the DockerSwarmOperator works, e.g.:

    task1 = DockerSwarmOperator(
        task_id='docker_swarm_validator',
        image='alpine:3.11.5',
        api_version='auto',
        command='echo X',
        tty=True,
    )

Anything else we need to know:

A workaround I found was to execute the logging in a separate process, while checking for the current status of the service in the main process. Once the service has finished, the logging process can simply be terminated.
The workaround would look something like this:

        if self.enable_logging:
            # Since this subprocess is daemonized, it will automatically be terminated once the main script finishes
            p = Process(target=_stream_logs_to_output,
                        args=(self.log,
                              self.cli.service_logs(self.service['ID'], follow=True, stdout=True, stderr=True, is_tty=self.tty)
                              ),
                        daemon=True
                        )
            p.start()

I removed the _stream_logs_to_output function from the class to better separate used ressources.

mik-laj · 2020-06-15T17:24:46Z

mik-laj
Jun 15, 2020
Collaborator

@nullhack @retornam @CodingJonas @CaptainCuddleCube @nalepae @akki Can you look at it? It may be interesting for you.

0 replies

nullhack · 2020-06-21T14:23:29Z

nullhack
Jun 21, 2020

Sure, I'll put on my list, after reviewing the other PR

0 replies

CodingJonas · 2020-06-22T08:27:11Z

CodingJonas
Jun 22, 2020
Author

I currently using the workaround with a subprocess without issues. Not saying it is perfect, but I can make a pull request out of if, since it seems to me like a sufficient solution.

0 replies

akki · 2020-07-23T07:39:10Z

akki
Jul 23, 2020

@CodingJonas Just curious if you asked upstream why docker-py behaves that way?
It might be worth asking them to know if they can fix the root cause itself.

0 replies

CodingJonas · 2020-07-23T08:01:48Z

CodingJonas
Jul 23, 2020
Author

That's a good point, I see that I open an issue on their side today and come around to polish my changes for a at least WIP merge request to see if my current solution is feasible :)

0 replies

retornam · 2020-07-23T22:48:51Z

retornam
Jul 23, 2020

@CodingJonas you are hitting a known bug in docker-py where is is_tty is set and there isn't any streaming output for about 60 seconds, the connection is closed see docker/docker-py#931 (comment), there is a fix for it in docker/docker-py#1959

0 replies

eladkal · 2021-05-06T13:07:20Z

eladkal
May 6, 2021
Collaborator

I'm not quite sure if this is an Airflow issue?

0 replies

enima2684 · 2021-08-10T09:54:43Z

enima2684
Aug 10, 2021

The issue docker/docker-py#931 is still not yet fixed and the DockerSwarmOperator still have the 60s delay before finishing a completed service.

Any updated workaround you are using to prevent this delay ?

0 replies

CodingJonas · 2021-09-06T14:36:07Z

CodingJonas
Sep 6, 2021
Author

Sorry for not following up on my workaround, we moved to Kubernetes, so we did not finish polishing the workaround properly. I can still add the main code parts we used to try to fix it. Perhaps it helps you!

    def _run_service(self):
        ...
        if self.enable_logging:
            service_log_stream = self.cli.service_logs(self.service['ID'], follow=True, stdout=True, stderr=True, is_tty=self.tty)
            _start_logging_async(self.log, service_log_stream)
        ...

def _start_logging_async(logger, service_log_stream):
    """
    The logging task is blocking and thus stops the operator from recognizing in time when the service finishes.
    Since the logging thread is demonized, it will automatically be terminated once the main script finishes.
    """
    p = Thread(target=_stream_logs_to_output,
               args=(logger, service_log_stream),
               daemon=True)
    p.start()

def _stream_logs_to_output(logger, logs):
    line = ''
    while True:
        try:
            log = next(logs)
        # TODO: Remove this clause once https://github.com/docker/docker-py/issues/931 is fixed
        except requests.exceptions.ConnectionError:
            logger.info("Connection Error while fetching log")
            # If the service log stream stopped sending messages, check if it the service has
            # terminated.
            break
        except StopIteration:
            logger.info("StopIteration while fetching log")
            # If the service log stream terminated, stop fetching logs further.
            break
        else:
            try:
                log = log.decode()
            except UnicodeDecodeError:
                continue
            if log == '\n':
                logger.info(line)
                line = ''
            else:
                line += log
    # flush any remaining log stream
    if line:
        logger.info(line)

The only addition we did is wrapping the _stream_logs_to_output method in a Thread.

0 replies

akki · 2021-09-17T13:03:55Z

akki
Sep 17, 2021

FYI, docker/docker-py#931 has got fixed.
So we should not see this issue once we upgrade our docker-py dependency to their next release.

Also, we can remove this exception handling as per the comment.

1 reply

potiuk Sep 17, 2021
Collaborator

FYI, docker/docker-py#931 has got fixed.

Cool. That should happen automagically then with one of the next provider releases :)

HaloKo4 · 2021-10-08T11:55:30Z

HaloKo4
Oct 8, 2021

docker-py version 5.0.3 was released with a the fix of docker/docker-py#931

https://github.com/docker/docker-py/releases/tag/5.0.3

@akki @potiuk @CodingJonas

2 replies

potiuk Oct 8, 2021
Collaborator

We should pick it up automatically with our constraint upgrade mechanism :)

potiuk Oct 8, 2021
Collaborator

Actually - we already did :)

eladkal · 2021-10-10T07:36:30Z

eladkal
Oct 10, 2021
Collaborator

I opened #18867 as followup
FYI @akki @HaloKo4 @CodingJonas if any one of you is interested in working on it.

1 reply

eladkal Oct 12, 2021
Collaborator

Issue is resolved by #18872

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delay until Docker Swarm recognizes finished service #18270

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 12 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Delay until Docker Swarm recognizes finished service #18270

CodingJonas Jun 15, 2020

Replies: 12 comments · 4 replies

mik-laj Jun 15, 2020 Collaborator

nullhack Jun 21, 2020

CodingJonas Jun 22, 2020 Author

akki Jul 23, 2020

CodingJonas Jul 23, 2020 Author

retornam Jul 23, 2020

eladkal May 6, 2021 Collaborator

enima2684 Aug 10, 2021

CodingJonas Sep 6, 2021 Author

akki Sep 17, 2021

potiuk Sep 17, 2021 Collaborator

HaloKo4 Oct 8, 2021

potiuk Oct 8, 2021 Collaborator

potiuk Oct 8, 2021 Collaborator

eladkal Oct 10, 2021 Collaborator

eladkal Oct 12, 2021 Collaborator

CodingJonas
Jun 15, 2020

Replies: 12 comments 4 replies

mik-laj
Jun 15, 2020
Collaborator

nullhack
Jun 21, 2020

CodingJonas
Jun 22, 2020
Author

akki
Jul 23, 2020

CodingJonas
Jul 23, 2020
Author

retornam
Jul 23, 2020

eladkal
May 6, 2021
Collaborator

enima2684
Aug 10, 2021

CodingJonas
Sep 6, 2021
Author

akki
Sep 17, 2021

potiuk Sep 17, 2021
Collaborator

HaloKo4
Oct 8, 2021

potiuk Oct 8, 2021
Collaborator

potiuk Oct 8, 2021
Collaborator

eladkal
Oct 10, 2021
Collaborator

eladkal Oct 12, 2021
Collaborator