Flush stdout/err in Dockerfile-workers before replacing the current process #14195

realtyem · 2022-10-16T02:31:33Z

Pull Request Checklist

Pull request is based on the develop branch
Pull request includes a changelog file. The entry should:
- Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
- Use markdown where necessary, mostly for code blocks.
- End with either a period (.) or an exclamation mark (!).
- Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
Pull request includes a sign off
Code style is correct
(run the linters)
Fixes Dockerfile-workers not showing logging in early output #14194
Additional log output there.

Some logged output from the entrypoint of Dockerfile-workers is not displayed when running a container as a background process, or detached mode, with -d. -d and detached mode is explained here, but a quick summary is that it is used by any process or service to be run in the background and not accept input from a command line, like a standard http server. Many VPS and NAS systems(such as unRAID and fly.io) that run docker images have this option built-in to the scripting that runs docker images as 'apps'; it is not something easily changeable.

Explained somewhat here and even more so here, the culprit is because of line 'buffering' of output in Python. It appears that the buffer for stdout isn't being flushed out and it just gets 'lost'. Figuring out where it disappeared to is beyond the scope of this pull-request. The simplest solution to flushing this buffer is to include a keyword to print called flush. More about print() and it's flush keyword. The keyword was added in Python 3.3 and therefore isn't in danger of being non-existent for any versions of Python used or tested today.

As observed inside of start.py, the issue is mitigated by sending logged output to stderr instead of stdout. This works, but doesn't feel like the right way to handle this. Just flush the buffers. If someone would rather I switch to using stderr, leave me a comment to that effect and I'll adjust.

Other options explored:

PYTHONUNBUFFERED=1 in the environment before starting the Python process.
Starting the python scripts with -u, such as python -u configure_workers_and_start.py

Neither seemed viable and felt intrusive to fix one tiny oversight in the way buffered output is handled.

There are two commits in this pull request that can be reverted. The one is harmless and probably would be good to keep. The other is bad and shouldn't be used long term on Github(tons of log spam):

A line is output when Generating base homeserver config that didn't have a mate when config already exists. I added that in before collecting logs while diagnosing the issue. Look for Homeserver config already exists at:.
Added the environment variable COMPLEMENT_ALWAYS_PRINT_SERVER_LOGS=1 to the complement CI workflow so when tests gets ran, you won't have to wait for a flake or an error to actually see the log output.

Github CI Complement snippet output from before:

Generating RSA private key, 2048 bit long modulus (2 primes)
...........................+++++
...........................................+++++
e is 65537 (0x010001)
Signature ok
subject=CN = hs1
Getting CA Private Key
                DNS:hs1
2022-10-16 00:57:23,912 INFO Included extra file "/etc/supervisor/conf.d/postgres.conf" during parsing
2022-10-16 00:57:23,912 INFO Included extra file "/etc/supervisor/conf.d/synapse.conf" during parsing
2022-10-16 00:57:23,912 INFO Set uid to user 0 succeeded
2022-10-16 00:57:23,914 INFO supervisord started with pid 1

After:

Generating RSA private key, 2048 bit long modulus (2 primes)
....+++++
....................................................................+++++
e is 65537 (0x010001)
Signature ok
subject=CN = hs1
Getting CA Private Key
                DNS:hs1
Homeserver config already exists at: /data/homeserver.yaml
Starting supervisord
2022-10-16 01:19:35,454 INFO Included extra file "/etc/supervisor/conf.d/postgres.conf" during parsing
2022-10-16 01:19:35,454 INFO Included extra file "/etc/supervisor/conf.d/synapse.conf" during parsing
2022-10-16 01:19:35,454 INFO Set uid to user 0 succeeded
2022-10-16 01:19:35,456 INFO supervisord started with pid 1

And yes it's literally only two lines different. Apparently, the Github Complement suites generate their own configuration elsewhere or hide the log output of start.py if it's ran. This doesn't happen with the command-line ran script. That I tested by creating an obviously wrong worker type and feeding it through(See issue at top). The configuration was not created(which is the correct response) but because that part of configure_workers_and_start.py never gets touched(at least in these logs), it didn't produce the information that it was an incorrect worker type. I'm assuming that is something configurable with GO or Complement and is a rabbit hole for another day.

Output in docker logs synapse works as expected.

Signed-off-by: Jason Little [email protected]

…er base config to show when it already exists.

…on successful tests.

realtyem · 2022-10-16T03:39:49Z

Spent a few minutes tracking down the fail. Looks like the Alias flake

2022-10-16T03:02:51.3775894Z ##[group]�[0;31m❌ TestRoomAlias�[0;37m (36.99s)�[0m
2022-10-16T03:02:51.3776170Z 2022/10/16 02:46:16 Sharing [SERVER_NAME=hs1 SYNAPSE_COMPLEMENT_USE_WORKERS=true SYNAPSE_COMPLEMENT_DATABASE=postgres] host environment variables with container
2022-10-16T03:02:51.3776434Z 2022/10/16 02:46:34 Sharing [SERVER_NAME=hs1 SYNAPSE_COMPLEMENT_USE_WORKERS=true SYNAPSE_COMPLEMENT_DATABASE=postgres] host environment variables with container
2022-10-16T03:02:51.3776600Z     apidoc_room_alias_test.go:43: Deploy times: 18.08280097s blueprints, 15.881970214s containers
2022-10-16T03:02:51.3776773Z     client.go:604: [CSAPI] POST hs1/_matrix/client/v3/createRoom => 200 OK (666.647458ms)
2022-10-16T03:02:51.3776939Z     client.go:604: [CSAPI] POST hs1/_matrix/client/v3/createRoom => 200 OK (735.356276ms)
2022-10-16T03:02:51.3777132Z     client.go:604: [CSAPI] PUT hs1/_matrix/client/v3/directory/room/#creates_alias:hs1 => 200 OK (80.10207ms)
2022-10-16T03:02:51.3777373Z     client.go:604: [CSAPI] GET hs1/_matrix/client/v3/directory/room/#creates_alias:hs1 => 200 OK (11.070455ms)
2022-10-16T03:02:51.3777568Z     client.go:604: [CSAPI] PUT hs1/_matrix/client/v3/directory/room/#room_members_list:hs1 => 200 OK (22.387208ms)
2022-10-16T03:02:51.3777758Z     client.go:604: [CSAPI] GET hs1/_matrix/client/v3/directory/room/#room_members_list:hs1 => 200 OK (7.739968ms)
2022-10-16T03:02:51.3777970Z     client.go:604: [CSAPI] GET hs1/_matrix/client/v3/rooms/!VPygoawTzSovztdqiv:hs1/aliases => 403 Forbidden (20.338016ms)
2022-10-16T03:02:51.3778134Z     client.go:604: [CSAPI] POST hs1/_matrix/client/v3/createRoom => 200 OK (788.782257ms)
2022-10-16T03:02:51.3778333Z     client.go:604: [CSAPI] GET hs1/_matrix/client/v3/rooms/!yOMojKJgXDwFkaEYcv:hs1/aliases => 200 OK (11.944651ms)
2022-10-16T03:02:51.3778517Z     client.go:604: [CSAPI] PUT hs1/_matrix/client/v3/directory/room/#lists_aliases:hs1 => 200 OK (13.035746ms)
2022-10-16T03:02:51.3778716Z     client.go:604: [CSAPI] GET hs1/_matrix/client/v3/rooms/!yOMojKJgXDwFkaEYcv:hs1/aliases => 200 OK (5.369578ms)
2022-10-16T03:02:51.3778939Z ##[endgroup]
2022-10-16T03:02:51.3779257Z ##[group]�[0;31m❌ TestRoomAlias/Parallel�[0;37m (0s)�[0m
2022-10-16T03:02:51.3779463Z ##[endgroup]
2022-10-16T03:02:51.3779840Z ##[group]�[0;31m❌ TestRoomAlias/Parallel/GET_/rooms/:room_id/aliases_lists_aliases�[0;37m (820ms)�[0m
2022-10-16T03:02:51.3780337Z     apidoc_room_alias_test.go:87: MatchResponse key 'aliases' got '[]' want '[#lists_aliases:hs1]' - http://localhost:49277/_matrix/client/v3/rooms/%21yOMojKJgXDwFkaEYcv:hs1/aliases => {"aliases":[]}
2022-10-16T03:02:51.3780445Z 2022/10/16 02:46:52 ============================================
2022-10-16T03:02:51.3780454Z

Text searching the raw logs for the ❌ found the error. The logs are huge, which is why I recommend reverting that commit.

DMRobertson

Not sure about some of this.

I think a cleaner fix might be to call sys.stdout.flush() and sys.stderr.flush() before we call os.execle/os.execv/os.execve?

.github/workflows/tests.yml

docker/configure_workers_and_start.py

docker/start.py

changelog.d/14195.docker

Co-authored-by: David Robertson <[email protected]>

realtyem · 2022-10-17T11:10:33Z

Not sure about some of this.

I think a cleaner fix might be to call sys.stdout.flush() and sys.stderr.flush() before we call os.execle/os.execv/os.execve?

I admit to coming across those in my research about this. I did not think they would work without a companion sys.stdout.write() and friend. Turns out they do.

A quick set of tests yielded odd results. I won't rehash the bulk of what I tried out in combinations but the most notable was that leaving start.py with a flush=True and removing it in configure_workers_and_start.py but adding a sys.stdout.flush() right above the os.execle() yielded log output from configure_workers_and_start.py, but nothing from start.py. Very strange, that.

I'm pretty sure I stand by my change here. Having flush=True in start.py means not having to add a sys.stdout.flush() call in three places. Isn't one keyword argument change better than three other lines added, just to fix an anomaly in buffer handling when that option is available?

…gs even on successful tests." This reverts commit 03ba864.

…homeserver base config to show when it already exists." This reverts commit 44cfab5.

realtyem · 2022-10-17T11:19:16Z

If you like I can include a comment above the change about why the flush was added?

DMRobertson · 2022-10-17T11:25:54Z

Having flush=True in start.py means not having to add a sys.stdout.flush() call in three places. Isn't one keyword argument change better than three other lines added[...]?

Yes and no. If someone in the future writes makes a print() call without flushing, or starts doing so indirectly via Python's stdlib logging, we're going to get the same problem all over again.

just to fix an anomaly in buffer handling when that option is available?

I don't think this is an anomaly in buffer handling; this is something that you simply are expected to do when replacing the current process with an exec* call. The Python docs even highlight this:

The current process is replaced immediately. Open file objects and descriptors are not flushed, so if there may be data buffered on these open files, you should flush them using sys.stdout.flush() or os.fsync() before calling an exec* function.

So I think explicitly flushing before exec-ing is the more robust (and arguably idiomatic) change here.

realtyem · 2022-10-17T11:37:28Z

Having flush=True in start.py means not having to add a sys.stdout.flush() call in three places. Isn't one keyword argument change better than three other lines added[...]?

Yes and no. If someone in the future writes makes a print() call without flushing, or starts doing so indirectly via Python's stdlib logging, we're going to get the same problem all over again.

Well. I suppose that's a good point. I would argue if they've heard of those things they would look and see what other's have done before in a file they are changing. But, still a valid point.

just to fix an anomaly in buffer handling when that option is available?

I don't think this is an anomaly in buffer handling; this is something that you simply are expected to do when replacing the current process with an exec* call. The Python docs even highlight this:

The current process is replaced immediately. Open file objects and descriptors are not flushed, so if there may be data buffered on these open files, you should flush them using sys.stdout.flush() or os.fsync() before calling an exec* function.

So I think explicitly flushing before exec-ing is the more robust (and arguably idiomatic) change here.

And this is me learning something new today. I had not come across that detail in previous research. It would have saved me a few hours, I'm sure. Very well, I'll add sys.stdout.flush() to start.py and configure_workers_and_start.py and leave them both writing to stdout but without a flush=True. Sound good?

realtyem · 2022-10-17T11:41:10Z

Or perhaps just make both of them print(str, file=sys.stderr). Somehow that feels wrong to me, but results 🤷‍♂️

DMRobertson · 2022-10-17T11:44:43Z

I'll add sys.stdout.flush() to start.py and configure_workers_and_start.py and leave them both writing to stdout but without a flush=True. Sound good?

I think having error() write to sys.stderr would be a nice change too while we're here---though you'd need to sys.stderr.flush() too before an os.exec* call. Also beware that there are three different places where we call os.exec* in start.py.

Otherwise sounds good!

realtyem · 2022-10-17T12:15:50Z

I'll add sys.stdout.flush() to start.py and configure_workers_and_start.py and leave them both writing to stdout but without a flush=True. Sound good?

I think having error() write to sys.stderr would be a nice change too while we're here---though you'd need to sys.stderr.flush() too before an os.exec* call. Also beware that there are three different places where we call os.exec* in start.py.

Otherwise sounds good!

I went for(essentially):

def log(txt: str) -> None:
    print(txt)


def error(txt: str) -> NoReturn:
    print(txt, file=sys.stderr)
    sys.exit(2)

and then added:

sys.stdout.flush()
sys.stderr.flush()

above each os.exec variant. Testing this really quick then showed that start.py outputted no logs, as it gets called with an argument that returns without using any os.exec() variant.

Inside configure_workers_and_start.py, generate_base_homeserver_config() calls out to start.py with a subprocess.check_output() using ["/usr/local/bin/python", "/start.py", "migrate_config"]
In start.py, it bounces from main into this conditional which then calls [generate_config_from_template()](https://github.com/matrix-org/synapse/blob/develop/docker/start.py#L39)
Inside generate_config_from_template(), a set of arguments is constructed, then passed to subprocess.check_output() at which point it returns all the way back to configure_workers_and_start.py and continues onwards to finish.
Poof! Buffers from stdout gone.

Would you like another set of flush()s above the subprocess.check_output() in start.py(and any more of those I find while I'm still in there)?

DMRobertson · 2022-10-17T12:26:48Z

Inside configure_workers_and_start.py, generate_base_homeserver_config() calls out to start.py with a subprocess.check_output() using ["/usr/local/bin/python", "/start.py", "migrate_config"]

check_output:

spawns an entirely new process with an entirely new stdout/stderr stream
these streams are completely disconnected from the controlling process's stdout/stderr
for check_output specifically, the controlling process captures the subprocess's stdout; its stderr is discarded

Try changing check_output to check_call. See https://docs.python.org/3/library/subprocess.html for more details.

realtyem · 2022-10-17T12:37:22Z

Inside configure_workers_and_start.py, generate_base_homeserver_config() calls out to start.py with a subprocess.check_output() using ["/usr/local/bin/python", "/start.py", "migrate_config"]

check_output:

spawns an entirely new process with an entirely new stdout/stderr stream

these streams are completely disconnected from the controlling process's stdout/stderr

for check_output specifically, the controlling process captures the subprocess's stdout; its stderr is discarded

Try changing check_output to check_call. See https://docs.python.org/3/library/subprocess.html for more details.

check_call worked. I'll fix this up later tonight and send it in then.

…stdout." This reverts commit 7210a97.

realtyem · 2022-10-17T22:52:20Z

Any objections to a helper function? Here's what I'm looking at duplicating across both files:

 Utility functions
def log(txt: str) -> None:
    """
    Log something to the stdout.

    Args:
        txt: The text to log.
    """
    print(txt)


def error(txt: str) -> NoReturn:
    """
    Log something to the stderr and exit with an error code.

    Args:
        txt: The text to log in error.
    """
    print(txt, file=sys.stderr)
    sys.exit(2)


def flush_buffers() -> None:
    """
    Flush stdout and stderr buffers
    """
    sys.stdout.flush()
    sys.stderr.flush()

Then can just call flush_buffers() at each exit point.

… use the new helper.

realtyem · 2022-10-17T23:27:30Z

Any objections to using subprocess.run() instead of subprocess.check_call()? I was reading about them in the link you gave, https://docs.python.org/3/library/subprocess.html, and it was said:

Code needing to capture stdout or stderr should use run() instead:

So I read a bit on run(), and it's more...robust? Also the

Older high-level API

title above all the subprocess.check_* calls seems worrying. In testing this out, run() seems to work fine. Is there a higher level process somewhere(like maybe in Complement) that depends on using subprocess.check_* calls?

DMRobertson · 2022-10-18T00:40:19Z

These are simple bootstrapping scripts: we don't need thorough docstrings here (particularly when the functions are short and the docstrings are longer than the functions they describe). Otherwise no objections to helpers.

check_output etc are implemented on top of run(), I think; there's no robustness difference AFAIK. They're oldish, but not going to be removed any time soon (no deprecation notice; the docs even note that lots of existing code calls these functions). If you particularly want to use run I won't stop you; crucially it's present in all Python versions we support (3.7+ at the time of writing). But I don't think it's a must-have.

realtyem · 2022-10-18T00:49:22Z

They were in configure_workers_and_start.py, just thought I would for consistency. I'll use run() but take out the docstrings.

docker/start.py

…go south it's LOUD.

DMRobertson

Thanks, LGTM.

realtyem added 4 commits October 15, 2022 20:52

Change Python print() to use flush=True for early logging on stdout.

7210a97

[CAN REVERT THIS] Introduce mate for logging when generating homeserv…

44cfab5

…er base config to show when it already exists.

[CAN REVERT THIS] Enable Complement always printing server logs even …

03ba864

…on successful tests.

Changelog

950ea04

realtyem marked this pull request as ready for review October 16, 2022 03:42

realtyem requested a review from a team as a code owner October 16, 2022 03:42

DMRobertson suggested changes Oct 17, 2022

View reviewed changes

.github/workflows/tests.yml Outdated Show resolved Hide resolved

docker/configure_workers_and_start.py Outdated Show resolved Hide resolved

docker/start.py Outdated Show resolved Hide resolved

changelog.d/14195.docker Outdated Show resolved Hide resolved

Update changelog.d/14195.docker

a3b693b

Co-authored-by: David Robertson <[email protected]>

realtyem added 2 commits October 17, 2022 06:12

Revert "[CAN REVERT THIS] Enable Complement always printing server lo…

4fd922e

…gs even on successful tests." This reverts commit 03ba864.

Revert "[CAN REVERT THIS] Introduce mate for logging when generating …

7814c7f

…homeserver base config to show when it already exists." This reverts commit 44cfab5.

Revert "Change Python print() to use flush=True for early logging on …

cf6029f

…stdout." This reverts commit 7210a97.

Make new helper function to flush buffers and adjust docstrings, then…

74d37b9

… use the new helper.

realtyem added 3 commits October 17, 2022 19:50

Use subprocess.run() instead of subprocess.check_output

0974e43

Strip out docstrings. Unneeded in bootstrapping.

f1f2f1c

Merge branch 'develop' into fix-docker-logging

ac4ec4d

DMRobertson reviewed Oct 18, 2022

View reviewed changes

docker/start.py Outdated Show resolved Hide resolved

Add keyword argument 'check=True' to subprocess.run() so when things …

2f6b8c8

…go south it's LOUD.

realtyem requested a review from DMRobertson October 18, 2022 11:28

DMRobertson changed the title ~~Fix early docker logging~~ Flush stdout/err in Dockerfile-workers before replacing the current process Oct 18, 2022

DMRobertson approved these changes Oct 18, 2022

View reviewed changes

DMRobertson enabled auto-merge (squash) October 18, 2022 11:43

DMRobertson merged commit 6c5082f into matrix-org:develop Oct 18, 2022

realtyem deleted the fix-docker-logging branch October 19, 2022 22:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flush stdout/err in Dockerfile-workers before replacing the current process #14195

Flush stdout/err in Dockerfile-workers before replacing the current process #14195

realtyem commented Oct 16, 2022 •

edited

Loading

realtyem commented Oct 16, 2022 •

edited

Loading

DMRobertson left a comment

realtyem commented Oct 17, 2022

realtyem commented Oct 17, 2022

DMRobertson commented Oct 17, 2022

realtyem commented Oct 17, 2022

realtyem commented Oct 17, 2022

DMRobertson commented Oct 17, 2022

realtyem commented Oct 17, 2022 •

edited

Loading

DMRobertson commented Oct 17, 2022

realtyem commented Oct 17, 2022

realtyem commented Oct 17, 2022

realtyem commented Oct 17, 2022

DMRobertson commented Oct 18, 2022

realtyem commented Oct 18, 2022

DMRobertson left a comment

Flush stdout/err in Dockerfile-workers before replacing the current process #14195

Flush stdout/err in Dockerfile-workers before replacing the current process #14195

Conversation

realtyem commented Oct 16, 2022 • edited Loading

Pull Request Checklist

realtyem commented Oct 16, 2022 • edited Loading

DMRobertson left a comment

Choose a reason for hiding this comment

realtyem commented Oct 17, 2022

realtyem commented Oct 17, 2022

DMRobertson commented Oct 17, 2022

realtyem commented Oct 17, 2022

realtyem commented Oct 17, 2022

DMRobertson commented Oct 17, 2022

realtyem commented Oct 17, 2022 • edited Loading

DMRobertson commented Oct 17, 2022

realtyem commented Oct 17, 2022

realtyem commented Oct 17, 2022

realtyem commented Oct 17, 2022

DMRobertson commented Oct 18, 2022

realtyem commented Oct 18, 2022

DMRobertson left a comment

Choose a reason for hiding this comment

realtyem commented Oct 16, 2022 •

edited

Loading

realtyem commented Oct 16, 2022 •

edited

Loading

realtyem commented Oct 17, 2022 •

edited

Loading