Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Flush stdout/err in Dockerfile-workers before replacing the current process #14195

Merged
merged 13 commits into from
Oct 18, 2022

Conversation

realtyem
Copy link
Contributor

@realtyem realtyem commented Oct 16, 2022

Pull Request Checklist

  • Pull request is based on the develop branch

  • Pull request includes a changelog file. The entry should:

    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Pull request includes a sign off

  • Code style is correct
    (run the linters)

  • Fixes Dockerfile-workers not showing logging in early output #14194
    Additional log output there.

Some logged output from the entrypoint of Dockerfile-workers is not displayed when running a container as a background process, or detached mode, with -d. -d and detached mode is explained here, but a quick summary is that it is used by any process or service to be run in the background and not accept input from a command line, like a standard http server. Many VPS and NAS systems(such as unRAID and fly.io) that run docker images have this option built-in to the scripting that runs docker images as 'apps'; it is not something easily changeable.

Explained somewhat here and even more so here, the culprit is because of line 'buffering' of output in Python. It appears that the buffer for stdout isn't being flushed out and it just gets 'lost'. Figuring out where it disappeared to is beyond the scope of this pull-request. The simplest solution to flushing this buffer is to include a keyword to print called flush. More about print() and it's flush keyword. The keyword was added in Python 3.3 and therefore isn't in danger of being non-existent for any versions of Python used or tested today.

As observed inside of start.py, the issue is mitigated by sending logged output to stderr instead of stdout. This works, but doesn't feel like the right way to handle this. Just flush the buffers. If someone would rather I switch to using stderr, leave me a comment to that effect and I'll adjust.

Other options explored:

  • PYTHONUNBUFFERED=1 in the environment before starting the Python process.
  • Starting the python scripts with -u, such as python -u configure_workers_and_start.py

Neither seemed viable and felt intrusive to fix one tiny oversight in the way buffered output is handled.

There are two commits in this pull request that can be reverted. The one is harmless and probably would be good to keep. The other is bad and shouldn't be used long term on Github(tons of log spam):

  • A line is output when Generating base homeserver config that didn't have a mate when config already exists. I added that in before collecting logs while diagnosing the issue. Look for Homeserver config already exists at:.
  • Added the environment variable COMPLEMENT_ALWAYS_PRINT_SERVER_LOGS=1 to the complement CI workflow so when tests gets ran, you won't have to wait for a flake or an error to actually see the log output.

Github CI Complement snippet output from before:

Generating RSA private key, 2048 bit long modulus (2 primes)
...........................+++++
...........................................+++++
e is 65537 (0x010001)
Signature ok
subject=CN = hs1
Getting CA Private Key
                DNS:hs1
2022-10-16 00:57:23,912 INFO Included extra file "/etc/supervisor/conf.d/postgres.conf" during parsing
2022-10-16 00:57:23,912 INFO Included extra file "/etc/supervisor/conf.d/synapse.conf" during parsing
2022-10-16 00:57:23,912 INFO Set uid to user 0 succeeded
2022-10-16 00:57:23,914 INFO supervisord started with pid 1

After:

Generating RSA private key, 2048 bit long modulus (2 primes)
....+++++
....................................................................+++++
e is 65537 (0x010001)
Signature ok
subject=CN = hs1
Getting CA Private Key
                DNS:hs1
Homeserver config already exists at: /data/homeserver.yaml
Starting supervisord
2022-10-16 01:19:35,454 INFO Included extra file "/etc/supervisor/conf.d/postgres.conf" during parsing
2022-10-16 01:19:35,454 INFO Included extra file "/etc/supervisor/conf.d/synapse.conf" during parsing
2022-10-16 01:19:35,454 INFO Set uid to user 0 succeeded
2022-10-16 01:19:35,456 INFO supervisord started with pid 1

And yes it's literally only two lines different. Apparently, the Github Complement suites generate their own configuration elsewhere or hide the log output of start.py if it's ran. This doesn't happen with the command-line ran script. That I tested by creating an obviously wrong worker type and feeding it through(See issue at top). The configuration was not created(which is the correct response) but because that part of configure_workers_and_start.py never gets touched(at least in these logs), it didn't produce the information that it was an incorrect worker type. I'm assuming that is something configurable with GO or Complement and is a rabbit hole for another day.

Output in docker logs synapse works as expected.

Signed-off-by: Jason Little [email protected]

@realtyem
Copy link
Contributor Author

realtyem commented Oct 16, 2022

Spent a few minutes tracking down the fail. Looks like the Alias flake

2022-10-16T03:02:51.3775894Z ##[group]�[0;31m❌ TestRoomAlias�[0;37m (36.99s)�[0m
2022-10-16T03:02:51.3776170Z 2022/10/16 02:46:16 Sharing [SERVER_NAME=hs1 SYNAPSE_COMPLEMENT_USE_WORKERS=true SYNAPSE_COMPLEMENT_DATABASE=postgres] host environment variables with container
2022-10-16T03:02:51.3776434Z 2022/10/16 02:46:34 Sharing [SERVER_NAME=hs1 SYNAPSE_COMPLEMENT_USE_WORKERS=true SYNAPSE_COMPLEMENT_DATABASE=postgres] host environment variables with container
2022-10-16T03:02:51.3776600Z     apidoc_room_alias_test.go:43: Deploy times: 18.08280097s blueprints, 15.881970214s containers
2022-10-16T03:02:51.3776773Z     client.go:604: [CSAPI] POST hs1/_matrix/client/v3/createRoom => 200 OK (666.647458ms)
2022-10-16T03:02:51.3776939Z     client.go:604: [CSAPI] POST hs1/_matrix/client/v3/createRoom => 200 OK (735.356276ms)
2022-10-16T03:02:51.3777132Z     client.go:604: [CSAPI] PUT hs1/_matrix/client/v3/directory/room/#creates_alias:hs1 => 200 OK (80.10207ms)
2022-10-16T03:02:51.3777373Z     client.go:604: [CSAPI] GET hs1/_matrix/client/v3/directory/room/#creates_alias:hs1 => 200 OK (11.070455ms)
2022-10-16T03:02:51.3777568Z     client.go:604: [CSAPI] PUT hs1/_matrix/client/v3/directory/room/#room_members_list:hs1 => 200 OK (22.387208ms)
2022-10-16T03:02:51.3777758Z     client.go:604: [CSAPI] GET hs1/_matrix/client/v3/directory/room/#room_members_list:hs1 => 200 OK (7.739968ms)
2022-10-16T03:02:51.3777970Z     client.go:604: [CSAPI] GET hs1/_matrix/client/v3/rooms/!VPygoawTzSovztdqiv:hs1/aliases => 403 Forbidden (20.338016ms)
2022-10-16T03:02:51.3778134Z     client.go:604: [CSAPI] POST hs1/_matrix/client/v3/createRoom => 200 OK (788.782257ms)
2022-10-16T03:02:51.3778333Z     client.go:604: [CSAPI] GET hs1/_matrix/client/v3/rooms/!yOMojKJgXDwFkaEYcv:hs1/aliases => 200 OK (11.944651ms)
2022-10-16T03:02:51.3778517Z     client.go:604: [CSAPI] PUT hs1/_matrix/client/v3/directory/room/#lists_aliases:hs1 => 200 OK (13.035746ms)
2022-10-16T03:02:51.3778716Z     client.go:604: [CSAPI] GET hs1/_matrix/client/v3/rooms/!yOMojKJgXDwFkaEYcv:hs1/aliases => 200 OK (5.369578ms)
2022-10-16T03:02:51.3778939Z ##[endgroup]
2022-10-16T03:02:51.3779257Z ##[group]�[0;31m❌ TestRoomAlias/Parallel�[0;37m (0s)�[0m
2022-10-16T03:02:51.3779463Z ##[endgroup]
2022-10-16T03:02:51.3779840Z ##[group]�[0;31m❌ TestRoomAlias/Parallel/GET_/rooms/:room_id/aliases_lists_aliases�[0;37m (820ms)�[0m
2022-10-16T03:02:51.3780337Z     apidoc_room_alias_test.go:87: MatchResponse key 'aliases' got '[]' want '[#lists_aliases:hs1]' - http://localhost:49277/_matrix/client/v3/rooms/%21yOMojKJgXDwFkaEYcv:hs1/aliases => {"aliases":[]}
2022-10-16T03:02:51.3780445Z 2022/10/16 02:46:52 ============================================
2022-10-16T03:02:51.3780454Z 

Text searching the raw logs for the ❌ found the error. The logs are huge, which is why I recommend reverting that commit.

@realtyem realtyem marked this pull request as ready for review October 16, 2022 03:42
@realtyem realtyem requested a review from a team as a code owner October 16, 2022 03:42
Copy link
Contributor

@DMRobertson DMRobertson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about some of this.

I think a cleaner fix might be to call sys.stdout.flush() and sys.stderr.flush() before we call os.execle/os.execv/os.execve?

Co-authored-by: David Robertson <[email protected]>
@realtyem
Copy link
Contributor Author

Not sure about some of this.

I think a cleaner fix might be to call sys.stdout.flush() and sys.stderr.flush() before we call os.execle/os.execv/os.execve?

I admit to coming across those in my research about this. I did not think they would work without a companion sys.stdout.write() and friend. Turns out they do.

A quick set of tests yielded odd results. I won't rehash the bulk of what I tried out in combinations but the most notable was that leaving start.py with a flush=True and removing it in configure_workers_and_start.py but adding a sys.stdout.flush() right above the os.execle() yielded log output from configure_workers_and_start.py, but nothing from start.py. Very strange, that.

I'm pretty sure I stand by my change here. Having flush=True in start.py means not having to add a sys.stdout.flush() call in three places. Isn't one keyword argument change better than three other lines added, just to fix an anomaly in buffer handling when that option is available?

…gs even on successful tests."

This reverts commit 03ba864.
…homeserver base config to show when it already exists."

This reverts commit 44cfab5.
@realtyem
Copy link
Contributor Author

If you like I can include a comment above the change about why the flush was added?

@DMRobertson
Copy link
Contributor

Having flush=True in start.py means not having to add a sys.stdout.flush() call in three places. Isn't one keyword argument change better than three other lines added[...]?

Yes and no. If someone in the future writes makes a print() call without flushing, or starts doing so indirectly via Python's stdlib logging, we're going to get the same problem all over again.

just to fix an anomaly in buffer handling when that option is available?

I don't think this is an anomaly in buffer handling; this is something that you simply are expected to do when replacing the current process with an exec* call. The Python docs even highlight this:

The current process is replaced immediately. Open file objects and descriptors are not flushed, so if there may be data buffered on these open files, you should flush them using sys.stdout.flush() or os.fsync() before calling an exec* function.

So I think explicitly flushing before exec-ing is the more robust (and arguably idiomatic) change here.

@realtyem
Copy link
Contributor Author

Having flush=True in start.py means not having to add a sys.stdout.flush() call in three places. Isn't one keyword argument change better than three other lines added[...]?

Yes and no. If someone in the future writes makes a print() call without flushing, or starts doing so indirectly via Python's stdlib logging, we're going to get the same problem all over again.

Well. I suppose that's a good point. I would argue if they've heard of those things they would look and see what other's have done before in a file they are changing. But, still a valid point.

just to fix an anomaly in buffer handling when that option is available?

I don't think this is an anomaly in buffer handling; this is something that you simply are expected to do when replacing the current process with an exec* call. The Python docs even highlight this:

The current process is replaced immediately. Open file objects and descriptors are not flushed, so if there may be data buffered on these open files, you should flush them using sys.stdout.flush() or os.fsync() before calling an exec* function.

So I think explicitly flushing before exec-ing is the more robust (and arguably idiomatic) change here.

And this is me learning something new today. I had not come across that detail in previous research. It would have saved me a few hours, I'm sure. Very well, I'll add sys.stdout.flush() to start.py and configure_workers_and_start.py and leave them both writing to stdout but without a flush=True. Sound good?

@realtyem
Copy link
Contributor Author

Or perhaps just make both of them print(str, file=sys.stderr). Somehow that feels wrong to me, but results 🤷‍♂️

@DMRobertson
Copy link
Contributor

I'll add sys.stdout.flush() to start.py and configure_workers_and_start.py and leave them both writing to stdout but without a flush=True. Sound good?

I think having error() write to sys.stderr would be a nice change too while we're here---though you'd need to sys.stderr.flush() too before an os.exec* call. Also beware that there are three different places where we call os.exec* in start.py.

Otherwise sounds good!

@realtyem
Copy link
Contributor Author

realtyem commented Oct 17, 2022

I'll add sys.stdout.flush() to start.py and configure_workers_and_start.py and leave them both writing to stdout but without a flush=True. Sound good?

I think having error() write to sys.stderr would be a nice change too while we're here---though you'd need to sys.stderr.flush() too before an os.exec* call. Also beware that there are three different places where we call os.exec* in start.py.

Otherwise sounds good!

I went for(essentially):

def log(txt: str) -> None:
    print(txt)


def error(txt: str) -> NoReturn:
    print(txt, file=sys.stderr)
    sys.exit(2)

and then added:

sys.stdout.flush()
sys.stderr.flush()

above each os.exec variant. Testing this really quick then showed that start.py outputted no logs, as it gets called with an argument that returns without using any os.exec() variant.

  • Inside configure_workers_and_start.py, generate_base_homeserver_config() calls out to start.py with a subprocess.check_output() using ["/usr/local/bin/python", "/start.py", "migrate_config"]
  • In start.py, it bounces from main into this conditional which then calls [generate_config_from_template()](https://github.com/matrix-org/synapse/blob/develop/docker/start.py#L39)
  • Inside generate_config_from_template(), a set of arguments is constructed, then passed to subprocess.check_output() at which point it returns all the way back to configure_workers_and_start.py and continues onwards to finish.
  • Poof! Buffers from stdout gone.

Would you like another set of flush()s above the subprocess.check_output() in start.py(and any more of those I find while I'm still in there)?

@DMRobertson
Copy link
Contributor

Inside configure_workers_and_start.py, generate_base_homeserver_config() calls out to start.py with a subprocess.check_output() using ["/usr/local/bin/python", "/start.py", "migrate_config"]

check_output:

  • spawns an entirely new process with an entirely new stdout/stderr stream
  • these streams are completely disconnected from the controlling process's stdout/stderr
  • for check_output specifically, the controlling process captures the subprocess's stdout; its stderr is discarded

Try changing check_output to check_call. See https://docs.python.org/3/library/subprocess.html for more details.

@realtyem
Copy link
Contributor Author

Inside configure_workers_and_start.py, generate_base_homeserver_config() calls out to start.py with a subprocess.check_output() using ["/usr/local/bin/python", "/start.py", "migrate_config"]

check_output:

  • spawns an entirely new process with an entirely new stdout/stderr stream
  • these streams are completely disconnected from the controlling process's stdout/stderr
  • for check_output specifically, the controlling process captures the subprocess's stdout; its stderr is discarded

Try changing check_output to check_call. See https://docs.python.org/3/library/subprocess.html for more details.

check_call worked. I'll fix this up later tonight and send it in then.

@realtyem
Copy link
Contributor Author

Any objections to a helper function? Here's what I'm looking at duplicating across both files:

 Utility functions
def log(txt: str) -> None:
    """
    Log something to the stdout.

    Args:
        txt: The text to log.
    """
    print(txt)


def error(txt: str) -> NoReturn:
    """
    Log something to the stderr and exit with an error code.

    Args:
        txt: The text to log in error.
    """
    print(txt, file=sys.stderr)
    sys.exit(2)


def flush_buffers() -> None:
    """
    Flush stdout and stderr buffers
    """
    sys.stdout.flush()
    sys.stderr.flush()

Then can just call flush_buffers() at each exit point.

@realtyem
Copy link
Contributor Author

Any objections to using subprocess.run() instead of subprocess.check_call()? I was reading about them in the link you gave, https://docs.python.org/3/library/subprocess.html, and it was said:

Code needing to capture stdout or stderr should use run() instead:

So I read a bit on run(), and it's more...robust? Also the

Older high-level API

title above all the subprocess.check_* calls seems worrying. In testing this out, run() seems to work fine. Is there a higher level process somewhere(like maybe in Complement) that depends on using subprocess.check_* calls?

@DMRobertson
Copy link
Contributor

These are simple bootstrapping scripts: we don't need thorough docstrings here (particularly when the functions are short and the docstrings are longer than the functions they describe). Otherwise no objections to helpers.

check_output etc are implemented on top of run(), I think; there's no robustness difference AFAIK. They're oldish, but not going to be removed any time soon (no deprecation notice; the docs even note that lots of existing code calls these functions). If you particularly want to use run I won't stop you; crucially it's present in all Python versions we support (3.7+ at the time of writing). But I don't think it's a must-have.

@realtyem
Copy link
Contributor Author

They were in configure_workers_and_start.py, just thought I would for consistency. I'll use run() but take out the docstrings.

@realtyem realtyem requested a review from DMRobertson October 18, 2022 11:28
@DMRobertson DMRobertson changed the title Fix early docker logging Flush stdout/err in Dockerfile-workers before replacing the current process Oct 18, 2022
Copy link
Contributor

@DMRobertson DMRobertson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM.

@DMRobertson DMRobertson enabled auto-merge (squash) October 18, 2022 11:43
@DMRobertson DMRobertson merged commit 6c5082f into matrix-org:develop Oct 18, 2022
@realtyem realtyem deleted the fix-docker-logging branch October 19, 2022 22:04
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dockerfile-workers not showing logging in early output
2 participants