Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ NEW: nb_merge_streams configuration #364

Merged
merged 1 commit into from
Oct 3, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/use/start.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,4 +112,7 @@ Then for parsing and output rendering:
* - `nb_output_stderr`
- `show`
- One of 'show', 'remove', 'warn', 'error' or 'severe', [see here](use/format/stderr) for details.
* - `nb_merge_streams`
- `False`
- If `True`, ensure all stdout / stderr output streams are merged into single outputs. This ensures deterministic outputs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This explanation is perhaps worth to clarify further. Which aspect of the outputs becomes deterministic?

Copy link
Member Author

@chrisjsewell chrisjsewell Oct 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to make a PR with the change, and I'll merge

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, perhaps 5 minutes left to review of a new config option is too short. Maybe something to consider for the future.

Also I can't make this PR exactly because I don't understand what result does merging of the streams achieve. Given that I'm somewhat familiar with nbformat and the involved stack, this is possibly true for many others.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result is "documented" by the test and it output (see PR files changed).
It is a common pattern, already used in places like: https://github.com/computationalmodelling/nbval/blob/master/nbval/plugin.py, and was already planned in jupyter-book/jupyter-book#973

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I understand the purpose of the feature now. I'll trust your judgement call that looking at the source and the test is good enough as far as documentation is concerned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh, I'm certainly open to improved documentation, and we will probably iterate on that more in jupyter-book/jupyter-book#1448 (comment)

The results, e.g. here https://sqla-tutorials-nb.readthedocs.io/en/latest/metadata.html#emitting-ddl-to-the-database, speak for themselves though: before all the output lines were in separate div boxes 😄

Copy link
Member

@mmcky mmcky Oct 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result is "documented" by the test and it output (see PR files changed).
It is a common pattern, already used in places like: https://github.com/computationalmodelling/nbval/blob/master/nbval/plugin.py, and was already planned in executablebooks/jupyter-book#973

Thanks @chrisjsewell -- this is a great improvement 👍

Thanks @akhmerov for your comments -- I am also not sure what This ensures deterministic outputs. means here.

Just my 2cents-- I find documentation in tests a sub-optimal form of documentation as it requires reading tests, which typically means you need to understand the test infrastructure with is non-trivial in many cases and (for me) is cognitive overhead :-).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its deterministic because, instead of having a random (non-deterministic) number of stdout/stderr outputs, you have only a maximum of one for each. As explained in jupyter-book/jupyter-book#973

Copy link
Member

@choldgraf choldgraf Oct 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this could be documented more clearly for end-users - if we think that this is a useful feature, and if it is turned off by default, then we should have a section that is easily discoverable for a user that would benefit from this feature.

For example, we could create a parent category of ## `stdout`/`stderr` around here, then create a nested section like:

### Group `stdout` and `stderr` outputs into a single stream

Notebooks may print multiple things to `stdout` and `stderr`. For example, if a cell prints status updates throughout its execution, each of these is often printed to `stdout`. By default, these outputs may be split across multiple items, and will be rendered as separate "chunks" in your built documentation.

If you'd like each of the outputs in `stderr` and `stdout` to be merged into a single stream for each, use the following configuration:

```
nb_merge_streams = True
```

This will ensure that all `stderr` and `stdout` outputs are merged into a single group.
This also makes the cell outputs more deterministic because slight differences in timing may result in different orders of `stderr` and `stdout` in the cell output, while this will sort them properly.

Copy link
Member

@choldgraf choldgraf Oct 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I'll just open a PR to propose these changes since this one is already merged.

`````
1 change: 1 addition & 0 deletions myst_nb/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,7 @@ def visit_element_html(self, node):
app.add_config_value("nb_render_plugin", "default", "env")
app.add_config_value("nb_render_text_lexer", "myst-ansi", "env")
app.add_config_value("nb_output_stderr", "show", "env")
app.add_config_value("nb_merge_streams", False, "env")

# Register our post-transform which will convert output bundles to nodes
app.add_post_transform(PasteNodesToDocutils)
Expand Down
55 changes: 55 additions & 0 deletions myst_nb/render_outputs.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
"""A Sphinx post-transform, to convert notebook outpus to AST nodes."""
import os
import re
from abc import ABC, abstractmethod
from typing import List, Optional
from unittest import mock
Expand Down Expand Up @@ -91,6 +92,58 @@ def load_renderer(name: str) -> "CellOutputRendererBase":
raise MystNbEntryPointError(f"No Entry Point found for myst_nb.mime_render:{name}")


RGX_CARRIAGERETURN = re.compile(r".*\r(?=[^\n])")
RGX_BACKSPACE = re.compile(r"[^\n]\b")


def coalesce_streams(outputs: List[NotebookNode]) -> List[NotebookNode]:
"""Merge all stream outputs with shared names into single streams.

This ensure deterministic outputs.

Adapted from:
https://github.com/computationalmodelling/nbval/blob/master/nbval/plugin.py.
"""
if not outputs:
return []

new_outputs = []
streams = {}
for output in outputs:
if output["output_type"] == "stream":
if output["name"] in streams:
streams[output["name"]]["text"] += output["text"]
else:
new_outputs.append(output)
streams[output["name"]] = output
else:
new_outputs.append(output)

# process \r and \b characters
for output in streams.values():
old = output["text"]
while len(output["text"]) < len(old):
old = output["text"]
# Cancel out anything-but-newline followed by backspace
output["text"] = RGX_BACKSPACE.sub("", output["text"])
# Replace all carriage returns not followed by newline
output["text"] = RGX_CARRIAGERETURN.sub("", output["text"])

# We also want to ensure stdout and stderr are always in the same consecutive order,
# because they are asynchronous, so order isn't guaranteed.
for i, output in enumerate(new_outputs):
if output["output_type"] == "stream" and output["name"] == "stderr":
if (
len(new_outputs) >= i + 2
and new_outputs[i + 1]["output_type"] == "stream"
and new_outputs[i + 1]["name"] == "stdout"
):
stdout = new_outputs.pop(i + 1)
new_outputs.insert(i, stdout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I understand correctly, is this correct:

new_outputs may contain a bunch of outputs, some stdout and some stderr, all interwoven. This section sorts them so that the stdout outputs always come after the stderr outputs?

and if that is correct, maybe I'm missing something but could we simplify with something like:

sorted(new_outputs, lambda output: output.get("name"))

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stdout come before stderr. Again, I would emphasise this is copied directly from https://github.com/computationalmodelling/nbval/blob/master/nbval/plugin.py. If you want to change this, I would suggest making a PR there first

Copy link
Contributor

@akhmerov akhmerov Oct 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@choldgraf I think it's not what you describe (although I might have misread it).

  • non-stream outputs aren't reshuffled. So [stdout, image, stdout] stays untouched.
  • multiple arguments of the same stream type are merged and not just reordered.

Again, I would emphasise this is copied directly from https://github.com/computationalmodelling/nbval/blob/master/nbval/plugin.py. If you want to change this, I would suggest making a PR there first

Looking at the description of nbval, I think its purpose is different from publishing notebooks, and therefore the logic it applies might not be optimal for myst-nb or vice versa.

Speaking of publication purposes, reordering streams may be undesirable: imagine demonstrating code that produces a warning halfway through.

On the other hand, coalescing same stream types may be a good idea without even giving the users an option to configure this. Or is there a use case for having two independent text outputs in a row after a single input?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imagine demonstrating code that produces a warning halfway through.

except, if you write to sderr, then there is no guarantee that it will end up halfway stdout

anyhow, I have no capacity left to devote to this, so I'll leave you guys to propose new PRs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point @akhmerov - I feel like we should see how folks use this feature and see if there are any natural points to improve the ux over time 👍


return new_outputs


class CellOutputsToNodes(SphinxPostTransform):
"""Use the builder context to transform a CellOutputNode into Sphinx nodes."""

Expand All @@ -108,6 +161,8 @@ def run(self):
renderer_cls = load_renderer(node.renderer)
renderers[node.renderer] = renderer_cls
renderer = renderer_cls(self.document, node, abs_dir)
if self.config.nb_merge_streams:
node._outputs = coalesce_streams(node.outputs)
output_nodes = renderer.cell_output_to_nodes(self.env.nb_render_priority)
node.replace_self(output_nodes)

Expand Down
82 changes: 82 additions & 0 deletions tests/notebooks/merge_streams.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"source": [
"import sys\n",
"print('stdout1', file=sys.stdout)\n",
"print('stdout2', file=sys.stdout)\n",
"print('stderr1', file=sys.stderr)\n",
"print('stderr2', file=sys.stderr)\n",
"print('stdout3', file=sys.stdout)\n",
"print('stderr3', file=sys.stderr)\n",
"1"
],
"outputs": [
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here all the stdout/stderr are all mixed up

{
"output_type": "stream",
"name": "stdout",
"text": [
"stdout1\n",
"stdout2\n"
]
},
{
"output_type": "stream",
"name": "stderr",
"text": [
"stderr1\n",
"stderr2\n"
]
},
{
"output_type": "stream",
"name": "stdout",
"text": [
"stdout3\n"
]
},
{
"output_type": "stream",
"name": "stderr",
"text": [
"stderr3\n"
]
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"1"
]
},
"metadata": {},
"execution_count": 1
}
],
"metadata": {}
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
11 changes: 11 additions & 0 deletions tests/test_render_outputs.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,17 @@ def test_stderr_remove(sphinx_run, file_regression):
file_regression.check(doctree.pformat(), extension=".xml", encoding="utf8")


@pytest.mark.sphinx_params(
"merge_streams.ipynb",
conf={"jupyter_execute_notebooks": "off", "nb_merge_streams": True},
)
def test_merge_streams(sphinx_run, file_regression):
sphinx_run.build()
assert sphinx_run.warnings() == ""
doctree = sphinx_run.get_resolved_doctree("merge_streams")
file_regression.check(doctree.pformat(), extension=".xml", encoding="utf8")


@pytest.mark.sphinx_params(
"metadata_image.ipynb",
conf={"jupyter_execute_notebooks": "off", "nb_render_key": "myst"},
Expand Down
23 changes: 23 additions & 0 deletions tests/test_render_outputs/test_merge_streams.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
<document source="merge_streams">
<CellNode cell_type="code" classes="cell">
<CellInputNode classes="cell_input">
<literal_block language="ipython3" linenos="False" xml:space="preserve">
import sys
print('stdout1', file=sys.stdout)
print('stdout2', file=sys.stdout)
print('stderr1', file=sys.stderr)
print('stderr2', file=sys.stderr)
print('stdout3', file=sys.stdout)
print('stderr3', file=sys.stderr)
1
<CellOutputNode classes="cell_output">
<literal_block classes="output stream" language="myst-ansi" linenos="False" xml:space="preserve">
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here they are all in single blocks

Copy link
Member

@mmcky mmcky Oct 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah - thanks @chrisjsewell that's helpful. So the order will always be the same (so long as the code-cell contents doesn't change).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes the order of stdout / stderr, individually, are guaranteed, so they can be group merged. the order of stdout vs stderr is not guaranteed (they are independent streams), so stderr is placed always after stdout

stdout1
stdout2
stdout3
<literal_block classes="output stderr" language="myst-ansi" linenos="False" xml:space="preserve">
stderr1
stderr2
stderr3
<literal_block classes="output text_plain" language="myst-ansi" linenos="False" xml:space="preserve">
1
4 changes: 2 additions & 2 deletions tox.ini
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,9 @@
# then then deleting compiled files has been found to fix it: `find . -name \*.pyc -delete`

[tox]
envlist = py37-sphinx3
envlist = py37-sphinx4

[testenv:py{36,37,38,39}-sphinx{3,4}]
[testenv:py{37,38,39}-sphinx{3,4}]
extras = testing
deps =
sphinx3: sphinx>=3,<4
Expand Down