Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the full diff by having more consistent indentation in the PrettyPrinter #11571

Merged

Conversation

BenjaminSchubert
Copy link
Contributor

Overview

Note: This is an alternative implementation to #11537, and vendors pprint in

The normal default pretty printer is not great when objects are nested it can get hard to read the diffs produced.

Instead, provide a pretty printer that behaves more like when json get which allows for smaller, more meaningful differences, at the expense of a slightly longer diff.

This also has the nice side effect of making diffs stable across python versions, which it was not previously, as, for example, dataclass support was added in python3.9

Fix #1531

Alternatives/Potential improvements

This has the disadvantage that diffs are now longer, even for small changes (like [1, 2] == [1, 3]). We could potentially still keep the previous implementation for the case where the diff is the same length AND has a single line. This would take care of trivial cases. It would however make some diffs harder to read again, like [1, 2, 3] == [2, 3, 4], which would now show 3 differences.

This is however not generalisable to deeply nested payloads.

Notes for maintainers

This is the requested alternative to #11537, which vendors the pprint module in, and then modifies the class in place.

The first commit vendors the module in, and makes it pass linting. Note that only the required part of the module are imported.

The second commit makes the modification and adds the same tests as #11537.

It is possible that we could still simplify the logic (e.g. always computing the indentation based on the level, instead of passing both). I believe this might be easier as a subsequent PR if requested, but happy to try and simplify this one if wanted.

Examples

Basic

Generated using the following script
from collections import Counter, defaultdict, deque
from dataclasses import dataclass
from functools import partial
import pprint
from types import SimpleNamespace
from typing import Optional, Dict, Any, IO, List
import difflib

from _pytest._io.pprint import PrettyPrinter


###
# Original pytest diff, copied in
###
class AlwaysDispatchingPrettyPrinter(pprint.PrettyPrinter):
  """PrettyPrinter that always dispatches (regardless of width)."""

  def _format(
      self,
      object: object,
      stream: IO[str],
      indent: int,
      allowance: int,
      context: Dict[int, Any],
      level: int,
  ) -> None:
      # Type ignored because _dispatch is private.
      p = self._dispatch.get(type(object).__repr__, None)  # type: ignore[attr-defined]

      objid = id(object)
      if objid in context or p is None:
          # Type ignored because _format is private.
          super()._format(  # type: ignore[misc]
              object,
              stream,
              indent,
              allowance,
              context,
              level,
          )
          return

      context[objid] = 1
      p(self, object, stream, indent, allowance, context, level + 1)
      del context[objid]


def _pformat_dispatch_original(
  object: object,
  indent: int = 1,
  width: int = 80,
  depth: Optional[int] = None,
  *,
  compact: bool = False,
) -> str:
  return AlwaysDispatchingPrettyPrinter(
      indent=indent, width=width, depth=depth, compact=compact
  ).pformat(object)


def _surrounding_parens_on_own_lines(lines: List[str]) -> None:
  """Move opening/closing parenthesis/bracket to own lines."""
  opening = lines[0][:1]
  if opening in ["(", "[", "{"]:
      lines[0] = " " + lines[0][1:]
      lines[:] = [opening] + lines
  closing = lines[-1][-1:]
  if closing in [")", "]", "}"]:
      lines[-1] = lines[-1][:-1] + ","
      lines[:] = lines + [closing]


def original_diff(left, right):
  left_formatting = pprint.pformat(left).splitlines()
  right_formatting = pprint.pformat(right).splitlines()

  # Re-format for different output lengths.
  lines_left = len(left_formatting)
  lines_right = len(right_formatting)
  if lines_left != lines_right:
      left_formatting = _pformat_dispatch_original(left).splitlines()
      right_formatting = _pformat_dispatch_original(right).splitlines()

  if lines_left > 1 or lines_right > 1:
      _surrounding_parens_on_own_lines(left_formatting)
      _surrounding_parens_on_own_lines(right_formatting)

  return left_formatting, right_formatting


###
# Script to generate the diffs
###

TABLE = """
<table>
<tr>
<th>Test</th>
<th>Main</th>
<th>Proposal</th>
</tr>
{rows}
</table>
"""

ROW = """
<tr>
<td colspan=2>

\`\`\`python
{python}
\`\`\`
</td>
</tr>
<tr>
<td>

\`\`\`diff
{diff_original}
\`\`\`
</td>

<td>

\`\`\`diff
{diff_new}
\`\`\`
</td>
</tr>
"""


def get_row(left, right):
  original = "\n".join(
      line.rstrip() for line in difflib.ndiff(*original_diff(left, right))
  )
  new = "\n".join(
      line.rstrip()
      for line in difflib.ndiff(
          PrettyPrinter().pformat(left).splitlines(), PrettyPrinter().pformat(right).splitlines()
      )
  )

  fmt = partial(pprint.pformat, indent=2, width=60)
  return f"{fmt(left)} \\ \n == {fmt(right)}", original, new


@dataclass
class DataclassWithTwoItems:
  foo: str
  bar: str


rows = [
  get_row(left, right)
  for left, right in [
      [{"one": 1, "two": 2}, {"three": 1, "two": 3}],
      [[1, 2], [1, 3]],
      [(1,), (2,)],
      [(1, 2), (1, 3)],
      [{1, 2}, {1, 3}],
      [SimpleNamespace(one=1, two=2), SimpleNamespace(one=2, three=2)],
      [
          defaultdict(str, {"one": "1", "two": "2"}),
          defaultdict(str, {"one": "1", "two": "3"}),
      ],
      [Counter("121"), Counter("122")],
      [deque([1, 2]), deque([1, 3])],
      [deque([1, 2], maxlen=3), deque([1, 3], maxlen=4)],
      [
          {
              "counter": Counter("122"),
              "dataclass": DataclassWithTwoItems(foo="foo", bar="bar"),
              "defaultdict": defaultdict(str, {"one": "1", "two": "2"}),
              "deque": deque([1, 2], maxlen=3),
              "dict": {"one": 1, "two": 2},
              "list": [1, 2],
              "set": {1, 2},
              "simplenamespace": SimpleNamespace(one=1, two=2),
              "tuple": (1, 2),
          },
          {
              "counter": Counter("121"),
              "dataclass": DataclassWithTwoItems(foo="foo", bar="baz"),
              "defaultdict": defaultdict(str, {"three": "1", "two": "3"}),
              "deque": deque([1, 3], maxlen=3),
              "dict": {"one": 1, "two": 3},
              "list": [1, 2, 3],
              "set": {1, 3},
              "simplenamespace": SimpleNamespace(one=1, two=2, three=3),
              "tuple": (1,),
          },
      ],
  ]
]

print(
  TABLE.format(
      rows="\n".join(
          [
              ROW.format(python=row[0], diff_original=row[1], diff_new=row[2])
              for row in rows
          ]
      )
  )
)
We get the following differences on small entries:
Test Main Proposal
{'one': 1, 'two': 2} \ 
 == {'three': 1, 'two': 3}
- {'one': 1, 'two': 2}
?   ^^              ^
+ {'three': 1, 'two': 3}
?   ^^^^              ^
  {
-     'one': 1,
?      ^^
+     'three': 1,
?      ^^^^
-     'two': 2,
?            ^
+     'two': 3,
?            ^
  }
[1, 2] \ 
 == [1, 3]
- [1, 2]
?     ^
+ [1, 3]
?     ^
  [
      1,
-     2,
?     ^
+     3,
?     ^
  ]
(1,) \ 
 == (2,)
- (1,)
?  ^
+ (2,)
?  ^
  (
-     1,
?     ^
+     2,
?     ^
  )
(1, 2) \ 
 == (1, 3)
- (1, 2)
?     ^
+ (1, 3)
?     ^
  (
      1,
-     2,
?     ^
+     3,
?     ^
  )
{1, 2} \ 
 == {1, 3}
- {1, 2}
?     ^
+ {1, 3}
?     ^
  {
      1,
-     2,
?     ^
+     3,
?     ^
  }
namespace(one=1, two=2) \ 
 == namespace(one=2, three=2)
- namespace(one=1, two=2)
?               ^   ^^
+ namespace(one=2, three=2)
?               ^   ^^^^
  namespace(
-     one=1,
?         ^
+     one=2,
?         ^
-     two=2,
+     three=2,
  )
defaultdict(<class 'str'>, {'one': '1', 'two': '2'}) \ 
 == defaultdict(<class 'str'>, {'one': '1', 'two': '3'})
- defaultdict(<class 'str'>, {'one': '1', 'two': '2'})
?                                                 ^
+ defaultdict(<class 'str'>, {'one': '1', 'two': '3'})
?                                                 ^
  defaultdict(<class 'str'>, {
      'one': '1',
-     'two': '2',
?             ^
+     'two': '3',
?             ^
  })
Counter({'1': 2, '2': 1}) \ 
 == Counter({'2': 2, '1': 1})
- Counter({'1': 2, '2': 1})
+ Counter({'2': 2, '1': 1})
  Counter({
-     '1': 2,
?      ^
+     '2': 2,
?      ^
-     '2': 1,
?      ^
+     '1': 1,
?      ^
  })
deque([1, 2]) \ 
 == deque([1, 3])
- deque([1, 2])
?           ^
+ deque([1, 3])
?           ^
  deque([
      1,
-     2,
?     ^
+     3,
?     ^
  ])
deque([1, 2], maxlen=3) \ 
 == deque([1, 3], maxlen=4)
- deque([1, 2], maxlen=3)
?           ^          ^
+ deque([1, 3], maxlen=4)
?           ^          ^
- deque(maxlen=3, [
?              ^
+ deque(maxlen=4, [
?              ^
      1,
-     2,
?     ^
+     3,
?     ^
  ])
{ 'counter': Counter({'2': 2, '1': 1}),
  'dataclass': DataclassWithTwoItems(foo='foo', bar='bar'),
  'defaultdict': defaultdict(<class 'str'>,
                             { 'one': '1',
                               'two': '2'}),
  'deque': deque([1, 2], maxlen=3),
  'dict': {'one': 1, 'two': 2},
  'list': [1, 2],
  'set': {1, 2},
  'simplenamespace': namespace(one=1, two=2),
  'tuple': (1, 2)} \ 
 == { 'counter': Counter({'1': 2, '2': 1}),
  'dataclass': DataclassWithTwoItems(foo='foo', bar='baz'),
  'defaultdict': defaultdict(<class 'str'>,
                             { 'three': '1',
                               'two': '3'}),
  'deque': deque([1, 3], maxlen=3),
  'dict': {'one': 1, 'two': 3},
  'list': [1, 2, 3],
  'set': {1, 3},
  'simplenamespace': namespace(one=1, two=2, three=3),
  'tuple': (1,)}
  {
-  'counter': Counter({'2': 2, '1': 1}),
?                          --------
+  'counter': Counter({'1': 2, '2': 1}),
?                       ++++++++
-  'dataclass': DataclassWithTwoItems(foo='foo', bar='bar'),
?                                                       ^
+  'dataclass': DataclassWithTwoItems(foo='foo', bar='baz'),
?                                                       ^
-  'defaultdict': defaultdict(<class 'str'>, {'one': '1', 'two': '2'}),
?                                              ^^                 ^
+  'defaultdict': defaultdict(<class 'str'>, {'three': '1', 'two': '3'}),
?                                              ^^^^                 ^
-  'deque': deque([1, 2], maxlen=3),
?                     ^
+  'deque': deque([1, 3], maxlen=3),
?                     ^
-  'dict': {'one': 1, 'two': 2},
?                            ^
+  'dict': {'one': 1, 'two': 3},
?                            ^
-  'list': [1, 2],
+  'list': [1, 2, 3],
?               +++
-  'set': {1, 2},
?             ^
+  'set': {1, 3},
?             ^
-  'simplenamespace': namespace(one=1, two=2),
+  'simplenamespace': namespace(one=1, two=2, three=3),
?                                           +++++++++
-  'tuple': (1, 2),
?              --
+  'tuple': (1,),
  }
  {
      'counter': Counter({
-         '2': 2,
?          ^
+         '1': 2,
?          ^
-         '1': 1,
?          ^
+         '2': 1,
?          ^
      }),
      'dataclass': DataclassWithTwoItems(
          foo='foo',
-         bar='bar',
?                ^
+         bar='baz',
?                ^
      ),
      'defaultdict': defaultdict(<class 'str'>, {
-         'one': '1',
?          ^^
+         'three': '1',
?          ^^^^
-         'two': '2',
?                 ^
+         'two': '3',
?                 ^
      }),
      'deque': deque(maxlen=3, [
          1,
-         2,
?         ^
+         3,
?         ^
      ]),
      'dict': {
          'one': 1,
-         'two': 2,
?                ^
+         'two': 3,
?                ^
      },
      'list': [
          1,
          2,
+         3,
      ],
      'set': {
          1,
-         2,
?         ^
+         3,
?         ^
      },
      'simplenamespace': namespace(
          one=1,
          two=2,
+         three=3,
      ),
      'tuple': (
          1,
-         2,
      ),
  }

Full example

Taking the example from https://github.com/lukaszb/pytest-dictsdiff, as in the issue:

Previously:

- {
-  'cell': '(056)-022-8631',
-  'dob': {'age': 34,
+ OrderedDict([('cell',
+               '(056)-022-8631'),
+              ('dob',
+               {'age': 44,
-          'date': '1953-11-04T01:21:04Z'},
?                     ^              ^
+                'date': '1983-11-04T01:21:14Z'}),
? ++++++                    ^              ^    +
-  'email': '[email protected]',
-  'gender': 'female',
-  'id': {'name': 'BSN',
+              ('email',
+               '[email protected]'),
+              ('gender',
+               'female'),
+              ('id',
+               {'name': 'BSN',
-         'value': '36180866'},
+                'value': '36180866'}),
? +++++++                            +
-  'location': {'city': 'Tholen',
+              ('location',
+               {'city': 'tholen',
-               'coordinates': {'latitude': '46.8823',
+                'coordinates': {'latitude': '46.8823',
? +
-                               'longitude': '175.8856'},
+                                'longitude': '175.8856'},
? +
-               'postcode': 64509,
?                               ^
+                'postcode': 64504,
? +                              ^
-               'state': 'groningen',
+                'state': 'groningen',
? +
-               'street': '2074 adriaen van ostadelaan',
+                'street': '2074 adriaen van ostadelaan',
? +
-               'timezone': {'description': 'Adelaide, Darwin',
+                'timezone': {'description': 'Adelaide, Darwin',
? +
-                            'offset': '+9:30'}},
+                             'offset': '+9:30'}}),
? +                                              +
+              ('login',
-  'login': {'md5': 'bafe8cf9d37806a7b13edc218d5ff762',
?  ^^^^^^^^
+               {'md5': 'bafe8cf9d37806a7b13edc218d5ff762',
?  ^^^^^^^^^^^^
-            'password': 'ontario',
+                'password': 'ontario',
? ++++
-            'salt': 'QVBKgEjy',
+                'salt': 'QVBKgEjy',
? ++++
-            'sha1': 'cacef09ff61072d1c55732963766fa84e919aa7a',
+                'sha1': 'cacef09ff61072d1c55732963766fa84e919aa7a',
? ++++
-            'sha256': 'cc86af47aedbdbb1de73ff10484996fe9785c47c0fc191b7c67eaf71e0782300',
+                'sha256': 'cc86af47aedbdbb1de73ff10484996fe9785c47c0fc191b7c67eaf71e0782300',
? ++++
-            'username': 'smallgorilla897',
+                'username': 'smallgorilla897',
? ++++
-            'uuid': '37e30c59-bc79-4172-aac6-e2c640e165fa'},
+                'uuid': '37e30c59-bc79-4172-aac6-e2c640e165fa'}),
? ++++                                                          +
-  'name': {'first': 'Zeyneb',
+              ('name',
+               {'first': 'zeyneb',
-           'last': 'Elfring',
?                    ^
+                'last': 'elfring',
? +++++                   ^
-           'title': 'mrs'},
+                'title': 'mrs'}),
? +++++                         +
-  'nat': 'NL',
-  'phone': '(209)-143-9697',
+              ('nat',
+               'NL'),
+              ('phone',
+               '(209)-143-9697'),
+              ('picture',
-  'picture': {'large': 'https://randomuser.me/api/portraits/women/37.jpg',
?  ^^^^^^^^^^
+               {'large': 'https://randomuser.me/api/portraits/women/37.jpg',
?  ^^^^^^^^^^^^
-              'medium': 'https://randomuser.me/api/portraits/med/women/37.jpg',
+                'medium': 'https://randomuser.me/api/portraits/med/women/37.jpg',
? ++
-              'thumbnail': 'https://randomuser.me/api/portraits/thumb/women/37.jpg'},
+                'thumbnail': 'https://randomuser.me/api/portraits/thumb/women/37.jpg'}),
? ++                                                                                   +
-  'registered': {'age': 3,
+              ('registered',
+               {'age': 3,
-                 'date': '2014-12-07T06:54:14Z'},
? -
+                'date': '2014-12-07T06:54:14Z'})],
?                                               ++
- }

Now:

- {
+ OrderedDict({
      'cell': '(056)-022-8631',
      'dob': {
-         'age': 34,
?                ^
+         'age': 44,
?                ^
-         'date': '1953-11-04T01:21:04Z',
?                    ^              ^
+         'date': '1983-11-04T01:21:14Z',
?                    ^              ^
      },
      'email': '[email protected]',
      'gender': 'female',
      'id': {
          'name': 'BSN',
          'value': '36180866',
      },
      'location': {
-         'city': 'Tholen',
?                  ^
+         'city': 'tholen',
?                  ^
          'coordinates': {
              'latitude': '46.8823',
              'longitude': '175.8856',
          },
-         'postcode': 64509,
?                         ^
+         'postcode': 64504,
?                         ^
          'state': 'groningen',
          'street': '2074 adriaen van ostadelaan',
          'timezone': {
              'description': 'Adelaide, Darwin',
              'offset': '+9:30',
          },
      },
      'login': {
          'md5': 'bafe8cf9d37806a7b13edc218d5ff762',
          'password': 'ontario',
          'salt': 'QVBKgEjy',
          'sha1': 'cacef09ff61072d1c55732963766fa84e919aa7a',
          'sha256': 'cc86af47aedbdbb1de73ff10484996fe9785c47c0fc191b7c67eaf71e0782300',
          'username': 'smallgorilla897',
          'uuid': '37e30c59-bc79-4172-aac6-e2c640e165fa',
      },
      'name': {
-         'first': 'Zeyneb',
?                   ^
+         'first': 'zeyneb',
?                   ^
-         'last': 'Elfring',
?                  ^
+         'last': 'elfring',
?                  ^
          'title': 'mrs',
      },
      'nat': 'NL',
      'phone': '(209)-143-9697',
      'picture': {
          'large': 'https://randomuser.me/api/portraits/women/37.jpg',
          'medium': 'https://randomuser.me/api/portraits/med/women/37.jpg',
          'thumbnail': 'https://randomuser.me/api/portraits/thumb/women/37.jpg',
      },
      'registered': {
          'age': 3,
          'date': '2014-12-07T06:54:14Z',
      },
- }
+ })

@bluetech
Copy link
Member

Thanks @BenjaminSchubert, sorry for the delay in reviewing.

First, I think we're in agreement that the new formatting is nicer then the existing one, so we can consider this aspect as accepted.

Procedurally I'd prefer it if the vendoring change (your first commit) is done in its own PR which we can merge first, then the formatting change in a separate commit. It'd be easier to handle this way.

Regarding the first commit message

This takes in the version of the pprint moduel that is used from python3.12, essentially backporting it for python3.8 and 3.9.

This will allow us to make changes in how it represents objects without relying on private methods.

Only the required API surface was copied, the rest is left out.

Note that we still continue to use the upstream version for every other part of the system. It might be worth using it everywhere, but is probably too many changes at once.

I think we should replace all usages of pprint with our vendored copy from the start, for these reasons:

  • Consistency
  • Presumably ours is better :)
  • If we're going the cut the parts we don't use, let's make sure the other pprint-using code doesn't need them either
  • I think it should just be a manner of replacing import pprint with our import? Probably some tests will fail due to the different formatting, but hopefully not too much work to fix either, and will provide us with more evaluation data.

Regarding the technical vendoring aspect, I wonder if we should turn off black and linting for this file so as to make future syncs from upstream easier, or if we should just leave upstream behind and not look back... I'm not sure myself - WDYT?

@BenjaminSchubert
Copy link
Contributor Author

BenjaminSchubert commented Nov 13, 2023

@bluetech thanks a lot for the update, no worries at all :)

Procedurally I'd prefer it if the vendoring change (your first commit) is done in its own PR which we can merge first, then the formatting change in a separate commit. It'd be easier to handle this way.

Sure, I'll do this as soon as we clarified the other points :)

I think we should replace all usages of pprint with our vendored copy from the start, for these reasons:

I will add a warning here that on my side I very rarely use the -v mode, most of my pytest invocations are setup to have -vv at all times, which is the mode I personally prefer, so I might be biased :)

Consistency

Agreed that consistency is important. Technically, the verbose (so differing items, common items, etc) is already inconsistent with the differing items using SafeRepr, whereas the others use the pprint.pformat, I do think optimizing for the context does potentially make sense, as I understand was the case here before?

Presumably ours is better :)

I do agree it gives much much better diffs (I would not be proposing this change otherwise :P). However, the outputs are also longer, which reduces the amount of information that fits on one screen, so in cases it doesn't need to be compared, it might become less readable

As such, for the verbose mode I am not 100% sure this is the right call.

To showcase my concerns, here's a (probably rare, exaggerated failure):

Keeping the old pprint for the `-v` mode
E         Common items:
E         {'email': '[email protected]',
E          'gender': 'female',
E          'id': {'name': 'BSN', 'value': '36180866'},
E          'login': {'md5': 'bafe8cf9d37806a7b13edc218d5ff762',
E                    'password': 'ontario',
E                    'salt': 'QVBKgEjy',
E                    'sha1': 'cacef09ff61072d1c55732963766fa84e919aa7a',
E                    'sha256': 'cc86af47aedbdbb1de73ff10484996fe9785c47c0fc191b7c67eaf71e0782300',
E                    'username': 'smallgorilla897',
E                    'uuid': '37e30c59-bc79-4172-aac6-e2c640e165fa'},
E          'nat': 'NL',
E          'phone': '(209)-143-9697',
E          'picture': {'large': 'https://randomuser.me/api/portraits/women/37.jpg',
E                      'medium': 'https://randomuser.me/api/portraits/med/women/37.jpg',
E                      'thumbnail': 'https://randomuser.me/api/portraits/thumb/women/37.jpg'},
E          'registered': {'age': 3, 'date': '2014-12-07T06:54:14Z'}}
E         Differing items:
E         {'location': {'city': 'tholen', 'coordinates': {'latitude': '46.8823', 'longitude': '175.8856'}, 'postcode': 64504, 'state': 'groningen', ...}} != {'location': {'city': 'Tholen', 'coordinates': {'latitude': '46.8823', 'longitude': '175.8856'}, 'postcode': 64509, 'state': 'groningen', ...}}
E         {'name': {'first': 'zeyneb', 'last': 'elfring', 'title': 'mrs'}} != {'name': {'first': 'Zeyneb', 'last': 'Elfring', 'title': 'mrs'}}
E         {'dob': {'age': 44, 'date': '1983-11-04T01:21:14Z'}} != {'dob': {'age': 34, 'date': '1953-11-04T01:21:04Z'}}
E         Left contains 1 more item:
E         {'cell': '(056)-022-8631'}
E         Right contains 1 more item:
E         {'cellphone': '(056)-022-8631'}
E         Full diff:
E         - {
E         + OrderedDict({
E         -     'cellphone': '(056)-022-8631',
E         ?          -----
E         +     'cell': '(056)-022-8631',
E               'dob': {
E         -         'age': 34,
E         ?                ^
E         +         'age': 44,
E         ?                ^
E         -         'date': '1953-11-04T01:21:04Z',
E         ?                    ^              ^
E         +         'date': '1983-11-04T01:21:14Z',
E         ?                    ^              ^
E               },
E               'email': '[email protected]',
E               'gender': 'female',
E               'id': {
E                   'name': 'BSN',
E                   'value': '36180866',
E               },
E               'location': {
E         -         'city': 'Tholen',
E         ?                  ^
E         +         'city': 'tholen',
E         ?                  ^
E                   'coordinates': {
E                       'latitude': '46.8823',
E                       'longitude': '175.8856',
E                   },
E         -         'postcode': 64509,
E         ?                         ^
E         +         'postcode': 64504,
E         ?                         ^
E                   'state': 'groningen',
E                   'street': '2074 adriaen van ostadelaan',
E                   'timezone': {
E                       'description': 'Adelaide, Darwin',
E                       'offset': '+9:30',
E                   },
E               },
E               'login': {
E                   'md5': 'bafe8cf9d37806a7b13edc218d5ff762',
E                   'password': 'ontario',
E                   'salt': 'QVBKgEjy',
E                   'sha1': 'cacef09ff61072d1c55732963766fa84e919aa7a',
E                   'sha256': 'cc86af47aedbdbb1de73ff10484996fe9785c47c0fc191b7c67eaf71e0782300',
E                   'username': 'smallgorilla897',
E                   'uuid': '37e30c59-bc79-4172-aac6-e2c640e165fa',
E               },
E               'name': {
E         -         'first': 'Zeyneb',
E         ?                   ^
E         +         'first': 'zeyneb',
E         ?                   ^
E         -         'last': 'Elfring',
E         ?                  ^
E         +         'last': 'elfring',
E         ?                  ^
E                   'title': 'mrs',
E               },
E               'nat': 'NL',
E               'phone': '(209)-143-9697',
E               'picture': {
E                   'large': 'https://randomuser.me/api/portraits/women/37.jpg',
E                   'medium': 'https://randomuser.me/api/portraits/med/women/37.jpg',
E                   'thumbnail': 'https://randomuser.me/api/portraits/thumb/women/37.jpg',
E               },
E               'registered': {
E                   'age': 3,
E                   'date': '2014-12-07T06:54:14Z',
E               },
E         - }
E         + })
Using the new pprint for the `-v` mode
E         Common items:
E         {
E             'email': '[email protected]',
E             'gender': 'female',
E             'id': {
E                 'name': 'BSN',
E                 'value': '36180866',
E             },
E             'login': {
E                 'md5': 'bafe8cf9d37806a7b13edc218d5ff762',
E                 'password': 'ontario',
E                 'salt': 'QVBKgEjy',
E                 'sha1': 'cacef09ff61072d1c55732963766fa84e919aa7a',
E                 'sha256': 'cc86af47aedbdbb1de73ff10484996fe9785c47c0fc191b7c67eaf71e0782300',
E                 'username': 'smallgorilla897',
E                 'uuid': '37e30c59-bc79-4172-aac6-e2c640e165fa',
E             },
E             'nat': 'NL',
E             'phone': '(209)-143-9697',
E             'picture': {
E                 'large': 'https://randomuser.me/api/portraits/women/37.jpg',
E                 'medium': 'https://randomuser.me/api/portraits/med/women/37.jpg',
E                 'thumbnail': 'https://randomuser.me/api/portraits/thumb/women/37.jpg',
E             },
E             'registered': {
E                 'age': 3,
E                 'date': '2014-12-07T06:54:14Z',
E             },
E         }
E         Differing items:
E         {'name': {'first': 'zeyneb', 'last': 'elfring', 'title': 'mrs'}} != {'name': {'first': 'Zeyneb', 'last': 'Elfring', 'title': 'mrs'}}
E         {'location': {'city': 'tholen', 'coordinates': {'latitude': '46.8823', 'longitude': '175.8856'}, 'postcode': 64504, 'state': 'groningen', ...}} != {'location': {'city': 'Tholen', 'coordinates': {'latitude': '46.8823', 'longitude': '175.8856'}, 'postcode': 64509, 'state': 'groningen', ...}}
E         {'dob': {'age': 44, 'date': '1983-11-04T01:21:14Z'}} != {'dob': {'age': 34, 'date': '1953-11-04T01:21:04Z'}}
E         Left contains 1 more item:
E         {
E             'cell': '(056)-022-8631',
E         }
E         Right contains 1 more item:
E         {
E             'cellphone': '(056)-022-8631',
E         }
E         Full diff:
E         - {
E         + OrderedDict({
E         -     'cellphone': '(056)-022-8631',
E         ?          -----
E         +     'cell': '(056)-022-8631',
E               'dob': {
E         -         'age': 34,
E         ?                ^
E         +         'age': 44,
E         ?                ^
E         -         'date': '1953-11-04T01:21:04Z',
E         ?                    ^              ^
E         +         'date': '1983-11-04T01:21:14Z',
E         ?                    ^              ^
E               },
E               'email': '[email protected]',
E               'gender': 'female',
E               'id': {
E                   'name': 'BSN',
E                   'value': '36180866',
E               },
E               'location': {
E         -         'city': 'Tholen',
E         ?                  ^
E         +         'city': 'tholen',
E         ?                  ^
E                   'coordinates': {
E                       'latitude': '46.8823',
E                       'longitude': '175.8856',
E                   },
E         -         'postcode': 64509,
E         ?                         ^
E         +         'postcode': 64504,
E         ?                         ^
E                   'state': 'groningen',
E                   'street': '2074 adriaen van ostadelaan',
E                   'timezone': {
E                       'description': 'Adelaide, Darwin',
E                       'offset': '+9:30',
E                   },
E               },
E               'login': {
E                   'md5': 'bafe8cf9d37806a7b13edc218d5ff762',
E                   'password': 'ontario',
E                   'salt': 'QVBKgEjy',
E                   'sha1': 'cacef09ff61072d1c55732963766fa84e919aa7a',
E                   'sha256': 'cc86af47aedbdbb1de73ff10484996fe9785c47c0fc191b7c67eaf71e0782300',
E                   'username': 'smallgorilla897',
E                   'uuid': '37e30c59-bc79-4172-aac6-e2c640e165fa',
E               },
E               'name': {
E         -         'first': 'Zeyneb',
E         ?                   ^
E         +         'first': 'zeyneb',
E         ?                   ^
E         -         'last': 'Elfring',
E         ?                  ^
E         +         'last': 'elfring',
E         ?                  ^
E                   'title': 'mrs',
E               },
E               'nat': 'NL',
E               'phone': '(209)-143-9697',
E               'picture': {
E                   'large': 'https://randomuser.me/api/portraits/women/37.jpg',
E                   'medium': 'https://randomuser.me/api/portraits/med/women/37.jpg',
E                   'thumbnail': 'https://randomuser.me/api/portraits/thumb/women/37.jpg',
E               },
E               'registered': {
E                   'age': 3,
E                   'date': '2014-12-07T06:54:14Z',
E               },
E         - }
E         + })

Ultimately, I am happy to go either way if you think using it for everything is nicer, I just wanted to show an example of the change before doing it. Let me know which direction you prefer.

Regarding the technical vendoring aspect, I wonder if we should turn off black and linting for this file so as to make future syncs from upstream easier, or if we should just leave upstream behind and not look back... I'm not sure myself - WDYT?

I never know what's best. Advantages of linting as the rest is that it feels part of the codebase and is easier for contributors, but syncing is harder. Ultimately, I don't think the pprint codebase on upstream python will change often, but we don't really know. I tend to slightly prefer linting/formatting the same. Ultimately, unless there are big refactors upstream, it should still not be too hard to sync.

@bluetech
Copy link
Member

Ultimately, I am happy to go either way if you think using it for everything is nicer, I just wanted to show an example of the change before doing it. Let me know which direction you prefer.

OK, let's keep using the stdlib pprint for other stuff, and we can maybe migrate them later but as a separate concern.

I never know what's best. Advantages of linting as the rest is that it feels part of the codebase and is easier for contributors, but syncing is harder. Ultimately, I don't think the pprint codebase on upstream python will change often, but we don't really know. I tend to slightly prefer linting/formatting the same. Ultimately, unless there are big refactors upstream, it should still not be too hard to sync.

I agree. Let's assimilate fully with pytest, and not worry about upstream syncs too much. This will allow us to freely improve and adjust the code to pytest needs, and have proper linting and formatting. I'd also like to type annotate the code later (unless you feel like doing it already ).


So here's how I think it would be best to do the vendoring part:

  1. First commit - copy stdlib pprint.py verbatim, ignoring formatting and linting using # fmt: off and flake8: noqa etc. at the top of the file.
  2. Second commit: Delete parts we don't need.
  3. Third commit: Apply formatting and linting.
  4. Fourth commit: Integrate AlwaysDispatchingPrettyPrinter into _pytest.io.PrettyPrinter and switch to it.

That should bring us to the current state in main, but with a cleaned up vendored pprint and clear provenance. After that you can do the new improvements.

BTW I know you just wanted to make some improvements and I side-tracked you with this vendoring stuff, I hope you're not too annoyed with that :)

@BenjaminSchubert
Copy link
Contributor Author

Ok, #11626 contains the vendoring, and this one has now been rebased on top of it. Once the other is in, I'll rebase again. In the meantime I'll put it to draft.

BTW I know you just wanted to make some improvements and I side-tracked you with this vendoring stuff, I hope you're not too annoyed with that :)

No worries at all, those are all reasonable asks, I am glad you are open to get so much new code in to rewrite those diffs :D

@BenjaminSchubert BenjaminSchubert force-pushed the bschubert/nicer-comparisons-vendor branch from a6b35d8 to 86545ed Compare November 20, 2023 16:43
@BenjaminSchubert
Copy link
Contributor Author

This is now ready for review :)

The normal default pretty printer is not great when objects are nested
and it can get hard to read the diff.

Instead, provide a pretty printer that behaves more like when json get
indented, which allows for smaller, more meaningful differences, at
the expense of a slightly longer diff.

This does not touch the other places where the pretty printer is used,
and only updated the full diff one.
@BenjaminSchubert BenjaminSchubert force-pushed the bschubert/nicer-comparisons-vendor branch from 86545ed to 445687c Compare November 20, 2023 19:03
Copy link
Member

@nicoddemus nicoddemus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! I did not review the code itself, only the changelog and the test outcomes. 👍

changelog/1531.improvement.rst Outdated Show resolved Hide resolved
@@ -677,8 +695,13 @@ def test_dict_different_items(self) -> None:
"Right contains 2 more items:",
"{'b': 1, 'c': 2}",
"Full diff:",
"- {'b': 1, 'c': 2}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed for very small diffs, the change makes it harder to read, but those are not problematic anyways, the problematic diffs are the long ones, which are greatly improved, so I think the trade-off is valid. 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also first do a diff with the normal pprint, and if both entries are single line, use it, otherwise uses the long one if that's preferred

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that complicate the code quite a bit? If not I would go for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I would have in mind is something like 0850f34

I haven't pushed it yet to this branch as I am not convinced about it for the following reasons:

  • We lose the ability of having a stable diff across python versions (since the pprint code in the standard library changes)
  • Older versions of python thus don't get some of the improvements (e.g. python3.8 and dataclasses)
  • The diff looks nicer for simple cases (a same length value changed), but looks worse for cases where something is missing, see here

As such I think I personally lean towards keeping the diff consistent even if it gets over multiple lines all the time. I'm happy to push this commit here if you prefer. In the end, the really hard to read diffs are the longer ones :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I'm a fan of using the multi-line format even for short cases. This is what I do in my own code for git diffs as well. IMO, let's keep it, and if it ends up being weird or people complain, we can think of using a more compat format when it would fit in a single line.

Copy link
Member

@bluetech bluetech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments, please take a look.

Some things for possible follow up:

  • There are some small mistakes in the initial typing I didn't notice before, but can be fixed separately.

  • I wonder if sort_dicts=True is still the right choice these days, when dicts are ordered. This is also a separate discussion.

  • The context can be simplified to a set if I'm not mistaken, seems like it's currently a int -> 1 dict for legacy reasons (probably from before set existed...).

  • The readable stuff is now unused, we can perhaps drop it to simplify the code.

  • Since compact is now ignored, I think we ought to drop this parameter.

Some ruminations on pprint after reading its code:

  • The pprint code has each format function care about the global indentation level. Intuitively if I were to design a pretty-printer I would have each formatter not care about the existing indentation level, i.e. format as if it's the top-level and only care about max width, and the machinery will insert the nesting indentation itself.

    I wonder if there's a reason why it wasn't done, maybe performance?

  • I wonder why Python went with the _dispatch dict instead of an extensible __pprint__ protocol, which would allow each type to handle its own pretty-printing instead of having it all in the pprint module. Maybe the protocol would be too complex.

context,
level,
)
self._pprint_dict(object, stream, indent, allowance, context, level)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not change the traditional OrderedDict formatting, it's what people know/expect, I think.

With pprint_dict it's:

OrderedDict({
    'hello': 100,
})

but should be

OrderedDict([
    ('hello', 100),
])

last = False
while not last:
ent = next_ent
while True:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a for loop now

""",
id="defaultdict-two-items",
),
pytest.param(
Counter(),
"Counter()",
"Counter({})",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would add a special case to keep the previous formatting here.

@@ -677,8 +695,13 @@ def test_dict_different_items(self) -> None:
"Right contains 2 more items:",
"{'b': 1, 'c': 2}",
"Full diff:",
"- {'b': 1, 'c': 2}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I'm a fan of using the multi-line format even for short cases. This is what I do in my own code for git diffs as well. IMO, let's keep it, and if it ends up being weird or people complain, we can think of using a more compat format when it would fit in a single line.

"? ^",
"+ 2,",
"+ 3,",
" )",
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to see another test here where the lists have some commonality, e.g. [1, 2, 3] vs. [1, 20, 3]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, added.

Copy link
Contributor Author

@BenjaminSchubert BenjaminSchubert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bluetech thanks for the review! I addressed the comments with the last commit.

Some things for possible follow up:

You read my mind. I've got https://github.com/BenjaminSchubert/pytest/tree/bschubert/pprint-cleanup lined up with a lot of the requested clean ups already :)

The pprint code has each format function care about the global indentation level. Intuitively if I were to design a pretty-printer I would have each formatter not care about the existing indentation level, i.e. format as if it's the top-level and only care about max width, and the machinery will insert the nesting indentation itself.

I don't think that's easy to do, as the max width depends on the indentation (the further you go, the least space you have).

However, I think we can simplify this as we don't need to have both levels and indentations anymore, with consistent indentation per level. This is a follow up I intend to clean up

"? ^",
"+ 2,",
"+ 3,",
" )",
]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, added.

@bluetech
Copy link
Member

Regarding the OrderedDict...

The new pformat wasn't looking good to me, each key-value being spread over 4 lines instead of 1. So I was going to suggest adding complexity to fix it, but then I thought, since dicts are now ordered, the OrderedDict([('key', 'value')]) is really no better than the OrderedDict({"key": "value"}) format anymore, so maybe we can suggest Python to change it and then we don't have to complicate our code :) But it turns out, someone already did (python/cpython#101446) and and it was actually accepted and implemented in Python 3.12 🎉. So let's undo this last change :)

I think it would also be nice to update upstream pprint to use the new OrderedDict repr, in case you're also interested in contributing to cpython.

BTW, another interesting issue I found: python/cpython#51683

I don't think that's easy to do, as the max width depends on the indentation (the further you go, the least space you have).

I was thinking that instead of increasing the indentation, you would decrease the max size.

@BenjaminSchubert
Copy link
Contributor Author

But it turns out, someone already did (python/cpython#101446) and and it was actually accepted and implemented in Python 3.12 🎉. So let's undo this last change :)

Oh fun, this means they updated the dict but not the pprint module. Change undone

I was thinking that instead of increasing the indentation, you would decrease the max size.

Interesting idea, I'll see if I can find a nice way of doing this

I also missed one part previously:

I wonder if sort_dicts=True is still the right choice these days, when dicts are ordered. This is also a separate discussion.

I think it's still better to sort them per key, as in most cases, I would expect the content to be what's important not the order (in which case OrderedDict could be used)

@bluetech bluetech merged commit 2d1710e into pytest-dev:main Nov 27, 2023
22 checks passed
@bluetech
Copy link
Member

Thanks @BenjaminSchubert!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better diff for asserts on dicts
3 participants