-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shorter repr for attributes #1322
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs more comprehensive unit tests, to verify all the clever logic for truncating and reformatting multi-line attribute strings. Right now we don't really exercise that logic at all in the test suite. I would probably put these in test_formatting.py
.
xarray/core/formatting.py
Outdated
empty_line = output.find(u'\n\n') | ||
if empty_line != -1: | ||
output = output[:empty_line] + u'...' | ||
return maybe_truncate(output.replace(u'\n', u' '), 79) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use OPTIONS['display_width']
instead of 79 here.
xarray/core/formatting.py
Outdated
key = u' %s:' % unicode_type(key) | ||
if col_width is not None: | ||
key = pretty_print(key, col_width) | ||
output = u'%s %s' % (key, unicode_type(value).replace(u'\t', u' ')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should only do all this text formatting logic if the attribute is already a string? I can imagine this doing strange things if the attribute is a number or array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree - it's just adjusting the spacing and replacing tabs and newlines before printing them. While I don't think numbers or array representations would have spaces (and arrays with newlines are rare), surely we want to process them for display in the same way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If somebody is putting non-strings in attributes (which they almost certainly are, as ill-advised as that may be), there is some value in having them appear in a noticeably distinct way. I would rather have non-strings look a little funky and span multiple lines than be confused with the repr
of the objects.
e.g., consider how the repr of an xarray.Dataset
looks:
In [16]: repr(xarray.Dataset({'foo': ('x', [1, 2, 3])}))[:70] + '...'
Out[16]: '<xarray.Dataset>\nDimensions: (x: 3)\nDimensions without coordinates: x...'
Though, it might possibly make sense to "indent" their reprs on any newlines to align them with the indentation of first line.
@fmaussion @pwolfram any opinions here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that indentation would help clarify. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation would certainly be an improvement, but I still strongly prefer to limit each attribute to a single line.
My motivation for this pull request is seeing a common pattern in notebooks dealing with NetCDF data of simply discarding the attrs
metadata, because (a) the formatting is a mess and (b) it's a lot of output. Taking the example in #1319, this is a temporary file with "only" 29 attributes, of which eight are long enough to truncate and three include newlines. This is a short example. With larger workflows, this pattern destroys provenance and reproducibility - along with directly useful information!
If users need the unmodified attributes or are unsure about types, they can access them directly - so I would really prefer to keep the repr to a single line for each, no matter what.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the examples you've seen, how often do these datasets with long lists of attributes include non-string attributes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather, non-string attributes that have a repr that spans multiple lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not uncommon to find numbers or timestamps in the attributes, but it would surprise me to find something with a multiline repr. I'd still want to truncate it to one line in that case, but that's more of a personal preference I guess.
NetCDF files often have tens of attributes, including multi-paragraph summaries or the full modification history of the file. It's great to have this available in the .attrs, but we can truncate it substantially in the repr! Hopefully this will stop people writing `data.attrs = {}` and discarding metadata in interactive workflows for the sake of cleaner output.
If the logic is "clever", time to ditch it. I've rebased to a clearer version which simply handles indentation and replaces tabs and newlines with their backslash representations. What properties of this would you want to test? |
Ping @shoyer |
"Clever" was possibly too strong of a word. I think both your original and new proposal are fine -- use your own best taste on which version is most readable. It's more that we need unit tests for any functionality. Otherwise, things tend to brake inadvertently when someone else does code cleanup months or years later.
This could be super simple here, e.g., just a few assert statements verifying the text substitutions and truncation. Ideally we exercise every line of logic in your code. |
@shoyer - my preference is to process and truncate all strings, including object reprs. If you want to overrule that feel free; otherwise I think it's ready to merge. |
@Zac-HD My guess is that non-string multi-line attributes are quite rare, so in truth it doesn't really matter what we pick (even though my preference leaned the other way from yours). If someone cares enough to complain we can revisit this, but until then I'm OK with your choice. |
Thanks for the contribution! |
NetCDF files often have tens of attributes, including multi-paragraph summaries or the full modification history of the file. It's great to have this available in the .attrs, but we can truncate it substantially
in the repr! Hopefully this will stop people writing
data.attrs = {}
and discarding metadata in interactive workflows for the sake of cleaner output.git diff upstream/master | flake8 --diff