Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shorter repr for attributes #1322

Merged
merged 2 commits into from
Apr 3, 2017
Merged

Shorter repr for attributes #1322

merged 2 commits into from
Apr 3, 2017

Conversation

Zac-HD
Copy link
Contributor

@Zac-HD Zac-HD commented Mar 24, 2017

NetCDF files often have tens of attributes, including multi-paragraph summaries or the full modification history of the file. It's great to have this available in the .attrs, but we can truncate it substantially
in the repr! Hopefully this will stop people writing data.attrs = {} and discarding metadata in interactive workflows for the sake of cleaner output.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs more comprehensive unit tests, to verify all the clever logic for truncating and reformatting multi-line attribute strings. Right now we don't really exercise that logic at all in the test suite. I would probably put these in test_formatting.py.

empty_line = output.find(u'\n\n')
if empty_line != -1:
output = output[:empty_line] + u'...'
return maybe_truncate(output.replace(u'\n', u' '), 79)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use OPTIONS['display_width'] instead of 79 here.

key = u' %s:' % unicode_type(key)
if col_width is not None:
key = pretty_print(key, col_width)
output = u'%s %s' % (key, unicode_type(value).replace(u'\t', u' '))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should only do all this text formatting logic if the attribute is already a string? I can imagine this doing strange things if the attribute is a number or array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree - it's just adjusting the spacing and replacing tabs and newlines before printing them. While I don't think numbers or array representations would have spaces (and arrays with newlines are rare), surely we want to process them for display in the same way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If somebody is putting non-strings in attributes (which they almost certainly are, as ill-advised as that may be), there is some value in having them appear in a noticeably distinct way. I would rather have non-strings look a little funky and span multiple lines than be confused with the repr of the objects.

e.g., consider how the repr of an xarray.Dataset looks:

In [16]: repr(xarray.Dataset({'foo': ('x', [1, 2, 3])}))[:70] + '...'
Out[16]: '<xarray.Dataset>\nDimensions:  (x: 3)\nDimensions without coordinates: x...'

Though, it might possibly make sense to "indent" their reprs on any newlines to align them with the indentation of first line.

@fmaussion @pwolfram any opinions here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that indentation would help clarify. 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation would certainly be an improvement, but I still strongly prefer to limit each attribute to a single line.

My motivation for this pull request is seeing a common pattern in notebooks dealing with NetCDF data of simply discarding the attrs metadata, because (a) the formatting is a mess and (b) it's a lot of output. Taking the example in #1319, this is a temporary file with "only" 29 attributes, of which eight are long enough to truncate and three include newlines. This is a short example. With larger workflows, this pattern destroys provenance and reproducibility - along with directly useful information!

If users need the unmodified attributes or are unsure about types, they can access them directly - so I would really prefer to keep the repr to a single line for each, no matter what.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the examples you've seen, how often do these datasets with long lists of attributes include non-string attributes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather, non-string attributes that have a repr that spans multiple lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not uncommon to find numbers or timestamps in the attributes, but it would surprise me to find something with a multiline repr. I'd still want to truncate it to one line in that case, but that's more of a personal preference I guess.

NetCDF files often have tens of attributes, including multi-paragraph
summaries or the full modification history of the file.  It's great to
have this available in the .attrs, but we can truncate it substantially
in the repr!  Hopefully this will stop people writing `data.attrs = {}`
and discarding metadata in interactive workflows for the sake of cleaner
output.
@Zac-HD
Copy link
Contributor Author

Zac-HD commented Mar 25, 2017

If the logic is "clever", time to ditch it. I've rebased to a clearer version which simply handles indentation and replaces tabs and newlines with their backslash representations.

What properties of this would you want to test?

@Zac-HD
Copy link
Contributor Author

Zac-HD commented Mar 27, 2017

Ping @shoyer

@shoyer
Copy link
Member

shoyer commented Mar 27, 2017

If the logic is "clever", time to ditch it. I've rebased to a clearer version which simply handles indentation and replaces tabs and newlines with their backslash representations.

"Clever" was possibly too strong of a word. I think both your original and new proposal are fine -- use your own best taste on which version is most readable.

It's more that we need unit tests for any functionality. Otherwise, things tend to brake inadvertently when someone else does code cleanup months or years later.

What properties of this would you want to test?

This could be super simple here, e.g., just a few assert statements verifying the text substitutions and truncation. Ideally we exercise every line of logic in your code.

@Zac-HD
Copy link
Contributor Author

Zac-HD commented Mar 30, 2017

@shoyer - my preference is to process and truncate all strings, including object reprs. If you want to overrule that feel free; otherwise I think it's ready to merge.

@shoyer shoyer merged commit d08efaf into pydata:master Apr 3, 2017
@shoyer
Copy link
Member

shoyer commented Apr 3, 2017

@Zac-HD My guess is that non-string multi-line attributes are quite rare, so in truth it doesn't really matter what we pick (even though my preference leaned the other way from yours). If someone cares enough to complain we can revisit this, but until then I'm OK with your choice.

@shoyer
Copy link
Member

shoyer commented Apr 3, 2017

Thanks for the contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Truncate long lines in repr of Dataset.attrs
3 participants