Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-17449: [Python] Better repr for Buffer, MemoryPool, NativeFile and Codec #13921

Merged
merged 10 commits into from
Aug 29, 2022
Merged

ARROW-17449: [Python] Better repr for Buffer, MemoryPool, NativeFile and Codec #13921

merged 10 commits into from
Aug 29, 2022

Conversation

milesgranger
Copy link
Contributor

@milesgranger milesgranger commented Aug 19, 2022

Example:

In [1]: import io
In [2]: import pyarrow as pa

In [3]: pa.PythonFile(io.BytesIO())
Out[3]: <pyarrow.PythonFile closed=False own_file=False is_seekable=False is_writable=True is_readable=False>

In [4]: pa.Codec('gzip')
Out[4]: <pyarrow.Codec name=gzip compression_level=9>

In [5]: pool = pa.default_memory_pool()
In [6]: pool
Out[6]: <pyarrow.MemoryPool backend_name=jemalloc bytes_allocated=0 max_memory=0>

In [7]: pa.allocate_buffer(1024, memory_pool=pool)
Out[7]: <pyarrow.Buffer address=0x7fd660a08000 size=1024 is_cpu=True is_mutable=True

@github-actions
Copy link

@lidavidm
Copy link
Member

Just passing by, but I think it'd be good to have the address in Buffer's repr (perhaps in hex as well)

@milesgranger
Copy link
Contributor Author

Something like this pyarrow.lib.Buffer(0x7fc0b46a5e30, size=1024, is_cpu=True, is_mutable=True) or
like how arrays are presented with default repr followed by the pretty repr?

<pyarrow.lib.Buffer at 0x7fea5e51e7b0>
pyarrow.lib.Buffer(size=1024, is_cpu=True, is_mutable=True)

@lidavidm
Copy link
Member

I would probably vote for the former (I'd guess arrays only do that because their repr() can get long)

python/pyarrow/io.pxi Outdated Show resolved Hide resolved
@milesgranger
Copy link
Contributor Author

@pitrou what do you think?
I believe the error in docs testing is unrelated. (seems to run fine locally anyway)

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this. I agree the CI failure looked unrelated (I've restarted the job).

@@ -2013,7 +2013,7 @@ cdef class RecordBatch(_PandasConvertible):
>>> batch = pa.RecordBatch.from_arrays([n_legs, animals],
... names=["n_legs", "animals"])
>>> batch.serialize()
<pyarrow.lib.Buffer object at ...>
pyarrow.lib.Buffer(address=..., size=..., is_cpu=True, is_mutable=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps:

Suggested change
pyarrow.lib.Buffer(address=..., size=..., is_cpu=True, is_mutable=True)
pyarrow.lib.Buffer(address=0x..., size=..., is_cpu=True, is_mutable=True)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return (f"{name}("
f"backend_name={self.backend_name}, "
f"bytes_allocated={self.bytes_allocated()}, "
f"max_memory={self.max_memory()})")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we test this repr somewhere? (either a doctest or a pytest unit test)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name = f"{self.__class__.__module__}.{self.__class__.__name__}"
return (f"{name}("
f"name={self.name}, "
f"compression_level={self.compression_level})")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f"address={hex(self.address)}, "
f"size={self.size}, "
f"is_cpu={self.is_cpu}, "
f"is_mutable={self.is_mutable})")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, just for the record, this makes it look like the Buffer constructor is callable with these arguments (which it is not).
We could instead go for: <pyarrow.Buffer address=0x...>

@jorisvandenbossche What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on keeping the < .. > (instead of ()) to not confuse it with an eval-able repr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return frombytes(self.unwrap().compression_level())
if self.name == 'snappy':
return None
return self.unwrap().compression_level()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add a test for this? (I assume this was raising for Snappy?)

Copy link
Contributor Author

@milesgranger milesgranger Aug 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, it was failing as-is, compression_level() -> int and frombytes would fail trying to decode an int. Also modified snappy variant as that has no compression level and would give invalid integers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -121,6 +121,14 @@ cdef class NativeFile(_Weakrefable):
def __exit__(self, exc_type, exc_value, tb):
self.close()

def __repr__(self):
name = f"{self.__class__.__module__}.{self.__class__.__name__}"
Copy link
Member

@jorisvandenbossche jorisvandenbossche Aug 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general those objects are exposed in the top-level pyarrow namespace, so I would maybe hardcode that here instead of using __module__ which gives pyarrow.lib (the lib submodule is also considered private)

(and same for Buffer and others)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f"own_file={self.own_file}, "
f"is_seekable={self.is_seekable}, "
f"is_writable={self.is_writable}, "
f"is_readable={self.is_readable})")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be useful to add whether it is closed or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. 889881e

@pitrou
Copy link
Member

pitrou commented Aug 24, 2022

@milesgranger Please don't hesitate to ping where you're finished.

@milesgranger
Copy link
Contributor Author

Apologies, missed the lint failing. Then this should do it. 🤞

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @milesgranger

@pitrou
Copy link
Member

pitrou commented Aug 24, 2022

@jorisvandenbossche Do you want to take another look?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@jorisvandenbossche jorisvandenbossche merged commit 6f302a3 into apache:master Aug 29, 2022
@milesgranger milesgranger deleted the ARROW-17449_better-reprs branch August 29, 2022 10:29
@ursabot
Copy link

ursabot commented Aug 29, 2022

Benchmark runs are scheduled for baseline = bd76850 and contender = 6f302a3. 6f302a3 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Failed ⬇️1.1% ⬆️0.27%] ursa-i9-9960x
[Finished ⬇️0.14% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 6f302a30 ec2-t3-xlarge-us-east-2
[Failed] 6f302a30 test-mac-arm
[Failed] 6f302a30 ursa-i9-9960x
[Finished] 6f302a30 ursa-thinkcentre-m75q
[Finished] bd768506 ec2-t3-xlarge-us-east-2
[Failed] bd768506 test-mac-arm
[Failed] bd768506 ursa-i9-9960x
[Finished] bd768506 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Aug 29, 2022

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

anjakefala pushed a commit to anjakefala/arrow that referenced this pull request Aug 31, 2022
…and Codec (apache#13921)

Example:
```python
In [1]: import io
In [2]: import pyarrow as pa

In [3]: pa.PythonFile(io.BytesIO())
Out[3]: <pyarrow.PythonFile closed=False own_file=False is_seekable=False is_writable=True is_readable=False>

In [4]: pa.Codec('gzip')
Out[4]: <pyarrow.Codec name=gzip compression_level=9>

In [5]: pool = pa.default_memory_pool()
In [6]: pool
Out[6]: <pyarrow.MemoryPool backend_name=jemalloc bytes_allocated=0 max_memory=0>

In [7]: pa.allocate_buffer(1024, memory_pool=pool)
Out[7]: <pyarrow.Buffer address=0x7fd660a08000 size=1024 is_cpu=True is_mutable=True
```

Authored-by: Miles Granger <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
zagto pushed a commit to zagto/arrow that referenced this pull request Oct 7, 2022
…and Codec (apache#13921)

Example:
```python
In [1]: import io
In [2]: import pyarrow as pa

In [3]: pa.PythonFile(io.BytesIO())
Out[3]: <pyarrow.PythonFile closed=False own_file=False is_seekable=False is_writable=True is_readable=False>

In [4]: pa.Codec('gzip')
Out[4]: <pyarrow.Codec name=gzip compression_level=9>

In [5]: pool = pa.default_memory_pool()
In [6]: pool
Out[6]: <pyarrow.MemoryPool backend_name=jemalloc bytes_allocated=0 max_memory=0>

In [7]: pa.allocate_buffer(1024, memory_pool=pool)
Out[7]: <pyarrow.Buffer address=0x7fd660a08000 size=1024 is_cpu=True is_mutable=True
```

Authored-by: Miles Granger <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
fatemehp pushed a commit to fatemehp/arrow that referenced this pull request Oct 17, 2022
…and Codec (apache#13921)

Example:
```python
In [1]: import io
In [2]: import pyarrow as pa

In [3]: pa.PythonFile(io.BytesIO())
Out[3]: <pyarrow.PythonFile closed=False own_file=False is_seekable=False is_writable=True is_readable=False>

In [4]: pa.Codec('gzip')
Out[4]: <pyarrow.Codec name=gzip compression_level=9>

In [5]: pool = pa.default_memory_pool()
In [6]: pool
Out[6]: <pyarrow.MemoryPool backend_name=jemalloc bytes_allocated=0 max_memory=0>

In [7]: pa.allocate_buffer(1024, memory_pool=pool)
Out[7]: <pyarrow.Buffer address=0x7fd660a08000 size=1024 is_cpu=True is_mutable=True
```

Authored-by: Miles Granger <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants