-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pyarrow: Check compatibility of pyarrow.array with string type #2933
Conversation
Ensure that pyarrow.array objects with string type can be read by pygmt.text.
Convert a pyarrow.StringArray via a Python list to a ctypes array in the strings_to_ctypes_array function. Updated docstrings and type hints in `clib.Session.put_strings` method and `clib.conversion.strings_to_ctypes_array` function. Added two parametrized unit tests to ensure that pyarrow.StringArray can be passed into the clib methods.
pygmt/clib/conversion.py
Outdated
@@ -263,14 +267,15 @@ def sequence_to_ctypes_array( | |||
return (ctype * size)(*sequence) | |||
|
|||
|
|||
def strings_to_ctypes_array(strings: Sequence[str]) -> ctp.Array: | |||
def strings_to_ctypes_array(strings: Sequence[str] | pa.StringArray) -> ctp.Array: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to figure out how to fix the mypy
type errors for an optional import 🤔
pygmt/clib/conversion.py:298: error: Item "Sequence[str]" of "Sequence[str] | Any" has no attribute "to_pylist" [union-attr]
pygmt/clib/session.py:1004: error: Item "Sequence[str]" of "Sequence[str] | Any" has no attribute "dtype" [union-attr]
Fixes `AttributeError: 'NoneType' object has no attribute 'string'`
Also improve docstring of strings_to_ctypes_array function to mention np.ndarray as supported input.
So that the pyarrow.StringArray type hint will show up on Readthedocs and the PyGMT docs page.
CodSpeed Performance ReportMerging #2933 will not alter performanceComparing Summary
Benchmarks breakdown
|
Faster for longer string arrays. Co-authored-by: Dongdong Tian <[email protected]>
Still need to work on Duration and GeoArrow geometry types.
Accidentally removed from `test_virtualfile_from_vectors_one_string_or_object_column` during merge conflict handling.
With PR #3619 and #3608 merged, pyarrow arrays with string type now can be correctly converted to numpy string array. So, we don't need to add support of pyarrow arrays in the low-level functions like |
pygmt/tests/test_clib_put_strings.py
Outdated
@@ -8,12 +8,31 @@ | |||
from pygmt import clib | |||
from pygmt.exceptions import GMTCLibError | |||
from pygmt.helpers import GMTTempFile | |||
from pygmt.helpers.testing import skip_if_no |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also need to revert changes in this PR, since the low-level function Session.put_strings
only accepts numpy.str_
and everything else should be converted to numpy.str_
by the _to_numpy
function before calling put_strings
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, just to document things, the way we are intending pyarrow.string inputs to go through high-level modules like fig.text
is:
pygmt.Figure.text
->clib.Session.virtualfile_in
->clib.Session.virtualfile_from_vectors
- In
clib.Session.virtualfile_from_vectors
:- First call
clib.conversion.vectors_to_arrays
->clib.conversion._to_numpy
->np.ascontiguous_array
to convertpyarrow.string
tonp.str_
- With a
list[np.str_]
, this is then passed toclib.Session.put_strings
- First call
For the record, the original intention in this PR, mostly in commit d379e46, was to let clib.Session.put_strings
handle both np.str_
and pyarrow.string
.
d92ca80
to
7dc353b
Compare
@@ -137,3 +143,37 @@ def test_open_virtual_file(): | |||
bounds = "\t".join([f"<{col.min():.0f}/{col.max():.0f}>" for col in data.T]) | |||
expected = f"<matrix memory>: N = {shape[0]}\t{bounds}\n" | |||
assert output == expected | |||
|
|||
|
|||
@pytest.mark.benchmark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test was moved to test_clib_virtualfile_from_vectors
in #3512.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved in 6ad6eb9
pytest.param( | ||
getattr(pa, "array", None), | ||
{}, # pa.string() | ||
marks=skip_if_no(package="pyarrow"), | ||
id="pyarrow", | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A possible solution to avoid the complicated param is:
try:
import pyarrow as pa
pa_array = pa.array
except ImportError:
pa_array = None
@pytest.mark.parametrize(
("array_func", "dtype"),
[
pytest.param(np.array, {"dtype": np.str_}, id="str"),
pytest.param(np.array, {"dtype": np.object_}, id="object"),
pytest.param(pa_array, {}, marks=skip_if_no(package="pyarrow"), id="pyarrow"),
]
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-Authored-By: Dongdong Tian <[email protected]>
85d2bb5
to
8172102
Compare
It's likely that pytest-codspeed v3.0 breaks our Benchmark workflow. For comparison, see the CI job on the main branch yesterday (https://github.com/GenericMappingTools/pygmt/actions/runs/11830721578/job/32964716012) and this PR (https://github.com/GenericMappingTools/pygmt/actions/runs/11849193208/job/33022039738?pr=2933). Edit: Will explore the issue in a separate PR. |
Description of proposed changes
Check that
pyarrow.array
objects of string type can be passed to PyGMT modules/functions.This PR also adds type hints to the docstring of the
pygmt.Figure.text
'stext
parameter.Preview at https://pygmt-dev--2933.org.readthedocs.build/en/2933/api/generated/pygmt.Figure.text.html
TODO:
pyarrow.array
by adding parametrized tests topygmt.text
test_clib_virtualfile_from_vectors.py
pa.StringArray
can be passed intopygmt.Figure.text
Addresses #2800 (comment), part of #2800.
Reminders
make format
andmake check
to make sure the code follows the style guide.doc/api/index.rst
.Slash Commands
You can write slash commands (
/command
) in the first line of a comment to performspecific operations. Supported slash commands are:
/format
: automatically format and lint the code/test-gmt-dev
: run full tests on the latest GMT development version