Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

**Breaking**: data_kind: Now 'matrix' represents a 2-D numpy array and unrecognized data types fall back to 'vectors' #3351

Merged
merged 21 commits into from
Oct 16, 2024
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
0c82b3c
data_kind: Refactor the if-else statements into if-return statements
seisman Oct 3, 2024
808755d
data_kind: Now 'matrix' represents a 2-D numpy array and unrecognizd …
seisman Oct 3, 2024
0eb4f8f
Make 'data' a required parameter
seisman Oct 3, 2024
9891b2c
Fix x2sys_cross as pd.DataFrame is 'vectors' kind now
seisman Oct 3, 2024
a9d094c
Fix legend as now 'vectors' doesn't mean data is None
seisman Oct 3, 2024
3d8be4d
Add docstrings for stringio
seisman Oct 3, 2024
7c104a9
Merge branch 'data_kind/return' into refactor/data_kind
seisman Oct 4, 2024
5790923
Merge branch 'main' into refactor/data_kind
seisman Oct 7, 2024
6954c5d
Fix docstrings
seisman Oct 7, 2024
9300ca3
Merge branch 'main' into refactor/data_kind
seisman Oct 7, 2024
2701a4a
Merge branch 'main' into refactor/data_kind
seisman Oct 8, 2024
7fcf57f
Merge branch 'main' into refactor/data_kind
seisman Oct 10, 2024
991f688
Merge branch 'main' into refactor/data_kind
seisman Oct 11, 2024
51569c8
Update pygmt/clib/session.py
seisman Oct 13, 2024
4a4f192
2-D array-like that implements '__array_interface__' is matrix
seisman Oct 14, 2024
2a6e788
Merge branch 'main' into refactor/data_kind
seisman Oct 14, 2024
2b054b6
Merge branch 'main' into refactor/data_kind
seisman Oct 15, 2024
cfa32ed
Add a test for passing string dtype matrix
seisman Oct 15, 2024
ef6e6aa
Fix a bug when passing a 2-D matrix to virtualfile_from_vectors
seisman Oct 15, 2024
423e5dc
Should transpose the 2-D matrix before passing to virtualfile_from_ve…
seisman Oct 15, 2024
81e57f7
Merge branch 'main' into refactor/data_kind
seisman Oct 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 18 additions & 18 deletions pygmt/clib/session.py
Original file line number Diff line number Diff line change
Expand Up @@ -1781,10 +1781,7 @@
"grid": self.virtualfile_from_grid,
"image": tempfile_from_image,
"stringio": self.virtualfile_from_stringio,
# Note: virtualfile_from_matrix is not used because a matrix can be
# converted to vectors instead, and using vectors allows for better
# handling of string type inputs (e.g. for datetime data types)
"matrix": self.virtualfile_from_vectors,
"matrix": self.virtualfile_from_matrix,
"vectors": self.virtualfile_from_vectors,
}[kind]

Expand All @@ -1801,29 +1798,32 @@
warnings.warn(message=msg, category=RuntimeWarning, stacklevel=2)
_data = (data,) if not isinstance(data, pathlib.PurePath) else (str(data),)
elif kind == "vectors":
_data = [x, y]
if z is not None:
_data.append(z)
if extra_arrays:
_data.extend(extra_arrays)
elif kind == "matrix": # turn 2-D arrays into list of vectors
if hasattr(data, "items") and not hasattr(data, "to_frame"):
if data is None:
# data is None, so data must be given via x/y/z.
_data = [x, y]
if z is not None:
_data.append(z)
if extra_arrays:
_data.extend(extra_arrays)
elif hasattr(data, "items") and not hasattr(data, "to_frame"):
# pandas.DataFrame or xarray.Dataset types.
# pandas.Series will be handled below like a 1-D numpy.ndarray.
_data = [array for _, array in data.items()]
elif hasattr(data, "ndim") and data.ndim == 2 and data.dtype.kind in "iuf":
# Just use virtualfile_from_matrix for 2-D numpy.ndarray
# which are signed integer (i), unsigned integer (u) or
# floating point (f) types
_virtualfile_from = self.virtualfile_from_matrix
_data = (data,)
else:
# Python list, tuple, numpy.ndarray, and pandas.Series types
_data = np.atleast_2d(np.asanyarray(data).T)
elif kind == "matrix":
# GMT can only accept a 2-D matrix which are signed integer (i), unsigned
# integer (u) or floating point (f) types. For other data types, we need to
# use virtualfile_from_vectors instead, which turns the matrix into list of
# vectors and allows for better handling of string type inputs (e.g. for
# datetime data types).
seisman marked this conversation as resolved.
Show resolved Hide resolved
_data = (data,)
if data.dtype.kind not in "iuf":
_virtualfile_from = self.virtualfile_from_vectors

Check warning on line 1823 in pygmt/clib/session.py

View check run for this annotation

Codecov / codecov/patch

pygmt/clib/session.py#L1823

Added line #L1823 was not covered by tests
Comment on lines +1828 to +1829
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test coverage for these lines?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add a test for it. Before adding more tests, I'm wondering if we should split the big "test_clib_virtualfiles.py" file (with more than 500 lines) into separate smaller test files, i.e., one test file for each Session method.

  • test_clib_virtualfile_in
  • test_clib_virtualfile_from_vectors
  • test_clib_virtualfile_from_matrix
  • test_clib_open_virtualfile

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've split it before in #2784, so yes, ok to split it again 😆

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've split it before in #2784, so yes, ok to split it again 😆

Done in #3512.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a test in cfa32ed to cover this line.


# Finally create the virtualfile from the data, to be passed into GMT
file_context = _virtualfile_from(*_data)

return file_context

def virtualfile_from_data(
Expand Down
34 changes: 19 additions & 15 deletions pygmt/helpers/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
from collections.abc import Iterable, Mapping, Sequence
from typing import Any, Literal

import numpy as np
import xarray as xr
from pygmt.encodings import charset
from pygmt.exceptions import GMTInvalidInput
Expand Down Expand Up @@ -207,8 +208,11 @@ def data_kind(
- ``"grid"``: a :class:`xarray.DataArray` object that is not 3-D
- ``"image"``: a 3-D :class:`xarray.DataArray` object
- ``"stringio"``: a :class:`io.StringIO` object
- ``"matrix"``: anything else that is not ``None``
- ``"vectors"``: ``data`` is ``None`` and ``required=True``
- ``"matrix"``: a 2-D :class:`numpy.ndarray` object
- ``"vectors"``: ``data`` is ``None`` and ``required=True``, or any unrecognized
data. Common data types include, a :class:`pandas.DataFrame` object, a dictionary
with array-like values, a 1-D/3-D :class:`numpy.ndarray` object, or array-like
objects.

Parameters
----------
Expand Down Expand Up @@ -268,27 +272,27 @@ def data_kind(

The "matrix"`` kind:

>>> data_kind(data=np.arange(10)) # 1-D numpy.ndarray
'matrix'
>>> data_kind(data=np.arange(10).reshape((5, 2))) # 2-D numpy.ndarray
'matrix'

The "vectors" kind:

>>> data_kind(data=np.arange(10)) # 1-D numpy.ndarray
'vectors'
>>> data_kind(data=np.arange(60).reshape((3, 4, 5))) # 3-D numpy.ndarray
'matrix'
'vectors'
>>> data_kind(xr.DataArray(np.arange(12), name="x").to_dataset()) # xarray.Dataset
'matrix'
'vectors'
>>> data_kind(data=[1, 2, 3]) # 1-D sequence
'matrix'
'vectors'
>>> data_kind(data=[[1, 2, 3], [4, 5, 6]]) # sequence of sequences
'matrix'
'vectors'
>>> data_kind(data={"x": [1, 2, 3], "y": [4, 5, 6]}) # dictionary
'matrix'
'vectors'
>>> data_kind(data=pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]})) # pd.DataFrame
'matrix'
'vectors'
>>> data_kind(data=pd.Series([1, 2, 3], name="x")) # pd.Series
'matrix'

The "vectors" kind:

'vectors'
>>> data_kind(data=None)
'vectors'
"""
Expand All @@ -312,7 +316,7 @@ def data_kind(
# geopandas.GeoDataFrame or shapely.geometry).
# Reference: https://gist.github.com/sgillies/2217756
kind = "geojson"
case x if x is not None: # Any not-None is considered as a matrix.
case np.ndarray() if data.ndim == 2: # A 2-D numpy.ndarray object.
kind = "matrix"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things:

  1. Should we extend kind="matrix" to 3-D numpy.ndarray objects?
  2. I'm hesitant to match based on np.ndarray() class only, because while it's not widely advertised, checking for data is not None would have caught other array types implementing the __array_function__ protocol (see NEP18), but checking for isinstance(data, np.ndarray) would exclude those. That said, I'm wondering if the fallback to kind="vectors" might just work, but need to test this out more extensively.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Should we extend kind="matrix" to 3-D numpy.ndarray objects?

It's unclear if we can pass a 3-D numpy array yet. Even if we can, it means more work, since in virtualfile_from_vectors/virtualfile_from_matrix, we explicitly check if the ndim=1 or 2. So, better to revisit the 3-D numpy array support in a separate PR if necessary.

  • I'm hesitant to match based on np.ndarray() class only, because while it's not widely advertised, checking for data is not None would have caught other array types implementing the __array_function__ protocol (see NEP18), but checking for isinstance(data, np.ndarray) would exclude those. That said, I'm wondering if the fallback to kind="vectors" might just work, but need to test this out more extensively.

Yes, I was thinking about checking __array_interface__ in #3351 (comment).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 4a4f192, although other array-like objects are not tested.

case _: # Fall back to "vectors" if data is None and required=True.
kind = "vectors"
Expand Down
2 changes: 1 addition & 1 deletion pygmt/src/legend.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ def legend(
kwargs["F"] = box

kind = data_kind(spec)
if kind not in {"vectors", "file", "stringio"}: # kind="vectors" means spec is None
if spec is not None and kind not in {"file", "stringio"}:
raise GMTInvalidInput(f"Unrecognized data type: {type(spec)}")
if kind == "file" and is_nonstr_iter(spec):
raise GMTInvalidInput("Only one legend specification file is allowed.")
Expand Down
2 changes: 1 addition & 1 deletion pygmt/src/x2sys_cross.py
Original file line number Diff line number Diff line change
Expand Up @@ -195,7 +195,7 @@ def x2sys_cross(
match data_kind(track):
case "file":
file_contexts.append(contextlib.nullcontext(track))
case "matrix":
case "vectors":
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas.DataFrame now is "vectors" kind.

# find suffix (-E) of trackfiles used (e.g. xyz, csv, etc) from
# $X2SYS_HOME/TAGNAME/TAGNAME.tag file
tagfile = Path(
Expand Down