Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

**Breaking**: data_kind: Now 'matrix' represents a 2-D numpy array and unrecognized data types fall back to 'vectors' #3351

Merged
merged 21 commits into from
Oct 16, 2024

Conversation

seisman
Copy link
Member

@seisman seisman commented Jul 23, 2024

Need to wait for #3481.

Description of proposed changes

As can be seen in the doctests added in #3480, currently, matrix and vectors kinds are defined like below:

  • vectors kind means data=None and required=True. It means that the input is given via a series of vectors (e.g., x/y/z).
  • matrix represents any data types not recognized as arg/file/stringio/grid/image/geojson. The most common data types include pd.DataFrame, pd.Series, xr.Dataset, np.ndarray, and sequence of sequences.

So, matrix here is an inclusive concept. However, in both Python and GMT, matrix usually means a homogenous 2-D array. When passing a "matrix" data to GMT, we usually treat the "matrix" as a series of vectors. The only exception is when the "matrix" data is a strict matrix (a 2-D numpy array) with data.dtype.kind in "iuf".

pygmt/pygmt/clib/session.py

Lines 1742 to 1755 in d7560fa

elif kind == "matrix": # turn 2-D arrays into list of vectors
if hasattr(data, "items") and not hasattr(data, "to_frame"):
# pandas.DataFrame or xarray.Dataset types.
# pandas.Series will be handled below like a 1-D numpy.ndarray.
_data = [array for _, array in data.items()]
elif hasattr(data, "ndim") and data.ndim == 2 and data.dtype.kind in "iuf":
# Just use virtualfile_from_matrix for 2-D numpy.ndarray
# which are signed integer (i), unsigned integer (u) or
# floating point (f) types
_virtualfile_from = self.virtualfile_from_matrix
_data = (data,)
else:
# Python list, tuple, numpy.ndarray, and pandas.Series types
_data = np.atleast_2d(np.asanyarray(data).T)

I think it makes more sense to use a more strict definition for matrix and let all other data types be vectors. I propose to change the definitions of data kinds to:

  • matrix: a 2-D homogenous numpy.ndarray
  • vectors: fallbacks to this kind for any unrecognized data types, including a dictionary, pd.DataFrame, pd.Series, xr.Dataset, nested lists/tuples, or any other unrecognized data types.

The refactoring is mainly done in 808755d.

Of course, data=None, required=True is still recognized as "vectors". I also think we should give this case another special kind like "none", which makes more sense and can also simplify the codes. [This is done in a separate PR at #3482, but I'm OK to merge #3482 into this PR first before merging this PR into main].

Other data kinds are unchanged.

This is a breaking change since some previous "matrix" data are now recognized "vectors" and previously "vectors" is "none". But I guess very few users are using this function, so the breaking change should have minimal effects.

@seisman seisman force-pushed the refactor/data_kind branch from ea90891 to c0a5bdb Compare July 23, 2024 08:31
pygmt/helpers/utils.py Outdated Show resolved Hide resolved
pygmt/helpers/utils.py Outdated Show resolved Hide resolved
@seisman seisman force-pushed the refactor/data_kind branch from c0a5bdb to 78066a4 Compare July 23, 2024 08:42
@seisman seisman marked this pull request as ready for review July 23, 2024 08:47
pygmt/helpers/utils.py Outdated Show resolved Hide resolved
@seisman seisman added maintenance Boring but important stuff for the core devs needs review This PR has higher priority and needs review. labels Jul 23, 2024
@seisman seisman added this to the 0.13.0 milestone Jul 23, 2024
@seisman seisman added the run/benchmark Trigger the benchmark workflow in PRs label Jul 23, 2024
@seisman seisman force-pushed the refactor/data_kind branch 3 times, most recently from f160dd6 to 1c517e6 Compare July 23, 2024 15:37
@seisman seisman changed the title Refactor the data_kind function and improve docstrings [BREAKING CHANGES] Refactor the data_kind function to return "none"/"matrix"/"vectors" data kinds in a more reasonable way Jul 23, 2024
@seisman seisman force-pushed the refactor/data_kind branch from 1c517e6 to 21f581a Compare July 23, 2024 16:41
@seisman seisman force-pushed the refactor/data_kind branch 3 times, most recently from 52b711b to 713d0a2 Compare July 26, 2024 15:32
@seisman seisman mentioned this pull request Aug 1, 2024
39 tasks
@seisman seisman requested a review from a team August 1, 2024 06:20
return "geojson"

# A 2-D numpy.ndarray
if isinstance(data, np.ndarray) and data.ndim == 2:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably can change this line to:

Suggested change
if isinstance(data, np.ndarray) and data.ndim == 2:
if hasattr(data, "__array_interface__") and len(data.__array_interface__.shape) == 2:

to support more array-like objects that implements the __array_interface__ protocol (https://numpy.org/doc/stable/reference/arrays.interface.html#object.__array_interface__), but it can be done in a future PR.

@seisman seisman removed the needs review This PR has higher priority and needs review. label Aug 4, 2024
@seisman seisman modified the milestones: 0.13.0, 0.14.0 Aug 4, 2024
@seisman seisman removed this from the 0.14.0 milestone Sep 5, 2024
@seisman seisman marked this pull request as draft September 13, 2024 00:30
@seisman seisman changed the base branch from main to data_kind/doctest October 2, 2024 13:28
@seisman seisman removed the run/benchmark Trigger the benchmark workflow in PRs label Oct 2, 2024
@GenericMappingTools GenericMappingTools deleted a comment from codspeed-hq bot Oct 3, 2024
@@ -195,7 +195,7 @@ def x2sys_cross(
match data_kind(track):
case "file":
file_contexts.append(contextlib.nullcontext(track))
case "matrix":
case "vectors":
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas.DataFrame now is "vectors" kind.

@seisman seisman added the needs review This PR has higher priority and needs review. label Oct 7, 2024
@seisman seisman added this to the 0.14.0 milestone Oct 7, 2024
@seisman seisman marked this pull request as ready for review October 7, 2024 05:44
pygmt/clib/session.py Outdated Show resolved Hide resolved
Comment on lines 319 to 320
case np.ndarray() if data.ndim == 2: # A 2-D numpy.ndarray object.
kind = "matrix"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things:

  1. Should we extend kind="matrix" to 3-D numpy.ndarray objects?
  2. I'm hesitant to match based on np.ndarray() class only, because while it's not widely advertised, checking for data is not None would have caught other array types implementing the __array_function__ protocol (see NEP18), but checking for isinstance(data, np.ndarray) would exclude those. That said, I'm wondering if the fallback to kind="vectors" might just work, but need to test this out more extensively.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Should we extend kind="matrix" to 3-D numpy.ndarray objects?

It's unclear if we can pass a 3-D numpy array yet. Even if we can, it means more work, since in virtualfile_from_vectors/virtualfile_from_matrix, we explicitly check if the ndim=1 or 2. So, better to revisit the 3-D numpy array support in a separate PR if necessary.

  • I'm hesitant to match based on np.ndarray() class only, because while it's not widely advertised, checking for data is not None would have caught other array types implementing the __array_function__ protocol (see NEP18), but checking for isinstance(data, np.ndarray) would exclude those. That said, I'm wondering if the fallback to kind="vectors" might just work, but need to test this out more extensively.

Yes, I was thinking about checking __array_interface__ in #3351 (comment).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 4a4f192, although other array-like objects are not tested.

Comment on lines +1822 to +1823
if data.dtype.kind not in "iuf":
_virtualfile_from = self.virtualfile_from_vectors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test coverage for these lines?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add a test for it. Before adding more tests, I'm wondering if we should split the big "test_clib_virtualfiles.py" file (with more than 500 lines) into separate smaller test files, i.e., one test file for each Session method.

  • test_clib_virtualfile_in
  • test_clib_virtualfile_from_vectors
  • test_clib_virtualfile_from_matrix
  • test_clib_open_virtualfile

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've split it before in #2784, so yes, ok to split it again 😆

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've split it before in #2784, so yes, ok to split it again 😆

Done in #3512.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a test in cfa32ed to cover this line.

@@ -101,3 +103,27 @@ def test_virtualfile_in_fail_non_valid_data(data):
z=data[:, 2],
data=data,
)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is added to address #3351 (comment).

This new test passes a string dtype numpy array into GMT C API, which contains longitude/latitude strings. The data kind is "matrix", but since the data dtype is not in iuf, PyGMT will use virtualfile_from_vectors rather than virtualfile_from_matrix to pass the data into GMT C API. Ideally, we should check that the virtualfile_from_vectors is called once and virtualfile_from_matrix is not called, but I find it technically complicated with unittest.mock, so I'll leave it untested.

The function is marked as xfail with GMT 6.5 due to a newly found upstream bug at GenericMappingTools/gmt#8600.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, and thanks for adding the test!

@seisman seisman added the run/test-gmt-dev Trigger the GMT Dev Tests workflow in PR label Oct 15, 2024
@seisman seisman force-pushed the refactor/data_kind branch from 6b7a578 to 423e5dc Compare October 15, 2024 08:46
@weiji14 weiji14 added final review call This PR requires final review and approval from a second reviewer and removed needs review This PR has higher priority and needs review. labels Oct 16, 2024
@seisman seisman removed final review call This PR requires final review and approval from a second reviewer run/test-gmt-dev Trigger the GMT Dev Tests workflow in PR labels Oct 16, 2024
@seisman seisman merged commit a5c0aa2 into main Oct 16, 2024
23 of 25 checks passed
@seisman seisman deleted the refactor/data_kind branch October 16, 2024 02:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Boring but important stuff for the core devs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants