Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly reserve the grid data dtype by converting ctypes array to numpy array with np.ctypeslib.as_array #3446

Merged
merged 3 commits into from
Sep 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 32 additions & 30 deletions pygmt/datatypes/grid.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,16 +40,15 @@ class _GMT_GRID(ctp.Structure): # noqa: N801
... print(header.pad[:])
... print(header.mem_layout, header.nan_value, header.xy_off)
... # The x and y coordinates
... print(grid.x[: header.n_columns])
... print(grid.y[: header.n_rows])
... x = np.ctypeslib.as_array(grid.x, shape=(header.n_columns,)).copy()
... y = np.ctypeslib.as_array(grid.y, shape=(header.n_rows,)).copy()
... # The data array (with paddings)
... data = np.reshape(
... grid.data[: header.mx * header.my], (header.my, header.mx)
... )
... data = np.ctypeslib.as_array(
... grid.data, shape=(header.my, header.mx)
... ).copy()
... # The data array (without paddings)
... pad = header.pad[:]
... data = data[pad[2] : header.my - pad[3], pad[0] : header.mx - pad[1]]
... print(data)
14 8 1
[-55.0, -47.0, -24.0, -10.0] 190.0 981.0 [1.0, 1.0]
1.0 0.0
Expand All @@ -61,22 +60,26 @@ class _GMT_GRID(ctp.Structure): # noqa: N801
18 1 12 18
[2, 2, 2, 2]
b'' nan 0.5
[-54.5, -53.5, -52.5, -51.5, -50.5, -49.5, -48.5, -47.5]
[-10.5, -11.5, -12.5, -13.5, -14.5, -15.5, ..., -22.5, -23.5]
[[347.5 331.5 309. 282. 190. 208. 299.5 348. ]
[349. 313. 325.5 247. 191. 225. 260. 452.5]
[345.5 320. 335. 292. 207.5 247. 325. 346.5]
[450.5 395.5 366. 248. 250. 354.5 550. 797.5]
[494.5 488.5 357. 254.5 286. 484.5 653.5 930. ]
[601. 526.5 535. 299. 398.5 645. 797.5 964. ]
[308. 595.5 555.5 556. 580. 770. 927. 920. ]
[521.5 682.5 796. 886. 571.5 638.5 739.5 881.5]
[310. 521.5 757. 570.5 538.5 524. 686.5 794. ]
[561.5 539. 446.5 481.5 439.5 553. 726.5 981. ]
[557. 435. 385.5 345.5 413.5 496. 519.5 833.5]
[373. 367.5 349. 352.5 419.5 428. 570. 667.5]
[383. 284.5 344.5 394. 491. 556.5 578.5 618.5]
[347.5 344.5 386. 640.5 617. 579. 646.5 671. ]]
>>> x
array([-54.5, -53.5, -52.5, -51.5, -50.5, -49.5, -48.5, -47.5])
>>> y
array([-10.5, -11.5, -12.5, -13.5, ..., -20.5, -21.5, -22.5, -23.5])
>>> data
array([[347.5, 331.5, 309. , 282. , 190. , 208. , 299.5, 348. ],
[349. , 313. , 325.5, 247. , 191. , 225. , 260. , 452.5],
[345.5, 320. , 335. , 292. , 207.5, 247. , 325. , 346.5],
[450.5, 395.5, 366. , 248. , 250. , 354.5, 550. , 797.5],
[494.5, 488.5, 357. , 254.5, 286. , 484.5, 653.5, 930. ],
[601. , 526.5, 535. , 299. , 398.5, 645. , 797.5, 964. ],
[308. , 595.5, 555.5, 556. , 580. , 770. , 927. , 920. ],
[521.5, 682.5, 796. , 886. , 571.5, 638.5, 739.5, 881.5],
[310. , 521.5, 757. , 570.5, 538.5, 524. , 686.5, 794. ],
[561.5, 539. , 446.5, 481.5, 439.5, 553. , 726.5, 981. ],
[557. , 435. , 385.5, 345.5, 413.5, 496. , 519.5, 833.5],
[373. , 367.5, 349. , 352.5, 419.5, 428. , 570. , 667.5],
[383. , 284.5, 344.5, 394. , 491. , 556.5, 578.5, 618.5],
[347.5, 344.5, 386. , 640.5, 617. , 579. , 646.5, 671. ]],
dtype=float32)
"""

_fields_: ClassVar = [
Expand Down Expand Up @@ -126,7 +129,8 @@ def to_dataarray(self) -> xr.DataArray:
[450.5, 395.5, 366. , 248. , 250. , 354.5, 550. , 797.5],
[345.5, 320. , 335. , 292. , 207.5, 247. , 325. , 346.5],
[349. , 313. , 325.5, 247. , 191. , 225. , 260. , 452.5],
[347.5, 331.5, 309. , 282. , 190. , 208. , 299.5, 348. ]])
[347.5, 331.5, 309. , 282. , 190. , 208. , 299.5, 348. ]],
dtype=float32)
Coordinates:
* lat (lat) float64... -23.5 -22.5 -21.5 -20.5 ... -12.5 -11.5 -10.5
* lon (lon) float64... -54.5 -53.5 -52.5 -51.5 -50.5 -49.5 -48.5 -47.5
Expand Down Expand Up @@ -169,16 +173,14 @@ def to_dataarray(self) -> xr.DataArray:
# Get dimensions and their attributes from the header.
dims, dim_attrs = header.dims, header.dim_attrs
# The coordinates, given as a tuple of the form (dims, data, attrs)
coords = [
(dims[0], self.y[: header.n_rows], dim_attrs[0]),
(dims[1], self.x[: header.n_columns], dim_attrs[1]),
]
x = np.ctypeslib.as_array(self.x, shape=(header.n_columns,)).copy()
y = np.ctypeslib.as_array(self.y, shape=(header.n_rows,)).copy()
coords = [(dims[0], y, dim_attrs[0]), (dims[1], x, dim_attrs[1])]

# The data array without paddings
data = np.ctypeslib.as_array(self.data, shape=(header.my, header.mx)).copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need to be cautious that we need to make a copy of the ctypes/numpy array, since the memory of the array is allocated by GMT and will be freed when exiting the session.

Trying to understand this sentence a bit more. So we can't assign ownership of the array data from GMT to numpy/xarray, and need to make a copy to ensure that exiting the GMT session doesn't crash the program? It doesn't look like we needed to explicitly make a copy before, so why is it needed now?

Copy link
Member Author

@seisman seisman Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we can't assign ownership of the array data from GMT to numpy/xarray, and need to make a copy to ensure that exiting the GMT session doesn't crash the program?

Yes, the arrays allocated by GMT will be freed when exiting the GMT session. This can be verified by the example below. In the GMT session, we can correctly access the grid.x array, but outside the session, grid.x reports a "NULL pointer access" error.

In [6]: from pygmt.clib import Session

In [7]: with Session() as lib:
   ...:     with lib.virtualfile_out(kind="grid") as voutgrd:
   ...:         lib.call_module("read", ["@static_earth_relief.nc", voutgrd, "-Tg"])
   ...:         grid = lib.read_virtualfile(voutgrd, kind="grid").contents
   ...:         header = grid.header.contents
   ...:         print(grid.x[:header.n_columns])
   ...: print(grid.x[0:5])
[-54.5, -53.5, -52.5, -51.5, -50.5, -49.5, -48.5, -47.5]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[7], line 7
      5         header = grid.header.contents
      6         print(grid.x[:header.n_columns])
----> 7 print(grid.x[0:5])

ValueError: NULL pointer access

It doesn't look like we needed to explicitly make a copy before, so why is it needed now?

As explained in the top post, in the previous version, we first access the ctypes array via the list slice syntax (i.e., ctypes_array[:120]) and then convert the list slice to a numpy array by np.reshape. The list slice doesn't make a new copy of the array memory, but the list->ndarray conversion changes the data dtype from float32 (the default dtype for GMT grid) to float64 (the default dtype for numpy). I think in this conversion, numpy makes a new copy of the array:

array1 = np.reshape(ctypes_array[:120], (10, 12))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like we needed to explicitly make a copy before, so why is it needed now?

As explained in the top post, in the previous version, we first access the ctypes array via the list slice syntax (i.e., ctypes_array[:120]) and then convert the list slice to a numpy array by np.reshape. The list slice doesn't make a new copy of the array memory, but the list->ndarray conversion changes the data dtype from float32 (the default dtype for GMT grid) to float64 (the default dtype for numpy). I think in this conversion, numpy makes a new copy of the array:

array1 = np.reshape(ctypes_array[:120], (10, 12))

Here is a test script:

In [22]: import ctypes

In [23]: import numpy as np

In [24]: ctypes_array = (ctypes.c_uint8 * 12)(*range(12))

In [25]: array1 = np.reshape(ctypes_array[:12], (3, 4))

In [26]: ctypes_array[:12]
Out[26]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

In [27]: array1
Out[27]:
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [28]: ctypes_array[:12] = list(range(100, 112))

In [29]: ctypes_array
Out[29]: <__main__.c_ubyte_Array_12 at 0x7f88d81b1fd0>

In [30]: ctypes_array[:12]
Out[30]: [100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111]

In [31]: array1
Out[31]:
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the list->ndarray conversion changes the data dtype from float32 (the default dtype for GMT grid) to float64 (the default dtype for numpy). I think in this conversion, numpy makes a new copy of the array:

Ok, this part makes sense, so we were implicitly doing a copy of the array before, whereas now it's more explicit via a .copy() call.

the arrays allocated by GMT will be freed when exiting the GMT session.

Thanks for showing the "NULL pointer access" exception. We probably don't want the pointer to remain active after the session is closed since that would be a memory leak, but is there a way for GMT to pass ownership of the array data to NumPy before exiting the session, so that the memory is then managed by Python (and its garbage collector)?

Copy link
Member Author

@seisman seisman Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way for GMT to pass ownership of the array data to NumPy before exiting the session, so that the memory is then managed by Python (and its garbage collector)?

I thought the API function GMT_Set_AllocMode may help, but I just tried it locally and it doesn't work as I expected. Maybe another upstream bug or something else that needs to debug the C codes.

Edit: The API documentation says:

When external programs supply their own allocated memory (e.g., grid arrays, matrices) that are then hooked into the GMT containers data pointer, we need to flag the allocation mode as being external to GMT.
...
This change prevents GMT from trying to free memory it did not allocate.

The API function only tells GMT to not free the memory allocated by Python and can't do the reverse.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From this post https://stackoverflow.com/questions/28760273/python-ctypes-and-garbage-collection and some answers from ChatGPT, it seems even if it's possible to tell GMT to not free the memory, the Python side is still responsible to free the memory explicitly. Otherwise, there may be memory leaks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so we either need to:

  1. Allocate the memory space in Python and have GMT C use it with GMT_Set_AllocMode so that memory is still managed by Python
  2. Allocate memory space in GMT C, pass ownership to Python, and ensure that Python handles free-ing the memory when no longer needed

Maybe something to think about longer term if we want to avoid copies 🙂

pad = header.pad[:]
data = np.reshape(self.data[: header.mx * header.my], (header.my, header.mx))[
pad[2] : header.my - pad[3], pad[0] : header.mx - pad[1]
]
data = data[pad[2] : header.my - pad[3], pad[0] : header.mx - pad[1]]

# Create the xarray.DataArray object
grid = xr.DataArray(
Expand Down
19 changes: 9 additions & 10 deletions pygmt/datatypes/image.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,13 +38,12 @@ class _GMT_IMAGE(ctp.Structure): # noqa: N801
... # Image-specific attributes.
... print(image.type, image.n_indexed_colors)
... # The x and y coordinates
... x = image.x[: header.n_columns]
... y = image.y[: header.n_rows]
... x = np.ctypeslib.as_array(image.x, shape=(header.n_columns,)).copy()
... y = np.ctypeslib.as_array(image.y, shape=(header.n_rows,)).copy()
... # The data array (with paddings)
... data = np.reshape(
... image.data[: header.n_bands * header.mx * header.my],
... (header.my, header.mx, header.n_bands),
... )
... data = np.ctypeslib.as_array(
... image.data, shape=(header.my, header.mx, header.n_bands)
... ).copy()
... # The data array (without paddings)
... pad = header.pad[:]
... data = data[pad[2] : header.my - pad[3], pad[0] : header.mx - pad[1], :]
Expand All @@ -60,10 +59,10 @@ class _GMT_IMAGE(ctp.Structure): # noqa: N801
[2, 2, 2, 2]
b'BRPa' 0.5
1 0
>>> x
[-179.5, -178.5, ..., 178.5, 179.5]
>>> y
[89.5, 88.5, ..., -88.5, -89.5]
>>> x # doctest: +NORMALIZE_WHITESPACE, +ELLIPSIS
array([-179.5, -178.5, ..., 178.5, 179.5])
>>> y # doctest: +NORMALIZE_WHITESPACE, +ELLIPSIS
array([ 89.5, 88.5, ..., -88.5, -89.5])
>>> data.shape
(180, 360, 3)
>>> data.min(), data.max()
Expand Down
2 changes: 1 addition & 1 deletion pygmt/tests/test_sphinterpolate.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,4 +41,4 @@ def test_sphinterpolate_no_outgrid(mars):
npt.assert_allclose(temp_grid.max(), 14628.144)
npt.assert_allclose(temp_grid.min(), -6908.1987)
npt.assert_allclose(temp_grid.median(), 118.96849)
npt.assert_allclose(temp_grid.mean(), 272.60578)
npt.assert_allclose(temp_grid.mean(), 272.60593)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The grid mean value is updated from 272.60578 to 272.60593.

To verify which value is correct, I first generate the output grid using the following GMT CLI:

gmt sphinterpolate @mars370d.txt -Rg -I1 -Gout.nc

then get the mean value by:

In [1]: import xarray as xr

In [2]: da = xr.load_dataarray("out.nc")

In [3]: da.mean()
Out[3]:
<xarray.DataArray 'z' ()> Size: 4B
array(272.60593, dtype=float32)

With option 1, the ctypes array was first converted to a Python list and then converted to a numpy array by np.reshape. The dtype of the result numpy array is float64. However, the GMT netCDF grids are stored in float32 by default. So, there are data type conversions with option 1, which can cause the tiny difference in the data mean value.

One thing to note is that, in GMT, the mean value is weighted by spherical area for geographic grids, so it's different from the one reported by xarray.

gmt grdinfo out.nc -L2
out.nc: Title:
out.nc: Command: gmt sphinterpolate /Users/seisman/.gmt/cache/mars370d.txt -Rg -I1 -Gout.nc
out.nc: Remark:
out.nc: Gridline node registration used [Geographic grid]
out.nc: Grid file format: nf = GMT netCDF format (32-bit float), CF-1.7
out.nc: x_min: 0 x_max: 360 x_inc: 1 name: longitude n_columns: 361
out.nc: y_min: -90 y_max: 90 y_inc: 1 name: latitude n_rows: 181
out.nc: v_min: -6908.19873047 v_max: 14628.1435547 name: z
out.nc: scale_factor: 1 add_offset: 0
out.nc: mean: 177.476587475 stdev: 3671.3669124 rms: 3675.62602319
out.nc: format: netCDF-4 chunk_size: 181,181 shuffle: on deflation_level: 3
out.nc: Default CPT: