Support writing object columns with np.nan values #118

theroggy · 2022-06-06T01:00:49Z

Closes #60

jorisvandenbossche

Thanks for working on this!

jorisvandenbossche · 2022-06-06T15:12:36Z

pyogrio/_io.pyx

-                        # this will fail for strings mixed with nans
-                        value_b = field_value.encode("UTF-8")
+                    if isinstance(field_value, float) and isnan(field_value):
+                        OGR_F_SetFieldNull(ogr_feature, field_idx)


Can we combine this with the field_value is None check above to set FieldNull in a single place?
(or move that check here, as currently the only way to get a None is through object dtype)

It seems like the check for isnan should be in the elif field_type == OFTReal: block? The field type for the incoming column should be float32 / float64 rather than object (though I haven't verified this)

@brendan-ward note that the specific case that is being fixed here is if you have an object dtype column (but which contains NaN, pandas doesn't really distinguish None vs NaN in object columns)

Sure... if you prefer that... I moved the "is None" check as it is redundant for column types other than string/object.

@brendan-ward in addition to what @jorisvandenbossche wrote: np.nan values in an OFTReal (float) column are automatically treated correctly (as null) by GDAL (OGR_F_SetFieldDouble), so no need to explicitly call OGR_F_SetFieldNull for OFTReal columns. Only for object columns this is needed.

np.nan values in an OFTReal (float) column are automatically treated correctly (as null) by GDAL (OGR_F_SetFieldDouble), so no need to explicitly call OGR_F_SetFieldNull for OFTReal columns. Only for object columns this is needed.

Actually, that's not fully correct I think. They are written as NaN and not as Null. Since numpy/pandas only support NaN in float arrays, that's not an issue for correct rountrip for geopandas->gdal->geopandas, though.

Small illustrations:

# write file with GDAL + pyogrio import geopandas import pyogrio gdf = geopandas.GeoDataFrame({'col': [0.1, np.nan]}, geometry=geopandas.points_from_xy([0, 1], [0, 1])) pyogrio.write_dataframe(gdf, "test_nulls_pyogrio.arrow", driver="Arrow") # write file with pyarrow that includes both NaN and Null (Arrow distinguishes both) import pyarrow as pa from pyarrow import feather feather.write_feather(pa.table({"col": [0.1, np.nan, None]}), "test_nulls_pyarrow.arrow")

And check both using GDAL's ogrinfo:

(gdal-dev) $ ogrinfo test_nulls_pyogrio.arrow -al INFO: Open of `test_nulls_pyogrio.arrow' using driver `Arrow' successful. Layer name: test_nulls_pyogrio Geometry: Point Feature Count: 2 Extent: (0.000000, 0.000000) - (1.000000, 1.000000) Layer SRS WKT: (unknown) Geometry Column = geometry col: Real (0.0) OGRFeature(test_nulls_pyogrio):0 col (Real) = 0.1 POINT (0 0) OGRFeature(test_nulls_pyogrio):1 col (Real) = nan POINT (1 1) (gdal-dev) $ ogrinfo test_nulls_pyarrow.arrow -al INFO: Open of `test_nulls_pyarrow.arrow' using driver `Arrow' successful. Layer name: test_nulls_pyarrow Geometry: None Feature Count: 3 Layer SRS WKT: (unknown) col: Real (0.0) OGRFeature(test_nulls_pyarrow):0 col (Real) = 0.1 OGRFeature(test_nulls_pyarrow):1 col (Real) = nan OGRFeature(test_nulls_pyarrow):2 col (Real) = (null)

Opened #122 to further track this

jorisvandenbossche · 2022-06-06T15:16:02Z

pyogrio/tests/test_geopandas_io.py

+    geom = Point(0, 0)
+    test_data = {
+        "geometry": [geom, geom, geom],
+        "float64": [1.0, None, np.nan],


This doesn't have any effect in practice, as that will get converted to twice a NaN by pandas:

In [10]: pd.Series([1.0, None, np.nan]) Out[10]: 0 1.0 1 NaN 2 NaN dtype: float64

Yes, that's indeed the case. Nonetheless I think it is transparant that it is explicitly in the test?

But obviously if you think that's better I can put 2 times np.nan or whatever...

pyogrio/tests/test_geopandas_io.py

brendan-ward

Thanks for working on this @theroggy ; a few comments to add

CHANGES.md

brendan-ward · 2022-06-06T16:16:35Z

pyogrio/_io.pyx

-                        # this will fail for strings mixed with nans
-                        value_b = field_value.encode("UTF-8")
+                    if isinstance(field_value, float) and isnan(field_value):
+                        OGR_F_SetFieldNull(ogr_feature, field_idx)


It seems like the check for isnan should be in the elif field_type == OFTReal: block? The field type for the incoming column should be float32 / float64 rather than object (though I haven't verified this)

Co-authored-by: Brendan Ward <[email protected]>

brendan-ward

Thanks @theroggy !

theroggy added 3 commits June 6, 2022 02:48

Add support to write nan values in object columns

da48362

Add to changelog

ce1b68a

Consistent newlines

530b5d4

theroggy mentioned this pull request Jun 6, 2022

Add support to write object columns that contain types different than string #119

Closed

jorisvandenbossche reviewed Jun 6, 2022

View reviewed changes

brendan-ward reviewed Jun 6, 2022

View reviewed changes

theroggy and others added 3 commits June 6, 2022 18:34

Move is None check within string/object handling

744a031

Only test on gpkg

6c38520

Update CHANGES.md

cb28a2e

Co-authored-by: Brendan Ward <[email protected]>

brendan-ward approved these changes Jun 6, 2022

View reviewed changes

brendan-ward merged commit 2a7f078 into geopandas:main Jun 7, 2022

theroggy deleted the support-nan-in-object-columns branch June 7, 2022 15:09

theroggy mentioned this pull request Jun 7, 2022

How to handle NaNs in the input when writing? (as NaN or as Null) #122

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support writing object columns with np.nan values #118

Support writing object columns with np.nan values #118

theroggy commented Jun 6, 2022

jorisvandenbossche left a comment

jorisvandenbossche Jun 6, 2022

brendan-ward Jun 6, 2022

jorisvandenbossche Jun 6, 2022

theroggy Jun 6, 2022

theroggy Jun 6, 2022 •

edited

Loading

jorisvandenbossche Jun 7, 2022

jorisvandenbossche Jun 7, 2022

jorisvandenbossche Jun 6, 2022

theroggy Jun 6, 2022

brendan-ward left a comment

brendan-ward Jun 6, 2022

brendan-ward left a comment

Support writing object columns with np.nan values #118

Support writing object columns with np.nan values #118

Conversation

theroggy commented Jun 6, 2022

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

theroggy Jun 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brendan-ward left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brendan-ward left a comment

Choose a reason for hiding this comment

theroggy Jun 6, 2022 •

edited

Loading