Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix: support writing NAT in datetime column #146

Merged
merged 9 commits into from
Sep 13, 2022
Merged
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 31 additions & 25 deletions pyogrio/_io.pyx
Original file line number Diff line number Diff line change
@@ -1407,33 +1407,39 @@ def ogr_write(str path, str layer, str driver, geometry, field_data, fields,
OGR_F_SetFieldDouble(ogr_feature, field_idx, field_value)

elif field_type == OFTDate:
datetime = field_value.item()
OGR_F_SetFieldDateTimeEx(
ogr_feature,
field_idx,
datetime.year,
datetime.month,
datetime.day,
0,
0,
0.0,
0
)
if field_value is None or np.isnat(field_value):
OGR_F_SetFieldNull(ogr_feature, field_idx)
else:
datetime = field_value.item()
OGR_F_SetFieldDateTimeEx(
ogr_feature,
field_idx,
datetime.year,
datetime.month,
datetime.day,
0,
0,
0.0,
0
)

elif field_type == OFTDateTime:
# TODO: add support for timezones
datetime = field_value.astype("datetime64[ms]").item()
OGR_F_SetFieldDateTimeEx(
ogr_feature,
field_idx,
datetime.year,
datetime.month,
datetime.day,
datetime.hour,
datetime.minute,
datetime.second + datetime.microsecond / 10**6,
0
)
if field_value is None or np.isnat(field_value):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it ever be None?

I am also wondering if we can do np.isnat more efficiently (knowing that this is basically the smallest int64 value). Although it might not matter much for performance, since below we convert to a Python datetime object, which will probably be much slower and dominate the time anyway (so if we want to improve performance here, if it mattered, we should probably first look there).

Copy link
Member Author

@theroggy theroggy Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it ever be None?

Not sure. I was also wondering whether a check for nan would be useful or not. The only case I could imagine is if you are saving a string column into an existing datetime columns (when append is supported) or something like that, but it sounds a bit far fetched.
Possibly the same with None: never gonna happen in a numpy datetime column?

I am also wondering if we can do np.isnat more efficiently (knowing that this is basically the smallest int64 value). Although it might not matter much for performance, since below we convert to a Python datetime object, which will probably be much slower and dominate the time anyway (so if we want to improve performance here, if it mattered, we should probably first look there).

Indeed. When I implemented the datetime write support I noticed the performance is indeed bad. I experimented a bit back then trying to speed it up by implementing the parsing of the datetime using only c, functions, but I didn't get it working (immediately). So the python datetime is the big bottleneck, normally np.isnat even should be fast I think because the cdef version of numpy is imported (as well)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the pandas cython libraries, we have a bunch of code to convert a datime64 numpy int to the different fields, but that's quite a lot of code that we can't just copy/paste (and it's not public, so we also can't import it from pandas).

One possible avenue would be to convert the full array up front to the different fields (for that there are public features in pandas). That would help performance, but will increase memory usage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some more testing regarding the "fieldvalue is None" check, and apparently

  • if you add a value None to a datetime64 numpy array it is automatically converted to NaT
  • if you add a value np.nan to a datetime64 numpy array you get an error

So, I removed is "fieldvalue is None" check and added a None case in the unittest, so the test demonstrates this behaviour.

OGR_F_SetFieldNull(ogr_feature, field_idx)
else:
# TODO: add support for timezones
datetime = field_value.astype("datetime64[ms]").item()
OGR_F_SetFieldDateTimeEx(
ogr_feature,
field_idx,
datetime.year,
datetime.month,
datetime.day,
datetime.hour,
datetime.minute,
datetime.second + datetime.microsecond / 10**6,
0
)

else:
raise NotImplementedError(f"OGR field type is not supported for writing: {field_type}")
4 changes: 3 additions & 1 deletion pyogrio/tests/test_raw_io.py
Original file line number Diff line number Diff line change
@@ -437,13 +437,15 @@ def test_read_write_datetime(tmp_path):
["2001-01-01T12:00", "2002-02-03T13:56:03.072123456"],
dtype="datetime64[ns]",
),
np.array([np.datetime64("NaT"), np.datetime64("NaT")], dtype="datetime64[ms]"),
]
fields = [
"datetime64_d",
"datetime64_s",
"datetime64_ms",
"datetime64_ns",
"datetime64_precise_ns",
"datetime64_ms_nat",
]

# Point(0, 0)
@@ -460,7 +462,7 @@ def test_read_write_datetime(tmp_path):
# gdal rounds datetimes to ms
assert np.array_equal(result[idx], field_data[idx].astype("datetime64[ms]"))
else:
assert np.array_equal(result[idx], field_data[idx])
assert np.array_equal(result[idx], field_data[idx], equal_nan=True)


def test_read_data_types_numeric_with_null(test_gpkg_nulls):