Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading an empty series leads to a crash when as_of parameter is used #940

Closed
G-D-Petrov opened this issue Oct 9, 2023 · 10 comments · Fixed by #1049
Closed

Reading an empty series leads to a crash when as_of parameter is used #940

G-D-Petrov opened this issue Oct 9, 2023 · 10 comments · Fixed by #1049
Assignees
Labels
bug Something isn't working
Milestone

Comments

@G-D-Petrov
Copy link
Collaborator

Describe the bug

When using the as_of parameter in reads, if the return data is an empty series, this leads to a crash.
The problem seems to reproduce only on 4.0.0 and only when as_of is used.
See full repro below

Steps/Code to Reproduce

import pandas as pd
from arcticdb import Arctic
ac = Arctic("lmdb://tst")
ac.create_library("1")
lib = ac["1"]
s = pd.Series(dtype="datetime64[ns]")
lib.write("sym", s)
lib.read("sym", as_of=0).data

Expected Results

The read operation should return an empty series.
On 3.0.0, the type of the series is: Series([], dtype: datetime64[ns])

OS, Python Version and ArcticDB Version

ArcticDB: 4.0.0

Backend storage used

Any storage

Additional Context

No response

@G-D-Petrov G-D-Petrov added the bug Something isn't working label Oct 9, 2023
@poodlewars
Copy link
Collaborator

Also note that reads without the as_of with 4.0.0 return the wrong dtype Series([], dtype: float)

@phoebusm
Copy link
Collaborator

phoebusm commented Oct 10, 2023

User reports that if a empty series is written to a library using a 4.0 client, exception ArcticNativeCxxException: std::runtime_error(Invalid dtype 'UNKNOWN' in visit dim) will be thrown if reading it with old client

PR maybe related: #804

@jjerphan
Copy link
Collaborator

I cannot reproduce the crash:

  • on ArcticDB 4.0.0 for Linux on conda-forge (build: py310hea85ef7_3)
  • as of 870a01c in development

yet I get the behavior of #940 (comment).

@jjerphan
Copy link
Collaborator

In #804, we chose to have ArcticDB store empty columns under a dedicated EMPTYVAL type in case new rows of a different type are added, specifying which dtype is to use. For now, we loose the information about the dtype of the original series when it is empty.

To me, we need to finish the discussions of #224 first.

@G-D-Petrov
Copy link
Collaborator Author

Some clarification on reproducing the bug,.
It looks like we need to write a version with an older version of ArcticDB (e.g. arcticdb==3.0.0)
And then, we need to read the exact same same version with ArcticDB 4.0.0.
That is why we had to used as_of on the past.

@DrNickClarke
Copy link
Contributor

DrNickClarke commented Oct 19, 2023

This is the error message from reading in v4.0.0 the data written in v3.0.0 using the above code

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [10], in <cell line: 2>()
      1 # this repros the issue, by writing a version in v3.0.0 and reading in v4.0.0
----> 2 lib.read(empty_datetime_sym, as_of=7).data

File /pegasus-next-venv/lib/python3.8/site-packages/arcticdb/version_store/library.py:986, in Library.read(self, symbol, as_of, date_range, columns, query_builder)
    921 def read(
    922     self,
    923     symbol: str,
   (...)
    927     query_builder: Optional[QueryBuilder] = None,
    928 ) -> VersionedItem:
    929     """
    930     Read data for the named symbol.  Returns a VersionedItem object with a data and metadata element (as passed into
    931     write).
   (...)
    984     2       7
    985     """
--> 986     return self._nvs.read(
    987         symbol=symbol, as_of=as_of, date_range=date_range, columns=columns, query_builder=query_builder
    988     )

File /pegasus-next-venv/lib/python3.8/site-packages/arcticdb/version_store/_store.py:1620, in NativeVersionStore.read(self, symbol, as_of, date_range, row_range, columns, query_builder, **kwargs)
   1616 version_query, read_options, read_query = self._get_queries(
   1617     as_of, date_range, row_range, columns, query_builder, **kwargs
   1618 )
   1619 read_result = self._read_dataframe(symbol, version_query, read_query, read_options)
-> 1620 return self._post_process_dataframe(read_result, read_query, query_builder)

File /pegasus-next-venv/lib/python3.8/site-packages/arcticdb/version_store/_store.py:1707, in NativeVersionStore._post_process_dataframe(self, read_result, read_query, query_builder)
   1704         data.append(c[start_idx:end_idx])
   1705     read_result.frame_data = FrameData(data, read_result.frame_data.names, read_result.frame_data.index_columns)
-> 1707 vitem = self._adapt_read_res(read_result)
   1709 # Handle custom normalized data
   1710 if len(read_result.keys) > 0:

File /pegasus-next-venv/lib/python3.8/site-packages/arcticdb/version_store/_store.py:1881, in NativeVersionStore._adapt_read_res(self, read_result)
   1878 frame_data = FrameData.from_cpp(read_result.frame_data)
   1880 meta = denormalize_user_metadata(read_result.udm, self._normalizer)
-> 1881 data = self._normalizer.denormalize(frame_data, read_result.norm)
   1882 if read_result.norm.HasField("custom"):
   1883     data = self._custom_normalizer.denormalize(data, read_result.norm.custom)

File /pegasus-next-venv/lib/python3.8/site-packages/arcticdb/version_store/_normalization.py:1234, in CompositeNormalizer.denormalize(self, item, norm_meta)
   1232     return self.df.denormalize(item, norm_meta.df)
   1233 elif input_type == "series":
-> 1234     return self.series.denormalize(item, norm_meta.series)
   1235 elif input_type == "ts":
   1236     return self.tf.denormalize(item, norm_meta.ts)

File /pegasus-next-venv/lib/python3.8/site-packages/arcticdb/version_store/_normalization.py:598, in SeriesNormalizer.denormalize(self, item, norm_meta)
    591     s.name = None
    593 if s.empty:
    594     # Before Pandas 2.0, empty series' dtype was float, but as of Pandas 2.0. empty series' dtype became object.
    595     # See: [https://github.com/pandas-dev/pandas/issues/17261](https://github.com/pandas-dev/pandas/issues/17261%3C/span%3E)
    596     # We want to maintain consistent behaviour, so we return empty series as containing objects
    597     # when the Pandas version is >= 2.0
--> 598     s = s.astype("object") if IS_PANDAS_TWO else s.astype("float")
    600 return s

File /pegasus-next-venv/lib/python3.8/site-packages/pandas/core/generic.py:5698, in NDFrame.astype(self, dtype, copy, errors)
   5691     results = [
   5692         self.iloc[:, i].astype(dtype, copy=copy)
   5693         for i in range(len(self.columns))
   5694     ]
   5696 else:
   5697     # else, only a single dtype is given
-> 5698     new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors)
   5699     return self._constructor(new_data).__finalize__(self)
   5701 # GH 19920: retain column metadata after concat

File /pegasus-next-venv/lib/python3.8/site-packages/pandas/core/internals/managers.py:582, in BlockManager.astype(self, dtype, copy, errors)
    581 def astype(self, dtype, copy: bool = False, errors: str = "raise"):
--> 582     return self.apply("astype", dtype=dtype, copy=copy, errors=errors)

File /pegasus-next-venv/lib/python3.8/site-packages/pandas/core/internals/managers.py:442, in BlockManager.apply(self, f, filter, **kwargs)
    440         applied = b.apply(f, **kwargs)
    441     else:
--> 442         applied = getattr(b, f)(**kwargs)
    443     result_blocks = _extend_blocks(applied, result_blocks)
    445 if len(result_blocks) == 0:

File /pegasus-next-venv/lib/python3.8/site-packages/pandas/core/internals/blocks.py:2227, in DatetimeBlock.astype(self, dtype, copy, errors)
   2224     return self.make_block(values)
   2226 # delegate
-> 2227 return super().astype(dtype=dtype, copy=copy, errors=errors)

File /pegasus-next-venv/lib/python3.8/site-packages/pandas/core/internals/blocks.py:625, in Block.astype(self, dtype, copy, errors)
    623 vals1d = values.ravel()
    624 try:
--> 625     values = astype_nansafe(vals1d, dtype, copy=True)
    626 except (ValueError, TypeError):
    627     # e.g. astype_nansafe can fail on object-dtype of strings
    628     #  trying to convert to float
    629     if errors == "raise":

File /pegasus-next-venv/lib/python3.8/site-packages/pandas/core/dtypes/cast.py:841, in astype_nansafe(arr, dtype, copy, skipna)
    838     if dtype.kind == "M":
    839         return arr.astype(dtype)
--> 841     raise TypeError(f"cannot astype a datetimelike from [{arr.dtype}] to [{dtype}]")
    843 elif is_timedelta64_dtype(arr):
    844     if is_object_dtype(dtype):

TypeError: cannot astype a datetimelike from [datetime64[ns]] to [float64]

@DrNickClarke
Copy link
Contributor

Here is a summary of the current behaviour

  • s round-trips correctly in v3.0.0
  • s reads as float64 dtype when written and read in v4.0.0
  • both versions return False when s is passed to lib._nvs.will_item_be_pickled(s)
  • reading in v4.0.0 a version written in v3.0.0 gives the error message above

@DrNickClarke
Copy link
Contributor

Confirmed with the app owners who raised the issue that the app is doing a read of an empty dataframe with an as_of on the first run after an upgrade to v4.0.0, which fits this repro case closely. Also the error message is exactly the same.

@alexowens90 alexowens90 added this to the 4.2.0 milestone Oct 24, 2023
G-D-Petrov added a commit that referenced this issue Nov 1, 2023
… api methods (#989)

#### Reference Issues/PRs
Fixes #744 

#### What does this implement or fix?
Add new tests and extend already existing ones to test against the
persistent storages

There is some commented code, that needs to be uncommented once there is
a fix for issue #940

---------

Co-authored-by: Georgi Petrov <[email protected]>
@G-D-Petrov
Copy link
Collaborator Author

I have some tests that address this issue as part of issue #744 which are commented out at the moment.
You can check what errors they are raising in this run.
When this issue is fixed, please enable them, they are in this file:

# TODO: Fix me when the cast bug is fixed #940

# TODO: Fix me when the cast bug is fixed #940

@jjerphan
Copy link
Collaborator

jjerphan commented Nov 3, 2023

This issue is an edge case of #987.

jjerphan added a commit that referenced this issue Nov 13, 2023
Signed-off-by: Julien Jerphanion <[email protected]>
grusev pushed a commit that referenced this issue Nov 25, 2024
… api methods (#989)

#### Reference Issues/PRs
Fixes #744 

#### What does this implement or fix?
Add new tests and extend already existing ones to test against the
persistent storages

There is some commented code, that needs to be uncommented once there is
a fix for issue #940

---------

Co-authored-by: Georgi Petrov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants