You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I had some problems with pandas.DeltaTableDataset when my nodes were returning a dataframe. Eg: Running the code below results in the error: name 'sepal_width' present in the specified schema is not found in the columns or index" even with the column sepal_width defined as nullable.
I also had some problems related with index_level_0 column when no schema was specified (see this issue).
Using pyarrow.Table.from_pandas(df) as node return fixed both problems. Could this function be embedded into pandas.DeltaTableDataset in the next release of kedro datasets?
regard to index_level_0, I have seen a case that this get created on transcoding from pandas -> spark with parquet. By default pandas.CSVDataset use to_index=False, but this is not consistent for other pandas dataset (ParquetDataset etc)
Description
I had some problems with pandas.DeltaTableDataset when my nodes were returning a dataframe. Eg: Running the code below results in the error:
name 'sepal_width' present in the specified schema is not found in the columns or index"
even with the column sepal_width defined as nullable.I also had some problems related with index_level_0 column when no schema was specified (see this issue).
Using
pyarrow.Table.from_pandas(df)
as node return fixed both problems. Could this function be embedded into pandas.DeltaTableDataset in the next release of kedro datasets?Possible Implementation
Embed
pyarrow.Table.from_pandas()
insidepandas.DeltaTableDataset.save()
function.Possible Alternatives
Use the
pyarrow.Table.from_pandas()
function in every node return.The text was updated successfully, but these errors were encountered: