use_arrow=True vs False: different empty dataframe result when no columns fetched #263

theroggy · 2023-08-10T15:21:47Z

When reading no columns from a file, this results in an empty dataframe with no columns using read_dataframe, you get a different result depending on use_arrow:

if False: you get a DataFrame with 0 rows
if True: you get a DataFrame with the number of rows of the input file, containing only the index.

Not sure what is the expected behaviour. For what it is worth, fiona seems to do the same as use_arrow=False.

Script to reproduce:

import pyogrio

url = "https://github.com/theroggy/pysnippets/raw/main/pysnippets/pyogrio/polygon-parcel_31370.zip"
for use_arrow in [True, False]:
    gdf = pyogrio.read_dataframe(url, use_arrow=use_arrow, columns=[], read_geometry=False)
    print(f"\nuse_arrow={use_arrow}, len(gdf): {len(gdf)}, gdf:\n{gdf}")

relevant output:

use_arrow=True, len(gdf): 46, gdf:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45]

use_arrow=False, len(gdf): 0, gdf:
Empty DataFrame
Columns: []
Index: []

The text was updated successfully, but these errors were encountered:

brendan-ward · 2023-08-14T18:29:29Z

I wonder if we should instead raise an exception if read_geometry=False, columns=[], fid_as_index=False? Because in that case, there is nothing meaningful to return. If fid_as_index=True, then I suppose we should at least return an empty data frame with a non-empty index.

Alternatively, at the end of read_arrow we could detect if the arrow table has no columns after attempting to read it, and then return a truly empty table:

if table.num_columns == 0:
    # return empty pyarrow.Table if there were no columns read
    return pyarrow.Table.from_pylist([])

We could probably even do that preemptively before even attempting to read the data source (which may be remote and slower), but raising an exception during parameter validation seems perhaps a bit better to me. Otherwise, if we just preemptively return an empty data frame / table, then it may out of sync with a read against the actual data source, which may fail for any number of reasons. I.e., it might be better that we don't succeed when reading no geometry / columns / FID when we'd fail otherwise if any of those are true / non-empty.

Thoughts?

kylebarron · 2023-08-14T18:57:32Z

Otherwise, if we just preemptively return an empty data frame / table, then it may out of sync with a read against the actual data source, which may fail for any number of reasons. I.e., it might be better that we don't succeed when reading no geometry / columns / FID when we'd fail otherwise if any of those are true / non-empty

IMO it would make sense to error instead of returning an empty table, as that's more likely to imply user error that columns was empty

jorisvandenbossche · 2023-08-14T21:45:22Z

Alternatively, at the end of read_arrow we could detect if the arrow table has no columns after attempting to read it, and then return a truly empty table:

I personally wouldn't do this, as one could argue that the the current result with use_arrow=True is more correct than a "truly" empty table. Because you did read all rows, just no columns ..

But certainly fine with detecting this case and raising an error. I think it will typically indeed be a user error.

theroggy · 2023-08-15T08:37:30Z

I also can't really think of use cases where reading no columns nor a geometry from a file would lead to something useful... so I agree it smells like a user error...

When fid_as_index=True you could return some data even though no columns nor geometry is asked, but it still can't really think of a use for this either...

theroggy changed the title ~~use_arrow=True vs False: empty dataframe result~~ use_arrow=True vs False: different empty dataframe result when no columns fetched Aug 10, 2023

theroggy mentioned this issue Aug 16, 2023

ENH: support writing dataframes without geometry column #267

Merged

This was referenced Sep 2, 2023

use arrow by default? #278

Open

Raise error if function is used with parameters to read no geometry, columns, or fids #280

Merged

brendan-ward closed this as completed in #280 Sep 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use_arrow=True vs False: different empty dataframe result when no columns fetched #263

use_arrow=True vs False: different empty dataframe result when no columns fetched #263

theroggy commented Aug 10, 2023

brendan-ward commented Aug 14, 2023

kylebarron commented Aug 14, 2023

jorisvandenbossche commented Aug 14, 2023

theroggy commented Aug 15, 2023

use_arrow=True vs False: different empty dataframe result when no columns fetched #263

use_arrow=True vs False: different empty dataframe result when no columns fetched #263

Comments

theroggy commented Aug 10, 2023

brendan-ward commented Aug 14, 2023

kylebarron commented Aug 14, 2023

jorisvandenbossche commented Aug 14, 2023

theroggy commented Aug 15, 2023