Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use_arrow=True vs False: different empty dataframe result when no columns fetched #263

Closed
theroggy opened this issue Aug 10, 2023 · 4 comments · Fixed by #280
Closed

use_arrow=True vs False: different empty dataframe result when no columns fetched #263

theroggy opened this issue Aug 10, 2023 · 4 comments · Fixed by #280

Comments

@theroggy
Copy link
Member

When reading no columns from a file, this results in an empty dataframe with no columns using read_dataframe, you get a different result depending on use_arrow:

  • if False: you get a DataFrame with 0 rows
  • if True: you get a DataFrame with the number of rows of the input file, containing only the index.

Not sure what is the expected behaviour. For what it is worth, fiona seems to do the same as use_arrow=False.

Script to reproduce:

import pyogrio

url = "https://github.com/theroggy/pysnippets/raw/main/pysnippets/pyogrio/polygon-parcel_31370.zip"
for use_arrow in [True, False]:
    gdf = pyogrio.read_dataframe(url, use_arrow=use_arrow, columns=[], read_geometry=False)
    print(f"\nuse_arrow={use_arrow}, len(gdf): {len(gdf)}, gdf:\n{gdf}")

relevant output:

use_arrow=True, len(gdf): 46, gdf:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45]

use_arrow=False, len(gdf): 0, gdf:
Empty DataFrame
Columns: []
Index: []
@theroggy theroggy changed the title use_arrow=True vs False: empty dataframe result use_arrow=True vs False: different empty dataframe result when no columns fetched Aug 10, 2023
@brendan-ward
Copy link
Member

I wonder if we should instead raise an exception if read_geometry=False, columns=[], fid_as_index=False? Because in that case, there is nothing meaningful to return. If fid_as_index=True, then I suppose we should at least return an empty data frame with a non-empty index.

Alternatively, at the end of read_arrow we could detect if the arrow table has no columns after attempting to read it, and then return a truly empty table:

if table.num_columns == 0:
    # return empty pyarrow.Table if there were no columns read
    return pyarrow.Table.from_pylist([])

We could probably even do that preemptively before even attempting to read the data source (which may be remote and slower), but raising an exception during parameter validation seems perhaps a bit better to me. Otherwise, if we just preemptively return an empty data frame / table, then it may out of sync with a read against the actual data source, which may fail for any number of reasons. I.e., it might be better that we don't succeed when reading no geometry / columns / FID when we'd fail otherwise if any of those are true / non-empty.

Thoughts?

@kylebarron
Copy link
Contributor

Otherwise, if we just preemptively return an empty data frame / table, then it may out of sync with a read against the actual data source, which may fail for any number of reasons. I.e., it might be better that we don't succeed when reading no geometry / columns / FID when we'd fail otherwise if any of those are true / non-empty

IMO it would make sense to error instead of returning an empty table, as that's more likely to imply user error that columns was empty

@jorisvandenbossche
Copy link
Member

Alternatively, at the end of read_arrow we could detect if the arrow table has no columns after attempting to read it, and then return a truly empty table:

I personally wouldn't do this, as one could argue that the the current result with use_arrow=True is more correct than a "truly" empty table. Because you did read all rows, just no columns ..

But certainly fine with detecting this case and raising an error. I think it will typically indeed be a user error.

@theroggy
Copy link
Member Author

I also can't really think of use cases where reading no columns nor a geometry from a file would lead to something useful... so I agree it smells like a user error...

When fid_as_index=True you could return some data even though no columns nor geometry is asked, but it still can't really think of a use for this either...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants