Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Empty/Small Data DataFrames as a separate case #4605

Closed
naren-ponder opened this issue Jun 27, 2022 · 3 comments · Fixed by #7259 · May be fixed by #5113
Closed

Handle Empty/Small Data DataFrames as a separate case #4605

naren-ponder opened this issue Jun 27, 2022 · 3 comments · Fixed by #7259 · May be fixed by #5113
Labels
empty dataframes and series 🚫 Bugs having to do with empty dataframes and series Epic P1 Important tasks that we should complete soon pandas concordance 🐼 Functionality that does not match pandas

Comments

@naren-ponder
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. What kind of performance improvements would you like to see with this new API?

In our current approach, we default empty dataframes to pandas at the query compiler level which leads to some overhead as well as some bugs in empty dataframes (#4306, #4307). It would be ideal to default not only empty dataframes to pandas, but also dataframes with a small amount of data where distributing leads to more cost than it is worth.

@mvashishtha
Copy link
Collaborator

For reference, #4191 and #4060 are also bugs coming from improper treatment of empty dataframes.

@billiam-wang
Copy link
Collaborator

@modin-project/modin-core @modin-project/modin-contributors @RehanSD @vnlitvinov @anmyachev Currently, indexes are processed asynchronously making it difficult to determine when a data frame will be empty or not without waiting on the index to complete. Wondering if anybody had any suggestions on how to approach this problem.

Some ideas we have include changes at the query compiler level, API level, or modin core level whenever columns or rows are potentially added/removed.

@vnlitvinov
Copy link
Collaborator

In most cases, axes are known, and I'm pretty sure most operations can be analyzed to see what effects such operations have on the axes, so in a typical case both axes would be known. We can simply make an assumption that we either know the axes (and as such can use their sizes to see which compiler to apply) or the dataframe is big.

There are only a few operations which are unpredictable on outcoming axes - filtering by some user-defined condition (like df[df.a == b]), running groupby operations, etc. All other operations could be analyzed in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
empty dataframes and series 🚫 Bugs having to do with empty dataframes and series Epic P1 Important tasks that we should complete soon pandas concordance 🐼 Functionality that does not match pandas
Projects
None yet
5 participants