-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX-#4580: Fix access by row label in query and eval #6488
FIX-#4580: Fix access by row label in query and eval #6488
Conversation
Signed-off-by: mvashishtha <[email protected]>
Signed-off-by: mvashishtha <[email protected]>
Signed-off-by: mvashishtha <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @mvashishtha!
Signed-off-by: mvashishtha <[email protected]>
Signed-off-by: mvashishtha <[email protected]>
Signed-off-by: mvashishtha <[email protected]>
def query(self, expr, **kwargs): | ||
def query_builder(df, **modin_internal_kwargs): | ||
return df.query(expr, inplace=False, **kwargs, **modin_internal_kwargs) | ||
|
||
return self.__constructor__(self._modin_frame.filter(1, query_builder)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this PR we see about ~1.6x perf degradation for one of our internal workloads that uses simple queries a lot.
@mvashishtha can you please check #6499?
What do these changes do?
query and eval raise exceptions as in #4580 on partitioned execution engines when given expressions that access elements by row label. I don't see a way to do distributed execution for query and eval other than doing what pandas does when using the
python
engine, which is to evaluate the expression as python (and then pass it intodf.loc[result, :]
forquery
). That's what I do in this PR.So far I have been able to use
pandas.DataFrame.query
andpandas.DataFrame.eval
directly without copying code. I prefer that for now because we would have to copy many methods.Supporting the
numexpr
engine this way will take more work.flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date