-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX-#6552: avoid FutureWarning
s in groupby
unless necessary
#6595
Changes from all commits
525556a
aa9fbcc
390ed5e
5ea7f2b
b1d6e2a
b36c35f
df557a9
0c83014
97b0661
e9dae20
50667d1
997b2ea
2322d9b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -23,6 +23,7 @@ | |
Memory, | ||
NPartitions, | ||
) | ||
from modin.core.execution.utils import set_env | ||
from modin.error_message import ErrorMessage | ||
|
||
|
||
|
@@ -32,6 +33,14 @@ def initialize_dask(): | |
|
||
try: | ||
client = default_client() | ||
|
||
def _disable_warnings(): | ||
import warnings | ||
|
||
warnings.simplefilter("ignore", category=FutureWarning) | ||
|
||
client.run(_disable_warnings) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we need to disable the warnings here, whereas we use the context manager below to enable the warnings? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In this section we connect to the already created Dask cluster. Using an environment variable will not help here, since the processes are already running. The context manager below does not enable warnings, but rather disables a certain category of warnings using an environment variable. |
||
|
||
except ValueError: | ||
from distributed import Client | ||
|
||
|
@@ -47,7 +56,11 @@ def initialize_dask(): | |
num_cpus = CpuCount.get() | ||
memory_limit = Memory.get() | ||
worker_memory_limit = memory_limit // num_cpus if memory_limit else "auto" | ||
client = Client(n_workers=num_cpus, memory_limit=worker_memory_limit) | ||
|
||
# when the client is initialized, environment variables are inherited | ||
with set_env(PYTHONWARNINGS="ignore::FutureWarning"): | ||
client = Client(n_workers=num_cpus, memory_limit=worker_memory_limit) | ||
|
||
if GithubCI.get(): | ||
# set these keys to run tests that write to the mock s3 service. this seems | ||
# to be the way to pass environment variables to the workers: | ||
|
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -36,6 +36,7 @@ | |||
StorageFormat, | ||||
ValueSource, | ||||
) | ||||
from modin.core.execution.utils import set_env | ||||
from modin.error_message import ErrorMessage | ||||
|
||||
from .engine_wrapper import RayWrapper | ||||
|
@@ -82,7 +83,10 @@ def initialize_ray( | |||
# the `pandas` module has been fully imported inside of each process before | ||||
# any execution begins: | ||||
# https://github.com/modin-project/modin/pull/4603 | ||||
env_vars = {"__MODIN_AUTOIMPORT_PANDAS__": "1"} | ||||
env_vars = { | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What if we connect to an already running Ray cluster? Will we see FutureWarnings? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unfortunately, yes, I don’t know of a way to change the state of an already running cluster on Ray. |
||||
"__MODIN_AUTOIMPORT_PANDAS__": "1", | ||||
"PYTHONWARNINGS": "ignore::FutureWarning", | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm wondering why modin/modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition.py Line 379 in ea8088a
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the case of distribution engines, the environment variable approach is simpler and more reliable (we do not need to keep track of all the call points of pandas functions). Therefore, the question here is why we should not use this approach where it is possible. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll try to do this for the unidist, while setting variables does not work without There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||||
} | ||||
if GithubCI.get(): | ||||
# need these to write parquet to the moto service mocking s3. | ||||
env_vars.update( | ||||
|
@@ -143,9 +147,8 @@ def initialize_ray( | |||
# time and doesn't enforce us with any overhead that Ray's native `runtime_env` | ||||
# is usually causing. You can visit this gh-issue for more info: | ||||
# https://github.com/modin-project/modin/issues/5157#issuecomment-1500225150 | ||||
for key, value in env_vars.items(): | ||||
os.environ[key] = value | ||||
ray.init(**ray_init_kwargs) | ||||
with set_env(**env_vars): | ||||
ray.init(**ray_init_kwargs) | ||||
|
||||
if StorageFormat.get() == "Cudf": | ||||
from modin.core.execution.ray.implementations.cudf_on_ray.partitioning import ( | ||||
|
@@ -163,12 +166,7 @@ def initialize_ray( | |||
runtime_env_vars = ray.get_runtime_context().runtime_env.get("env_vars", {}) | ||||
for varname, varvalue in env_vars.items(): | ||||
if str(runtime_env_vars.get(varname, "")) != str(varvalue): | ||||
if is_cluster or ( | ||||
# Here we relax our requirements for a non-cluster case allowing for the `env_vars` | ||||
# to be set at least as a process environment variable | ||||
not is_cluster | ||||
and os.environ.get(varname, "") != str(varvalue) | ||||
Comment on lines
-167
to
-170
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dchigarev Now all environment variables are set temporarily before Ray is initialized (in case of a local cluster), so this check will not work. This seems to be enough for Ray's processes to start using these variables. I don't see any particular need to use these environment variables for the main process either. What do you think? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. agree |
||||
): | ||||
if is_cluster: | ||||
ErrorMessage.single_warning( | ||||
"When using a pre-initialized Ray cluster, please ensure that the runtime env " | ||||
+ f"sets environment variable {varname} to {varvalue}" | ||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -45,7 +45,8 @@ def initialize_unidist(): | |
unidist.init() | ||
""", | ||
) | ||
|
||
# TODO: allow unidist to inherit env variables on initialization | ||
# with set_env(PYTHONWARNINGS="ignore::FutureWarning"): | ||
Comment on lines
+48
to
+49
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @YarShev is it possible to make that change? I don't want to force users to set environment variables manually. In addition, in the case of warnings, this environment variable is only needed for worker processes, and not for main one. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This change should work when running unidist on MPI with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
unidist.init() | ||
|
||
num_cpus = sum(v["CPU"] for v in unidist.cluster_resources().values()) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# Licensed to Modin Development Team under one or more contributor license agreements. | ||
# See the NOTICE file distributed with this work for additional information regarding | ||
# copyright ownership. The Modin Development Team licenses this file to you under the | ||
# Apache License, Version 2.0 (the "License"); you may not use this file except in | ||
# compliance with the License. You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software distributed under | ||
# the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF | ||
# ANY KIND, either express or implied. See the License for the specific language | ||
# governing permissions and limitations under the License. | ||
|
||
"""General utils for execution module.""" | ||
|
||
import contextlib | ||
import os | ||
|
||
|
||
@contextlib.contextmanager | ||
def set_env(**environ): | ||
""" | ||
Temporarily set the process environment variables. | ||
""" | ||
old_environ = os.environ.copy() | ||
os.environ.update(environ) | ||
try: | ||
yield | ||
finally: | ||
os.environ.clear() | ||
os.environ.update(old_environ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid: