-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data][bug] Dataset.context not being sealed after creation #41573
Comments
8 tasks
raulchen
added a commit
that referenced
this issue
Dec 4, 2023
#41569) `Dataset.context` should be sealed the first time the Dataset is created. But if a new operator is applied to the dataset, the new global DataContext will be saved again to the Dataset. This bug prevents using different DataContexts for training and validation datasets in a training job. Note this PR only fixes the issue when multiple datasets are created in the process but will be running in different processes. If they run in the same process, it's still a bug, see #41573. --------- Signed-off-by: Hao Chen <[email protected]>
After we fix this issue, #40116 should be reverted. When iterating over a dataset with |
8 tasks
raulchen
added a commit
that referenced
this issue
Dec 6, 2024
## Why are these changes needed? When users using multiple datasets and want to set different DataContext configurations. The recommended way is to set `DataContext.get_current()` before creating a Dataset. The DataContext is supposed to be captured and sealed by a Dataset when it's created. For example: ```python import ray context = ray.data.DataContext.get_current() context.target_max_block_size = 100 * 1024 ** 2 ds1 = ray.data.range(1) context.target_max_block_size = 1 * 1024 ** 2 ds2 = ray.data.range(1) # ds1's target_max_block_size will be 100MB ds1.take_all() # ds2's target_max_block_size will be 1MB ds2.take_all() ``` However in Ray Data internal code, `DataContext.get_current()` has been widely used in an incorrect way. This PR fixes most outstanding issues (but not all), by explicitly passing around the captured DataContext object as an argument to each component. ## Related issue number #41573 --------- Signed-off-by: Hao Chen <[email protected]>
ujjawal-khare
pushed a commit
to ujjawal-khare-27/ray
that referenced
this issue
Dec 17, 2024
) ## Why are these changes needed? When users using multiple datasets and want to set different DataContext configurations. The recommended way is to set `DataContext.get_current()` before creating a Dataset. The DataContext is supposed to be captured and sealed by a Dataset when it's created. For example: ```python import ray context = ray.data.DataContext.get_current() context.target_max_block_size = 100 * 1024 ** 2 ds1 = ray.data.range(1) context.target_max_block_size = 1 * 1024 ** 2 ds2 = ray.data.range(1) # ds1's target_max_block_size will be 100MB ds1.take_all() # ds2's target_max_block_size will be 1MB ds2.take_all() ``` However in Ray Data internal code, `DataContext.get_current()` has been widely used in an incorrect way. This PR fixes most outstanding issues (but not all), by explicitly passing around the captured DataContext object as an argument to each component. ## Related issue number ray-project#41573 --------- Signed-off-by: Hao Chen <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Ideally, datasets should capture global DataContext when the dataset is created for the first time.
However, a lot of data code is using
DataContext.get_current()
, instead of the captured DataContext.This prevents using multiple datasets with different DataContexts.
#41569 is the first attempt to mitigate this issue. But it only fixes the issue for training jobs, where multiple datasets will be propagated to different SplitCoordinator actors for execution.
If multiple Datasets are running in the same process. This bug still exists.
The text was updated successfully, but these errors were encountered: