[data][bug] Dataset.context not being sealed after creation #41573

raulchen · 2023-12-02T01:27:41Z

Ideally, datasets should capture global DataContext when the dataset is created for the first time.
However, a lot of data code is using DataContext.get_current(), instead of the captured DataContext.
This prevents using multiple datasets with different DataContexts.

#41569 is the first attempt to mitigate this issue. But it only fixes the issue for training jobs, where multiple datasets will be propagated to different SplitCoordinator actors for execution.

If multiple Datasets are running in the same process. This bug still exists.

The text was updated successfully, but these errors were encountered:

#41569) `Dataset.context` should be sealed the first time the Dataset is created. But if a new operator is applied to the dataset, the new global DataContext will be saved again to the Dataset. This bug prevents using different DataContexts for training and validation datasets in a training job. Note this PR only fixes the issue when multiple datasets are created in the process but will be running in different processes. If they run in the same process, it's still a bug, see #41573. --------- Signed-off-by: Hao Chen <[email protected]>

raulchen · 2023-12-06T00:25:46Z

After we fix this issue, #40116 should be reverted.

When iterating over a dataset with iter_batches, the execution happens on streaming split coordinator actor. The issue should be fixed by #41569.
However, some API may trigger execution on the local training worker process. For example, to_tf will call schema and trigger local execution. In this case, the DataContext being used is still incorrect.

## Why are these changes needed? When users using multiple datasets and want to set different DataContext configurations. The recommended way is to set `DataContext.get_current()` before creating a Dataset. The DataContext is supposed to be captured and sealed by a Dataset when it's created. For example: ```python import ray context = ray.data.DataContext.get_current() context.target_max_block_size = 100 * 1024 ** 2 ds1 = ray.data.range(1) context.target_max_block_size = 1 * 1024 ** 2 ds2 = ray.data.range(1) # ds1's target_max_block_size will be 100MB ds1.take_all() # ds2's target_max_block_size will be 1MB ds2.take_all() ``` However in Ray Data internal code, `DataContext.get_current()` has been widely used in an incorrect way. This PR fixes most outstanding issues (but not all), by explicitly passing around the captured DataContext object as an argument to each component. ## Related issue number #41573 --------- Signed-off-by: Hao Chen <[email protected]>

) ## Why are these changes needed? When users using multiple datasets and want to set different DataContext configurations. The recommended way is to set `DataContext.get_current()` before creating a Dataset. The DataContext is supposed to be captured and sealed by a Dataset when it's created. For example: ```python import ray context = ray.data.DataContext.get_current() context.target_max_block_size = 100 * 1024 ** 2 ds1 = ray.data.range(1) context.target_max_block_size = 1 * 1024 ** 2 ds2 = ray.data.range(1) # ds1's target_max_block_size will be 100MB ds1.take_all() # ds2's target_max_block_size will be 1MB ds2.take_all() ``` However in Ray Data internal code, `DataContext.get_current()` has been widely used in an incorrect way. This PR fixes most outstanding issues (but not all), by explicitly passing around the captured DataContext object as an argument to each component. ## Related issue number ray-project#41573 --------- Signed-off-by: Hao Chen <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

raulchen added P1 Issue that should be fixed within a few weeks data Ray Data-related issues ray-2.10 labels Dec 2, 2023

raulchen mentioned this issue Dec 2, 2023

[Data] Partial fix for Dataset.context not being sealed after creation #41569

Merged

8 tasks

anyscalesam added the size-small label Dec 8, 2023

anyscalesam assigned raulchen Dec 8, 2023

c21 assigned raulchen and unassigned raulchen Dec 11, 2023

anyscalesam added bug Something that is supposed to be working; but isn't stability labels Mar 5, 2024

raulchen mentioned this issue Dec 5, 2024

[data] Fix DataContext sealing for multiple datasets. #49096

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data][bug] Dataset.context not being sealed after creation #41573

[data][bug] Dataset.context not being sealed after creation #41573

raulchen commented Dec 2, 2023

raulchen commented Dec 6, 2023

[data][bug] Dataset.context not being sealed after creation #41573

[data][bug] Dataset.context not being sealed after creation #41573

Comments

raulchen commented Dec 2, 2023

raulchen commented Dec 6, 2023