Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data][bug] Dataset.context not being sealed after creation #41573

Open
raulchen opened this issue Dec 2, 2023 · 1 comment
Open

[data][bug] Dataset.context not being sealed after creation #41573

raulchen opened this issue Dec 2, 2023 · 1 comment
Assignees
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P1 Issue that should be fixed within a few weeks ray-2.10 size-small stability

Comments

@raulchen
Copy link
Contributor

raulchen commented Dec 2, 2023

Ideally, datasets should capture global DataContext when the dataset is created for the first time.
However, a lot of data code is using DataContext.get_current(), instead of the captured DataContext.
This prevents using multiple datasets with different DataContexts.

#41569 is the first attempt to mitigate this issue. But it only fixes the issue for training jobs, where multiple datasets will be propagated to different SplitCoordinator actors for execution.

If multiple Datasets are running in the same process. This bug still exists.

@raulchen raulchen added P1 Issue that should be fixed within a few weeks data Ray Data-related issues ray-2.10 labels Dec 2, 2023
raulchen added a commit that referenced this issue Dec 4, 2023
#41569)

`Dataset.context` should be sealed the first time the Dataset is created. But if a new operator is applied to the dataset, the new global DataContext will be saved again to the Dataset. 

This bug prevents using different DataContexts for training and validation datasets in a training job.

Note this PR only fixes the issue when multiple datasets are created in the process but will be running in different processes. If they run in the same process, it's still a bug, see #41573.

---------

Signed-off-by: Hao Chen <[email protected]>
@raulchen
Copy link
Contributor Author

raulchen commented Dec 6, 2023

After we fix this issue, #40116 should be reverted.

When iterating over a dataset with iter_batches, the execution happens on streaming split coordinator actor. The issue should be fixed by #41569.
However, some API may trigger execution on the local training worker process. For example, to_tf will call schema and trigger local execution. In this case, the DataContext being used is still incorrect.

@c21 c21 assigned raulchen and unassigned raulchen Dec 11, 2023
@anyscalesam anyscalesam added bug Something that is supposed to be working; but isn't stability labels Mar 5, 2024
raulchen added a commit that referenced this issue Dec 6, 2024
## Why are these changes needed?

When users using multiple datasets and want to set different DataContext
configurations.
The recommended way is to set `DataContext.get_current()` before
creating a Dataset. The DataContext is supposed to be captured and
sealed by a Dataset when it's created. For example:

```python
                import ray

                context = ray.data.DataContext.get_current()

                context.target_max_block_size = 100 * 1024 ** 2
                ds1 = ray.data.range(1)
                context.target_max_block_size = 1 * 1024 ** 2
                ds2 = ray.data.range(1)

                # ds1's target_max_block_size will be 100MB
                ds1.take_all()
                # ds2's target_max_block_size will be 1MB
                ds2.take_all()
```

However in Ray Data internal code, `DataContext.get_current()` has been
widely used in an incorrect way. This PR fixes most outstanding issues
(but not all), by explicitly passing around the captured DataContext
object as an argument to each component.

## Related issue number
#41573

---------

Signed-off-by: Hao Chen <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this issue Dec 17, 2024
)

## Why are these changes needed?

When users using multiple datasets and want to set different DataContext
configurations.
The recommended way is to set `DataContext.get_current()` before
creating a Dataset. The DataContext is supposed to be captured and
sealed by a Dataset when it's created. For example:

```python
                import ray

                context = ray.data.DataContext.get_current()

                context.target_max_block_size = 100 * 1024 ** 2
                ds1 = ray.data.range(1)
                context.target_max_block_size = 1 * 1024 ** 2
                ds2 = ray.data.range(1)

                # ds1's target_max_block_size will be 100MB
                ds1.take_all()
                # ds2's target_max_block_size will be 1MB
                ds2.take_all()
```

However in Ray Data internal code, `DataContext.get_current()` has been
widely used in an incorrect way. This PR fixes most outstanding issues
(but not all), by explicitly passing around the captured DataContext
object as an argument to each component.

## Related issue number
ray-project#41573

---------

Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P1 Issue that should be fixed within a few weeks ray-2.10 size-small stability
Projects
None yet
Development

No branches or pull requests

2 participants