Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune](deps): Bump accelerate from 0.3.0 to 0.5.1 in /python/requirements/tune #2

Merged

Conversation

dependabot[bot]
Copy link

@dependabot dependabot bot commented on behalf of github Nov 13, 2021

Bumps accelerate from 0.3.0 to 0.5.1.

Release notes

Sourced from accelerate's releases.

v0.5.1: Patch release

Fix the two following bugs:

  • convert_to_fp32 returned booleans instead of tensors #173
  • wrong dataloader lenght when dispatch_batches=True #175

v0.5.0 Dispatch batches from main DataLoader

This release introduces support for iterating through a DataLoader only on the main process, that then dispatches the batches to all processes.

Dispatch batches from main DataLoader

The motivation behind this come from dataset streaming which introduces two difficulties:

  • there might be some timeouts for some elements of the dataset, which might then be different in each process launched, thus it's impossible to make sure the data is iterated though the same way on each process
  • when using IterableDataset, each process goes through the dataset, thus applies the preprocessing on all elements. This can yield to the training being slowed down by this preprocessing.

This new feature is activated by default for all IterableDataset.

Various fixes

v0.4.0 Experimental DeepSpeed and multi-node CPU support

v0.4.0 Experimental DeepSpeed support

This release adds support for DeepSpeed. While the basics are there to support ZeRO-2, ZeRo-3, as well a CPU and NVME offload, the API might evolve a little bit as we polish it in the near future.

It also adds support for multi-node CPU. In both cases, just filling the questionnaire outputted by accelerate config and then launching your script with accelerate launch is enough, there are no changes in the main API.

DeepSpeed support

Multinode CPU support

  • Add distributed multi-node cpu only support (MULTI_CPU) #63 (@​ddkalamk)

Various fixes

... (truncated)

Commits

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot merge will merge this PR after your CI passes on it
  • @dependabot squash and merge will squash and merge this PR after your CI passes on it
  • @dependabot cancel merge will cancel a previously requested merge and block automerging
  • @dependabot reopen will reopen this PR if it is closed
  • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

Bumps [accelerate](https://github.com/huggingface/accelerate) from 0.3.0 to 0.5.1.
- [Release notes](https://github.com/huggingface/accelerate/releases)
- [Commits](huggingface/accelerate@v0.3.0...v0.5.1)

---
updated-dependencies:
- dependency-name: accelerate
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
@dependabot dependabot bot added the dependencies Pull requests that update a dependency file label Nov 13, 2021
simonsays1980 pushed a commit that referenced this pull request Nov 14, 2021
simonsays1980 pushed a commit that referenced this pull request Feb 5, 2022
@dependabot @github
Copy link
Author

dependabot bot commented on behalf of github Feb 5, 2022

Dependabot tried to update this pull request, but something went wrong. We're looking into it, but in the meantime you can retry the update by commenting @dependabot rebase.

@simonsays1980
Copy link
Owner

@dependabot rebase

@dependabot @github
Copy link
Author

dependabot bot commented on behalf of github Feb 6, 2022

Dependabot tried to update this pull request, but something went wrong. We're looking into it, but in the meantime you can retry the update by commenting @dependabot rebase.

simonsays1980 pushed a commit that referenced this pull request Feb 27, 2022
…roject#22317)

Improve observability for general objects and lineage reconstruction by adding a "Status" field to `ray memory`. The value of the field can be:
```
  // The task is waiting for its dependencies to be created.
  WAITING_FOR_DEPENDENCIES = 1;
  // All dependencies have been created and the task is scheduled to execute.
  SCHEDULED = 2;
  // The task finished successfully.
  FINISHED = 3;
```

In addition, tasks that failed or that needed to be re-executed due to lineage reconstruction will have a field listing the attempt number. Example output:
```
IP Address    | PID      | Type    | Call Site | Status    | Size     | Reference Type | Object Ref
192.168.4.22  | 279475   | Driver  | (task call) ... | Attempt #2: FINISHED | 10000254.0 B | LOCAL_REFERENCE | c2668a65bda616c1ffffffffffffffffffffffff0100000001000000


```
@dependabot @github
Copy link
Author

dependabot bot commented on behalf of github Feb 27, 2022

Dependabot tried to update this pull request, but something went wrong. We're looking into it, but in the meantime you can retry the update by commenting @dependabot rebase.

1 similar comment
@dependabot @github
Copy link
Author

dependabot bot commented on behalf of github Feb 27, 2022

Dependabot tried to update this pull request, but something went wrong. We're looking into it, but in the meantime you can retry the update by commenting @dependabot rebase.

Copy link
Owner

@simonsays1980 simonsays1980 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@simonsays1980
Copy link
Owner

@dependabot rebase

@dependabot @github
Copy link
Author

dependabot bot commented on behalf of github Feb 27, 2022

Dependabot tried to update this pull request, but something went wrong. We're looking into it, but in the meantime you can retry the update by commenting @dependabot rebase.

@simonsays1980 simonsays1980 merged commit 56efc5b into master Feb 27, 2022
@dependabot dependabot bot deleted the dependabot/pip/python/requirements/tune/accelerate-0.5.1 branch February 27, 2022 23:27
simonsays1980 pushed a commit that referenced this pull request Feb 27, 2022
…roject#22317)

Improve observability for general objects and lineage reconstruction by adding a "Status" field to `ray memory`. The value of the field can be:
```
  // The task is waiting for its dependencies to be created.
  WAITING_FOR_DEPENDENCIES = 1;
  // All dependencies have been created and the task is scheduled to execute.
  SCHEDULED = 2;
  // The task finished successfully.
  FINISHED = 3;
```

In addition, tasks that failed or that needed to be re-executed due to lineage reconstruction will have a field listing the attempt number. Example output:
```
IP Address    | PID      | Type    | Call Site | Status    | Size     | Reference Type | Object Ref
192.168.4.22  | 279475   | Driver  | (task call) ... | Attempt #2: FINISHED | 10000254.0 B | LOCAL_REFERENCE | c2668a65bda616c1ffffffffffffffffffffffff0100000001000000


```
simonsays1980 pushed a commit that referenced this pull request Apr 20, 2022
…ray-project#23821)

This PR refactors `LazyBlockList` in service of out-of-band serialization (see [mono-PR](ray-project#22616)) and is a precursor to an execution plan refactor (PR #2) and adding the actual out-of-band serialization APIs (PR #3). The following is included in this refactor:
1. `ReadTask`s are now a first-class concept, replacing calls;
2. read stage progress tracking is consolidated into `LazyBlockList._get_blocks_with_metadta()` and more of the read task complexity, e.g. the read remote function, was pushed into `LazyBlockList` to make `ray.data.read_datasource()` simpler;
3. we are a bit smarter with how we progressively launch tasks and fetch and cache metadata, including fetching the metadata for read tasks in `.iter_blocks_with_metadata()` instead of relying on the pre-read task metadata (which will be less accurate), and we also fix some small bugs in the lazy ramp-up around progressive metadata fetching.

(1) is the most important item for supporting out-of-band serialization and fundamentally changes the `LazyBlockList` data model. This is required since we need to be able to reference the underlying read tasks when rewriting read stages during optimization and when serializing the lineage of the Dataset. See the [mono-PR](ray-project#22616) for more context.

Other changes:
1. Changed stats actor to a global named actor singleton in order to obviate the need for serializing the actor handle with the Dataset stats; without this, we were encountering serialization failures.
simonsays1980 pushed a commit that referenced this pull request Aug 21, 2022
We encountered SIGSEGV when running Python test `python/ray/tests/test_failure_2.py::test_list_named_actors_timeout`. The stack is:

```
#0  0x00007fffed30f393 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&) ()
   from /lib64/libstdc++.so.6
#1  0x00007fffee707649 in ray::RayLog::GetLoggerName() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#2  0x00007fffee70aa90 in ray::SpdLogMessage::Flush() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#3  0x00007fffee70af28 in ray::RayLog::~RayLog() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#4  0x00007fffee2b570d in ray::asio::testing::(anonymous namespace)::DelayManager::Init() [clone .constprop.0] ()
   from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#5  0x00007fffedd0d95a in _GLOBAL__sub_I_asio_chaos.cc () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#6  0x00007ffff7fe282a in call_init.part () from /lib64/ld-linux-x86-64.so.2
#7  0x00007ffff7fe2931 in _dl_init () from /lib64/ld-linux-x86-64.so.2
#8  0x00007ffff7fe674c in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#9  0x00007ffff7b82e79 in _dl_catch_exception () from /lib64/libc.so.6
#10 0x00007ffff7fe5ffe in _dl_open () from /lib64/ld-linux-x86-64.so.2
#11 0x00007ffff7d5f39c in dlopen_doit () from /lib64/libdl.so.2
#12 0x00007ffff7b82e79 in _dl_catch_exception () from /lib64/libc.so.6
#13 0x00007ffff7b82f13 in _dl_catch_error () from /lib64/libc.so.6
#14 0x00007ffff7d5fb09 in _dlerror_run () from /lib64/libdl.so.2
#15 0x00007ffff7d5f42a in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#16 0x00007fffef04d330 in py_dl_open (self=<optimized out>, args=<optimized out>)
    at /tmp/python-build.20220507135524.257789/Python-3.7.11/Modules/_ctypes/callproc.c:1369
```

The root cause is that when loading `_raylet.so`, `static DelayManager _delay_manager` is initialized and `RAY_LOG(ERROR) << "RAY_testing_asio_delay_us is set to " << delay_env;` is executed. However, the static variables declared in `logging.cc` are not initialized yet (in this case, `std::string RayLog::logger_name_ = "ray_log_sink"`).

It's better not to rely on the initialization order of static variables in different compilation units because it's not guaranteed. I propose to change all `RAY_LOG`s to `std::cerr` in `DelayManager::Init()`.

The crash happens in Ant's internal codebase. Not sure why this test case passes in the community version though.

BTW, I've tried different approaches:

1. Using a static local variable in `get_delay_us` and remove the global variable. This doesn't work because `init()` needs to access the variable as well.
2. Defining the global variable as type `std::unique_ptr<DelayManager>` and initialize it in `get_delay_us`. This works but it requires a lock to be thread-safe.
simonsays1980 pushed a commit that referenced this pull request Jul 14, 2023
Why are these changes needed?

Right now the theory is as follow.

pubsub io service is created and run inside the GcsServer. That means if pubsub io service is accessed after GCSServer GC'ed, it will segfault.
Right now, upon teardown, when we call rpc::DrainAndResetExecutor, this will recreate the Executor thread pool.
Upon teardown, If DrainAndResetExecutor -> GcsServer's internal pubsub posts new SendReply to the newly created threadpool -> GcsServer.reset -> pubsub io service GC'ed -> SendReply invoked from the newly created thread pool, it will segfault.
NOTE: the segfault is from pubsub service if you see the failure

#2 0x7f92034d9129 in ray::rpc::ServerCallImpl<ray::rpc::InternalPubSubGcsServiceHandler, ray::rpc::GcsSubscriberPollRequest, ray::rpc::GcsSubscriberPollReply>::HandleRequestImpl()::'lambda'(ray::Status, std::__1::function<void ()>, std::__1::function<void ()>)::operator()(ray::Status, std::__1::function<void ()>, std::__1::function<void ()>) const::'lambda'()::operator()() const /proc/self/cwd/bazel-out/k8-opt/bin/_virtual_includes/grpc_common_lib/ray/rpc/server_call.h:212:48
As a fix, I only drain the thread pool. And then reset it after all operations are fully cleaned up (only from tests). I think there's no need to reset for regular proc termination like raylet, gcs, core workers.

Related issue number

Closes ray-project#34344

Signed-off-by: SangBin Cho <[email protected]>
simonsays1980 pushed a commit that referenced this pull request Nov 8, 2024
missing comma related to ray-project#48564
Had to try again because @can-anyscale suspects a rare ReadTheDocs
failure and there's no easy way to kick off a rebuild from the browser.

Signed-off-by: angelinalg <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant