Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray launcher WIP #518

Merged
merged 85 commits into from
Oct 28, 2020
Merged

Ray launcher WIP #518

merged 85 commits into from
Oct 28, 2020

Conversation

jieru-hu
Copy link
Contributor

@jieru-hu jieru-hu commented Apr 8, 2020

Update 10/14
Follow up items:

  • add custom resolver to get lib versions.
  • clean up orphaned instances.

Update 09/30

A few items I want to follow up separately (will create PR for these)

  1. cannot run the plugin in python 3.6 (cloudpickle issue, same error message as _pickle.PicklingError: Can't pickle typing.Union[str, NoneType]: it's not the same object as typing.Union #428, failed run here https://app.circleci.com/pipelines/github/jieru-hu/hydra/611/workflows/0d99e3b0-1442-446f-857b-f476a7707b6d/jobs/6648 ), I tried a few things (updating cloudpickle version etc but was not able to resolve it.)
  2. nightly builds test AMIs (right now the AMI is set as an env variable which is annoying everytime when we need to update the AMI id, I want to automate this.)
  3. doc update. The doc needs to be refreshed a bit. I think it might be easier to create a separate PR for this.

Update 09/28

Summary of the changes:

1. Address omry's comments.

2. Changes to integration test:

The goal is "No outbound traffic for the test instances." The barrier is the pip install and conda create we need to run while setting up the instance which requires us to open 443 to all outbound traffic.
To get around this: for conda and dependency packages needed for starting the cluster, I created a base AMI that has everything pre-installed. 2) for Hydra related packages, build the wheels at test time and install the wheels on the instance.
The upside is we achieve "no outbound traffic for the instance", the downside is that means we need update AMI when dependencies changes. To help with that I created a script
(create_ami.py) to automate building the AMI.

It would be good to build nightly AMIs and wheels, that's something I want to work on soon.

output from running `create_ami.py`
$ AWS_PROFILE=jieru python create_ami.py 
2020-09-28 16:23:56.268051 - Running: aws ec2 authorize-security-group-egress --group-id sg-0a1 --ip-permissions IpProtocol=tcp,FromPort=443,ToPort=443,IpRanges=[{CidrIp=0.0.0.0/0}]
2020-09-28 16:23:57.464861 - 
2020-09-28 16:23:57.487688 - Running: ray up /var/folders/n_/9qzct77j68j6n9lh0lw3vjqcn96zxl/T/tmpjw6lihef.yaml -y
2020-09-28 16:25:57.400029 - 2020-09-28 16:23:58,540    INFO cli_logger.py:388 -- Using cached config at /var/folders/n_/9qzct77j68j6n9lh0lw3vjqcn96zxl/T/ray-config-d951a214f8602b878335411b5df6e84af463922b

2020-09-28 16:25:57.462567 - Running: ray rsync_up /var/folders/n_/9qzct77j68j6n9lh0lw3vjqcn96zxl/T/tmpjw6lihef.yaml './setup_ami.py' '/home/ubuntu/' 
2020-09-28 16:25:59.210039 - 2020-09-28 16:25:58,320    INFO cli_logger.py:388 -- Using cached config at /var/folders/n_/9qzct77j68j6n9lh0lw3vjqcn96zxl/T/ray-config-d951a214f8602b878335411b5df6e84af463922b
2020-09-28 16:25:58,736 INFO cli_logger.py:388 -- NodeUpdater: i-0d82abc901a725abc: Syncing ./setup_ami.py to /home/ubuntu/...
2020-09-28 16:25:58,922 INFO log_timer.py:25 -- NodeUpdater: i-0d82abc901a725abc: Got IP  [LogTimer=186ms]
building file list ... done
setup_ami.py

sent 692 bytes  received 42 bytes  489.33 bytes/sec
total size is 1345  speedup is 1.83
Installing dependencies now, this may take a while...
2020-09-28 16:25:59.210121 - Running: ray exec /var/folders/n_/9qzct77j68j6n9lh0lw3vjqcn96zxl/T/tmpjw6lihef.yaml 'python ./setup_ami.py' 
...
Installing collected packages: typing-extensions, omegaconf
Successfully installed omegaconf-2.0.2 typing-extensions-3.7.4.3
2020-09-28 23:45:09.147903 - OUT: /home/ubuntu/anaconda3/envs/hydra_3.8.5/bin/pip install antlr4-python3-runtime==4.8
2020-09-28 23:45:09.927774 - OUT: Processing ./.cache/pip/wheels/c8/d0/ab/d43c02eaddc5b9004db86950802442ad9a26f279c619e28da0/antlr4_python3_runtime-4.8-py3-none-any.whl
Installing collected packages: antlr4-python3-runtime
Successfully installed antlr4-python3-runtime-4.8
2020-09-28 23:45:09.927847 - OUT: /home/ubuntu/anaconda3/envs/hydra_3.8.5/bin/pip install --ignore-installed PyYAML
2020-09-28 23:45:10.798501 - OUT: Processing ./.cache/pip/wheels/13/90/db/290ab3a34f2ef0b5a0f89235dc2d40fea83e77de84ed2dc05c/PyYAML-5.3.1-cp38-cp38-linux_x86_64.whl
Installing collected packages: PyYAML
Successfully installed PyYAML-5.3.1
Shared connection to 34.221.119.106 closed.
2020-09-28 16:45:11.007294 - Running: aws ec2 revoke-security-group-egress --group-id sg-0a1 --ip-permissions IpProtocol=tcp,FromPort=443,ToPort=443,IpRanges=[{CidrIp=0.0.0.0/0}]
2020-09-28 16:45:12.047395 - 
2020-09-28 16:45:12.047395 - 
ec2.Image(id='ami-0c46') current state pending
ec2.Image(id='ami-0c46') current state pending
...
ami-0c46 ready for use now.
#### 3. skip `-Werror` flag for ray launcher, the tests will fail with the flag, stack trace (this is caused by ray, not the plugin itself) - solution is to add a pytest.ini in ray's tests dir to suppress the warnings.
Stack trace
test_ray_local_launcher.py .[2020-09-28 21:52:46,553][HYDRA] Ray Launcher is launching 1 jobs, sweep output dir: /private/var/folders/n_/9qzct77j68j6n9lh0lw3vjqcn96zxl/T/pytest-of-jieru/pytest-25/test_sweep_1_job_ray_local_ove0
[2020-09-28 21:52:46,553][HYDRA] Initializing ray with config: {'num_cpus': 1, 'num_gpus': 0}
2020-09-28 21:52:46,564 INFO resource_spec.py:223 -- Starting Ray with 8.79 GiB memory available for workers and up to 4.41 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
Exception ignored in: <_io.FileIO name='/tmp/ray/session_2020-09-28_21-52-46_557592_90415/logs/redis-shard_0.err' mode='ab' closefd=True>
Traceback (most recent call last):
  File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/ray/node.py", line 746, in start_head_processes
    self.start_redis()
ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/ray/session_2020-09-28_21-52-46_557592_90415/logs/redis-shard_0.err' mode='a' encoding='utf-8'>
Exception ignored in: <_io.FileIO name='/tmp/ray/session_2020-09-28_21-52-46_557592_90415/logs/redis-shard_0.out' mode='ab' closefd=True>
Traceback (most recent call last):
  File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/ray/node.py", line 746, in start_head_processes
    self.start_redis()
ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/ray/session_2020-09-28_21-52-46_557592_90415/logs/redis-shard_0.out' mode='a' encoding='utf-8'>
Exception ignored in: <_io.FileIO name='/tmp/ray/session_2020-09-28_21-52-46_557592_90415/logs/redis.err' mode='ab' closefd=True>
Traceback (most recent call last):
  File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/ray/node.py", line 746, in start_head_processes
    self.start_redis()
ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/ray/session_2020-09-28_21-52-46_557592_90415/logs/redis.err' mode='a' encoding='utf-8'>
Exception ignored in: <_io.FileIO name='/tmp/ray/session_2020-09-28_21-52-46_557592_90415/logs/redis.out' mode='ab' closefd=True>
Traceback (most recent call last):
  File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/ray/node.py", line 746, in start_head_processes
    self.start_redis()
ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/ray/session_2020-09-28_21-52-46_557592_90415/logs/redis.out' mode='a' encoding='utf-8'>
Exception ignored in: <_io.FileIO name='/tmp/ray/session_2020-09-28_21-52-46_557592_90415/logs/gcs_server.out' mode='ab' closefd=True>
Traceback (most recent call last):
  File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/ray/node.py", line 748, in start_head_processes
    self.start_gcs_server()
ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/ray/session_2020-09-28_21-52-46_557592_90415/logs/gcs_server.out' mode='a' encoding='utf-8'>
Exception ignored in: <_io.FileIO name='/tmp/ray/session_2020-09-28_21-52-46_557592_90415/logs/gcs_server.err' mode='ab' closefd=True>
Traceback (most recent call last):
  File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/ray/node.py", line 748, in start_head_processes
    self.start_gcs_server()
ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/ray/session_2020-09-28_21-52-46_557592_90415/logs/gcs_server.err' mode='a' encoding='utf-8'>
Exception ignored in: <_io.FileIO name='/tmp/ray/session_2020-09-28_21-52-46_557592_90415/logs/monitor.out' mode='ab' closefd=True>
Traceback (most recent call last):
  File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/ray/node.py", line 750, in start_head_processes
    self.start_monitor()
ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/ray/session_2020-09-28_21-52-46_557592_90415/logs/monitor.out' mode='a' encoding='utf-8'>
Exception ignored in: <_io.FileIO name='/tmp/ray/session_2020-09-28_21-52-46_557592_90415/logs/monitor.err' mode='ab' closefd=True>
Traceback (most recent call last):
  File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/ray/node.py", line 750, in start_head_processes
    self.start_monitor()
ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/ray/session_2020-09-28_21-52-46_557592_90415/logs/monitor.err' mode='a' encoding='utf-8'>
F[2020-09-28 21:52:47,565][HYDRA] Ray Launcher is launching 2 jobs, sweep output dir: /private/var/folders/n_/9qzct77j68j6n9lh0lw3vjqcn96zxl/T/pytest-of-jieru/pytest-25/test_sweep_2_jobs_ray_local_ov0
[2020-09-28 21:52:47,565][HYDRA] Initializing ray with config: {'num_cpus': 1, 'num_gpus': 0}
2020-09-28 21:52:47,573 INFO resource_spec.py:223 -- Starting Ray with 8.74 GiB memory available for workers and up to 4.38 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
Exception ignored in: <_io.FileIO name='/tmp/ray/session_2020-09-28_21-52-47_566018_90415/logs/redis-shard_0.err' mode='ab' closefd=True>
Traceback (most recent call last):
  File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/ray/node.py", line 746, in start_head_processes
    self.start_redis()
ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/ray/session_2020-09-28_21-52-47_566018_90415/logs/redis-shard_0.err' mode='a' encoding='utf-8'>
Exception ignored in: <_io.FileIO name='/tmp/ray/session_2020-09-28_21-52-47_566018_90415/logs/redis-shard_0.out' mode='ab' closefd=True>
Traceback (most recent call last):
  File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/ray/node.py", line 746, in start_head_processes
    self.start_redis()
ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/ray/session_2020-09-28_21-52-47_566018_90415/logs/redis-shard_0.out' mode='a' encoding='utf-8'>
Exception ignored in: <_io.FileIO name='/tmp/ray/session_2020-09-28_21-52-47_566018_90415/logs/redis.err' mode='ab' closefd=True>
Traceback (most recent call last):
  File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/ray/node.py", line 746, in start_head_processes
    self.start_redis()
ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/ray/session_2020-09-28_21-52-47_566018_90415/logs/redis.err' mode='a' encoding='utf-8'>
Exception ignored in: <_io.FileIO name='/tmp/ray/session_2020-09-28_21-52-47_566018_90415/logs/redis.out' mode='ab' closefd=True>
Traceback (most recent call last):
  File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/ray/node.py", line 746, in start_head_processes
    self.start_redis()
ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/ray/session_2020-09-28_21-52-47_566018_90415/logs/redis.out' mode='a' encoding='utf-8'>
Exception ignored in: <_io.FileIO name='/tmp/ray/session_2020-09-28_21-52-47_566018_90415/logs/gcs_server.out' mode='ab' closefd=True>
Traceback (most recent call last):
  File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/ray/node.py", line 748, in start_head_processes
    self.start_gcs_server()
ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/ray/session_2020-09-28_21-52-47_566018_90415/logs/gcs_server.out' mode='a' encoding='utf-8'>
Exception ignored in: <_io.FileIO name='/tmp/ray/session_2020-09-28_21-52-47_566018_90415/logs/gcs_server.err' mode='ab' closefd=True>
Traceback (most recent call last):
  File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/ray/node.py", line 748, in start_head_processes
    self.start_gcs_server()
ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/ray/session_2020-09-28_21-52-47_566018_90415/logs/gcs_server.err' mode='a' encoding='utf-8'>
Exception ignored in: <_io.FileIO name='/tmp/ray/session_2020-09-28_21-52-47_566018_90415/logs/monitor.out' mode='ab' closefd=True>
Traceback (most recent call last):
  File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/ray/node.py", line 750, in start_head_processes
    self.start_monitor()
ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/ray/session_2020-09-28_21-52-47_566018_90415/logs/monitor.out' mode='a' encoding='utf-8'>
Exception ignored in: <_io.FileIO name='/tmp/ray/session_2020-09-28_21-52-47_566018_90415/logs/monitor.err' mode='ab' closefd=True>
Traceback (most recent call last):
  File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/ray/node.py", line 750, in start_head_processes
    self.start_monitor()
ResourceWarning: unclosed file <_io.TextIOWrapper name='/tmp/ray/session_2020-09-28_21-52-47_566018_90415/logs/monitor.err' mode='a' encoding='utf-8'>

INTERNALERROR> Traceback (most recent call last):
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/main.py", line 191, in wrap_session
INTERNALERROR>     session.exitstatus = doit(config, session) or 0
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/main.py", line 247, in _main
INTERNALERROR>     config.hook.pytest_runtestloop(session=session)
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/hooks.py", line 286, in __call__
INTERNALERROR>     return self._hookexec(self, self.get_hookimpls(), kwargs)
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/manager.py", line 93, in _hookexec
INTERNALERROR>     return self._inner_hookexec(hook, methods, kwargs)
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/manager.py", line 84, in <lambda>
INTERNALERROR>     self._inner_hookexec = lambda hook, methods, kwargs: hook.multicall(
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/callers.py", line 208, in _multicall
INTERNALERROR>     return outcome.get_result()
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/callers.py", line 80, in get_result
INTERNALERROR>     raise ex[1].with_traceback(ex[2])
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/callers.py", line 187, in _multicall
INTERNALERROR>     res = hook_impl.function(*args)
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/main.py", line 272, in pytest_runtestloop
INTERNALERROR>     item.config.hook.pytest_runtest_protocol(item=item, nextitem=nextitem)
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/hooks.py", line 286, in __call__
INTERNALERROR>     return self._hookexec(self, self.get_hookimpls(), kwargs)
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/manager.py", line 93, in _hookexec
INTERNALERROR>     return self._inner_hookexec(hook, methods, kwargs)
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/manager.py", line 84, in <lambda>
INTERNALERROR>     self._inner_hookexec = lambda hook, methods, kwargs: hook.multicall(
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/callers.py", line 208, in _multicall
INTERNALERROR>     return outcome.get_result()
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/callers.py", line 80, in get_result
INTERNALERROR>     raise ex[1].with_traceback(ex[2])
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/callers.py", line 187, in _multicall
INTERNALERROR>     res = hook_impl.function(*args)
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/runner.py", line 85, in pytest_runtest_protocol
INTERNALERROR>     runtestprotocol(item, nextitem=nextitem)
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/runner.py", line 100, in runtestprotocol
INTERNALERROR>     reports.append(call_and_report(item, "call", log))
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/runner.py", line 188, in call_and_report
INTERNALERROR>     report = hook.pytest_runtest_makereport(item=item, call=call)
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/hooks.py", line 286, in __call__
INTERNALERROR>     return self._hookexec(self, self.get_hookimpls(), kwargs)
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/manager.py", line 93, in _hookexec
INTERNALERROR>     return self._inner_hookexec(hook, methods, kwargs)
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/manager.py", line 84, in <lambda>
INTERNALERROR>     self._inner_hookexec = lambda hook, methods, kwargs: hook.multicall(
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/callers.py", line 203, in _multicall
INTERNALERROR>     gen.send(outcome)
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/skipping.py", line 129, in pytest_runtest_makereport
INTERNALERROR>     rep = outcome.get_result()
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/callers.py", line 80, in get_result
INTERNALERROR>     raise ex[1].with_traceback(ex[2])
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/pluggy/callers.py", line 187, in _multicall
INTERNALERROR>     res = hook_impl.function(*args)
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/runner.py", line 260, in pytest_runtest_makereport
INTERNALERROR>     return TestReport.from_item_and_call(item, call)
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/reports.py", line 294, in from_item_and_call
INTERNALERROR>     longrepr = item.repr_failure(excinfo)
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/python.py", line 1511, in repr_failure
INTERNALERROR>     return self._repr_failure_py(excinfo, style=style)
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/nodes.py", line 355, in _repr_failure_py
INTERNALERROR>     return excinfo.getrepr(
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/_code/code.py", line 635, in getrepr
INTERNALERROR>     return fmt.repr_excinfo(self)
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/_code/code.py", line 880, in repr_excinfo
INTERNALERROR>     reprtraceback = self.repr_traceback(excinfo_)
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/_code/code.py", line 824, in repr_traceback
INTERNALERROR>     reprentry = self.repr_traceback_entry(entry, einfo)
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/_code/code.py", line 774, in repr_traceback_entry
INTERNALERROR>     source = self._getentrysource(entry)
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/_code/code.py", line 685, in _getentrysource
INTERNALERROR>     source = entry.getsource(self.astcache)
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/_code/code.py", line 246, in getsource
INTERNALERROR>     astnode, _, end = getstatementrange_ast(
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/_pytest/_code/source.py", line 384, in getstatementrange_ast
INTERNALERROR>     astnode = ast.parse(content, "source", "exec")
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/ast.py", line 47, in parse
INTERNALERROR>     return compile(source, filename, mode, flags,
INTERNALERROR>   File "/Users/jieru/opt/anaconda3/envs/pytest38/lib/python3.8/site-packages/ray/worker.py", line 869, in sigterm_handler
INTERNALERROR>     sys.exit(signum)
INTERNALERROR> SystemExit: 15
mainloop: caught unexpected SystemExit!

4. install conda in circleCI linux docker

Previously we pin the circleCI linux docker image to be python:3.8. however, the image runs on python 3.8.6 which is not yet available to be installed in conda. As a result, the ray launcher tests fails (cloudpickle requires the exact same version of python used on pickle and unpickle side)
To be consistent with how tests are run in MACOS and WIN, I added the miniconda installation for linux machines as well. The installation takes a few secs, so I didn't add cache for it.

Update 09/21

Now that #815 has finally landed. This PR is unblocked finally! This is my priority this week.
TODO items.

  • rebase onto latest master
  • upload latest wheels to S3 for installation during integration tests.
    In order to upload and install latest wheels in the integration, I want to:
  1. list all the plugins that's going to be tested (by getting the PLUGINS env variable), build wheels, and scp them all to the ec2 instance.
  2. Install all the wheels on the ec2 instance.
    This way, we can remove all outbound traffic of the testing ec2 instances.
  • Address omry's comments
  • Update circleCI test user creds. and finish all the TODO items outlined in the proposal quip.

Update 09/08
edit: moved the TODO items to the latest update.

Update 09/01

This PR has been blocked by #815. Now that we've figured out a good solution for #815, I will go ahead and get 815 in first and then circle back here.

Also I'm going to create data class for both ray init, ray remote and boto configs (we will only add typing for common boto fields, and the boto config will extend Dict[str, Any])

For the integration tests to work on the latest code, we are going to build wheels and upload to S3 with each integration test run. This will be likely be a a separate PR.


Motivation

This is built on #515, sorry I had to open a new pull requests. I still need to figure out what's a better workflow with forking & syncing.

This PR address some comments from PR515:

  1. Add local mode for ray launcher
  2. Add docker options
  3. refactor the Launcher class, group all file syncing together.

Plan to add in next PR(s):

  1. Add Integration test
  2. Add an option for users to update the cluster if they need.
  3. Remote cluster return results to laptop.
  4. make _dump_func_params better/less hard coded.

Update 04/10
Add Integration tests
Ray up automatically update cluster so no need for us to provide an option for update.
Now JobReturns are copied back to laptop
refactor _dump_func_params a bit more.

Next:

  1. Fix LOCAL mode to run RAY directly.
  2. Add integration tests for both LOCAL and AWS mode.

Have you read the Contributing Guidelines on pull requests?

Yes/No

Test Plan

Integration tests
Run launcher in both local and AWS mode

Run the

Related Issues and PRs

(Is this PR part of a group of changes? Link the other relevant PRs and Issues here. Use https://help.github.com/en/articles/closing-issues-using-keywords for help on GitHub syntax)

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 8, 2020
@jieru-hu jieru-hu mentioned this pull request Apr 8, 2020
@lgtm-com
Copy link
Contributor

lgtm-com bot commented Apr 8, 2020

This pull request introduces 1 alert when merging 90ff502 into 13d663c - view on LGTM.com

new alerts:

  • 1 for Unused import

@jan-matthis
Copy link
Contributor

jan-matthis commented Apr 9, 2020

Great to see a launcher for ray forthcoming!

Two brief comments:

  1. What I'm unclear about is why you dump a pickle with task_function and the sweep parameters to disk, then rsync, then ssh into the head node to execute, rather than just invoking ray.remote directly in the main launch function and letting ray handle the rest.
  2. Perhaps you have seen that ray has a joblib backend and Hydra has a joblib plugin. I haven't tested the ray backend and it's limitations, but possibly the joblib plugin can work with a ray cluster with some tiny modifications detailed in ray's docs. What's very nice about your plugin is that it provides options to configure a cluster, bringing it up and down, so that wouldn't be covered by the joblib plugin at all. But perhaps there can be synergy here.

@omry
Copy link
Collaborator

omry commented Apr 9, 2020

Great to see a launcher for ray forthcoming!

Yup, I am excited too!

Two brief comments:

What I'm unclear about is why you dump a pickle with task_function and the sweep parameters to disk, then rsync, then ssh into the head node to execute, rather than just invoking ray.remote directly in the main launch function and letting ray handle the rest.

Unfortunately Ray support for remote cluster execution is not transparent.
Currently it only work transparently when you call it from the head node (a special machine on your AWS cluster).
They use ray submit to launch to a remote cluster, which does something similar to what this plugin is doing, and then call ray.remote on the head node.
This is something they may improve based on our feedback, but for now we have to do it the hard way.

Perhaps you have seen that ray has a joblib backend and Hydra has a joblib plugin. I haven't tested the ray backend and it's limitations, but possibly the joblib plugin can work with a ray cluster with some tiny modifications detailed in ray's docs. What's very nice about your plugin is that it provides options to configure a cluster, bringing it up and down, so that wouldn't be covered by the joblib plugin at all. But perhaps there can be synergy here.

Based on my answer to 1, I think this is only going to work if you are running from the head node.
I am not seeing a lot of value in using JobLib here instead of using Ray more directly.
We can consider possibly synergy with the joblib launcher once we have a solid Ray solution.

@jieru-hu
Copy link
Contributor Author

Update pull requests with support for returning JobReturns. I'm currently working on enabling the integration test suite. There's so many failures right now 😅

@lgtm-com
Copy link
Contributor

lgtm-com bot commented Apr 13, 2020

This pull request introduces 2 alerts when merging ad3d76c into 079c325 - view on LGTM.com

new alerts:

  • 2 for Unused local variable

@lgtm-com
Copy link
Contributor

lgtm-com bot commented Apr 13, 2020

This pull request introduces 2 alerts when merging ee7f500 into 079c325 - view on LGTM.com

new alerts:

  • 2 for Unused local variable

@jieru-hu jieru-hu requested a review from omry April 15, 2020 16:47
@jieru-hu jieru-hu marked this pull request as draft April 15, 2020 16:47
@jieru-hu jieru-hu marked this pull request as ready for review April 15, 2020 21:00
@jieru-hu
Copy link
Contributor Author

Major changes for review:

  1. separate launcher into LOCAL & AWS mode/classes
  2. Add structured config
  3. Add github action for updating *.whl (temporary, pending longer term solution.)
  4. Clean up.

@jieru-hu jieru-hu marked this pull request as draft April 16, 2020 15:33
@jieru-hu jieru-hu marked this pull request as ready for review April 20, 2020 20:56
@jieru-hu
Copy link
Contributor Author

Major changes for review:

  1. Add script to upload local wheel to S3 bucket
  2. Use structured config as a schema along with Yaml files for configuring.
  3. Clean up all temp files created including local & remote
  4. Add the option to delete cluster instead of just stopping it.
  5. Populate job_id on Ray cluster instead of locally.
  6. simply the logic for rsync-up

@lgtm-com
Copy link
Contributor

lgtm-com bot commented Apr 20, 2020

This pull request introduces 1 alert when merging d16e454 into 3206dc0 - view on LGTM.com

new alerts:

  • 1 for Unused local variable

@jieru-hu jieru-hu requested a review from omry April 21, 2020 17:29
Copy link
Collaborator

@omry omry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good progress.

@jieru-hu jieru-hu marked this pull request as draft April 22, 2020 18:33
@jieru-hu jieru-hu requested a review from omry April 23, 2020 01:42
@jieru-hu jieru-hu marked this pull request as ready for review April 23, 2020 01:43
utils/upload_to_s3/config.yaml Outdated Show resolved Hide resolved
utils/upload_to_s3/config.yaml Outdated Show resolved Hide resolved
utils/upload_to_s3/upload_file_to_s3.py Outdated Show resolved Hide resolved
utils/upload_to_s3/upload_file_to_s3.py Outdated Show resolved Hide resolved
website/docs/plugins/ray_launcher.md Outdated Show resolved Hide resolved
website/docs/plugins/ray_launcher.md Outdated Show resolved Hide resolved
@jieru-hu jieru-hu marked this pull request as draft April 23, 2020 20:55
@lgtm-com
Copy link
Contributor

lgtm-com bot commented Apr 27, 2020

This pull request introduces 1 alert when merging 70d0564 into 52b6da8 - view on LGTM.com

new alerts:

  • 1 for Unused import

@jieru-hu
Copy link
Contributor Author

Rebased onto master again and response to omry's comments:

I think we attempted it initially, but what do you think about factoring your config to allow something similar to how submitit is handling it?
hydra/launcher=ray_aws
hydra/launcher=ray_local (or maybe just ray?)

I'm not sure what you meant here, this is how the launcher config looks like now :)
I like the submitit example app. I'm planning on splitting the example app to 2 apps:
one of them is the simple form, you only need to override the launcher config to run (just like the submitit example)
the other can demonstrate the download/upload function and will only be run for aws launcher.
I will address this as part of #666

I am guessing ray is a dependency of the ray launcher, right?
What is the total size of the wheels you are uploading like that?

everything combined it is 76m; it doesn't actually download all the dependencies of ray.

Can you create a minimal standalone example of the problem (Ideally with just OmegaConf and Cloudpickle)?

Yes, I will follow up on this. I just created #1097. (I thought I already created an issue to track but looks like I forgot to. )

@lgtm-com
Copy link
Contributor

lgtm-com bot commented Oct 28, 2020

This pull request introduces 1 alert when merging 4d020e8 into e98f518 - view on LGTM.com

new alerts:

  • 1 for Module is imported with 'import' and 'import from'

@omry
Copy link
Collaborator

omry commented Oct 28, 2020

Rebased onto master again and response to omry's comments:

I think we attempted it initially, but what do you think about factoring your config to allow something similar to how submitit is handling it?
hydra/launcher=ray_aws
hydra/launcher=ray_local (or maybe just ray?)

I'm not sure what you meant here, this is how the launcher config looks like now :)
I like the submitit example app. I'm planning on splitting the example app to 2 apps:
one of them is the simple form, you only need to override the launcher config to run (just like the submitit example)
the other can demonstrate the download/upload function and will only be run for aws launcher.
I will address this as part of #666

roger.

I am guessing ray is a dependency of the ray launcher, right?
What is the total size of the wheels you are uploading like that?

everything combined it is 76m; it doesn't actually download all the dependencies of ray.

76MB is still large. it can easily take 2-3 minutes to upload.
(Hydra itself is under 100kb).
Let's see if this is becoming a pain.

Can you create a minimal standalone example of the problem (Ideally with just OmegaConf and Cloudpickle)?

Yes, I will follow up on this. I just created #1097. (I thought I already created an issue to track but looks like I forgot to. )

roger.

Copy link
Collaborator

@omry omry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's time to merge this to master!

@omry omry merged commit 6244a3e into facebookresearch:master Oct 28, 2020
@jieru-hu
Copy link
Contributor Author

jieru-hu commented Oct 28, 2020

It's time to merge this to master!

me right now: > 🙀

I will keep an eye on the integration tests. I haven yet to update the env variable for Hydra's circleCI.

jieru-hu added a commit to jieru-hu/hydra that referenced this pull request Nov 13, 2020
jieru-hu added a commit to jieru-hu/hydra that referenced this pull request Nov 16, 2020
jieru-hu added a commit to jieru-hu/hydra that referenced this pull request Nov 16, 2020
jieru-hu added a commit that referenced this pull request Nov 16, 2020
@jieru-hu jieru-hu deleted the ray-launcher-v2 branch July 20, 2021 21:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants