Relax version match constraints #637

yadudoc · 2021-12-03T02:16:42Z

This PR removes the constraints enforced by the endpoint that the workers must have the same python3 version as itself.
In addition, there are some minimal tests that allow launching endpoints and workers with different conda envs such that the two components could have different versions. This allows us to run some tests against the backend components running versions from a test matrix.

Here are some selected lines from the results (EP_PY_V is endpoint python version):

TEST PASSED, EP_PY_V:3.6 WORKER_PY_V:3.6
TEST PASSED, EP_PY_V:3.6 WORKER_PY_V:3.7
TEST PASSED, EP_PY_V:3.6 WORKER_PY_V:3.8
TEST PASSED, EP_PY_V:3.6 WORKER_PY_V:3.9
TEST PASSED, EP_PY_V:3.7 WORKER_PY_V:3.6
TEST PASSED, EP_PY_V:3.7 WORKER_PY_V:3.7
TEST PASSED, EP_PY_V:3.7 WORKER_PY_V:3.8
TEST PASSED, EP_PY_V:3.7 WORKER_PY_V:3.9
TEST PASSED, EP_PY_V:3.8 WORKER_PY_V:3.6
TEST PASSED, EP_PY_V:3.8 WORKER_PY_V:3.7
TEST PASSED, EP_PY_V:3.8 WORKER_PY_V:3.8
TEST PASSED, EP_PY_V:3.8 WORKER_PY_V:3.9
TEST PASSED, EP_PY_V:3.9 WORKER_PY_V:3.6
TEST PASSED, EP_PY_V:3.9 WORKER_PY_V:3.7
TEST PASSED, EP_PY_V:3.9 WORKER_PY_V:3.8
TEST PASSED, EP_PY_V:3.9 WORKER_PY_V:3.9

Type of change

Choose which options apply, and delete the ones which do not apply.

New feature (non-breaking change that adds functionality)
Code maintentance/cleanup

…thon version for tests.

yadudoc · 2021-12-03T03:35:27Z

[sc-9610]

shortcut-integration · 2021-12-03T03:35:29Z

This pull request has been linked to Shortcut Story #9610: Don't require python version match between endpoint and workers.

benclifford

made some usability comments.

I massively encourage you to put the effort into runnings tests in CI rather than tossing them in the dustbin of time.

benclifford · 2021-12-03T12:52:21Z

funcx_endpoint/funcx_endpoint/executors/high_throughput/interchange.py

@@ -127,6 +127,7 @@ def __init__(self,
                 logging_level=logging.INFO,
                 endpoint_id=None,
                 suppress_failure=False,
+                 relax_manager_version_match=True,


i don't like naming of this: relax compared to what? merely to historic behaviour which will soon disappear into the mists of time

it's also a "don't do something = True" flag which is confusing when there are double negatives: eg false means: "don't not compare versions"

this is frustrating from a usability perspective.

a more positive expression of enabling behaviour would be something like: require_manager_endpoint_python_match (or something much shorter?) with inverted bool behaviour compared to this PR.

What do you think about removing this option entirely? I think having this option adds little value.

yeah I think thats probably fine

benclifford · 2021-12-03T12:55:07Z

funcx_endpoint/funcx_endpoint/executors/high_throughput/interchange.py

@@ -687,7 +689,9 @@ def start(self, poll_period=None):
                            msg['parsl_v'] != self.current_platform['parsl_v']):
                            logger.warn("[MAIN] Manager {} has incompatible version info with the interchange".format(manager))

-                            if self.suppress_failure is False:
+                            if self.relax_manager_version_match is True:
+                                logger.info("[MAIN] Ignoring version match requirements and continuing")


This logging (here and line 690) needs tidying: if we're expecting a user to be allowed to run different manager and interchange versions, we shouldn't be warning them that they're incompatible and talking about a version match "requirement" that is not a requirement.
If this is a supported feature, this is debug or info level information and should not be phrased this way.

Use "different" rather than "incompatible" ?

In line 691 below tell the user kill event is being set because actually in this code path it is a requirement that version match, unlike the default behaviour.

I've updated the text here to read:

logger.info(f"[MAIN] Manager:{manager} version:{msg['python_v']} does not match the interchange")

benclifford · 2021-12-03T12:58:39Z

funcx_endpoint/tests/version_mismatch_tests/config.py

+            init_blocks=1,
+            min_blocks=0,
+            max_blocks=1,
+            worker_init=("source ~/anaconda3/etc/profile.d/conda.sh; "


this is probably something specific to your development environment? - I don't have an ~/anaconda3

I'll add a note to the README that this bit needs to be updated. There was no intent from the start that would be part of regular testing, so I'm thinking this should be okay.

yadudoc · 2021-12-03T15:55:41Z

I just had a quick discussion with @benclifford about this, and we are agreed that the removal of the version checks need not be user-configurable. So I'll remove that option entirely. The tests are a slightly more complex story. Ideally, we want the tests to be run in CI to ensure that future changes continue to not require tight version match requirements. However, switching conda envs from a script, and then ensuring that conda envs are reliably started from the (grand) child processes, ie workers is a bit difficult. From past experience, virtualenvs are not great at this either. So, we might have to have a separate issue for making the version_mismatch_tests reliable for running under CI.

The interchange will note an info line if there's some version mismatch and continue. This can later be problematic if the funcx versions do not match.

codecov · 2021-12-03T17:06:37Z

Codecov Report

Merging #637 (dcd1935) into main (f6b7605) will increase coverage by 0.14%.
The diff coverage is 50.00%.

❗ Current head dcd1935 differs from pull request most recent head 8f72158. Consider uploading reports for the commit 8f72158 to get more accurate results

@@            Coverage Diff             @@
##             main     #637      +/-   ##
==========================================
+ Coverage   42.22%   42.36%   +0.14%     
==========================================
  Files          33       33              
  Lines        3313     3302      -11     
==========================================
  Hits         1399     1399              
+ Misses       1914     1903      -11

Impacted Files	Coverage Δ
..._endpoint/executors/high_throughput/interchange.py	`11.48% <50.00%> (+0.23%)`	⬆️
...ncx_endpoint/executors/high_throughput/executor.py	`32.63% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 66620bf...8f72158. Read the comment docs.

benclifford · 2021-12-06T14:30:48Z

I have attempted one run of the manual tests on my laptop. Several combinations of tests passed, but it has hung without completion here:

Testing EP:python=3.8 against Worker:python=3.9
/home/benc/.conda/envs/funcx_version_mismatch_py3.8/bin/funcx-endpoint
Running ep with funcx_version_mismatch_py3.9 with python 3.9
Using conda env:funcx_version_mismatch_py3.9 for worker_init
2021-12-06 14:16:25 endpoint.endpoint_manager:193 [INFO]  Starting endpoint with uuid: b5b9fbdf-33cd-4cdf-b04e-016a21ccacea
2021-12-06 14:16:26 endpoint.endpoint_manager:309 [INFO]  Launching endpoint daemon process
Running a version check against endpoint:b5b9fbdf-33cd-4cdf-b04e-016a21ccacea
Future launched with future:<Future at 0x7f727e49b790 state=pending>
[BENC: some delay of several minutes here]
WebSocket connection closed unexpectedly

In interchange log (which is shared between all tests) I see this near the end:

2021-12-06 14:16:33.909 interchange:460 [INFO]  [TASK_PULL_THREAD] Starting
2021-12-06 14:16:33.913 interchange:719 [INFO]  Starting strategy.
2021-12-06 14:16:33.915 interchange.strategy.simple:48 [INFO]  Task breakdown [('interchange', 0, True)]
2021-12-06 14:16:33.944 interchange:763 [WARNING]  [MAIN] Got a non-json registration message from manager:b'dfff6da5f5c0'
2021-12-06 14:16:33.947 interchange:1020 [INFO]  Processed 0 tasks in 0.0390472412109375 seconds
2021-12-06 14:16:33.947 interchange:1021 [WARNING]  Exiting

I don't know what the non-json registration message is.

I will rerun these tests and see if the behaviour is reproducible.

benclifford · 2021-12-06T14:49:11Z

I'm not sure what the general behaviour is meant to be in this case to converge to a final result on the user side (outside of these tests) rather than what looks like a hang forever.

benclifford · 2021-12-06T14:55:31Z

On a restart of the tests, the first test, 3.6 vs 3.6 hangs

Testing EP:python=3.6 against Worker:python=3.6
/home/benc/.conda/envs/funcx_version_mismatch_py3.6/bin/funcx-endpoint
Running ep with funcx_version_mismatch_py3.6 with python 3.6
Using conda env:funcx_version_mismatch_py3.6 for worker_init
2021-12-06 14:49:56 endpoint.endpoint_manager:193 [INFO]  Starting endpoint with uuid: b5b9fbdf-33cd-4cdf-b04e-016a21ccacea
2021-12-06 14:49:56 endpoint.endpoint_manager:202 [INFO]  Endpoint is already active
Running a version check against endpoint:b5b9fbdf-33cd-4cdf-b04e-016a21ccacea
Future launched with future:<Future at 0x7f14a49919b0 state=pending>

endpoint.log reports:

2021-12-06 14:50:05.356 funcx_endpoint:359 [INFO]  [TASK_PULL_THREAD] Received task:b9cfed3e-955c-412a-9bea-35c4bd80f2dd
2021-12-06 14:50:05.357 funcx_endpoint:628 [ERROR]  [MAIN] Unhandled issue while waiting for pending tasks
Traceback (most recent call last):
  File "/home/benc/.conda/envs/funcx_version_mismatch_py3.8/lib/python3.8/site-packages/funcx_endpoint/endpoint/interchange.py", line 624, in _main_loop
    executor.submit_raw(task.pack())
  File "/home/benc/.conda/envs/funcx_version_mismatch_py3.8/lib/python3.8/site-packages/funcx_endpoint/executors/high_throughput/executor.py", line 777, in submit_raw
    raise self._executor_exception
  File "/home/benc/.conda/envs/funcx_version_mismatch_py3.8/lib/python3.8/site-packages/funcx_endpoint/endpoint/interchange.py", line 624, in _main_loop
    executor.submit_raw(task.pack())
  File "/home/benc/.conda/envs/funcx_version_mismatch_py3.8/lib/python3.8/site-packages/funcx_endpoint/executors/high_throughput/executor.py", line 777, in submit_raw
    raise self._executor_exception
funcx_endpoint.executors.high_throughput.interchange.BadRegistration: Manager b'dfff6da5f5c0' attempted to register with a bad registration message. Caused a critical failure
2021-12-06 14:50:26 endpoint.results_ack:62 [INFO]  Unacked count: 0, Acked results since last check 0

and interchange.log contains no entries since the last run hung.

yadudoc · 2021-12-06T20:36:32Z

@benclifford, Could you try adding say a 3 min delay between tests? I'm a bit worried that managers that did not shut down somehow are connecting to the next test and messing up results. I just want to make sure that we are not blocking this functionality because the shutdown behavior is messing with tests.

… into relax_version_match_constraints

…akes a test failure more obvious

benclifford · 2021-12-16T13:46:35Z

After @yadudoc 's most recent changes which test exceptions end-to-end from worker to client, my main remaining concern was if serialised exceptions are deserialised inside the "service version" parts of the code (eg the interchange), rather than the "user version" parts of the code (eg the worker or the submitting process); or where exceptions are created and serialised in the "service version" parts of the code and expected to be deserialised in the "user version" code.

The most obvious case where this happens is the WorkerLost exception, which is injected as a result which does not come from a worker, by the interchange.

I added a test which kills the worker inside an app to the existing set of cross version tests, and ran it.

It successfully (after some time waiting to time out) returns a WorkerLost for 3.6 v 3.6.

The subsequent version pairing, 3.6 vs 3.7, hung on first (Running a version check against endpoint) test for 40 minutes or so, even though the future.result() call in that test has a 10 second timeout: future.result(timeout=10)

The output of this is pasted below.

Testing EP:python=3.6 against Worker:python=3.6
/home/benc/.conda/envs/funcx_version_mismatch_py3.6/bin/funcx-endpoint
Running ep with funcx_version_mismatch_py3.6 with python 3.6
Using conda env:funcx_version_mismatch_py3.6 for worker_init
2021-12-16 12:59:51 endpoint.endpoint_manager:193 [INFO]  Starting endpoint with uuid: b5b9fbdf-33cd-4cdf-b04e-016a21ccacea
2021-12-16 12:59:51 endpoint.endpoint_manager:309 [INFO]  Launching endpoint daemon process
Running a version check against endpoint:b5b9fbdf-33cd-4cdf-b04e-016a21ccacea
Future launched with future:<Future at 0x7f67e258d6d8 state=pending>
Expected worker_version : 3.6, actual: 3.6
Worker returned the expected version:3.6
Checking exceptions from app on endpoint:b5b9fbdf-33cd-4cdf-b04e-016a21ccacea
Future launched with future:<Future at 0x7f67e258ddd8 state=pending>
Worker returned the correct exception
Testing manager kill to hopefully provoke a ManagerLost
Future launched with future:<Future at 0x7f67e258d320 state=pending>
Worker returned the correct exception
RC=0
TEST PASSED, EP_PY_V:3.6 WORKER_PY_V:3.6
Stopping endpoint in 2s
2021-12-16 13:02:11 endpoint.endpoint_manager:397 [INFO]  Endpoint <mismatched> is now stopped
Testing EP:python=3.6 against Worker:python=3.7
/home/benc/.conda/envs/funcx_version_mismatch_py3.6/bin/funcx-endpoint
Running ep with funcx_version_mismatch_py3.7 with python 3.7
Using conda env:funcx_version_mismatch_py3.7 for worker_init
2021-12-16 13:02:17 endpoint.endpoint_manager:193 [INFO]  Starting endpoint with uuid: b5b9fbdf-33cd-4cdf-b04e-016a21ccacea
2021-12-16 13:02:18 endpoint.endpoint_manager:309 [INFO]  Launching endpoint daemon process
Running a version check against endpoint:b5b9fbdf-33cd-4cdf-b04e-016a21ccacea
Future launched with future:<Future at 0x7f527fb6e748 state=pending>
WebSocket connection closed unexpectedly

The added test is this:

def kill_managers():
    import os

    os.system("killall funcx-manager")


def test_kill_manager(fx, ep_id, ep_version, version):
    import sys

    print("Testing manager kill to hopefully provoke a ManagerLost")

    future = fx.submit(kill_managers, endpoint_id=ep_id)
    print(f"Future launched with future:{future}")
    try:
        future.result(timeout=600) # leave a longish time for this timeout...
    except ManagerLost:
        print("Worker returned the correct exception")
    except Exception as e:
        print(f"Expected ValueError, actual: {repr(e)}")
        print("Exiting...")
        sys.exit(1)
    else:
        print("No exception, expected ValueError")
        sys.exit(1)

benclifford · 2021-12-16T14:44:38Z

In a subsequent run, many combinations passed ok, but the tests eventually hung with something similar here - the last line of output about 10 mins after the last timestamp in the output

Testing EP:python=3.8 against Worker:python=3.9
/home/benc/.conda/envs/funcx_version_mismatch_py3.8/bin/funcx-endpoint
Running ep with funcx_version_mismatch_py3.9 with python 3.9
Using conda env:funcx_version_mismatch_py3.9 for worker_init
2021-12-16 14:31:23 endpoint.endpoint_manager:193 [INFO]  Starting endpoint with uuid: b5b9fbdf-33cd-4cdf-b04e-016a21ccacea
2021-12-16 14:31:23 endpoint.endpoint_manager:309 [INFO]  Launching endpoint daemon process
Running a version check against endpoint:b5b9fbdf-33cd-4cdf-b04e-016a21ccacea
Future launched with future:<Future at 0x7f11c1dd1880 state=pending>


WebSocket connection closed unexpectedly

benclifford · 2021-12-16T14:51:03Z

trying all of these again with a longer delay between tests in run_test_matrix

benclifford · 2021-12-16T15:39:41Z

I think those failures are unrelated to this PR.

I'm happy to approve this PR if the tests are clearly labelled as broken.

benclifford · 2021-12-16T16:05:08Z

also got this on another run, which is probably(?) unrelated to this change:

Testing EP:python=3.7 against Worker:python=3.7
/home/benc/.conda/envs/funcx_version_mismatch_py3.7/bin/funcx-endpoint
Running ep with funcx_version_mismatch_py3.7 with python 3.7
Using conda env:funcx_version_mismatch_py3.7 for worker_init
2021-12-16 16:00:22 endpoint.endpoint_manager:193 [INFO]  Starting endpoint with uuid: b5b9fbdf-33cd-4cdf-b04e-016a21ccacea
2021-12-16 16:00:22 endpoint.endpoint_manager:309 [INFO]  Launching endpoint daemon process
Running a version check against endpoint:b5b9fbdf-33cd-4cdf-b04e-016a21ccacea
Future launched with future:<Future at 0x7fcb9664a490 state=pending>
Expected worker_version : 3.7, actual: 3.7
Worker returned the expected version:3.7
Checking exceptions from app on endpoint:b5b9fbdf-33cd-4cdf-b04e-016a21ccacea
Future launched with future:<Future at 0x7fcb9664a3d0 state=pending>
Worker returned the correct exception
Testing manager kill to hopefully provoke a ManagerLost
Error in registering kill_managers
Traceback (most recent call last):
  File "/home/benc/.conda/envs/funcx_version_mismatch_py3.7/lib/python3.7/site-packages/urllib3/connectionpool.py", line 445, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/home/benc/.conda/envs/funcx_version_mismatch_py3.7/lib/python3.7/site-packages/urllib3/connectionpool.py", line 440, in _make_request
    httplib_response = conn.getresponse()
  File "/home/benc/.conda/envs/funcx_version_mismatch_py3.7/lib/python3.7/http/client.py", line 1373, in getresponse
    response.begin()
  File "/home/benc/.conda/envs/funcx_version_mismatch_py3.7/lib/python3.7/http/client.py", line 319, in begin
    version, status, reason = self._read_status()
  File "/home/benc/.conda/envs/funcx_version_mismatch_py3.7/lib/python3.7/http/client.py", line 280, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/home/benc/.conda/envs/funcx_version_mismatch_py3.7/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
  File "/home/benc/.conda/envs/funcx_version_mismatch_py3.7/lib/python3.7/ssl.py", line 1071, in recv_into
    return self.read(nbytes, buffer)
  File "/home/benc/.conda/envs/funcx_version_mismatch_py3.7/lib/python3.7/ssl.py", line 929, in read
    return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/benc/.conda/envs/funcx_version_mismatch_py3.7/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/home/benc/.conda/envs/funcx_version_mismatch_py3.7/lib/python3.7/site-packages/urllib3/connectionpool.py", line 756, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/home/benc/.conda/envs/funcx_version_mismatch_py3.7/lib/python3.7/site-packages/urllib3/util/retry.py", line 532, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/home/benc/.conda/envs/funcx_version_mismatch_py3.7/lib/python3.7/site-packages/urllib3/packages/six.py", line 770, in reraise
    raise value
  File "/home/benc/.conda/envs/funcx_version_mismatch_py3.7/lib/python3.7/site-packages/urllib3/connectionpool.py", line 706, in urlopen
    chunked=chunked,
  File "/home/benc/.conda/envs/funcx_version_mismatch_py3.7/lib/python3.7/site-packages/urllib3/connectionpool.py", line 447, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/home/benc/.conda/envs/funcx_version_mismatch_py3.7/lib/python3.7/site-packages/urllib3/connectionpool.py", line 337, in _raise_timeout
    self, url, "Read timed out. (read timeout=%s)" % timeout_value
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='api2.funcx.org', port=443): Read timed out. (read timeout=60)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/benc/.conda/envs/funcx_version_mismatch_py3.7/lib/python3.7/site-packages/globus_sdk/base.py", line 463, in send_request
    timeout=self._http_timeout,
  File "/home/benc/.conda/envs/funcx_version_mismatch_py3.7/lib/python3.7/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/benc/.conda/envs/funcx_version_mismatch_py3.7/lib/python3.7/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/home/benc/.conda/envs/funcx_version_mismatch_py3.7/lib/python3.7/site-packages/requests/adapters.py", line 529, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='api2.funcx.org', port=443): Read timed out. (read timeout=60)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/benc/parsl/funcx/src/funcX/funcx_endpoint/tests/version_mismatch_tests/test_mismatched.py", line 112, in <module>
    test_kill_manager(fx, args.endpoint_id, args.ep_version, args.worker_version)
  File "/home/benc/parsl/funcx/src/funcX/funcx_endpoint/tests/version_mismatch_tests/test_mismatched.py", line 71, in test_kill_manager
    future = fx.submit(kill_managers, endpoint_id=ep_id)
  File "/home/benc/.conda/envs/funcx_version_mismatch_py3.7/lib/python3.7/site-packages/funcx/sdk/executor.py", line 153, in submit
    container_uuid=container_uuid,
  File "/home/benc/.conda/envs/funcx_version_mismatch_py3.7/lib/python3.7/site-packages/funcx/sdk/client.py", line 667, in register_function
    r = self.post("/functions", json_body=data)
  File "/home/benc/.conda/envs/funcx_version_mismatch_py3.7/lib/python3.7/site-packages/funcx/sdk/error_handling_client.py", line 33, in post
    r = ThrottledBaseClient.post(self, path, **kwargs)
  File "/home/benc/.conda/envs/funcx_version_mismatch_py3.7/lib/python3.7/site-packages/globus_sdk/base.py", line 263, in post
    retry_401=retry_401,
  File "/home/benc/.conda/envs/funcx_version_mismatch_py3.7/lib/python3.7/site-packages/funcx/sdk/utils/throttling.py", line 84, in _request
    return super()._request(*args, **kwargs)
  File "/home/benc/.conda/envs/funcx_version_mismatch_py3.7/lib/python3.7/site-packages/globus_sdk/base.py", line 470, in _request
    r = send_request()
  File "/home/benc/.conda/envs/funcx_version_mismatch_py3.7/lib/python3.7/site-packages/globus_sdk/base.py", line 467, in send_request
    raise exc.convert_request_exception(e)
globus_sdk.exc.GlobusTimeoutError: TimeoutError on request
RC=1
TEST FAILED, EP_PY_V:3.7 WORKER_PY_V:3.7

benclifford

approving this with the proviso that the tests don't reliably work for me but I am led to believe that's because funcx master is broken, rather than the changes from this PR.

benclifford · 2021-12-18T16:08:39Z

There is still a warning message in the interchange:

                            logger.warning("[MTHREAD] Executor shutting down due to version mismatch in interchange")

which is incorrectly phrased I think: that warning will never fire in the presence of version mismatches (as there are none now), but will fire in the case of other interchange exceptions.

This was fixed in the Parsl fork of the HighThroughputExecutor in Parsl/parsl#861 and you should consider porting that PR to funcx.

* Fix linting on funcx_sdk/funcx/tests/ * Simplify instructions for installing endpoint secrets to cluster * Fix CI and the way dependencies are specified test and dev are extras for installing tools, no multiple requirements files scattered across the repo. CI workflows should not be installing multiple distinct applications into a single venv and then trying to test. This muddies the waters and makes the purpose and requirements of each CI step unclear. Therefore, in addition to putting the requirement data where it really belongs, this fixes the workflows to specify a better isolated test configuration. Workflow jobs are used more thoughtfully to produce isolated VMs for distinct build and testing processes. New daily tests run Monday-through-Friday in the AM for US timezones, doing `safety check` and little else. Safety checking on dependencies is done separately for funcx-sdk and funcx-endpoint. The CI workflow for testing PRs and branches also does `safety check`, but importantly the hourly test does not. Notifications from the daily and hourly workflows are done in a separate dedicated job. This means that new jobs can be added upstream without needing distinct or rewritten notification logic. The absence of any pytest run for the funcx-sdk package is more obvious and identified properly as a problem. It is paired with the "import test" so that when enabled, the import test removal will be localized to the same area of the workflow file. Linting is a dedicated job which runs as part of the CI workflow and uses pre-commit on the full repo. No additional lint steps (e.g. one-off flake8 runs) are used or should be needed. Testing documentation is updated minimally to refer to the installation of extras, rather than requirements files. The doc site requirements for some (undocumented) reason are installing funcx-sdk test requirements. This has been removed. Any documentation requirements can continue to be specified in the doc directory, or under a new `[docs]` extra if necessary. It should not "inherit" the requirements of the testing process. The codecov integration does not deliver value in the absence of good unit testing and a test matrix of size greater than 1. Therefore, until such a time as codecov is useful and a worthwhile configuration can be identified, the integration has been removed. The funcx-endpoint setup.py file was autoformatted with `black` to simplify the formatting of the requirements data. No workflow steps are allowed to use `actions/checkout@master`. This is unnecessary and dangerous usage. `checkout@v2` is used -- as the GitHub Action's own documentation instructs users to do -- instead. * Port parsl BadRegistration error message clarification to funcx fork This clarification was introduced as part of Parsl/parsl#980 I have not examined the rest of that PR for relevance to this fork. * Remove unused BadRegistration exception class This looks like it was excessive copy-paste from Parsl source code which has had an unused BadRegistration class since parsl PR Parsl/parsl#1671 * Broaden htex interchange->executor error message phrasing Previously the error message claimed that the fatal error was due to version mismatch. This is not true: this same fatal error may occur in the case of a BadRegistration. This PR rephrases the error to be less misleading. I have encountered this repeatedly during review of #909, and at least I think Yadu was getting it while hacking on this code. * Expand daily build to all days * Remove internal RedisQueue object and dependents RedisQueue is now provided by funcx-common if needed. 1. Remove `funcx_endpoint.queues` 2. Remove `funcx_endpoint/tests/integration/test_redis.py` (this just tests the removed subpackage) 3. Remove `funcx_endpoint.mock_broker`, which relies on RedisQueue (3) is the most questionable of these three. However, `mock_broker` does not appear to be used anywhere. Furthermore, even if a mocked Forwarder object is needed for some reason, there doesn't appear to be any reason that it should be provided by the `funcx-endpoint` package. * Update smoke test config with version, dev env version=0.3.5 add a "dev" block to the config for ease of use. black and isort on the file * Make smoke tests check min version, not exact Check against the API response with a loose version comparison, not an equality check. This means that the smoke tests set a "floor" for what versions are allowed, but do not fail when a release is done. * Add a Polaris config example * Flake8 * Removing comment # edtb-02 * Update linting config and fix Remove flake8 ignores for various rules: only ignore the rules which must be ignored to be black-compatible. Remove exclusions from pre-commit config. * Fix subject line in hourly smoke test * Remove async testing until it can be done safely * Adding a minimal s3 smoke test * Adding funcx_endpoint to install list * Marking test for large args to skip * Adding fixme note to gh workflow, and minor fixes to tests. * Remove all use of async code from smoke tests Do not use the executor interface until it can be repaired. Use a simple polling loop to wait for results. Turn off throttling on the client so that we don't spuriously fail a job. * Begin login config rewrite Structure logging config more tightly, removing odd naming and highly dynamic, imperative configuration of loggers. The new structure is that there is a single method which can be used to reconfigure logging on an as-needed basis (e.g. to turn off logging to stdio). Logging is configured at each entrypoint, and in one special case when a daemon is being launched. All loggers are initialized in module scope as log = logging.getLogger(__name__) Any logger creation not matching this exact pattern is considered incorrect and is fixed to match. * Update logging in Interchange - Use setup_logging() to replace global logger object configured via set_file_logger() - Cleanup CLI argument passing, as it relates to logging arguments - Remove logging attributes from the Interchange object This should, for the most part, not change the semantics of any code. However, it is worth noting that it reduces the maxbytes on the logfile from 256MB to 100MB. This could be adjusted back up by allowing `setup_logging()` to take a parameter for maxbytes, but for now this is not being done. * Fix logging configuration in worker Also cleanup use of loggers in various modules where impact is low. * Fix various issues in logging refactor log_max_bytes is no longer supported and needs to be removed. logging_level similarly needs to be removed. console logging needs to be disabled within the daemon. * Remove logging helpers; fix "asyncio" logger These were originally going to be left untouched, but the fact that they were configuring logging for "asyncio" makes it hard to see a way to salvage them. The logger for "asyncio" _does not_ belong to funcx. We absolutely should not be using that logger internally for funcx-related activities. Fix the references to the "asyncio" logger to use `__name__` as is normal, and remove the imperative logging configuration helpers. * Install funcx_endpoint for smoke tests * Patch default logfile in tests to handle CI In GitHub actions, the logfile handler fails to configure. The exact cause is unclear, but it may be that there's an issue with access to the default logfile or the default log directory. Put the default log path into a tmpdir to fix this. * Raise result size limit to 10MB And fix `sys.getsizeof` on a bytestring to use `len` instead. * WSPollingTask now has a close method and handle_incoming return False on remote disconnect. The executor closes and reconnects WS on remote disconnect. * Adding a long duration test. * Relax version match constraints (#637) * Relax version requirements * Adding test scripts that launch endpoint and workers of mismatched python version for tests. * Removing the `relax_manager_version_match` option. The interchange will note an info line if there's some version mismatch and continue. This can later be problematic if the funcx versions do not match. * Updating configs and README with note to update paths to user conda. * Module import ordering fixes for isort * Black auto-fixes * Be explicit about how to create conda environments * Make update_all fail when a component fails * Adding a test that raises an exception on the worker side * use unixy exit codes * abort tests on failure rather than continuing fairly quietly - this makes a test failure more obvious * Add another test to kill worker * Minor format fix * Remove unnecessary f-string * logger -> log Co-authored-by: Ben Clifford <[email protected]> * Fix broken link to parsl docs * (Re)Apply all autofixers black, isort, and pyupgrade are all (re)converged on the files modified under the task-cancel branch. * Apply changes from 'main' to task-cancel This is the result of - `git checkout --patch main` - manual edits to make each file agree with flake8 and other linting Co-authored-by: Ben Galewsky <[email protected]> Co-authored-by: Ben Clifford <[email protected]> Co-authored-by: Yadu Nand Babuji <[email protected]> Co-authored-by: Ryan Chard <[email protected]> Co-authored-by: Wes Brewer <[email protected]> Co-authored-by: Daniel S. Katz <[email protected]>

* Adding basic interfaces for task cancel * Adding a BadCommand message type to report unrecognized commands * FuncX worker will log a SIGTERM and exit * Removing dead code in _cancel and adding better logging when a task cancel returns a redundant result * Support for cancelling task on the manager * Adding support for task_cancel in the interchange * Adding a FuncXFuture that has a cancel() method * Updating the __str__ and removing the __repr__ * Cleaner task_cancel on manager * Replace worker when task is cancelled via terminating worker * Remove task from internal tracking * Minor refactoring * Remove bad print statements * New mechanism to trap cancelled tasks before they are dispatched. * Adding tests for task_cancel * Collection of minor fixes * Fixed bad doc string from TaskCancelled exception * Updated signal handler in funcx_worker to exit with log message on receiving SIGTERM * Updated FuncXFuture to use cancel from parent class. * Tighten exception handling * Updating tests for task_cancel * Fixed bad docstring * Clarify log message. A task_id is put on the task_cancel_pending_trap when it not found on a manager. It is possible that the task is already complete, in-flight, or waiting dispatch in the interchange's queues. * Minor restructuring of an awkward conditional ladder * Fixing whitespace issues from github conflict resolution * Using sys.exit instead of exit * Rename FuncXFuture -> HTEXFuture * Updating logic in task dispatch logic to ensure that cancel messages follow task dispatch. Updating log lines * best_effort_cancel added as a replacement for cancel. Added docstrings for best_effort_cancel. cancel will raise a NotImplemented error nudging the user to try best_effort_cancel * Changing log level. * Best effort cancel delegates cancellation to the executor. Executor relies on the futures base cancel method Added exception handling around executors result setting to avoid cancelled tasks. * Removed logging and moved test-cancel to funcx-endpoint * Defaulting suppress_failure=True and better exception handling. * Setting suppress_failure=True as default to ignore incoming connections from un-registered managers. This avoid lingering managers from a previous session causing a shutdown * Added exception handling for potential race-condition between user invoked task cancellation, and task failure from system shutdown. * Add a new start_remove_worker method that pairs with remove_worker * Fix race-condition, and use start_remove_worker * Both task cancel and task completion can call remove_task on the manager, this is now protected by a lock. * Remove_task now updates self.task_done_counter * Removed unnecessary logging * Adding some logging to better track tasks that are stuck * Updating log lines to use Task:<task_id> for easy grepping. * Renaming loop var for clarity and updating logline for clarity * Task cancel reconverge linting (#656) * Fix linting on funcx_sdk/funcx/tests/ * Simplify instructions for installing endpoint secrets to cluster * Fix CI and the way dependencies are specified test and dev are extras for installing tools, no multiple requirements files scattered across the repo. CI workflows should not be installing multiple distinct applications into a single venv and then trying to test. This muddies the waters and makes the purpose and requirements of each CI step unclear. Therefore, in addition to putting the requirement data where it really belongs, this fixes the workflows to specify a better isolated test configuration. Workflow jobs are used more thoughtfully to produce isolated VMs for distinct build and testing processes. New daily tests run Monday-through-Friday in the AM for US timezones, doing `safety check` and little else. Safety checking on dependencies is done separately for funcx-sdk and funcx-endpoint. The CI workflow for testing PRs and branches also does `safety check`, but importantly the hourly test does not. Notifications from the daily and hourly workflows are done in a separate dedicated job. This means that new jobs can be added upstream without needing distinct or rewritten notification logic. The absence of any pytest run for the funcx-sdk package is more obvious and identified properly as a problem. It is paired with the "import test" so that when enabled, the import test removal will be localized to the same area of the workflow file. Linting is a dedicated job which runs as part of the CI workflow and uses pre-commit on the full repo. No additional lint steps (e.g. one-off flake8 runs) are used or should be needed. Testing documentation is updated minimally to refer to the installation of extras, rather than requirements files. The doc site requirements for some (undocumented) reason are installing funcx-sdk test requirements. This has been removed. Any documentation requirements can continue to be specified in the doc directory, or under a new `[docs]` extra if necessary. It should not "inherit" the requirements of the testing process. The codecov integration does not deliver value in the absence of good unit testing and a test matrix of size greater than 1. Therefore, until such a time as codecov is useful and a worthwhile configuration can be identified, the integration has been removed. The funcx-endpoint setup.py file was autoformatted with `black` to simplify the formatting of the requirements data. No workflow steps are allowed to use `actions/checkout@master`. This is unnecessary and dangerous usage. `checkout@v2` is used -- as the GitHub Action's own documentation instructs users to do -- instead. * Port parsl BadRegistration error message clarification to funcx fork This clarification was introduced as part of Parsl/parsl#980 I have not examined the rest of that PR for relevance to this fork. * Remove unused BadRegistration exception class This looks like it was excessive copy-paste from Parsl source code which has had an unused BadRegistration class since parsl PR Parsl/parsl#1671 * Broaden htex interchange->executor error message phrasing Previously the error message claimed that the fatal error was due to version mismatch. This is not true: this same fatal error may occur in the case of a BadRegistration. This PR rephrases the error to be less misleading. I have encountered this repeatedly during review of #909, and at least I think Yadu was getting it while hacking on this code. * Expand daily build to all days * Remove internal RedisQueue object and dependents RedisQueue is now provided by funcx-common if needed. 1. Remove `funcx_endpoint.queues` 2. Remove `funcx_endpoint/tests/integration/test_redis.py` (this just tests the removed subpackage) 3. Remove `funcx_endpoint.mock_broker`, which relies on RedisQueue (3) is the most questionable of these three. However, `mock_broker` does not appear to be used anywhere. Furthermore, even if a mocked Forwarder object is needed for some reason, there doesn't appear to be any reason that it should be provided by the `funcx-endpoint` package. * Update smoke test config with version, dev env version=0.3.5 add a "dev" block to the config for ease of use. black and isort on the file * Make smoke tests check min version, not exact Check against the API response with a loose version comparison, not an equality check. This means that the smoke tests set a "floor" for what versions are allowed, but do not fail when a release is done. * Add a Polaris config example * Flake8 * Removing comment # edtb-02 * Update linting config and fix Remove flake8 ignores for various rules: only ignore the rules which must be ignored to be black-compatible. Remove exclusions from pre-commit config. * Fix subject line in hourly smoke test * Remove async testing until it can be done safely * Adding a minimal s3 smoke test * Adding funcx_endpoint to install list * Marking test for large args to skip * Adding fixme note to gh workflow, and minor fixes to tests. * Remove all use of async code from smoke tests Do not use the executor interface until it can be repaired. Use a simple polling loop to wait for results. Turn off throttling on the client so that we don't spuriously fail a job. * Begin login config rewrite Structure logging config more tightly, removing odd naming and highly dynamic, imperative configuration of loggers. The new structure is that there is a single method which can be used to reconfigure logging on an as-needed basis (e.g. to turn off logging to stdio). Logging is configured at each entrypoint, and in one special case when a daemon is being launched. All loggers are initialized in module scope as log = logging.getLogger(__name__) Any logger creation not matching this exact pattern is considered incorrect and is fixed to match. * Update logging in Interchange - Use setup_logging() to replace global logger object configured via set_file_logger() - Cleanup CLI argument passing, as it relates to logging arguments - Remove logging attributes from the Interchange object This should, for the most part, not change the semantics of any code. However, it is worth noting that it reduces the maxbytes on the logfile from 256MB to 100MB. This could be adjusted back up by allowing `setup_logging()` to take a parameter for maxbytes, but for now this is not being done. * Fix logging configuration in worker Also cleanup use of loggers in various modules where impact is low. * Fix various issues in logging refactor log_max_bytes is no longer supported and needs to be removed. logging_level similarly needs to be removed. console logging needs to be disabled within the daemon. * Remove logging helpers; fix "asyncio" logger These were originally going to be left untouched, but the fact that they were configuring logging for "asyncio" makes it hard to see a way to salvage them. The logger for "asyncio" _does not_ belong to funcx. We absolutely should not be using that logger internally for funcx-related activities. Fix the references to the "asyncio" logger to use `__name__` as is normal, and remove the imperative logging configuration helpers. * Install funcx_endpoint for smoke tests * Patch default logfile in tests to handle CI In GitHub actions, the logfile handler fails to configure. The exact cause is unclear, but it may be that there's an issue with access to the default logfile or the default log directory. Put the default log path into a tmpdir to fix this. * Raise result size limit to 10MB And fix `sys.getsizeof` on a bytestring to use `len` instead. * WSPollingTask now has a close method and handle_incoming return False on remote disconnect. The executor closes and reconnects WS on remote disconnect. * Adding a long duration test. * Relax version match constraints (#637) * Relax version requirements * Adding test scripts that launch endpoint and workers of mismatched python version for tests. * Removing the `relax_manager_version_match` option. The interchange will note an info line if there's some version mismatch and continue. This can later be problematic if the funcx versions do not match. * Updating configs and README with note to update paths to user conda. * Module import ordering fixes for isort * Black auto-fixes * Be explicit about how to create conda environments * Make update_all fail when a component fails * Adding a test that raises an exception on the worker side * use unixy exit codes * abort tests on failure rather than continuing fairly quietly - this makes a test failure more obvious * Add another test to kill worker * Minor format fix * Remove unnecessary f-string * logger -> log Co-authored-by: Ben Clifford <[email protected]> * Fix broken link to parsl docs * (Re)Apply all autofixers black, isort, and pyupgrade are all (re)converged on the files modified under the task-cancel branch. * Apply changes from 'main' to task-cancel This is the result of - `git checkout --patch main` - manual edits to make each file agree with flake8 and other linting Co-authored-by: Ben Galewsky <[email protected]> Co-authored-by: Ben Clifford <[email protected]> Co-authored-by: Yadu Nand Babuji <[email protected]> Co-authored-by: Ryan Chard <[email protected]> Co-authored-by: Wes Brewer <[email protected]> Co-authored-by: Daniel S. Katz <[email protected]> * isort fixes Co-authored-by: Stephen Rosen <[email protected]> Co-authored-by: Ben Galewsky <[email protected]> Co-authored-by: Ben Clifford <[email protected]> Co-authored-by: Ryan Chard <[email protected]> Co-authored-by: Wes Brewer <[email protected]> Co-authored-by: Daniel S. Katz <[email protected]>

yadudoc added 2 commits October 25, 2021 11:41

Relax version requirements

ba2b275

Adding test scripts that launch endpoint and workers of mismatched py…

b526987

…thon version for tests.

yadudoc requested review from benclifford, sirosen and joshbryan-globus December 3, 2021 02:16

benclifford reviewed Dec 3, 2021

View reviewed changes

yadudoc added 2 commits December 3, 2021 10:54

Removing the relax_manager_version_match option.

4b8a2ae

The interchange will note an info line if there's some version mismatch and continue. This can later be problematic if the funcx versions do not match.

Updating configs and README with note to update paths to user conda.

dcd1935

This was referenced Dec 3, 2021

funcX version match requirements between interchange and workers #638

Open

Support for running version mismatch tests in CI #639

Open

yadudoc and others added 5 commits December 3, 2021 11:19

Merge branch 'main' into relax_version_match_constraints

c90b793

Module import ordering fixes for isort

90789a7

Black auto-fixes

9dc5133

Be explicit about how to create conda environments

299a4b8

Make update_all fail when a component fails

d925be3

yadudoc and others added 4 commits December 15, 2021 15:15

Adding a test that raises an exception on the worker side

4a290f9

Merge remote-tracking branch 'origin/relax_version_match_constraints'…

9f20b7d

… into relax_version_match_constraints

use unixy exit codes

55f5e2a

abort tests on failure rather than continuing fairly quietly - this m…

1939969

…akes a test failure more obvious

Add another test to kill worker

c399695

benclifford approved these changes Dec 17, 2021

View reviewed changes

benclifford mentioned this pull request Dec 18, 2021

Support for cancelling tasks in HighThroughputExecutor #606

Merged

joshbryan-globus approved these changes Dec 20, 2021

View reviewed changes

yadudoc added 4 commits December 20, 2021 16:24

Minor format fix

c937133

Remove unnecessary f-string

89e92c5

Merge branch 'main' into relax_version_match_constraints

6418fde

logger -> log

8f72158

yadudoc merged commit 68dfe73 into main Dec 20, 2021

yadudoc deleted the relax_version_match_constraints branch December 20, 2021 23:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relax version match constraints #637

Relax version match constraints #637

yadudoc commented Dec 3, 2021

yadudoc commented Dec 3, 2021

shortcut-integration bot commented Dec 3, 2021

benclifford left a comment

benclifford Dec 3, 2021

yadudoc Dec 3, 2021

benclifford Dec 3, 2021

benclifford Dec 3, 2021

yadudoc Dec 3, 2021

benclifford Dec 3, 2021 •

edited

Loading

yadudoc Dec 3, 2021

yadudoc commented Dec 3, 2021

codecov bot commented Dec 3, 2021 •

edited

Loading

benclifford commented Dec 6, 2021

benclifford commented Dec 6, 2021

benclifford commented Dec 6, 2021

yadudoc commented Dec 6, 2021

benclifford commented Dec 16, 2021

benclifford commented Dec 16, 2021

benclifford commented Dec 16, 2021

benclifford commented Dec 16, 2021

benclifford commented Dec 16, 2021

benclifford left a comment

benclifford commented Dec 18, 2021

Relax version match constraints #637

Relax version match constraints #637

Conversation

yadudoc commented Dec 3, 2021

Type of change

yadudoc commented Dec 3, 2021

shortcut-integration bot commented Dec 3, 2021

benclifford left a comment

Choose a reason for hiding this comment

benclifford Dec 3, 2021

Choose a reason for hiding this comment

yadudoc Dec 3, 2021

Choose a reason for hiding this comment

benclifford Dec 3, 2021

Choose a reason for hiding this comment

benclifford Dec 3, 2021

Choose a reason for hiding this comment

yadudoc Dec 3, 2021

Choose a reason for hiding this comment

benclifford Dec 3, 2021 • edited Loading

Choose a reason for hiding this comment

yadudoc Dec 3, 2021

Choose a reason for hiding this comment

yadudoc commented Dec 3, 2021

codecov bot commented Dec 3, 2021 • edited Loading

Codecov Report

benclifford commented Dec 6, 2021

benclifford commented Dec 6, 2021

benclifford commented Dec 6, 2021

yadudoc commented Dec 6, 2021

benclifford commented Dec 16, 2021

benclifford commented Dec 16, 2021

benclifford commented Dec 16, 2021

benclifford commented Dec 16, 2021

benclifford commented Dec 16, 2021

benclifford left a comment

Choose a reason for hiding this comment

benclifford commented Dec 18, 2021

benclifford Dec 3, 2021 •

edited

Loading

codecov bot commented Dec 3, 2021 •

edited

Loading