Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manager errors #698 #980

Merged
merged 6 commits into from
May 29, 2019
Merged

Manager errors #698 #980

merged 6 commits into from
May 29, 2019

Conversation

yadudoc
Copy link
Member

@yadudoc yadudoc commented May 28, 2019

Adds str magic methods missing from exceptions so that there's cleaner exception reporting in the event of system failures like: Manager dying, managers connecting with malformed registration messages etc. Fixes #698.

Fixes a KeyError raised due to a missing value in a dict.

@yadudoc yadudoc added this to the Parsl-0.8.0 milestone May 28, 2019
@yadudoc yadudoc requested review from annawoodard and ZhuozhaoLi May 28, 2019 20:52
@yadudoc yadudoc self-assigned this May 28, 2019
annawoodard
annawoodard previously approved these changes May 28, 2019
parsl/executors/high_throughput/interchange.py Outdated Show resolved Hide resolved
@yadudoc yadudoc merged commit 730cb89 into master May 29, 2019
@yadudoc yadudoc deleted the manager_errors_#698 branch May 29, 2019 17:46
benclifford added a commit to globus/globus-compute that referenced this pull request Nov 2, 2021
This clarification was introduced as part of
Parsl/parsl#980

I have not examined the rest of that PR for relevance to this fork.
yadudoc added a commit to globus/globus-compute that referenced this pull request Jan 7, 2022
* Fix linting on funcx_sdk/funcx/tests/

* Simplify instructions for installing endpoint secrets to cluster

* Fix CI and the way dependencies are specified

test and dev are extras for installing tools, no multiple requirements
files scattered across the repo.

CI workflows should not be installing multiple distinct applications
into a single venv and then trying to test. This muddies the waters
and makes the purpose and requirements of each CI step unclear.

Therefore, in addition to putting the requirement data where it really
belongs, this fixes the workflows to specify a better isolated test
configuration.

Workflow jobs are used more thoughtfully to produce isolated VMs for
distinct build and testing processes.

New daily tests run Monday-through-Friday in the AM for US timezones,
doing `safety check` and little else. Safety checking on dependencies
is done separately for funcx-sdk and funcx-endpoint.

The CI workflow for testing PRs and branches also does `safety check`,
but importantly the hourly test does not.

Notifications from the daily and hourly workflows are done in a
separate dedicated job. This means that new jobs can be added upstream
without needing distinct or rewritten notification logic.

The absence of any pytest run for the funcx-sdk package is more
obvious and identified properly as a problem. It is paired with the
"import test" so that when enabled, the import test removal will be
localized to the same area of the workflow file.

Linting is a dedicated job which runs as part of the CI workflow and
uses pre-commit on the full repo. No additional lint steps (e.g.
one-off flake8 runs) are used or should be needed.

Testing documentation is updated minimally to refer to the
installation of extras, rather than requirements files.

The doc site requirements for some (undocumented) reason are
installing funcx-sdk test requirements. This has been removed. Any
documentation requirements can continue to be specified in the doc
directory, or under a new `[docs]` extra if necessary. It should not
"inherit" the requirements of the testing process.

The codecov integration does not deliver value in the absence of good
unit testing and a test matrix of size greater than 1. Therefore,
until such a time as codecov is useful and a worthwhile configuration
can be identified, the integration has been removed.

The funcx-endpoint setup.py file was autoformatted with `black` to
simplify the formatting of the requirements data.

No workflow steps are allowed to use `actions/checkout@master`. This
is unnecessary and dangerous usage. `checkout@v2` is used -- as the
GitHub Action's own documentation instructs users to do -- instead.

* Port parsl BadRegistration error message clarification to funcx fork

This clarification was introduced as part of
Parsl/parsl#980

I have not examined the rest of that PR for relevance to this fork.

* Remove unused BadRegistration exception class

This looks like it was excessive copy-paste from Parsl source code which
has had an unused BadRegistration class since parsl PR
Parsl/parsl#1671

* Broaden htex interchange->executor error message phrasing

Previously the error message claimed that the fatal error was due to
version mismatch.

This is not true: this same fatal error may occur in the case of a
BadRegistration.

This PR rephrases the error to be less misleading.

I have encountered this repeatedly during review of #909, and at
least I think Yadu was getting it while hacking on this code.

* Expand daily build to all days

* Remove internal RedisQueue object and dependents

RedisQueue is now provided by funcx-common if needed.

1. Remove `funcx_endpoint.queues`
2. Remove `funcx_endpoint/tests/integration/test_redis.py`
   (this just tests the removed subpackage)
3. Remove `funcx_endpoint.mock_broker`, which relies on RedisQueue

(3) is the most questionable of these three. However, `mock_broker`
does not appear to be used anywhere. Furthermore, even if a mocked
Forwarder object is needed for some reason, there doesn't appear to be
any reason that it should be provided by the `funcx-endpoint` package.

* Update smoke test config with version, dev env

version=0.3.5

add a "dev" block to the config for ease of use.

black and isort on the file

* Make smoke tests check min version, not exact

Check against the API response with a loose version comparison, not an
equality check. This means that the smoke tests set a "floor" for what
versions are allowed, but do not fail when a release is done.

* Add a Polaris config example

* Flake8

* Removing comment # edtb-02

* Update linting config and fix

Remove flake8 ignores for various rules: only ignore the rules which
must be ignored to be black-compatible.

Remove exclusions from pre-commit config.

* Fix subject line in hourly smoke test

* Remove async testing until it can be done safely

* Adding a minimal s3 smoke test

* Adding funcx_endpoint to install list

* Marking test for large args to skip

* Adding fixme note to gh workflow, and minor fixes to tests.

* Remove all use of async code from smoke tests

Do not use the executor interface until it can be repaired. Use a
simple polling loop to wait for results.

Turn off throttling on the client so that we don't spuriously fail a
job.

* Begin login config rewrite

Structure logging config more tightly, removing odd naming and highly
dynamic, imperative configuration of loggers.

The new structure is that there is a single method which can be used
to reconfigure logging on an as-needed basis (e.g. to turn off logging
to stdio).

Logging is configured at each entrypoint, and in one special case when
a daemon is being launched.

All loggers are initialized in module scope as

    log = logging.getLogger(__name__)

Any logger creation not matching this exact pattern is considered
incorrect and is fixed to match.

* Update logging in Interchange

- Use setup_logging() to replace global logger object configured via
  set_file_logger()
- Cleanup CLI argument passing, as it relates to logging arguments
- Remove logging attributes from the Interchange object

This should, for the most part, not change the semantics of any code.
However, it is worth noting that it reduces the maxbytes on the
logfile from 256MB to 100MB. This could be adjusted back up by
allowing `setup_logging()` to take a parameter for maxbytes, but for
now this is not being done.

* Fix logging configuration in worker

Also cleanup use of loggers in various modules where impact is low.

* Fix various issues in logging refactor

log_max_bytes is no longer supported and needs to be removed.
logging_level similarly needs to be removed. console logging needs to
be disabled within the daemon.

* Remove logging helpers; fix "asyncio" logger

These were originally going to be left untouched, but the fact that
they were configuring logging for "asyncio" makes it hard to see a way
to salvage them.

The logger for "asyncio" _does not_ belong to funcx. We absolutely
should not be using that logger internally for funcx-related
activities.

Fix the references to the "asyncio" logger to use `__name__` as is
normal, and remove the imperative logging configuration helpers.

* Install funcx_endpoint for smoke tests

* Patch default logfile in tests to handle CI

In GitHub actions, the logfile handler fails to configure. The exact
cause is unclear, but it may be that there's an issue with
access to the default logfile or the default log directory. Put the
default log path into a tmpdir to fix this.

* Raise result size limit to 10MB

And fix `sys.getsizeof` on a bytestring to use `len` instead.

* WSPollingTask now has a close method and handle_incoming return False on remote disconnect.

The executor closes and reconnects WS on remote disconnect.

* Adding a long duration test.

* Relax version match constraints (#637)

* Relax version requirements

* Adding test scripts that launch endpoint and workers of mismatched python version for tests.

* Removing the `relax_manager_version_match` option.

The interchange will note an info line if there's some version mismatch and continue. This can later be problematic if the funcx versions do not match.

* Updating configs and README with note to update paths to user conda.

* Module import ordering fixes for isort

* Black auto-fixes

* Be explicit about how to create conda environments

* Make update_all fail when a component fails

* Adding a test that raises an exception on the worker side

* use unixy exit codes

* abort tests on failure rather than continuing fairly quietly - this makes a test failure more obvious

* Add another test to kill worker

* Minor format fix

* Remove unnecessary f-string

* logger -> log

Co-authored-by: Ben Clifford <[email protected]>

* Fix broken link to parsl docs

* (Re)Apply all autofixers

black, isort, and pyupgrade are all (re)converged on the files
modified under the task-cancel branch.

* Apply changes from 'main' to task-cancel

This is the result of
- `git checkout --patch main`
- manual edits to make each file agree with flake8 and other linting

Co-authored-by: Ben Galewsky <[email protected]>
Co-authored-by: Ben Clifford <[email protected]>
Co-authored-by: Yadu Nand Babuji <[email protected]>
Co-authored-by: Ryan Chard <[email protected]>
Co-authored-by: Wes Brewer <[email protected]>
Co-authored-by: Daniel S. Katz <[email protected]>
yadudoc added a commit to globus/globus-compute that referenced this pull request Jan 7, 2022
* Adding basic interfaces for task cancel

* Adding a BadCommand message type to report unrecognized commands

* FuncX worker will log a SIGTERM and exit

* Removing dead code in _cancel and adding better logging when a task cancel returns a redundant result

* Support for cancelling task on the manager

* Adding support for task_cancel in the interchange

* Adding a FuncXFuture that has a cancel() method

* Updating the __str__ and removing the __repr__

* Cleaner task_cancel on manager
* Replace worker when task is cancelled via terminating worker
* Remove task from internal tracking
* Minor refactoring

* Remove bad print statements

* New mechanism to trap cancelled tasks before they are dispatched.

* Adding tests for task_cancel

* Collection of minor fixes

* Fixed bad doc string from TaskCancelled exception
* Updated signal handler in funcx_worker to exit with log message on receiving SIGTERM
* Updated FuncXFuture to use cancel from parent class.

* Tighten exception handling

* Updating tests for task_cancel

* Fixed bad docstring

* Clarify log message.

A task_id is put on the task_cancel_pending_trap when it not found on a manager. It is possible that the task is already complete, in-flight, or waiting dispatch in the interchange's queues.

* Minor restructuring of an awkward conditional ladder

* Fixing whitespace issues from github conflict resolution

* Using sys.exit instead of exit

* Rename FuncXFuture -> HTEXFuture

* Updating logic in task dispatch logic to ensure that cancel messages follow task dispatch.
Updating log lines

* best_effort_cancel added as a replacement for cancel.

Added docstrings for best_effort_cancel.
cancel will raise a NotImplemented error nudging the user to try best_effort_cancel

* Changing log level.

* Best effort cancel delegates cancellation to the executor.

Executor relies on the futures base cancel method
Added exception handling around executors result setting to avoid cancelled tasks.

* Removed logging and moved test-cancel to funcx-endpoint

* Defaulting suppress_failure=True and better exception handling.

* Setting suppress_failure=True as default to ignore incoming connections from un-registered managers. This avoid lingering managers from a previous session causing a shutdown
* Added exception handling for potential race-condition between user invoked task cancellation, and task failure from system shutdown.

* Add a new start_remove_worker method that pairs with remove_worker

* Fix race-condition, and use start_remove_worker

* Both task cancel and task completion can call remove_task on the manager, this is now protected by a lock.

* Remove_task now updates self.task_done_counter

* Removed unnecessary logging

* Adding some logging to better track tasks that are stuck

* Updating log lines to use Task:<task_id> for easy grepping.

* Renaming loop var for clarity and updating logline for clarity

* Task cancel reconverge linting (#656)

* Fix linting on funcx_sdk/funcx/tests/

* Simplify instructions for installing endpoint secrets to cluster

* Fix CI and the way dependencies are specified

test and dev are extras for installing tools, no multiple requirements
files scattered across the repo.

CI workflows should not be installing multiple distinct applications
into a single venv and then trying to test. This muddies the waters
and makes the purpose and requirements of each CI step unclear.

Therefore, in addition to putting the requirement data where it really
belongs, this fixes the workflows to specify a better isolated test
configuration.

Workflow jobs are used more thoughtfully to produce isolated VMs for
distinct build and testing processes.

New daily tests run Monday-through-Friday in the AM for US timezones,
doing `safety check` and little else. Safety checking on dependencies
is done separately for funcx-sdk and funcx-endpoint.

The CI workflow for testing PRs and branches also does `safety check`,
but importantly the hourly test does not.

Notifications from the daily and hourly workflows are done in a
separate dedicated job. This means that new jobs can be added upstream
without needing distinct or rewritten notification logic.

The absence of any pytest run for the funcx-sdk package is more
obvious and identified properly as a problem. It is paired with the
"import test" so that when enabled, the import test removal will be
localized to the same area of the workflow file.

Linting is a dedicated job which runs as part of the CI workflow and
uses pre-commit on the full repo. No additional lint steps (e.g.
one-off flake8 runs) are used or should be needed.

Testing documentation is updated minimally to refer to the
installation of extras, rather than requirements files.

The doc site requirements for some (undocumented) reason are
installing funcx-sdk test requirements. This has been removed. Any
documentation requirements can continue to be specified in the doc
directory, or under a new `[docs]` extra if necessary. It should not
"inherit" the requirements of the testing process.

The codecov integration does not deliver value in the absence of good
unit testing and a test matrix of size greater than 1. Therefore,
until such a time as codecov is useful and a worthwhile configuration
can be identified, the integration has been removed.

The funcx-endpoint setup.py file was autoformatted with `black` to
simplify the formatting of the requirements data.

No workflow steps are allowed to use `actions/checkout@master`. This
is unnecessary and dangerous usage. `checkout@v2` is used -- as the
GitHub Action's own documentation instructs users to do -- instead.

* Port parsl BadRegistration error message clarification to funcx fork

This clarification was introduced as part of
Parsl/parsl#980

I have not examined the rest of that PR for relevance to this fork.

* Remove unused BadRegistration exception class

This looks like it was excessive copy-paste from Parsl source code which
has had an unused BadRegistration class since parsl PR
Parsl/parsl#1671

* Broaden htex interchange->executor error message phrasing

Previously the error message claimed that the fatal error was due to
version mismatch.

This is not true: this same fatal error may occur in the case of a
BadRegistration.

This PR rephrases the error to be less misleading.

I have encountered this repeatedly during review of #909, and at
least I think Yadu was getting it while hacking on this code.

* Expand daily build to all days

* Remove internal RedisQueue object and dependents

RedisQueue is now provided by funcx-common if needed.

1. Remove `funcx_endpoint.queues`
2. Remove `funcx_endpoint/tests/integration/test_redis.py`
   (this just tests the removed subpackage)
3. Remove `funcx_endpoint.mock_broker`, which relies on RedisQueue

(3) is the most questionable of these three. However, `mock_broker`
does not appear to be used anywhere. Furthermore, even if a mocked
Forwarder object is needed for some reason, there doesn't appear to be
any reason that it should be provided by the `funcx-endpoint` package.

* Update smoke test config with version, dev env

version=0.3.5

add a "dev" block to the config for ease of use.

black and isort on the file

* Make smoke tests check min version, not exact

Check against the API response with a loose version comparison, not an
equality check. This means that the smoke tests set a "floor" for what
versions are allowed, but do not fail when a release is done.

* Add a Polaris config example

* Flake8

* Removing comment # edtb-02

* Update linting config and fix

Remove flake8 ignores for various rules: only ignore the rules which
must be ignored to be black-compatible.

Remove exclusions from pre-commit config.

* Fix subject line in hourly smoke test

* Remove async testing until it can be done safely

* Adding a minimal s3 smoke test

* Adding funcx_endpoint to install list

* Marking test for large args to skip

* Adding fixme note to gh workflow, and minor fixes to tests.

* Remove all use of async code from smoke tests

Do not use the executor interface until it can be repaired. Use a
simple polling loop to wait for results.

Turn off throttling on the client so that we don't spuriously fail a
job.

* Begin login config rewrite

Structure logging config more tightly, removing odd naming and highly
dynamic, imperative configuration of loggers.

The new structure is that there is a single method which can be used
to reconfigure logging on an as-needed basis (e.g. to turn off logging
to stdio).

Logging is configured at each entrypoint, and in one special case when
a daemon is being launched.

All loggers are initialized in module scope as

    log = logging.getLogger(__name__)

Any logger creation not matching this exact pattern is considered
incorrect and is fixed to match.

* Update logging in Interchange

- Use setup_logging() to replace global logger object configured via
  set_file_logger()
- Cleanup CLI argument passing, as it relates to logging arguments
- Remove logging attributes from the Interchange object

This should, for the most part, not change the semantics of any code.
However, it is worth noting that it reduces the maxbytes on the
logfile from 256MB to 100MB. This could be adjusted back up by
allowing `setup_logging()` to take a parameter for maxbytes, but for
now this is not being done.

* Fix logging configuration in worker

Also cleanup use of loggers in various modules where impact is low.

* Fix various issues in logging refactor

log_max_bytes is no longer supported and needs to be removed.
logging_level similarly needs to be removed. console logging needs to
be disabled within the daemon.

* Remove logging helpers; fix "asyncio" logger

These were originally going to be left untouched, but the fact that
they were configuring logging for "asyncio" makes it hard to see a way
to salvage them.

The logger for "asyncio" _does not_ belong to funcx. We absolutely
should not be using that logger internally for funcx-related
activities.

Fix the references to the "asyncio" logger to use `__name__` as is
normal, and remove the imperative logging configuration helpers.

* Install funcx_endpoint for smoke tests

* Patch default logfile in tests to handle CI

In GitHub actions, the logfile handler fails to configure. The exact
cause is unclear, but it may be that there's an issue with
access to the default logfile or the default log directory. Put the
default log path into a tmpdir to fix this.

* Raise result size limit to 10MB

And fix `sys.getsizeof` on a bytestring to use `len` instead.

* WSPollingTask now has a close method and handle_incoming return False on remote disconnect.

The executor closes and reconnects WS on remote disconnect.

* Adding a long duration test.

* Relax version match constraints (#637)

* Relax version requirements

* Adding test scripts that launch endpoint and workers of mismatched python version for tests.

* Removing the `relax_manager_version_match` option.

The interchange will note an info line if there's some version mismatch and continue. This can later be problematic if the funcx versions do not match.

* Updating configs and README with note to update paths to user conda.

* Module import ordering fixes for isort

* Black auto-fixes

* Be explicit about how to create conda environments

* Make update_all fail when a component fails

* Adding a test that raises an exception on the worker side

* use unixy exit codes

* abort tests on failure rather than continuing fairly quietly - this makes a test failure more obvious

* Add another test to kill worker

* Minor format fix

* Remove unnecessary f-string

* logger -> log

Co-authored-by: Ben Clifford <[email protected]>

* Fix broken link to parsl docs

* (Re)Apply all autofixers

black, isort, and pyupgrade are all (re)converged on the files
modified under the task-cancel branch.

* Apply changes from 'main' to task-cancel

This is the result of
- `git checkout --patch main`
- manual edits to make each file agree with flake8 and other linting

Co-authored-by: Ben Galewsky <[email protected]>
Co-authored-by: Ben Clifford <[email protected]>
Co-authored-by: Yadu Nand Babuji <[email protected]>
Co-authored-by: Ryan Chard <[email protected]>
Co-authored-by: Wes Brewer <[email protected]>
Co-authored-by: Daniel S. Katz <[email protected]>

* isort fixes

Co-authored-by: Stephen Rosen <[email protected]>
Co-authored-by: Ben Galewsky <[email protected]>
Co-authored-by: Ben Clifford <[email protected]>
Co-authored-by: Ryan Chard <[email protected]>
Co-authored-by: Wes Brewer <[email protected]>
Co-authored-by: Daniel S. Katz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clarify warning message when a manager goes missing
4 participants