[not for merge] DESC/LSST combined feature branch #1993

benclifford · 2021-03-17T13:30:45Z

This branch is ongoing with @TomGlanzman for using parsl with the LSST DESC DR2 workflow. This PR is to get it tested in parsl CI.

This branch is made from an stgit patch stack with regular merges to master. Don't push changes to this branch - rather, get them put on master and poke @benclifford for an update.

I'm migrating patches from this stack over the years onto master and adding new stuff on the end - hopefully eventually the patch stack, and this branch with it, should evaporate entirely.

There was one at the parsl top level, and one in providers, but these were not used uniformly: * the monitoring system used the provider version even though it is nothing to do with the provider system. * the top level error was not listed in the API documentation

and then [B] lots of "tried to update NNNN rows, only did MMMM" errors. hypothesis: i) we need to rety on db locked? [yes we do] at the sqlite3 commandline that needs to happen, at least, so maybe also in code? this would only really show up if you're looking in the db logs, and probably only if you're also running queries against the DB (or some other access) rather than letting the db manager be the only user. ii) if those locked errors are causing rows to not be INSERTed (eg for tasks) then that means that row cannot be UPDATEd later, which might be then causing symptom B. That symptom probably means that all updates in that batch are discarded. This could happen because of symptom A, or because other malformed data could not be inserted into the database. concerns: i) cross-DB portability (probably a lost cause without further abstraction, because that isn't really a thing SQL does) ii) what happens if I'm looping on some more genuine error that is always present and that causes some internal structures to overflow For the purposes of lsst-parsl, this patch is allowed to be a bit of a messy hack, to try to understand how this problem can be fixed. Problems with this patch: I'm seeing database lock errors when running CREATE TABLE fairly often (but awkward enough to reproduce that I'm not confident in reproducing at the moment). This looks like sqlalchemy doing those ops somewhere that doesn't have the retry logic I added for doing inserts and updates. So this would manifest, i guess, if you were poking at the database before it had been properly initialised by the parsl monitoring code. I think that is probably a situation I care about.

…y to do per-site testing.

__repr__ should be quasi-machine-readable, and __str__ human readable See PR #1966, commit a423955, for further context. Before: >>> str(e) "(['mymod'], 'this test needs demonstrating')" >>> repr(e) "The functionality requested requires a missing optional module:['mymod'], Reason:this test needs demonstrating" After: >>> str(e) "The functionality requested requires missing optional modules ['mymod'], because: this test needs demonstrating" >>> repr(e) "OptionalModuleMissing(['mymod'], 'this test needs demonstrating')"

This PR adds type annotations, asserts and comments to better describe the existing formats used for monitoring. This is intended to make future simplification easier. This PR should not change any behaviour except error handling when a format is violated.

I was seeing delays in running fresh_config of sometimes 30 seconds This i suspect was auto address config, so i've removed that and use localhost, which is fine for this test config This moves test run time to a consistent 11s against master on my laptop.

…when running --config local tests. exex isn't needed for LSST, so disable testing of it.

…it isn't high enough priority for me to fix at the moment

This makes RESOURCE_INFO messages be sent around in the same format as other messages, which makes upcoming rearrangement of monitoring message deliver easier

This is groundwork for sending monitoring messages in on any queue, and so via different protocols.

…atus delivery TODO: needs to be switchable? At the moment, it is hard coded into the definition of an executor - which maybe makes sense when using HTEX, but not so much when parsl user wants to choose between mechanisms (eg UDP vs FS). That switching also needs to be per executor. for lsst purposes this needs to scale up to the scale of 1000 nodes running many works (10s of) on each node. so eg 50000 workers as a target? 50000 zmq TCP connections is A Lot I'd like something that is not tied into the htex codebase (which at the moment, monitoring is a bit, and putting a messaging concentrator in there would make it more tied in). But a messaging concentrator per-node is what I need. Perhaps I can re-use the existing htex queue, even, using message types... "here's a message I want you to forward to the messaging system in the interchange" ? That would at least be a bit orthogonal? Or perhaps the messaging engine (eg the UDPRadio bit) can be replaced by an htex specific one. so you get UDPRadio in udp circumstances, or HTEXRadio in the htex case, if you so configure it? that would be an ok plugin thing to try out that at least would allow switching between UDP and HTEX while allowing very-HTEX specific stuff to be implemented. (and specificlaly, HTEX provides this kind of hierarchical feature that parsl-core does not) So additional communication this needs: i) monitoring worker wrapper: ability to communicate with htex worker enough to send monitoring messages - needs some "examine my context" that is probably provided by globals: if you're using "htex" mode monitoring, inspect a global that you expect to exist - and htex worker should set up that worker. - perhaps that also provides some scope for *inside apps* reporting state changes which is something i was interested in ii) interchange<->mangager<->htex worker: add in ability to send monitoring messages (perhaps as a broader "new tag to deliver") iii) interchange -> monitoring: deliver those messages to monitoring

…sages the file creation technique is similar to Maildir: create a file in tmp/ without any need to write atomically then move to new/ atomically This means that the only messages appearing in new/ will be complete, and can be deleted by the receiver without conflicting with the sender. I expect this will present interesting file system loads because of all the directory operations, especially if everything is happening just inside a single directory. the load will perhaps be comparable to each job creating a few stdout/stderr files inside one directory. unless resource monitoring is on, in which case much higher load. so maybe it will be slower but more reliable than UDP. i expect that there will be more out-of-order message delivery than with UDP - and I expect monitoring will deal badly with that. (especially, the notion of "first" RESOURCE_INFO messages) the intention is to use someone elses infrastructure to deliver messages rather than implementing a new htex-style comms network just for monitoring

…e WQ this is to check for me that wq+fs monitoring is ok i'm not clear how it should be integrated into the CI test suite.

the ExtremeScale case will never fire in the current extremescale implementaiton, because an extreme scale executor is also a high throughput executor, and so the earlier htex case will fire. It is possible that extreme scale scaling was broken because of this case. This patch should not make it either better or worse, because it only eliminates dead code. when an executor is not an htex instance, no cases match, but no error is raised here, and so tasks_per_node is never assigned. Later on (line 206) use of tasks_per_node is an error. this entire case is removed, and executor.workers_per_node is always used.

…nforced/documented in the executor base classes. This patch makes it obligatory for statushandlingexecutors to have that, on the assumption that statushandlingexecutor will become generally a scaling-capable base class.

broadly, this involves adding in enough helpers that the scaling strategy knows whats going on. i think probably the status handling class needs to also include descriptions (ideally machine enforceable) about the methods that I've had to add here. or the simple strategy should be made more resilient to their absence rather than crashing. PERHAPS OPEN AN ISSUE: "different kinds of executors, from a scaling perspective" the wq monitoring config is modified to start with 0 init blocks, and let the scaling system bring up more blocks. at time of writing this comment, that doesn't happen: wq launches init blocks only (once, at start) and scaling doesn't touch WQ (because the poll interval is set via the superclass to -1) wq default beahviour is one worker per node but i want ot run it in a mode where it figures out that it can run more things on a node at once. "worker" has a different meaning here wrt htex: htex workers each run one task, so 100 workers = 100 tasks at once. wq is more dynamic and always runes one wq per node but then that worker can run more parsl tasks. so be careful with terminology here.

…ic ones) to capture final cumulative usage although maybe that info isn't available any more if for example, executables have completed? how does `time` do it?

…logs will now go to the parsl main log *** *** so maybe i shouldn't do this patch after all *** the reason was because I couldn't find where @wrap_with_logs errors were going with interchange, but perhaps they're going to the main log? ----- make interchange set up logs like rest of parsl this removes some code that isn't interchange specific, in favour of other similar but different code it also means that logs made to parsl.* will go to the interchange log

for tomg, before release: i) disable (or option to disable) all resource monitoring ii) later, per-app enable/disable of resource monitoring, which will also maybe work around the joinapp vs threads problem

…be 0 tries. this is probably coming about because there is always a try number (starting at 0) even if no tries happen. TOOD: make a monitoring test to check this TODO: fix this bug

…not sure which one causes which) is that there is thread related connection sharing which is not permitted. (erlang style local connection rather than object-shared connection might help push on that, by removing the opportunity to share connections across threads) and the prepared state - comes from a failed not-rolled-back transaction earlier on. Perhaps: an error happened. Then tried to do another one. hence the error happened. then that tried to call rollback, but it was in a different thread (because it was in the sqlalchemy code which cannot call rollbacks successfully I guess, on broken connections, because we don't know which thread finalization will happen in) 2020-11-25 12:04:00 parsl.executors.high_throughput.executor:649 [INFO] Attempting HighThroughputExecutor shutdown 2020-11-25 12:04:00 parsl.executors.high_throughput.executor:651 [INFO] Finished HighThroughputExecutor shutdown attempt 2020-11-25 12:04:00 parsl.monitoring.monitoring:271 [INFO] Terminating Monitoring Hub 2020-11-25 12:04:00 parsl.monitoring.monitoring:287 [INFO] Waiting for Hub to receive all messages and terminate 2020-11-25 12:04:03 parsl.process_loggers:17 [INFO] exception wrapper: normal ending for thread MainThread in process UNLOGGED Exception during reset or similar Traceback (most recent call last): File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/parsl/monitoring/db_manager.py", line 476, in _update self.db.update(table=table, columns=columns, messages=messages) File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/parsl/monitoring/db_manager.py", line 76, in update self.session.bulk_update_mappings(mapper, mappings) File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2860, in bulk_update_mappings mapper, mappings, True, False, False, False, False File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2876, in _bulk_save_mappings transaction = self.begin(subtransactions=True) File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 938, in begin self.transaction = self.transaction._begin(nested=nested) File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 316, in _begin self._assert_active() File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 282, in _assert_active "This session is in 'prepared' state; no further " sqlalchemy.exc.InvalidRequestError: This session is in 'prepared' state; no further SQL can be emitted within this transaction. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 680, in _finalize_fairy fairy._reset(pool) File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 867, in _reset pool._dialect.do_rollback(self) File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 530, in do_rollback dbapi_connection.rollback() sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 46912496695104 and this is thread id 46912630470400. Exception closing connection <sqlite3.Connection object at 0x2aaab2acc1f0> Traceback (most recent call last): File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/parsl/monitoring/db_manager.py", line 476, in _update self.db.update(table=table, columns=columns, messages=messages) File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/parsl/monitoring/db_manager.py", line 76, in update self.session.bulk_update_mappings(mapper, mappings) File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2860, in bulk_update_mappings mapper, mappings, True, False, False, False, False File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2876, in _bulk_save_mappings transaction = self.begin(subtransactions=True) File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 938, in begin self.transaction = self.transaction._begin(nested=nested) File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 316, in _begin self._assert_active() File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 282, in _assert_active "This session is in 'prepared' state; no further " sqlalchemy.exc.InvalidRequestError: This session is in 'prepared' state; no further SQL can be emitted within this transaction. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 680, in _finalize_fairy fairy._reset(pool) File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 867, in _reset pool._dialect.do_rollback(self) File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 530, in do_rollback dbapi_connection.rollback() sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 46912496695104 and this is thread id 46912630470400. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 270, in _close_connection self._dialect.do_close(connection) File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 536, in do_close dbapi_connection.close() sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 46912496695104 and this is thread id 46912630470400. 2020-11-25 12:04:03 parsl.process_loggers:17 [INFO] exception wrapper: normal ending for thread MainThread in process UNLOGGED 2020-11-25 12:04:03 parsl.dataflow.dflow:1055 [INFO] DFK cleanup complete

This was disabled as part of the large set of tests disabled in the PR #652 testpocalypse. But on review it *should* work.

…edly (on each poll?) in my tutorial example. What's going wrong? 2020-10-19 11:29:24.016 parsl.dataflow.flow_control:114 [ERROR] Flow control callback threw an exception - logging and proceeding anyway Traceback (most recent call last): File "/home/benc/parsl/src/parsl/parsl/dataflow/flow_control.py", line 112, in make_callback self.callback(tasks=self._event_buffer, kind=kind) File "/home/benc/parsl/src/parsl/parsl/dataflow/task_status_poller.py", line 64, in poll self._update_state() File "/home/benc/parsl/src/parsl/parsl/dataflow/task_status_poller.py", line 71, in _update_state item.poll(now) File "/home/benc/parsl/src/parsl/parsl/dataflow/task_status_poller.py", line 28, in poll self._status = self._executor.status() File "/home/benc/parsl/src/parsl/parsl/executors/status_handling.py", line 73, in status status = self._make_status_dict(job_ids, self._provider.status(job_ids)) File "/home/benc/parsl/src/parsl/parsl/providers/local/local.py", line 86, in status str_ec = self._read_job_file(script_path, '.ec').strip() File "/home/benc/parsl/src/parsl/parsl/providers/local/local.py", line 143, in _read_job_file with open(path, 'r') as f: FileNotFoundError: [Errno 2] No such file or directory: '/home/benc/parsl/src/parsl/runinfo/000/submit_scripts/parsl.localprovider.1603103227.0265722.sh.ec

… that f(3) run multiple times has the same hashsum and f(4) has a different hashsum to f(3) in addition to the existing way of checking that runs are not repeated because now we have acceses to the task record in tests due to future.task_def

…n the monitoring DB. i vaguely remember this happening and fixing it a while ago but maybe only in this branch, not lsst-dm-202005? TODO: write a regression test for this using sql queries and looking at the task record hashsum (and checking two different invocations get two different hashsums)

#1848) > These changes look good to me. I am seeing two separate states in the plots, but it is still showing up as green for both exec_done and memo_done on the workflow page for me. Might be worth checking that. so I should dig into that visual style

ipp isn't used any more so remove the hack and see if the problem appears in other places

…cal hostname eg workers on htex could set an environment variable that gives executor name, manager ID, worker number, block id, for exampe feature req from tomq - he specifically asked for block id

…repr__ed to the monitoring database (see recent PR i made that writes out using __repr__ not __str_- they lack context) Look at the task failures in a monitoring.db coming from running the test suite to see examples of some which do not look right - i.e. ones that look liek a string, not Classname(FooDetail)

this is not intended to be exact, in the sense that a job with a lower priority might run before a job with a higher priority - but the "bulk" of the work (in the LSST sense) should be prioritised this way. priorities can be anything comparable to each other (and to the default priority, which is integer 0) i'm not going to address the macsafequeue in this prototype

…received the ctrl-C and is working towards ending. This would be fine as an ERROR level log message, I think, because it is an abort of the workflow. I'm not sure what the ctrl-C reporting of this looks like right now... so i should review that befor echanging anythign else

…t installed

benclifford · 2021-04-20T12:11:55Z

superseded by #2012

benclifford added 30 commits March 2, 2021 14:09

Move htex strategy docstring to the correct strategy function

a23255f

Remove disabled midway test. The per-site testing mechanism is the wa…

c8af416

…y to do per-site testing.

exex executor isn't shutting down nicely, causing excessive CPU load …

1346c81

…when running --config local tests. exex isn't needed for LSST, so disable testing of it.

temp disable ssh test because it doesn't run right in my dev env and …

6e7cf5e

…it isn't high enough priority for me to fix at the moment

Add in RESOURCE_INFO stuff

19c9641

This makes RESOURCE_INFO messages be sent around in the same format as other messages, which makes upcoming rearrangement of monitoring message deliver easier

Dispatch monitoring based on message type, rather than on name of queue.

ef3edbf

This is groundwork for sending monitoring messages in on any queue, and so via different protocols.

test fs based monitoring provider using local executor

f6613eb

switch WQ to fs mode, and add copies of the monitoring tests, that us…

1502eb6

…e WQ this is to check for me that wq+fs monitoring is ok i'm not clear how it should be integrated into the CI test suite.

Final resource message should always be sent (rather than just period…

9ecef83

…ic ones) to capture final cumulative usage although maybe that info isn't available any more if for example, executables have completed? how does `time` do it?

TODO:

db7c28b

for tomg, before release: i) disable (or option to disable) all resource monitoring ii) later, per-app enable/disable of resource monitoring, which will also maybe work around the joinapp vs threads problem

'try' table shouldn't get a row for a memoized result - there should …

e19c86d

…be 0 tries. this is probably coming about because there is always a try number (starting at 0) even if no tries happen. TOOD: make a monitoring test to check this TODO: fix this bug

Re-enable div_0 bash error code test after review

0b351e7

This was disabled as part of the large set of tests disabled in the PR #652 testpocalypse. But on review it *should* work.

remove test that is testing bash exit code behaviour, not parsl

a09db9a

raise exceptions harder (maybe just for dev?)

6cdf2de

benclifford added 28 commits March 2, 2021 16:08

misc debug

3bbd431

issue #204 was an IPP specific hack

2690a58

ipp isn't used any more so remove the hack and see if the problem appears in other places

TODO: represent the execution location more interestingly than the lo…

db81bb4

…cal hostname eg workers on htex could set an environment variable that gives executor name, manager ID, worker number, block id, for exampe feature req from tomq - he specifically asked for block id

Set version string for lsst fork

7865803

Release marker patch in stg patch stack

aa30845

Merge remotes/origin/HEAD into lsst-dm-202103

b52022a

Getting workqueue vs monitoring on cori working

61563bd

update monitoring bug patch to latest of draft pr #1917

ee21201

refresh stubs in documentation build

380c8d5

fix imports in monitoring tests for case where monitoring deps are no…

28eb53f

…t installed

make version string match pep 440

8a7909f

debugging CI

c72127d

more CI debugging

8a00625

hopefully fix broken sqlalchemy install

3c9ff07

add block monitoring for wq executor copied from htex

736167d

add block ID for WQ try table

b48ced3

Merge remotes/origin/HEAD into lsst-dm-202103

c72e4da

bugfixing try table running time broken by earlier state rearrangement

256cd9f

CI test environment fiddling

18041cd

ongoing CI fixing work

8e5ab7a

bump version

7f7bb08

more attempts at fixing viz in CI

3d5c318

push CI changes for WQ

001cc3f

more viz fixes

643d23e

docstring work

4ff6490

benclifford closed this Apr 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[not for merge] DESC/LSST combined feature branch #1993

[not for merge] DESC/LSST combined feature branch #1993

benclifford commented Mar 17, 2021 •

edited

Loading

benclifford commented Apr 20, 2021 •

edited

Loading

[not for merge] DESC/LSST combined feature branch #1993

[not for merge] DESC/LSST combined feature branch #1993

Conversation

benclifford commented Mar 17, 2021 • edited Loading

benclifford commented Apr 20, 2021 • edited Loading

benclifford commented Mar 17, 2021 •

edited

Loading

benclifford commented Apr 20, 2021 •

edited

Loading