-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[not for merge] DESC/LSST combined feature branch #1993
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
There was one at the parsl top level, and one in providers, but these were not used uniformly: * the monitoring system used the provider version even though it is nothing to do with the provider system. * the top level error was not listed in the API documentation
and then [B] lots of "tried to update NNNN rows, only did MMMM" errors. hypothesis: i) we need to rety on db locked? [yes we do] at the sqlite3 commandline that needs to happen, at least, so maybe also in code? this would only really show up if you're looking in the db logs, and probably only if you're also running queries against the DB (or some other access) rather than letting the db manager be the only user. ii) if those locked errors are causing rows to not be INSERTed (eg for tasks) then that means that row cannot be UPDATEd later, which might be then causing symptom B. That symptom probably means that all updates in that batch are discarded. This could happen because of symptom A, or because other malformed data could not be inserted into the database. concerns: i) cross-DB portability (probably a lost cause without further abstraction, because that isn't really a thing SQL does) ii) what happens if I'm looping on some more genuine error that is always present and that causes some internal structures to overflow For the purposes of lsst-parsl, this patch is allowed to be a bit of a messy hack, to try to understand how this problem can be fixed. Problems with this patch: I'm seeing database lock errors when running CREATE TABLE fairly often (but awkward enough to reproduce that I'm not confident in reproducing at the moment). This looks like sqlalchemy doing those ops somewhere that doesn't have the retry logic I added for doing inserts and updates. So this would manifest, i guess, if you were poking at the database before it had been properly initialised by the parsl monitoring code. I think that is probably a situation I care about.
…y to do per-site testing.
__repr__ should be quasi-machine-readable, and __str__ human readable See PR #1966, commit a423955, for further context. Before: >>> str(e) "(['mymod'], 'this test needs demonstrating')" >>> repr(e) "The functionality requested requires a missing optional module:['mymod'], Reason:this test needs demonstrating" After: >>> str(e) "The functionality requested requires missing optional modules ['mymod'], because: this test needs demonstrating" >>> repr(e) "OptionalModuleMissing(['mymod'], 'this test needs demonstrating')"
This PR adds type annotations, asserts and comments to better describe the existing formats used for monitoring. This is intended to make future simplification easier. This PR should not change any behaviour except error handling when a format is violated.
I was seeing delays in running fresh_config of sometimes 30 seconds This i suspect was auto address config, so i've removed that and use localhost, which is fine for this test config This moves test run time to a consistent 11s against master on my laptop.
…when running --config local tests. exex isn't needed for LSST, so disable testing of it.
…it isn't high enough priority for me to fix at the moment
This makes RESOURCE_INFO messages be sent around in the same format as other messages, which makes upcoming rearrangement of monitoring message deliver easier
This is groundwork for sending monitoring messages in on any queue, and so via different protocols.
…atus delivery TODO: needs to be switchable? At the moment, it is hard coded into the definition of an executor - which maybe makes sense when using HTEX, but not so much when parsl user wants to choose between mechanisms (eg UDP vs FS). That switching also needs to be per executor. for lsst purposes this needs to scale up to the scale of 1000 nodes running many works (10s of) on each node. so eg 50000 workers as a target? 50000 zmq TCP connections is A Lot I'd like something that is not tied into the htex codebase (which at the moment, monitoring is a bit, and putting a messaging concentrator in there would make it more tied in). But a messaging concentrator per-node is what I need. Perhaps I can re-use the existing htex queue, even, using message types... "here's a message I want you to forward to the messaging system in the interchange" ? That would at least be a bit orthogonal? Or perhaps the messaging engine (eg the UDPRadio bit) can be replaced by an htex specific one. so you get UDPRadio in udp circumstances, or HTEXRadio in the htex case, if you so configure it? that would be an ok plugin thing to try out that at least would allow switching between UDP and HTEX while allowing very-HTEX specific stuff to be implemented. (and specificlaly, HTEX provides this kind of hierarchical feature that parsl-core does not) So additional communication this needs: i) monitoring worker wrapper: ability to communicate with htex worker enough to send monitoring messages - needs some "examine my context" that is probably provided by globals: if you're using "htex" mode monitoring, inspect a global that you expect to exist - and htex worker should set up that worker. - perhaps that also provides some scope for *inside apps* reporting state changes which is something i was interested in ii) interchange<->mangager<->htex worker: add in ability to send monitoring messages (perhaps as a broader "new tag to deliver") iii) interchange -> monitoring: deliver those messages to monitoring
…sages the file creation technique is similar to Maildir: create a file in tmp/ without any need to write atomically then move to new/ atomically This means that the only messages appearing in new/ will be complete, and can be deleted by the receiver without conflicting with the sender. I expect this will present interesting file system loads because of all the directory operations, especially if everything is happening just inside a single directory. the load will perhaps be comparable to each job creating a few stdout/stderr files inside one directory. unless resource monitoring is on, in which case much higher load. so maybe it will be slower but more reliable than UDP. i expect that there will be more out-of-order message delivery than with UDP - and I expect monitoring will deal badly with that. (especially, the notion of "first" RESOURCE_INFO messages) the intention is to use someone elses infrastructure to deliver messages rather than implementing a new htex-style comms network just for monitoring
…e WQ this is to check for me that wq+fs monitoring is ok i'm not clear how it should be integrated into the CI test suite.
the ExtremeScale case will never fire in the current extremescale implementaiton, because an extreme scale executor is also a high throughput executor, and so the earlier htex case will fire. It is possible that extreme scale scaling was broken because of this case. This patch should not make it either better or worse, because it only eliminates dead code. when an executor is not an htex instance, no cases match, but no error is raised here, and so tasks_per_node is never assigned. Later on (line 206) use of tasks_per_node is an error. this entire case is removed, and executor.workers_per_node is always used.
…nforced/documented in the executor base classes. This patch makes it obligatory for statushandlingexecutors to have that, on the assumption that statushandlingexecutor will become generally a scaling-capable base class.
broadly, this involves adding in enough helpers that the scaling strategy knows whats going on. i think probably the status handling class needs to also include descriptions (ideally machine enforceable) about the methods that I've had to add here. or the simple strategy should be made more resilient to their absence rather than crashing. PERHAPS OPEN AN ISSUE: "different kinds of executors, from a scaling perspective" the wq monitoring config is modified to start with 0 init blocks, and let the scaling system bring up more blocks. at time of writing this comment, that doesn't happen: wq launches init blocks only (once, at start) and scaling doesn't touch WQ (because the poll interval is set via the superclass to -1) wq default beahviour is one worker per node but i want ot run it in a mode where it figures out that it can run more things on a node at once. "worker" has a different meaning here wrt htex: htex workers each run one task, so 100 workers = 100 tasks at once. wq is more dynamic and always runes one wq per node but then that worker can run more parsl tasks. so be careful with terminology here.
…ic ones) to capture final cumulative usage although maybe that info isn't available any more if for example, executables have completed? how does `time` do it?
…logs will now go to the parsl main log *** *** so maybe i shouldn't do this patch after all *** the reason was because I couldn't find where @wrap_with_logs errors were going with interchange, but perhaps they're going to the main log? ----- make interchange set up logs like rest of parsl this removes some code that isn't interchange specific, in favour of other similar but different code it also means that logs made to parsl.* will go to the interchange log
…be 0 tries. this is probably coming about because there is always a try number (starting at 0) even if no tries happen. TOOD: make a monitoring test to check this TODO: fix this bug
…not sure which one causes which) is that there is thread related connection sharing which is not permitted. (erlang style local connection rather than object-shared connection might help push on that, by removing the opportunity to share connections across threads) and the prepared state - comes from a failed not-rolled-back transaction earlier on. Perhaps: an error happened. Then tried to do another one. hence the error happened. then that tried to call rollback, but it was in a different thread (because it was in the sqlalchemy code which cannot call rollbacks successfully I guess, on broken connections, because we don't know which thread finalization will happen in) 2020-11-25 12:04:00 parsl.executors.high_throughput.executor:649 [INFO] Attempting HighThroughputExecutor shutdown 2020-11-25 12:04:00 parsl.executors.high_throughput.executor:651 [INFO] Finished HighThroughputExecutor shutdown attempt 2020-11-25 12:04:00 parsl.monitoring.monitoring:271 [INFO] Terminating Monitoring Hub 2020-11-25 12:04:00 parsl.monitoring.monitoring:287 [INFO] Waiting for Hub to receive all messages and terminate 2020-11-25 12:04:03 parsl.process_loggers:17 [INFO] exception wrapper: normal ending for thread MainThread in process UNLOGGED Exception during reset or similar Traceback (most recent call last): File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/parsl/monitoring/db_manager.py", line 476, in _update self.db.update(table=table, columns=columns, messages=messages) File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/parsl/monitoring/db_manager.py", line 76, in update self.session.bulk_update_mappings(mapper, mappings) File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2860, in bulk_update_mappings mapper, mappings, True, False, False, False, False File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2876, in _bulk_save_mappings transaction = self.begin(subtransactions=True) File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 938, in begin self.transaction = self.transaction._begin(nested=nested) File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 316, in _begin self._assert_active() File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 282, in _assert_active "This session is in 'prepared' state; no further " sqlalchemy.exc.InvalidRequestError: This session is in 'prepared' state; no further SQL can be emitted within this transaction. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 680, in _finalize_fairy fairy._reset(pool) File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 867, in _reset pool._dialect.do_rollback(self) File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 530, in do_rollback dbapi_connection.rollback() sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 46912496695104 and this is thread id 46912630470400. Exception closing connection <sqlite3.Connection object at 0x2aaab2acc1f0> Traceback (most recent call last): File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/parsl/monitoring/db_manager.py", line 476, in _update self.db.update(table=table, columns=columns, messages=messages) File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/parsl/monitoring/db_manager.py", line 76, in update self.session.bulk_update_mappings(mapper, mappings) File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2860, in bulk_update_mappings mapper, mappings, True, False, False, False, False File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2876, in _bulk_save_mappings transaction = self.begin(subtransactions=True) File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 938, in begin self.transaction = self.transaction._begin(nested=nested) File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 316, in _begin self._assert_active() File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 282, in _assert_active "This session is in 'prepared' state; no further " sqlalchemy.exc.InvalidRequestError: This session is in 'prepared' state; no further SQL can be emitted within this transaction. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 680, in _finalize_fairy fairy._reset(pool) File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 867, in _reset pool._dialect.do_rollback(self) File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 530, in do_rollback dbapi_connection.rollback() sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 46912496695104 and this is thread id 46912630470400. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 270, in _close_connection self._dialect.do_close(connection) File "/global/homes/d/descdm/.conda/envs/parsl-lsst-dm/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 536, in do_close dbapi_connection.close() sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 46912496695104 and this is thread id 46912630470400. 2020-11-25 12:04:03 parsl.process_loggers:17 [INFO] exception wrapper: normal ending for thread MainThread in process UNLOGGED 2020-11-25 12:04:03 parsl.dataflow.dflow:1055 [INFO] DFK cleanup complete
This was disabled as part of the large set of tests disabled in the PR #652 testpocalypse. But on review it *should* work.
…edly (on each poll?) in my tutorial example. What's going wrong? 2020-10-19 11:29:24.016 parsl.dataflow.flow_control:114 [ERROR] Flow control callback threw an exception - logging and proceeding anyway Traceback (most recent call last): File "/home/benc/parsl/src/parsl/parsl/dataflow/flow_control.py", line 112, in make_callback self.callback(tasks=self._event_buffer, kind=kind) File "/home/benc/parsl/src/parsl/parsl/dataflow/task_status_poller.py", line 64, in poll self._update_state() File "/home/benc/parsl/src/parsl/parsl/dataflow/task_status_poller.py", line 71, in _update_state item.poll(now) File "/home/benc/parsl/src/parsl/parsl/dataflow/task_status_poller.py", line 28, in poll self._status = self._executor.status() File "/home/benc/parsl/src/parsl/parsl/executors/status_handling.py", line 73, in status status = self._make_status_dict(job_ids, self._provider.status(job_ids)) File "/home/benc/parsl/src/parsl/parsl/providers/local/local.py", line 86, in status str_ec = self._read_job_file(script_path, '.ec').strip() File "/home/benc/parsl/src/parsl/parsl/providers/local/local.py", line 143, in _read_job_file with open(path, 'r') as f: FileNotFoundError: [Errno 2] No such file or directory: '/home/benc/parsl/src/parsl/runinfo/000/submit_scripts/parsl.localprovider.1603103227.0265722.sh.ec
… that f(3) run multiple times has the same hashsum and f(4) has a different hashsum to f(3) in addition to the existing way of checking that runs are not repeated because now we have acceses to the task record in tests due to future.task_def
…n the monitoring DB. i vaguely remember this happening and fixing it a while ago but maybe only in this branch, not lsst-dm-202005? TODO: write a regression test for this using sql queries and looking at the task record hashsum (and checking two different invocations get two different hashsums)
#1848) > These changes look good to me. I am seeing two separate states in the plots, but it is still showing up as green for both exec_done and memo_done on the workflow page for me. Might be worth checking that. so I should dig into that visual style
ipp isn't used any more so remove the hack and see if the problem appears in other places
…cal hostname eg workers on htex could set an environment variable that gives executor name, manager ID, worker number, block id, for exampe feature req from tomq - he specifically asked for block id
…repr__ed to the monitoring database (see recent PR i made that writes out using __repr__ not __str_- they lack context) Look at the task failures in a monitoring.db coming from running the test suite to see examples of some which do not look right - i.e. ones that look liek a string, not Classname(FooDetail)
this is not intended to be exact, in the sense that a job with a lower priority might run before a job with a higher priority - but the "bulk" of the work (in the LSST sense) should be prioritised this way. priorities can be anything comparable to each other (and to the default priority, which is integer 0) i'm not going to address the macsafequeue in this prototype
…received the ctrl-C and is working towards ending. This would be fine as an ERROR level log message, I think, because it is an abort of the workflow. I'm not sure what the ctrl-C reporting of this looks like right now... so i should review that befor echanging anythign else
superseded by #2012 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This branch is ongoing with @TomGlanzman for using parsl with the LSST DESC DR2 workflow. This PR is to get it tested in parsl CI.
This branch is made from an
stgit
patch stack with regular merges to master. Don't push changes to this branch - rather, get them put on master and poke @benclifford for an update.I'm migrating patches from this stack over the years onto
master
and adding new stuff on the end - hopefully eventually the patch stack, and this branch with it, should evaporate entirely.