Reloading default HighThroughputExecutor after `parsl.clear()` results in `KeyError: 'block_id'` exception #2871

d33bs · 2023-08-22T21:45:10Z

Describe the bug

Thanks for the great work on Parsl - it's an awesome tool! I noticed when attempting to load a HighThroughputExecutor configuration after using parsl.clear() an exception occurs which states: KeyError: 'block_id'. This doesn't seem to happen when using ThreadPoolExecutor. I'm unsure if this is expected behavior so I wanted to raise a bug just in case it helps diagnose or address this issue.

To Reproduce
Run the following Python snippet:

import parsl
from parsl.config import Config
from parsl.executors import HighThroughputExecutor

local_htex = Config(
    executors=[HighThroughputExecutor()],
)
parsl.load(local_htex)
parsl.clear()
parsl.load(local_htex)

Expected behavior
I'd expect something like what is witnessed with the ThreadPoolExecutor from the following example. If this isn't possible or is unreasonable to expect, I'd expect documentation or an upfront message from the logs about workarounds.

import parsl
from parsl.config import Config
from parsl.executors import ThreadPoolExecutor

local_threaded = Config(
    executors=[ThreadPoolExecutor()],
)
parsl.load(local_threaded)
parsl.clear()
parsl.load(local_threaded)

Environment

OS: MacOS
Python: 3.9
Parsl: 2023.08.21

Distributed Environment

Where are you running the Parsl script from ? Laptop/Workstation
Where do you need the workers to run ? Same as Parsl script

The text was updated successfully, but these errors were encountered:

benclifford · 2023-08-23T06:08:56Z

one thing in your example that sticks out to me is that you can't use the same Config object twice.

In the case of the thread pool executor, this probably doesn't matter, but it's likely to be more of a problem with htex.

There's a style for re-loading "the same" config that is used in (for example) the test suite and some applications that I work on that that looks something like this:

def fresh_config():
    return Config(executors=[HighThroughputExecutor()])

parsl.load(fresh_config())
parsl.clear()
parsl.load(fresh_config())

Probably we should make this more enforced - it wasn't part of the initial design of how Config was meant to work, but it's turned out since then that's a property of Config objects (and the subobjects contained inside a Config).

Can you try something like the above and see how things go?

benclifford · 2023-08-23T06:14:13Z

I see in your comment on the cytotable issue cytomining/CytoTable#90 (comment) that you have that sort of behaviour as a recommended change - for a user to generate a new config in each for loop iteration.

Either that, or (I'm not sure how your code is organised for this) make it so parsl/DFK can be initialised outside of the loop and not reloaded in each iteration in that loop.

I've found it a bit easier to understand why things work that way by thinking of a Config not as "here is our configuration file style specification", but "here is a bundle of the actual instantiated components that will be started and terminated by parsl.load()".

d33bs · 2023-08-24T13:54:45Z

Thank you @benclifford for the recommendations and clarity here! Is there a way to compare Parsl Config objects such that one could avoid "accidental" reload of configurations? I recognize deep object comparisons might perform this but wondered if there was something like a GUID or similar assigned to each configuration which could provide this capability.

benclifford · 2023-08-24T15:09:11Z

so if you want to say "we have a DFK loaded right now, and someone just gave me a config object ... is it the same config object? and if not, do a parsl shutdown/restart with the new config" then I would say you won't be able to compare any better than simple python object identity, using the is operator.

For the example in CytoTable issue 90

for plate_folder in sqlite_dir.iterdir():
    output_path = pathlib.Path(f"{output_dir}/{plate_folder.stem}_converted.parquet")
    # merge single cells and output as parquet file
    convert(
        source_path=str(plate_folder),
        dest_path=str(output_path),
        dest_datatype=dest_datatype,
        metadata=["image"],
        compartments=["nuclei"],
        identifying_columns=["ImageNumber"],
        joins=preset_join,
        parsl_config=Config(
            executors=[HighThroughputExecutor()],
        ),
        chunk_size=10000,
    )

probably for that to work, the user should create one Config object outside of the for loop and put it in a variable and pass that variable as parsl_config=my_single_config, if they want to use one single config for all of the parsl.

so then the cytotable code can compare the parsl_config just passed in with the config object of the currently active parsl and switch configs if the user has supplied a new one on a subsequent call.

If you avoid shutting down and restarting parsl each time, I think you'll also get some speedup from not paying the startup and shutdown costs of parsl for each loop iteration - depending on circumstances that might be significant, or might not.

However, this style feels a bit weird to me and I'd kinda encourage you to think about if you can also initialise parsl outside such a loop - i guess mostly it depends on how much you expect your users to be changing configs as a loop progresses.

d33bs · 2023-08-24T19:11:19Z

Many thanks @benclifford for the recommendations here! I'll move forward with using Parsl as you described.

@Andrew-S-Rosen

Clarifies problems users found in the documentation error in parallel documentation #2915: Decorators are not needed for ParslPoolExecutor.map Reloading default HighThroughputExecutor after parsl.clear() results in KeyError: 'block_id' exception #2871 : Configs may not be re-used Fixes Docs: Further clarification on how serialization impacts where imports need to be #2918 : Explain when functions need imports Fixes Allowing for inputs and outputs to take immutable types #2925 : Explain mutable types are undesirable Thanks, @Andrew-S-Rosen

d33bs added the bug label Aug 22, 2023

d33bs mentioned this issue Aug 22, 2023

Cytotable convert fails to reload config file outside for loop cytomining/CytoTable#90

Closed

d33bs mentioned this issue Aug 24, 2023

Reuse existing Parsl configurations cytomining/CytoTable#93

Merged

13 tasks

This was referenced Oct 16, 2023

error in parallel documentation #2915

Open

Improve documentation about configuration and workflows #2919

Merged

Reasonable Error Messages on Configuration Re-use #2920

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reloading default HighThroughputExecutor after `parsl.clear()` results in `KeyError: 'block_id'` exception #2871

Reloading default HighThroughputExecutor after `parsl.clear()` results in `KeyError: 'block_id'` exception #2871

d33bs commented Aug 22, 2023

benclifford commented Aug 23, 2023

benclifford commented Aug 23, 2023 •

edited

Loading

d33bs commented Aug 24, 2023

benclifford commented Aug 24, 2023

d33bs commented Aug 24, 2023

Reloading default HighThroughputExecutor after parsl.clear() results in KeyError: 'block_id' exception #2871

Reloading default HighThroughputExecutor after parsl.clear() results in KeyError: 'block_id' exception #2871

Comments

d33bs commented Aug 22, 2023

benclifford commented Aug 23, 2023

benclifford commented Aug 23, 2023 • edited Loading

d33bs commented Aug 24, 2023

benclifford commented Aug 24, 2023

d33bs commented Aug 24, 2023

Reloading default HighThroughputExecutor after `parsl.clear()` results in `KeyError: 'block_id'` exception #2871

Reloading default HighThroughputExecutor after `parsl.clear()` results in `KeyError: 'block_id'` exception #2871

benclifford commented Aug 23, 2023 •

edited

Loading