Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reloading default HighThroughputExecutor after parsl.clear() results in KeyError: 'block_id' exception #2871

Open
d33bs opened this issue Aug 22, 2023 · 5 comments
Labels

Comments

@d33bs
Copy link

d33bs commented Aug 22, 2023

Describe the bug

Thanks for the great work on Parsl - it's an awesome tool! I noticed when attempting to load a HighThroughputExecutor configuration after using parsl.clear() an exception occurs which states: KeyError: 'block_id'. This doesn't seem to happen when using ThreadPoolExecutor. I'm unsure if this is expected behavior so I wanted to raise a bug just in case it helps diagnose or address this issue.

To Reproduce
Run the following Python snippet:

import parsl
from parsl.config import Config
from parsl.executors import HighThroughputExecutor

local_htex = Config(
    executors=[HighThroughputExecutor()],
)
parsl.load(local_htex)
parsl.clear()
parsl.load(local_htex)

Expected behavior
I'd expect something like what is witnessed with the ThreadPoolExecutor from the following example. If this isn't possible or is unreasonable to expect, I'd expect documentation or an upfront message from the logs about workarounds.

import parsl
from parsl.config import Config
from parsl.executors import ThreadPoolExecutor

local_threaded = Config(
    executors=[ThreadPoolExecutor()],
)
parsl.load(local_threaded)
parsl.clear()
parsl.load(local_threaded)

Environment

  • OS: MacOS
  • Python: 3.9
  • Parsl: 2023.08.21

Distributed Environment

  • Where are you running the Parsl script from ? Laptop/Workstation
  • Where do you need the workers to run ? Same as Parsl script
@benclifford
Copy link
Collaborator

one thing in your example that sticks out to me is that you can't use the same Config object twice.

In the case of the thread pool executor, this probably doesn't matter, but it's likely to be more of a problem with htex.

There's a style for re-loading "the same" config that is used in (for example) the test suite and some applications that I work on that that looks something like this:

def fresh_config():
    return Config(executors=[HighThroughputExecutor()])

parsl.load(fresh_config())
parsl.clear()
parsl.load(fresh_config())

Probably we should make this more enforced - it wasn't part of the initial design of how Config was meant to work, but it's turned out since then that's a property of Config objects (and the subobjects contained inside a Config).

Can you try something like the above and see how things go?

@benclifford
Copy link
Collaborator

benclifford commented Aug 23, 2023

I see in your comment on the cytotable issue cytomining/CytoTable#90 (comment) that you have that sort of behaviour as a recommended change - for a user to generate a new config in each for loop iteration.

Either that, or (I'm not sure how your code is organised for this) make it so parsl/DFK can be initialised outside of the loop and not reloaded in each iteration in that loop.

I've found it a bit easier to understand why things work that way by thinking of a Config not as "here is our configuration file style specification", but "here is a bundle of the actual instantiated components that will be started and terminated by parsl.load()".

@d33bs
Copy link
Author

d33bs commented Aug 24, 2023

Thank you @benclifford for the recommendations and clarity here! Is there a way to compare Parsl Config objects such that one could avoid "accidental" reload of configurations? I recognize deep object comparisons might perform this but wondered if there was something like a GUID or similar assigned to each configuration which could provide this capability.

@benclifford
Copy link
Collaborator

so if you want to say "we have a DFK loaded right now, and someone just gave me a config object ... is it the same config object? and if not, do a parsl shutdown/restart with the new config" then I would say you won't be able to compare any better than simple python object identity, using the is operator.

For the example in CytoTable issue 90

for plate_folder in sqlite_dir.iterdir():
    output_path = pathlib.Path(f"{output_dir}/{plate_folder.stem}_converted.parquet")
    # merge single cells and output as parquet file
    convert(
        source_path=str(plate_folder),
        dest_path=str(output_path),
        dest_datatype=dest_datatype,
        metadata=["image"],
        compartments=["nuclei"],
        identifying_columns=["ImageNumber"],
        joins=preset_join,
        parsl_config=Config(
            executors=[HighThroughputExecutor()],
        ),
        chunk_size=10000,
    )

probably for that to work, the user should create one Config object outside of the for loop and put it in a variable and pass that variable as parsl_config=my_single_config, if they want to use one single config for all of the parsl.

so then the cytotable code can compare the parsl_config just passed in with the config object of the currently active parsl and switch configs if the user has supplied a new one on a subsequent call.

If you avoid shutting down and restarting parsl each time, I think you'll also get some speedup from not paying the startup and shutdown costs of parsl for each loop iteration - depending on circumstances that might be significant, or might not.

However, this style feels a bit weird to me and I'd kinda encourage you to think about if you can also initialise parsl outside such a loop - i guess mostly it depends on how much you expect your users to be changing configs as a loop progresses.

@d33bs
Copy link
Author

d33bs commented Aug 24, 2023

Many thanks @benclifford for the recommendations here! I'll move forward with using Parsl as you described.

benclifford pushed a commit that referenced this issue Oct 24, 2023
Clarifies problems users found in the documentation

 error in parallel documentation #2915: Decorators are not needed for ParslPoolExecutor.map
 Reloading default HighThroughputExecutor after parsl.clear() results in KeyError: 'block_id' exception #2871 : Configs may not be re-used
 Fixes Docs: Further clarification on how serialization impacts where imports need to be #2918 : Explain when functions need imports
 Fixes Allowing for inputs and outputs to take immutable types #2925 : Explain mutable types are undesirable

Thanks, @Andrew-S-Rosen
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants