Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix timeout in command_line.enumlib_caller.EnumlibAdaptor #4276

Merged

Conversation

DanielYang59
Copy link
Contributor

@DanielYang59 DanielYang59 commented Jan 29, 2025

Summary

@DanielYang59 DanielYang59 force-pushed the fix-4185-enumlib-caller-timeout branch from b980d67 to 6947ffc Compare January 29, 2025 20:19
with pytest.raises(TimeoutError, match="Enumeration took too long"):
adaptor._run_multienum()
Copy link
Contributor Author

@DanielYang59 DanielYang59 Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original test command is incorrect, a bare _run_multienum doesn't generate the necessary input files, and _gen_input_file is needed (therefore Fortran runtime error: Cannot open file 'struct_enum.in': No such file or directory), so perhaps just use run instead.

After fixing this, the run time on my local Apple M4 machine is around 0.04 s

@DanielYang59 DanielYang59 marked this pull request as ready for review January 31, 2025 13:12
@DanielYang59 DanielYang59 marked this pull request as draft February 1, 2025 09:10
@DanielYang59
Copy link
Contributor Author

DanielYang59 commented Feb 1, 2025

Hi @fraricci sorry for the late response, can you help me test if this patch works? I don't have a "slow enough" enumlib job to test with (the unit test job last only 0.04 s and I'm not sure if it's really validating the behaviour), thanks a lot!

@DanielYang59 DanielYang59 marked this pull request as ready for review February 1, 2025 21:07
@fraricci
Copy link
Contributor

Hey @DanielYang59, I just saw your fix here. Sure I can test it on my structure if still needed.

@DanielYang59
Copy link
Contributor Author

Sure, please have a try and let me know whether the fix works for you

@fraricci
Copy link
Contributor

fraricci commented Feb 12, 2025

I tested and still it does not work.
I tried to insert some print to understand and found that the line
output = process.communicate(timeout=timeout)[0].decode("utf-8")
is executed instantly without waiting for the timeout and it is not generating any error.

Also, a version that uses sleep 60 reveals that this logic does catch the timeout but
it still waits the end of the command to return the raised error

ENUM_CMD = ["/usr/bin/sleep","30"]
timeout = 6
with subprocess.Popen(ENUM_CMD, stdout=subprocess.PIPE, stdin=subprocess.PIPE, close_fds=True) as process:
    print(timeout)
    try:
        print("running")
        output = process.communicate(timeout=timeout)[0].decode("utf-8")
        print("running ended")
        
    except subprocess.TimeoutExpired as exc:
        print("timeout")
        raise TimeoutError("Enumeration took too long")

I think we need to kill the process inside the except.

@DanielYang59 DanielYang59 marked this pull request as draft February 12, 2025 15:17
@DanielYang59
Copy link
Contributor Author

Hi thanks for testing. I cannot really say for sure because I haven't really used enumlib (maybe it doesn't respond to SIGTERM or something?).

In this case is it possible to share the structure and code for me to recreate this?

@fraricci
Copy link
Contributor

fraricci commented Feb 12, 2025

Yeah, I think enum.x is not responding as other commands do.
Here is a structure which enumlib should take some time for computing all the configurations.
POSCAR_mp-1218641.txt

The code is yours with some print lines as in the example above.

Let me know what you find, thanks!

@DanielYang59
Copy link
Contributor Author

DanielYang59 commented Feb 14, 2025

The code is yours with some print lines as in the example above.

I guess I might be missing something (as I didn't share any code before). I tried the following script and seem to get another error (looks like the SpacegroupAnalyzer cannot find symmetrized_structure, even after I used a less strict enum_precision_parameter):

from pymatgen.core import Structure
from pymatgen.command_line.enumlib_caller import EnumlibAdaptor


struct = Structure.from_file("./POSCAR_mp-1218641.txt")

adaptor = EnumlibAdaptor(struct, enum_precision_parameter=0.01, timeout=5)

# Expect TimeoutError
adaptor.run()
Traceback (most recent call last):
  File "/Users/yang/developer/pymatgen/debug/test_enumlib_timeout.py", line 10, in <module>
    adaptor.run()
    ~~~~~~~~~~~^^
  File "/Users/yang/developer/pymatgen/src/pymatgen/command_line/enumlib_caller.py", line 136, in run
    self._gen_input_file()
    ~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/yang/developer/pymatgen/src/pymatgen/command_line/enumlib_caller.py", line 205, in _gen_input_file
    sg_num = get_sg_number(curr_sites)
  File "/Users/yang/developer/pymatgen/src/pymatgen/command_line/enumlib_caller.py", line 199, in get_sg_number
    finder = SpacegroupAnalyzer(Structure.from_sites(ss), self.symm_prec)
                                ~~~~~~~~~~~~~~~~~~~~^^^^
  File "/Users/yang/developer/pymatgen/src/pymatgen/core/structure.py", line 1254, in from_sites
    raise ValueError(f"You need at least 1 site to construct a {cls.__name__}")
ValueError: You need at least 1 site to construct a Structure

Can you share a code snippet for me to recreate this? Thanks!

@fraricci
Copy link
Contributor

fraricci commented Feb 14, 2025

I see, calling the EnumlibAdapter on that structure does not work.
Here my code:

from pymatgen.analysis.magnetism.analyzer import MagneticStructureEnumerator
from monty.serialization import loadfn

st = Structure.from_file("./POSCAR_mp-1218641.txt")
enum_mag = MagneticStructureEnumerator(st,transformation_kwargs={"timeout":1})

Note: you need the other fix to get the timout arg passed correctly to the EnumlibAdapter.

Let me know how that goes.

@DanielYang59
Copy link
Contributor Author

DanielYang59 commented Feb 15, 2025

Great! Really appreciate the code snippet!

I just had a look but I don't think that code snippet is enough to recreate the issue. In my case, the total runtime of the entire script (at around 150 seconds) is indeed longer than the timeout (note the timeout is in minutes so it's 60 seconds). However the run time of _run_multienum is around 0.01 seconds so we're far from hitting that timeout.

I guess we'd be much better off to have a code snnipet that could isolate this issue (the above code is a super high level wrapper and enum_caller is deeply nested).


I also did a quick line profiling of the script:

I didn't go further than this as it's already unrelated to timeout on the enumlib_caller side, just mention this in case you want to optimize this part.

@fraricci
Copy link
Contributor

Thanks for pointing out that the bottleneck might be not the enumeration but the "duplicate_checker" step.
Here a custom structure that can be used directly by EnumlibAdaptor
struct_for_testing_enumlib.json
and here the code I used for testing.

from pymatgen.core import Structure
from pymatgen.command_line.enumlib_caller import EnumlibAdaptor
import time

struct = loadfn("./struct_for_testing_enumlib.json")

adaptor = EnumlibAdaptor(struct_for_enumlib,
                         max_cell_size=10, #increasing this while take longer and hit the timeout
                         timeout=0.1)

# Expect TimeoutError
adaptor.run()

This should take enough time and reveals that the current implementation is actually working, but still missing the process.terminate()/kill() in the exception (as mentioned above).
Another process that is taking some time is the step "getting_structures" in the enumlibAdaptor.run() which follows the enumeration, and that fooled me in the beginning in finding the culprit.

Let me know what you find and we can close this I think.

@DanielYang59
Copy link
Contributor Author

DanielYang59 commented Feb 17, 2025

Beautiful! I could now recreate this issue (it's super helpful)!

Looks like enum.x indeed doesn't correctly respond to SIGTERM (subprocess.TimeoutExpired is caught successfully but the enum.x process wouldn't terminate), I switched to send SIGKILL instead.

I tested and seem to work on my side, can you have a try?

@DanielYang59
Copy link
Contributor Author

DanielYang59 commented Feb 17, 2025

@fraricci Can I have your permission to include the [struct_for_testing_enumlib.json](https://github.com/user-attachments/files/18825409/struct_for_testing_enumlib.json) file into the pymatgen test suite (as a test file)? Previous test doesn't seem to do what it should, we need to update it.

@fraricci
Copy link
Contributor

Good news, glad it helped out testing correctly this issue.
I had already tested myself adding process.terminate() in the exception and it works fine.
Sure, go ahead and include that structure into the test suit.

If you can also change the error message as I mentioned above, that could be great.

Thanks!

@DanielYang59
Copy link
Contributor Author

I had already tested myself adding process.terminate() in the exception and it works fine.

Great to know. You could directly checkout my branch (no need to manually apply the change)

If you can also change the error message as I mentioned above, that could be great.

Happy to change, but I'm not seeing anything related to this?

@DanielYang59 DanielYang59 force-pushed the fix-4185-enumlib-caller-timeout branch from a6e7c4f to a70c8bf Compare February 17, 2025 18:28
@fraricci
Copy link
Contributor

It was just easier to add one line than go through updating branch ;-)

I mentioned this in a message above:

Could you change the message into:
f"Enumeration took more than the provided timeout {self.timeout} minutes"

@DanielYang59
Copy link
Contributor Author

DanielYang59 commented Feb 17, 2025

Done! Super weird, I still couldn't find that message for some reason

@DanielYang59 DanielYang59 marked this pull request as ready for review February 17, 2025 19:30
@fraricci
Copy link
Contributor

It's a review message and I see that it is "pending". Maybe that's why you can't see it.

Anyway, thanks. It's good to go.

@DanielYang59
Copy link
Contributor Author

DanielYang59 commented Feb 18, 2025

It's a review message and I see that it is "pending". Maybe that's why you can't see it.

That explains, you have to submit the review message otherwise people cannot see it: https://github.com/orgs/community/discussions/10369

Anyway, thanks

No problem at all. Can never be done without your help :)

@shyuep shyuep merged commit 7cc19c3 into materialsproject:master Feb 18, 2025
43 checks passed
@shyuep
Copy link
Member

shyuep commented Feb 18, 2025

Thanks all for the great bug fix!

@DanielYang59
Copy link
Contributor Author

No problem at all!

@DanielYang59 DanielYang59 deleted the fix-4185-enumlib-caller-timeout branch February 18, 2025 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Timeout in MagneticStructureEnumerator (and EnumlibAdaptor) does not work
3 participants