Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Files are not written for remote jobs #1471

Closed
Leimeroth opened this issue Jun 29, 2024 · 20 comments
Closed

Files are not written for remote jobs #1471

Leimeroth opened this issue Jun 29, 2024 · 20 comments
Labels
bug Something isn't working

Comments

@Leimeroth
Copy link
Member

Leimeroth commented Jun 29, 2024

When trying to submit Lammps jobs to a remote cluster only a .h5 file is created, but no input files or working directory. I guess somewhere during restructuring of run functions the necessary call to write_input has gone missing.

EDIT:
For VASP it works, so the issue seems to be in the Lammps class.

@Leimeroth Leimeroth added the bug Something isn't working label Jun 29, 2024
@jan-janssen jan-janssen transferred this issue from pyiron/pyiron_base Jun 29, 2024
@jan-janssen
Copy link
Member

Can you try to call job.validate_ready_to_run() before submitting the job and check if that solves the issue?

@Leimeroth
Copy link
Member Author

job.validate_ready_to_run() does not seem to change the behavior.
Manually doing

os.makedirs(job.working_directory)
job.write_input()

seems to do the job.

@Leimeroth
Copy link
Member Author

Leimeroth commented Jul 1, 2024

For potential that are manually defined via a dataframe the write_input_files_from_input_dict functionality breaks the
remote setup because the filepath of the potential does not exist on the remote cluster.

Traceback (most recent call last):
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/cli/__main__.py", line 3, in <module>
    main()
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/cli/control.py", line 61, in main
    args.cli(args)
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/cli/wrapper.py", line 37, in main
    job_wrapper_function(
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/jobs/job/wrapper.py", line 186, in job_wrapper_function
    job.run()
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/jobs/job/wrapper.py", line 131, in run
    self.job.run_static()
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/jobs/job/generic.py", line 917, in run_static
    execute_job_with_calculate_function(job=self)
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/jobs/job/runfunction.py", line 720, in wrapper
    output = func(job)
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/jobs/job/runfunction.py", line 978, in execute_job_with_calculate_function
    ) = job.get_calculate_function()(**job.calculate_kwargs)
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/jobs/job/runfunction.py", line 135, in __call__
    self.write_input_funct(
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/site-packages/pyiron_base/jobs/job/runfunction.py", line 80, in write_input_files_from_input_dict
    shutil.copy(source, os.path.join(working_directory, file_name))
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/shutil.py", line 417, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/home/hk-project-silicaat/id_uym1602/miniforge3/lib/python3.10/shutil.py", line 254, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '/nfshome/leimeroth/MTP/AlCuZr/Fractions2//10/14//output.14.mtp'

Edit: I use https://github.com/pyiron/pyiron_atomistics/tree/workaround-file-copying as a workaround on the remote hpc right now.
As far as I understand the idea of the new workflow is to copy only the hdf5 and write all necessary files on the remote machine, is this correct? If yes, I guess it is necessary to make an exception for potentials that are not part of the default data repository somehow. Also I am somewhat afraid of issues arising due to different pyiron versions/branches/whatever when only writing the files on the remote machine.

@Leimeroth
Copy link
Member Author

bump

@jan-janssen
Copy link
Member

Can you be a bit more specific where the potential file is located /nfshome/leimeroth/MTP/AlCuZr/Fractions2//10/14//output.14.mtp is this on the cluster or on the local workstation?

@Leimeroth
Copy link
Member Author

This is the full local path

@Leimeroth
Copy link
Member Author

Regarding file writing I guess the problem is

    def _check_if_input_should_be_written(self):
        if self._job_with_calculate_function:
            return False
        else:
            return not (
                self.server.run_mode.interactive
                or self.server.run_mode.interactive_non_modal

always returning False for lammps, so that

def save(self):
        """
        Save the object, by writing the content to the HDF5 file and storing an entry in the database.

        Returns:
            (int): Job ID stored in the database
        """
        self.to_hdf()
        if not state.database.database_is_disabled:
            job_id = self.project.db.add_item_dict(self.db_entry())
            self._job_id = job_id
            _write_hdf(
                hdf_filehandle=self.project_hdf5.file_name,
                data=job_id,
                h5_path=self.job_name + "/job_id",
                overwrite="update",
            )
            self.refresh_job_status()
        else:
            job_id = self.job_name
        if self._check_if_input_should_be_written():
            self.project_hdf5.create_working_directory()
            self.write_input()
        self.status.created = True
        print(
            "The job "
            + self.job_name
            + " was saved and received the ID: "
            + str(job_id)
        )
        return job_id

does never call write_input

@jan-janssen
Copy link
Member

Just as a workaround, can you check if it works by setting:

job._job_with_calculate_function = False

@Leimeroth
Copy link
Member Author

With job._job_with_calculate_function = False the input and an additional WARNING_pyiron_modified_content file are written.

@jan-janssen
Copy link
Member

With job._job_with_calculate_function = False the input and an additional WARNING_pyiron_modified_content file are written.

Does the remote submission work when job._job_with_calculate_function = False is set?

@Leimeroth
Copy link
Member Author

Leimeroth commented Jul 3, 2024

Yes, the job is submitted and runs.

EDIT:
The job runs and finished on the cluster. However, retrieving it with pr.update_from_remote() changes their status to initialized instead of finished locally.

@Leimeroth
Copy link
Member Author

As the issue is not part of the Lammps class itself, I am confused why it works with VASP

@jan-janssen
Copy link
Member

Yes, the job is submitted and runs.

EDIT: The job runs and finished on the cluster. However, retrieving it with pr.update_from_remote() changes their status to initialized instead of finished locally.

Ok, an alternative suggestion would be to add the write_input() call before the remote submission. I tried it in pyiron/pyiron_base#1511 but have not tested it so far.

@jan-janssen
Copy link
Member

As the issue is not part of the Lammps class itself, I am confused why it works with VASP

I do not know yet. We had another bug with how restart files are read pyiron/pyiron_base#1509 but that is still work in progress.

@Leimeroth
Copy link
Member Author

Yes, the job is submitted and runs.
EDIT: The job runs and finished on the cluster. However, retrieving it with pr.update_from_remote() changes their status to initialized instead of finished locally.

Ok, an alternative suggestion would be to add the write_input() call before the remote submission. I tried it in pyiron/pyiron_base#1511 but have not tested it so far.

Works with the addition of job.project_hdf5.create_working_directory(). Here the warning file is not created

@jan-janssen
Copy link
Member

Works with the addition of job.project_hdf5.create_working_directory(). Here the warning file is not created

Great, I think that is the best solution, until we have https://github.com/pyiron/pympipool ready to handle the remote submission.

@Leimeroth
Copy link
Member Author

Do you have an idea how to fix the issue of potentials that are not part of the resources dataframe.

@jan-janssen
Copy link
Member

Do you have an idea how to fix the issue of potentials that are not part of the resources dataframe.

I would modify the potential data frame, and maybe just attach the potential as restart file.

@jan-janssen
Copy link
Member

@niklassiemer I close this issue, feel free to reopen it if the issue comes up again.

@niklassiemer
Copy link
Member

Probably wrong ping @Leimeroth

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants