Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] DPA2 Lammps on nopbc systems causes torchscript error #4167

Closed
iProzd opened this issue Sep 27, 2024 · 4 comments · Fixed by #4209 or #4237
Closed

[BUG] DPA2 Lammps on nopbc systems causes torchscript error #4167

iProzd opened this issue Sep 27, 2024 · 4 comments · Fixed by #4209 or #4237
Labels
Milestone

Comments

@iProzd
Copy link
Collaborator

iProzd commented Sep 27, 2024

Bug summary

When using a trained and frozen DPA2 model to run LAMMPS on nopbc systems, the program immediately raises a TorchScript error. Notably, this issue does not occur with DPA1 and se_a models in PyTorch, and the DPA2 model functions correctly on pbc systems, even with one-dimensional pbc.

DeePMD-kit Version

3.0.0b3

Backend and its version

PyTorch v2.1.2

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

  1. Train and freeze a dpa2 model in examples/water/dpa2,
  2. Modify p p p to f f f of the lammps input in.lammps and link the frozen model in examples/water/lmp,
  3. Run lmp -i in.lammps.
Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.0005
ERROR on proc 0: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/deepmd/pt/model/model/transform_output.py", line 156, in forward_lower
    vvi = split_vv1[_44]
    svvi = split_svv1[_44]
    _45 = _36(vvi, svvi, coord_ext, do_virial, do_atomic_virial, create_graph, )
          ~~~ <--- HERE
    ffi, aviri, = _45
    ffi0 = torch.unsqueeze(ffi, -2)
  File "code/__torch__/deepmd/pt/model/model/transform_output.py", line 191, in task_deriv_one
  faked_grad = torch.ones_like(energy)
  lst = annotate(List[Optional[Tensor]], [faked_grad])
  _52 = torch.autograd.grad([energy], [extended_coord], lst, True, create_graph)
        ~~~~~~~~~~~~~~~~~~~ <--- HERE
  extended_force = _52[0]
  if torch.__isnot__(extended_force, None):

Traceback of TorchScript, original code (most recent call last):
  File "/opt/deepmd-kit-3.0.0b3/lib/python3.10/site-packages/deepmd/pt/model/model/transform_output.py", line 138, in forward_lower
    for vvi, svvi in zip(split_vv1, split_svv1):
        # nf x nloc x 3, nf x nloc x 9
        ffi, aviri = task_deriv_one(
                     ~~~~~~~~~~~~~~ <--- HERE
            vvi,
            svvi,
  File "/opt/deepmd-kit-3.0.0b3/lib/python3.10/site-packages/deepmd/pt/model/model/transform_output.py", line 80, in task_deriv_one
    faked_grad = torch.ones_like(energy)
    lst = torch.jit.annotate(List[Optional[torch.Tensor]], [faked_grad])
    extended_force = torch.autograd.grad(
                     ~~~~~~~~~~~~~~~~~~~ <--- HERE
        [energy],
        [extended_coord],
RuntimeError: max(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument.
 (/home/conda/feedstock_root/build_artifacts/deepmd-kit_1722057353391/work/source/lmp/pair_deepmd.cpp:586)
Last command: run             1000
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

Steps to Reproduce

See above.

Further Information, Files, and Links

No response

@iProzd iProzd added the bug label Sep 27, 2024
@iProzd
Copy link
Collaborator Author

iProzd commented Sep 27, 2024

It seems not easy to resolve so far.

  1. Lammps using pytorch dpa1 and se_a works for nopbc systems.
  2. dp test always work for nopbc systems.
  3. Lammps using DPA2 even with 0 layer repformers still crashes.

Maybe it's a bug with border_op in torchscript in nopbc system?

@njzjz
Copy link
Member

njzjz commented Sep 27, 2024

xref: #4092

@njzjz
Copy link
Member

njzjz commented Oct 15, 2024

#4220 indicates that segfault is still thrown with the MPI.

github-merge-queue bot pushed a commit that referenced this issue Oct 17, 2024
Adding tests to see whether #4167 is resolved. The answer is no.
Segfaults are thrown with MPI.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Introduced a new command-line argument `--nopbc` to modify boundary
conditions in LAMMPS simulations.
- **Tests**
- Added a comprehensive suite of unit tests for the DeepMD potential in
LAMMPS, covering various configurations and scenarios to ensure accuracy
and reliability.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Jinzhe Zeng <[email protected]>
github-merge-queue bot pushed a commit that referenced this issue Oct 17, 2024
Adding tests to see whether #4167 is resolved. The answer is no.
Segfaults are thrown with MPI.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Introduced a new command-line argument `--nopbc` to modify boundary
conditions in LAMMPS simulations.
- **Tests**
- Added a comprehensive suite of unit tests for the DeepMD potential in
LAMMPS, covering various configurations and scenarios to ensure accuracy
and reliability.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Jinzhe Zeng <[email protected]>
@njzjz
Copy link
Member

njzjz commented Oct 23, 2024

Fixed by #4237.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment