Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

atom_site_label in CIF file are not unique #3761

Closed
fxcoudert opened this issue Apr 16, 2024 · 10 comments · Fixed by #3767
Closed

atom_site_label in CIF file are not unique #3761

fxcoudert opened this issue Apr 16, 2024 · 10 comments · Fixed by #3767
Labels
bug cif Crystallographic information file io Input/output functionality

Comments

@fxcoudert
Copy link

Python version

Python 3.11.6

Pymatgen version

2023.10.4

Operating system version

macOS 14.4.1

Current behavior

Read a structure with symmetry, and write it to a CIF file. My example is:

$ cat MOF-71.cif  
data_MOF-71
_audit_creation_method         'generated by CrystalMaker 8.6.5'
_publ_section_comment
;
From DOI: 10.1021/ja045123o

http://www.chemistry.ucla.edu/crmr/yaghi/pdfPublications/Rod-PkgsMtlOrgFrJAmChemSoc2005.
pdf

;
_cell_length_a                   8.8481(0)
_cell_length_b                   7.2386(0)
_cell_length_c                  19.2301(0)
_cell_angle_alpha               90.0000(0)
_cell_angle_beta                90.0000(0)
_cell_angle_gamma               90.0000(0)

_symmetry_space_group_name_H-M     'P 2/m'
_symmetry_Int_Tables_number         10
_symmetry_cell_setting             monoclinic
loop_
_symmetry_equiv_pos_as_xyz
'+x,+y,+z'
'-x,+y,-z'
'-x,-y,-z'
'+x,-y,+z'

loop_
_atom_site_label
_atom_site_type_symbol
_atom_site_occupancy
_atom_site_fract_x
_atom_site_fract_y
_atom_site_fract_z
     C1   C  1.0000      0.0134    0.0000    0.2761
     C2   C  1.0000      0.5463    0.0000    0.4307
     C3   C  1.0000      0.0463    0.5000    0.9307
     C4   C  1.0000      0.5134    0.5000    0.7761
     C5   C  1.0000      0.5463    0.0000    0.0693
     C6   C  1.0000      0.0238    0.3362    0.5341
     C7   C  1.0000      0.0238    0.6638    0.9659
     C8   C  1.0000      0.1469    0.0000    0.1670
     C9   C  1.0000      0.6469    0.5000    0.6670
     C10  C  1.0000      0.2871    0.0000    0.2815
     C11  C  1.0000      0.7871    0.5000    0.7815
     C12  C  1.0000      0.5238    0.1638    0.4659
     C13  C  1.0000      0.5238    0.8362    0.0341
     C14  C  1.0000      0.0463    0.5000    0.5693
     C15  C  1.0000      0.5974    0.0000    0.1442
     C16  C  1.0000      0.0974    0.5000    0.6442
     C17  C  1.0000      0.5974    0.0000    0.3558
     C18  C  1.0000      0.0974    0.5000    0.8558
    Co    Co 1.0000      0.7500    0.7500    0.2500
     H1   H  1.0000      0.7353    0.0000    0.8506
     H2   H  1.0000      0.5310    0.5000    0.6469
     H3   H  1.0000      0.5409    0.2755    0.4433
     H4   H  1.0000      0.7975    0.3678    0.8154
     H5   H  1.0000      0.0134    0.0000    0.3341
     H6   H  1.0000      0.7259    0.0000    0.6618
     H7   H  1.0000      0.6492    0.1322    0.7363
     H8    H  1.0000      0.7100    0.3678    0.6488
     H9    H  1.0000      0.8742    0.5000    0.7410
     H10    H  1.0000      0.9143    0.8678    0.8525
     H11    H  1.0000      0.5409    0.7245    0.0567
     H12    H  1.0000      0.0409    0.2245    0.5567
     H13    H  1.0000      0.0409    0.7755    0.9433
     H14    H  1.0000      0.5134    0.5000    0.8291
     N1   N  1.0000      0.6427    0.5000    0.7500
     N2   N  1.0000      0.1427    0.0000    0.2500
     O1    O  1.0000      0.3875    0.5000    0.7500
     O2    O  1.0000      0.6148    0.8462    0.1722
     O3    O  1.0000      0.1148    0.3461    0.6722
     O4    O  1.0000      0.6148    0.1538    0.3278
     O5    O  1.0000      0.1148    0.6539    0.8278
     O6    O  1.0000      0.8875    0.0000    0.2500

Then read it and write it back to CIF file:

structure = Structure.from_file("MOF-71.cif")
cif_writer = CifWriter(bonded_struct.structure)
cif_writer.write_file("output.cif")

The new CIF file will have atoms with duplicate atom_site_label:

loop_
 _atom_site_type_symbol
 _atom_site_label
 _atom_site_symmetry_multiplicity
 _atom_site_fract_x
 _atom_site_fract_y
 _atom_site_fract_z
 _atom_site_occupancy
  Co  Co0  1  0.75000000  0.75000000  0.25000000  1.0
  Co  Co1  1  0.25000000  0.75000000  0.75000000  1.0
  Co  Co2  1  0.25000000  0.25000000  0.75000000  1.0
  Co  Co3  1  0.75000000  0.25000000  0.25000000  1.0
  H  H1  1  0.73530000  0.00000000  0.85060000  1.0
  H  H1  1  0.26470000  0.00000000  0.14940000  1.0
  H  H2  1  0.53100000  0.50000000  0.64690000  1.0
  H  H2  1  0.46900000  0.50000000  0.35310000  1.0
  H  H3  1  0.54090000  0.27550000  0.44330000  1.0
  H  H3  1  0.45910000  0.27550000  0.55670000  1.0
  H  H3  1  0.45910000  0.72450000  0.55670000  1.0
  H  H3  1  0.54090000  0.72450000  0.44330000  1.0
[…]

CIF format specifies (https://www.iucr.org/__data/iucr/cifdic_html/1/cif_core.dic/Iatom_site_label.html):

The _atom_site_label is a unique identifier for a particular site in the crystal.

So this shouldn't happen. I believe this incorrect behavior was introduced by #3183 (and follow-ups: #3423 and #3527)

Expected Behavior

All sites in the CIF file should be given a unique atom_site_label. This should be enforced by the CifWriter class.

Minimal example

No response

Relevant files to reproduce this bug

No response

@fxcoudert fxcoudert added the bug label Apr 16, 2024
@fxcoudert
Copy link
Author

One of the reasons labels should be unique: they are used to indicate bonding (and angles, torsions, etc). If they are not unique, this information is useless. (And I am working on a PR to add that capability to pymatgen.)

@Andrew-S-Rosen
Copy link
Member

I can confirm that this is likely a bug (in terms of having agreement with the CIF spec).

@stefsmeets
Copy link
Contributor

The bug may not necessarily be in the CifWriter itself. Pymatgen expands the structure to P1 when reading a cif file (symmetry information gets lost). This includes duplicating the labels.

@fxcoudert
Copy link
Author

Does pymatgen require labels to be unique? I do not know. If so, the bug is in the symmetry expansion. But what is sure is that labels in a CIF file should be unique (whether pymatgen accepts duplicate labels or not).

@stefsmeets
Copy link
Contributor

No, pymatgen does not require unique labels. The duplication of labels happens when the symmetry is expanded in the CifParser:

pymatgen/pymatgen/io/cif.py

Lines 1003 to 1011 in 0e57abf

if not match:
coord_to_species[coord] = comp
coord_to_magmoms[coord] = magmom
labels[coord] = label
else:
coord_to_species[match] += comp
# disordered magnetic not currently supported
coord_to_magmoms[match] = None
labels[match] = label

One way to solve would be to ensure labels are suffixed 'abcde...' when the symmetry gets expanded there. Although this can also be a check on the CifWriter side.

@JaGeo
Copy link
Member

JaGeo commented Apr 17, 2024

@stefsmeets I just wanted to mention that this might be problematic for very large structures. There are only 26 letters in the alphabet. With exception to this drawback, such a solution would be nice.

@stefsmeets
Copy link
Contributor

I'd be happy to have a go at this

@JaGeo
Copy link
Member

JaGeo commented Apr 17, 2024

@stefsmeets sound great. (cc @janosh)

@stefsmeets
Copy link
Contributor

stefsmeets commented Apr 18, 2024

Some options:

  1. CifWriter raises ValueError if Structure contains duplicate label -> user is forced to correct it
  2. CifWriter raises UserWarning if Structure contains duplicate label -> user can choose to correct it
  3. CifWriter overwrites labels to fix the issue -> less hassle for the users, but can lead to unexpected labels in CIF file
  4. CifParser.parse_structures() ensures unique labels when reading CIF -> if the user modifies the sites, they may still end up with duplicate labels

I want to avoid agressively/automatically relabelling, because I know that this can lead to unexpected results which can be frustrating. So I'm leaning towards 1 or 2.

To help with 1 or 2, I want to add a method: Structure.relabel_sites() to give users some room to quickly relabel a large number of sites and ensure uniqueness. I think this can be useful to add regardless.

Let me know what you think.

@fxcoudert
Copy link
Author

  • Option 1: sadly, there are CIF files (mostly from computational tools) that have this problem, so I think that is not viable.
  • Option 2: sounds nice, but does not account for situations where pymatgen itself is creating label duplicates (when lowering symmetry). How about an option to CifWriter to make labels unique.

@janosh janosh changed the title atom_site_label in CIF file are not unique atom_site_label in CIF file are not unique Apr 25, 2024
@janosh janosh added io Input/output functionality cif Crystallographic information file labels Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug cif Crystallographic information file io Input/output functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants