Precompute cdvae #84

bmuaz · 2024-01-09T07:04:51Z

Dataset selection is created for user to select different dataset for CDVAE precomputing.
Graph training data can be generated from four datasets: MP, Carolina, NOMAD, OQMD.
OQMD devset data is regenerated so that the missing "site" information is included now.
oqmd_api.py is also modified.

…into precompute_cdvae

matsciml/common/types.py

laserkelvin

I left a few comments. examples/model_demos/cdvae/cdvae_precompute.py is not really how I would've liked as there is a lot of repetitive code which makes it hard to maintain.

As a first pass I don't mind too much about letting it through, but want to see if @melo-gonzo has any opinions. Also since it's replacing the original code, would be good if @migalkin can provide any comments/feedback on this.

matsciml/datasets/oqmd/devset/data.lmdb

matsciml/datasets/oqmd/oqmd_api.py

examples/model_demos/cdvae/cdvae_precompute.py

laserkelvin · 2024-01-18T21:43:43Z

@bmuaz you'll also need to resolve conflicts in the files, since they've been updated since you created the branch. We can't merge until you've done this.

The CONTRIBUTORS file added Jonathan's name, and the other two files were modified in #90 I believe. You obviously don't have to worry about precompute_mp_cdvae.py, but please make sure you use the right versions of oqmd_api.py.

migalkin · 2024-01-19T01:29:35Z

examples/model_demos/cdvae/cdvae_precompute.py

+    for element in elements:
+        atomic_num = atomic_number_map()[element]
+        if atomic_num is not None:
+            atomic_num_dict.append(atomic_num)
+    return atomic_num_dict


This loop can be replaced with one list comprehension

Yes, I changed to [Atomic_num_map_global[element] for element in elements]

migalkin · 2024-01-19T01:30:20Z

examples/model_demos/cdvae/cdvae_precompute.py

+def get_atomic_num(elements):
+    atomic_num_dict = []
+    for element in elements:
+        atomic_num = atomic_number_map()[element]


Here, the atomic_number_map() is called every time for each element in the array - there is no need in that, it can be initialized once at the beginning

Yes, I am using Atomic_num_map_global = atomic_number_map() before calling the function

migalkin · 2024-01-19T01:31:27Z

examples/model_demos/cdvae/cdvae_precompute.py

+    atomic_num_dict = []
+    for element in elements:
+        atomic_num = atomic_number_map()[element]
+        if atomic_num is not None:


The atomic_number_map() is a dict that never returns None, if the key element doesn't exist, then there will be a KeyNotFoundError exception

migalkin · 2024-01-19T01:32:26Z

examples/model_demos/cdvae/cdvae_precompute.py

+    for atomic_num in atomic_numbers:
+        element = map_reversed[atomic_num]
+        if element is not None:
+            elements.append(element)
+    return elements


The same comments from get_atomic_num() apply here

migalkin · 2024-01-19T01:33:08Z

examples/model_demos/cdvae/cdvae_precompute.py

+    for i in range(num_sites):
+        for j in range(i, num_sites):
+            delta = numpy.subtract(coords[i], coords[j])
+            # Apply minimum image convention for periodic boundary conditions
+            # delta -= numpy.round(delta)
+            distance = numpy.linalg.norm(numpy.dot(delta, lattice_vectors))
+            distance_matrix[i, j] = distance
+            distance_matrix[j, i] = distance


This O(N^2) loop is super inefficient and can be replaced with one line numpy vectorized operation

migalkin · 2024-01-19T01:39:30Z

There is a lot of code repetition in:

parse_structure_MP, parse_structure_NOMAD, parse_structure_OQDM, parse_structure_Carolina: essentially, they all can be wrapped within one function and, depending on the particular dataset_name, the final dict can be enriched with specific fields
and in the main() function - all four if blocks have exactly the same code that differs by one line - why not moving the if inside if key.decode("utf-8").isdigit() ?

bmuaz · 2024-01-19T03:23:38Z

Sorry my oversight on the four if functions. Now it is clean by using pyg_data = dataset_functions.get(dataset_name)(crystal_data). Thanks for all great comments.

bmuaz · 2024-02-01T04:34:55Z

Can we get this approved if no more questions? Thanks.

migalkin · 2024-02-01T08:29:08Z

examples/model_demos/cdvae/cdvae_precompute.py

+def data_to_cdvae_MP(item):
+    num_atoms = len(item["structure"].atomic_numbers)
+    check_num_atoms(num_atoms)
+    if check_num_atoms:
+        pyg_data = parse_structure_MP(item)
+        return pyg_data
+
+def data_to_cdvae_NOMAD(item):
+    num_atoms = len(item["properties"]["structures"]["structure_conventional"]["species_at_sites"])
+    check_num_atoms(num_atoms)
+    if check_num_atoms:
+        pyg_data = parse_structure_NOMAD(item)
+        return pyg_data
+
+def data_to_cdvae_OQMD(item):
+    num_atoms = item["natoms"]
+    check_num_atoms(num_atoms)
+    if check_num_atoms:
+        pyg_data = parse_structure_OQMD(item)
+        return pyg_data
+
+def data_to_cdvae_Carolina(item):
+    num_atoms = len(item["atomic_numbers"])
+    check_num_atoms(num_atoms)
+    if check_num_atoms:
+        pyg_data = parse_structure_Carolina(item)
+        return pyg_data


There is some significant code repetition in those 4 functions who
(1) accept the same argument (item)
(2) return the same object (pyg_data)
(3) execute the same sequence of actions (check_num_atoms() and the if block)
which makes them perfect candidates for shrinking into 1 function with the main parse_structure_ function obtained from the function argument

Yes, condensed them to one function data_to_cdvae().

migalkin · 2024-02-01T08:30:24Z

examples/model_demos/cdvae/cdvae_precompute.py

+    if structure is None:
+        raise ValueError(
+            "Structure not found in data - workflow needs a structure to use!"
+        )


Many blocks in the parse_structure_XXX functions are repeated like this check and some dictionary value assignments - those can also be simplified

Yes, removed the redundant sections. Done.

migalkin · 2024-02-01T08:31:14Z

The PR looks much better now! Is can still be polished further though to remove unnecessary code repetitions in other functions (highlighted above^)

migalkin · 2024-02-01T22:50:09Z

examples/model_demos/cdvae/cdvae_precompute.py

+    distance_matrix = get_distance_matrix(cartesian_coords, numpy.array(structure["lattice_vectors"]) * 1E10)
+    return_dict["distance_matrix"] = torch.from_numpy(distance_matrix)
+    y = (item["energies"]["total"]["value"] * 6.241509074461E+18) / num_particles   #formation_energy_per_atom, eV


What is the reason for multiplying by such large numbers? When training a subsequent energy regression model, large numbers could lead into numerical instability

@migalkin this is necessary a conversion from Joules to eV and from meters to angstroms.

Can @bmuaz print some indicative values stored in the data before and after the conversion?

Sure, as I mentioned yesterday, NOMAD used meter and Joules for their data. For example, you can check their original lattice data and total energy data. For example : "lattice_vectors": [ [2.91807e-10, 0, 0], [0, 2.91807e-10, 0], [0, 0, 2.91807e-10] ], "cartesian_site_positions": [ [0, 0, 0], [1.45904e-10, 1.45904e-10, 1.45904e-10] ], "total": { "value": -9.17469e-15 },

I see, do we need to keep the exponent then? Let's check the absolute energy values in four datasets - if we are to train a single model on MP and NOMAD, and MP data has, for example, energy values as floats in range [-10, 10] and NOMAD in range [-1e5, 1e5], then the standard regressor scaler will get confused and treat MP values as almost-zeros

Also important to remember that we have a normalize_kwargs to help with these scaling issues. Here is an example. I often forget about this but it helps tremendously in stabilizing training.

The normalize_kwargs are per-task, so you can re-scale based on each task (and by extension, dataset).

I think the design question is, do we apply the unit conversion in the data_from_keys step, or do we save them (like is done here) converted? Personally I feel like I prefer the former, and just document what is being done as opposed to having no metadata associated with the precomputed sets.

bmuaz and others added 7 commits January 8, 2024 23:33

precompute_cdvae

94ee97e

precompute_cdvae

2529c66

precompute_cdvae

e6cfef1

Precompute cdvae

5191673

Merge branch 'IntelLabs:main' into precompute_cdvae

536fe21

Union() is removed

240481c

Merge branch 'precompute_cdvae' of https://github.com/bmuaz/matsciml …

b168579

…into precompute_cdvae

laserkelvin added needs triage Issue needs decision making ux User experience, quality of life changes data Issues related to data loading, pipelining, etc. labels Jan 11, 2024

laserkelvin reviewed Jan 11, 2024

View reviewed changes

matsciml/common/types.py Outdated Show resolved Hide resolved

laserkelvin reviewed Jan 11, 2024

View reviewed changes

matsciml/common/types.py Outdated Show resolved Hide resolved

adding back Union()

7d485fd

laserkelvin requested changes Jan 16, 2024

View reviewed changes

matsciml/datasets/oqmd/devset/data.lmdb Outdated Show resolved Hide resolved

matsciml/datasets/oqmd/oqmd_api.py Outdated Show resolved Hide resolved

matsciml/datasets/oqmd/oqmd_api.py Outdated Show resolved Hide resolved

laserkelvin requested review from melo-gonzo and migalkin January 16, 2024 21:12

laserkelvin marked this pull request as ready for review January 16, 2024 21:13

melo-gonzo reviewed Jan 16, 2024

View reviewed changes

matsciml/datasets/oqmd/oqmd_api.py Outdated Show resolved Hide resolved

melo-gonzo reviewed Jan 16, 2024

View reviewed changes

examples/model_demos/cdvae/cdvae_precompute.py Outdated Show resolved Hide resolved

melo-gonzo reviewed Jan 16, 2024

View reviewed changes

examples/model_demos/cdvae/cdvae_precompute.py Outdated Show resolved Hide resolved

fix issues in team comments

b080631

merge with main

e2d466b

melo-gonzo approved these changes Jan 19, 2024

View reviewed changes

laserkelvin removed the needs triage Issue needs decision making label Jan 19, 2024

migalkin reviewed Jan 19, 2024

View reviewed changes

fix redundant code issues

15aa6c2

migalkin reviewed Feb 1, 2024

View reviewed changes

polished the data_to_cdvae() function

bc419e8

migalkin reviewed Feb 1, 2024

View reviewed changes

removed the unit conversion factors in parse_structure_NOMAD() function

d43e638

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Precompute cdvae #84

Precompute cdvae #84

bmuaz commented Jan 9, 2024 •

edited

Loading

laserkelvin left a comment

laserkelvin commented Jan 18, 2024 •

edited

Loading

migalkin Jan 19, 2024

bmuaz Jan 19, 2024

migalkin Jan 19, 2024

bmuaz Jan 19, 2024

migalkin Jan 19, 2024 •

edited

Loading

bmuaz Jan 19, 2024

migalkin Jan 19, 2024

bmuaz Jan 19, 2024

migalkin Jan 19, 2024

bmuaz Jan 19, 2024

migalkin commented Jan 19, 2024

bmuaz commented Jan 19, 2024

bmuaz commented Feb 1, 2024

migalkin Feb 1, 2024

bmuaz Feb 1, 2024

migalkin Feb 1, 2024

bmuaz Feb 1, 2024

migalkin commented Feb 1, 2024

migalkin Feb 1, 2024

melo-gonzo Feb 1, 2024

laserkelvin Feb 1, 2024

bmuaz Feb 1, 2024

migalkin Feb 1, 2024

melo-gonzo Feb 1, 2024

laserkelvin Feb 2, 2024

Precompute cdvae #84

Are you sure you want to change the base?

Precompute cdvae #84

Conversation

bmuaz commented Jan 9, 2024 • edited Loading

laserkelvin left a comment

Choose a reason for hiding this comment

laserkelvin commented Jan 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

migalkin Jan 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

migalkin commented Jan 19, 2024

bmuaz commented Jan 19, 2024

bmuaz commented Feb 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

migalkin commented Feb 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bmuaz commented Jan 9, 2024 •

edited

Loading

laserkelvin commented Jan 18, 2024 •

edited

Loading

migalkin Jan 19, 2024 •

edited

Loading