Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Precompute cdvae #84

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open

Precompute cdvae #84

wants to merge 13 commits into from

Conversation

bmuaz
Copy link

@bmuaz bmuaz commented Jan 9, 2024

  1. Dataset selection is created for user to select different dataset for CDVAE precomputing.
  2. Graph training data can be generated from four datasets: MP, Carolina, NOMAD, OQMD.
  3. OQMD devset data is regenerated so that the missing "site" information is included now.
  4. oqmd_api.py is also modified.

@laserkelvin laserkelvin added needs triage Issue needs decision making ux User experience, quality of life changes data Issues related to data loading, pipelining, etc. labels Jan 11, 2024
Copy link
Collaborator

@laserkelvin laserkelvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments. examples/model_demos/cdvae/cdvae_precompute.py is not really how I would've liked as there is a lot of repetitive code which makes it hard to maintain.

As a first pass I don't mind too much about letting it through, but want to see if @melo-gonzo has any opinions. Also since it's replacing the original code, would be good if @migalkin can provide any comments/feedback on this.

matsciml/datasets/oqmd/devset/data.lmdb Outdated Show resolved Hide resolved
matsciml/datasets/oqmd/oqmd_api.py Outdated Show resolved Hide resolved
matsciml/datasets/oqmd/oqmd_api.py Outdated Show resolved Hide resolved
@laserkelvin laserkelvin marked this pull request as ready for review January 16, 2024 21:13
@laserkelvin
Copy link
Collaborator

laserkelvin commented Jan 18, 2024

@bmuaz you'll also need to resolve conflicts in the files, since they've been updated since you created the branch. We can't merge until you've done this.

The CONTRIBUTORS file added Jonathan's name, and the other two files were modified in #90 I believe. You obviously don't have to worry about precompute_mp_cdvae.py, but please make sure you use the right versions of oqmd_api.py.

@laserkelvin laserkelvin removed the needs triage Issue needs decision making label Jan 19, 2024
Comment on lines 29 to 33
for element in elements:
atomic_num = atomic_number_map()[element]
if atomic_num is not None:
atomic_num_dict.append(atomic_num)
return atomic_num_dict
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop can be replaced with one list comprehension

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I changed to [Atomic_num_map_global[element] for element in elements]

def get_atomic_num(elements):
atomic_num_dict = []
for element in elements:
atomic_num = atomic_number_map()[element]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, the atomic_number_map() is called every time for each element in the array - there is no need in that, it can be initialized once at the beginning

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am using Atomic_num_map_global = atomic_number_map() before calling the function

atomic_num_dict = []
for element in elements:
atomic_num = atomic_number_map()[element]
if atomic_num is not None:
Copy link
Collaborator

@migalkin migalkin Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The atomic_number_map() is a dict that never returns None, if the key element doesn't exist, then there will be a KeyNotFoundError exception

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done

Comment on lines 40 to 44
for atomic_num in atomic_numbers:
element = map_reversed[atomic_num]
if element is not None:
elements.append(element)
return elements
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same comments from get_atomic_num() apply here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done.

Comment on lines 49 to 56
for i in range(num_sites):
for j in range(i, num_sites):
delta = numpy.subtract(coords[i], coords[j])
# Apply minimum image convention for periodic boundary conditions
# delta -= numpy.round(delta)
distance = numpy.linalg.norm(numpy.dot(delta, lattice_vectors))
distance_matrix[i, j] = distance
distance_matrix[j, i] = distance
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This O(N^2) loop is super inefficient and can be replaced with one line numpy vectorized operation

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@migalkin
Copy link
Collaborator

There is a lot of code repetition in:

  • parse_structure_MP, parse_structure_NOMAD, parse_structure_OQDM, parse_structure_Carolina: essentially, they all can be wrapped within one function and, depending on the particular dataset_name, the final dict can be enriched with specific fields
  • and in the main() function - all four if blocks have exactly the same code that differs by one line - why not moving the if inside if key.decode("utf-8").isdigit() ?

@bmuaz
Copy link
Author

bmuaz commented Jan 19, 2024

Sorry my oversight on the four if functions. Now it is clean by using pyg_data = dataset_functions.get(dataset_name)(crystal_data). Thanks for all great comments.

@bmuaz
Copy link
Author

bmuaz commented Feb 1, 2024

Can we get this approved if no more questions? Thanks.

Comment on lines 218 to 244
def data_to_cdvae_MP(item):
num_atoms = len(item["structure"].atomic_numbers)
check_num_atoms(num_atoms)
if check_num_atoms:
pyg_data = parse_structure_MP(item)
return pyg_data

def data_to_cdvae_NOMAD(item):
num_atoms = len(item["properties"]["structures"]["structure_conventional"]["species_at_sites"])
check_num_atoms(num_atoms)
if check_num_atoms:
pyg_data = parse_structure_NOMAD(item)
return pyg_data

def data_to_cdvae_OQMD(item):
num_atoms = item["natoms"]
check_num_atoms(num_atoms)
if check_num_atoms:
pyg_data = parse_structure_OQMD(item)
return pyg_data

def data_to_cdvae_Carolina(item):
num_atoms = len(item["atomic_numbers"])
check_num_atoms(num_atoms)
if check_num_atoms:
pyg_data = parse_structure_Carolina(item)
return pyg_data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some significant code repetition in those 4 functions who
(1) accept the same argument (item)
(2) return the same object (pyg_data)
(3) execute the same sequence of actions (check_num_atoms() and the if block)
which makes them perfect candidates for shrinking into 1 function with the main parse_structure_ function obtained from the function argument

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, condensed them to one function data_to_cdvae().

Comment on lines 97 to 100
if structure is None:
raise ValueError(
"Structure not found in data - workflow needs a structure to use!"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many blocks in the parse_structure_XXX functions are repeated like this check and some dictionary value assignments - those can also be simplified

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, removed the redundant sections. Done.

@migalkin
Copy link
Collaborator

migalkin commented Feb 1, 2024

The PR looks much better now! Is can still be polished further though to remove unnecessary code repetitions in other functions (highlighted above^)

Comment on lines 117 to 119
distance_matrix = get_distance_matrix(cartesian_coords, numpy.array(structure["lattice_vectors"]) * 1E10)
return_dict["distance_matrix"] = torch.from_numpy(distance_matrix)
y = (item["energies"]["total"]["value"] * 6.241509074461E+18) / num_particles #formation_energy_per_atom, eV
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for multiplying by such large numbers? When training a subsequent energy regression model, large numbers could lead into numerical instability

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@migalkin this is necessary a conversion from Joules to eV and from meters to angstroms.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can @bmuaz print some indicative values stored in the data before and after the conversion?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, as I mentioned yesterday, NOMAD used meter and Joules for their data. For example, you can check their original lattice data and total energy data. For example : "lattice_vectors": [ [2.91807e-10, 0, 0], [0, 2.91807e-10, 0], [0, 0, 2.91807e-10] ], "cartesian_site_positions": [ [0, 0, 0], [1.45904e-10, 1.45904e-10, 1.45904e-10] ], "total": { "value": -9.17469e-15 },

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, do we need to keep the exponent then? Let's check the absolute energy values in four datasets - if we are to train a single model on MP and NOMAD, and MP data has, for example, energy values as floats in range [-10, 10] and NOMAD in range [-1e5, 1e5], then the standard regressor scaler will get confused and treat MP values as almost-zeros

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also important to remember that we have a normalize_kwargs to help with these scaling issues. Here is an example. I often forget about this but it helps tremendously in stabilizing training.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The normalize_kwargs are per-task, so you can re-scale based on each task (and by extension, dataset).

I think the design question is, do we apply the unit conversion in the data_from_keys step, or do we save them (like is done here) converted? Personally I feel like I prefer the former, and just document what is being done as opposed to having no metadata associated with the precomputed sets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Issues related to data loading, pipelining, etc. ux User experience, quality of life changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants