-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Precompute cdvae #84
base: main
Are you sure you want to change the base?
Precompute cdvae #84
Conversation
bmuaz
commented
Jan 9, 2024
•
edited
Loading
edited
- Dataset selection is created for user to select different dataset for CDVAE precomputing.
- Graph training data can be generated from four datasets: MP, Carolina, NOMAD, OQMD.
- OQMD devset data is regenerated so that the missing "site" information is included now.
- oqmd_api.py is also modified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments. examples/model_demos/cdvae/cdvae_precompute.py
is not really how I would've liked as there is a lot of repetitive code which makes it hard to maintain.
As a first pass I don't mind too much about letting it through, but want to see if @melo-gonzo has any opinions. Also since it's replacing the original code, would be good if @migalkin can provide any comments/feedback on this.
@bmuaz you'll also need to resolve conflicts in the files, since they've been updated since you created the branch. We can't merge until you've done this. The CONTRIBUTORS file added Jonathan's name, and the other two files were modified in #90 I believe. You obviously don't have to worry about |
for element in elements: | ||
atomic_num = atomic_number_map()[element] | ||
if atomic_num is not None: | ||
atomic_num_dict.append(atomic_num) | ||
return atomic_num_dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop can be replaced with one list comprehension
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I changed to [Atomic_num_map_global[element] for element in elements]
def get_atomic_num(elements): | ||
atomic_num_dict = [] | ||
for element in elements: | ||
atomic_num = atomic_number_map()[element] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, the atomic_number_map()
is called every time for each element in the array - there is no need in that, it can be initialized once at the beginning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I am using Atomic_num_map_global = atomic_number_map()
before calling the function
atomic_num_dict = [] | ||
for element in elements: | ||
atomic_num = atomic_number_map()[element] | ||
if atomic_num is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The atomic_number_map()
is a dict that never returns None
, if the key element
doesn't exist, then there will be a KeyNotFoundError
exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, done
for atomic_num in atomic_numbers: | ||
element = map_reversed[atomic_num] | ||
if element is not None: | ||
elements.append(element) | ||
return elements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same comments from get_atomic_num()
apply here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, done.
for i in range(num_sites): | ||
for j in range(i, num_sites): | ||
delta = numpy.subtract(coords[i], coords[j]) | ||
# Apply minimum image convention for periodic boundary conditions | ||
# delta -= numpy.round(delta) | ||
distance = numpy.linalg.norm(numpy.dot(delta, lattice_vectors)) | ||
distance_matrix[i, j] = distance | ||
distance_matrix[j, i] = distance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This O(N^2) loop is super inefficient and can be replaced with one line numpy vectorized operation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There is a lot of code repetition in:
|
Sorry my oversight on the four |
Can we get this approved if no more questions? Thanks. |
def data_to_cdvae_MP(item): | ||
num_atoms = len(item["structure"].atomic_numbers) | ||
check_num_atoms(num_atoms) | ||
if check_num_atoms: | ||
pyg_data = parse_structure_MP(item) | ||
return pyg_data | ||
|
||
def data_to_cdvae_NOMAD(item): | ||
num_atoms = len(item["properties"]["structures"]["structure_conventional"]["species_at_sites"]) | ||
check_num_atoms(num_atoms) | ||
if check_num_atoms: | ||
pyg_data = parse_structure_NOMAD(item) | ||
return pyg_data | ||
|
||
def data_to_cdvae_OQMD(item): | ||
num_atoms = item["natoms"] | ||
check_num_atoms(num_atoms) | ||
if check_num_atoms: | ||
pyg_data = parse_structure_OQMD(item) | ||
return pyg_data | ||
|
||
def data_to_cdvae_Carolina(item): | ||
num_atoms = len(item["atomic_numbers"]) | ||
check_num_atoms(num_atoms) | ||
if check_num_atoms: | ||
pyg_data = parse_structure_Carolina(item) | ||
return pyg_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some significant code repetition in those 4 functions who
(1) accept the same argument (item
)
(2) return the same object (pyg_data
)
(3) execute the same sequence of actions (check_num_atoms()
and the if
block)
which makes them perfect candidates for shrinking into 1 function with the main parse_structure_
function obtained from the function argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, condensed them to one function data_to_cdvae()
.
if structure is None: | ||
raise ValueError( | ||
"Structure not found in data - workflow needs a structure to use!" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many blocks in the parse_structure_XXX
functions are repeated like this check and some dictionary value assignments - those can also be simplified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, removed the redundant sections. Done.
The PR looks much better now! Is can still be polished further though to remove unnecessary code repetitions in other functions (highlighted above^) |
distance_matrix = get_distance_matrix(cartesian_coords, numpy.array(structure["lattice_vectors"]) * 1E10) | ||
return_dict["distance_matrix"] = torch.from_numpy(distance_matrix) | ||
y = (item["energies"]["total"]["value"] * 6.241509074461E+18) / num_particles #formation_energy_per_atom, eV |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for multiplying by such large numbers? When training a subsequent energy regression model, large numbers could lead into numerical instability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@migalkin this is necessary a conversion from Joules to eV and from meters to angstroms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can @bmuaz print some indicative values stored in the data before and after the conversion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, as I mentioned yesterday, NOMAD used meter and Joules for their data. For example, you can check their original lattice data and total energy data. For example : "lattice_vectors": [ [2.91807e-10, 0, 0], [0, 2.91807e-10, 0], [0, 0, 2.91807e-10] ], "cartesian_site_positions": [ [0, 0, 0], [1.45904e-10, 1.45904e-10, 1.45904e-10] ], "total": { "value": -9.17469e-15 },
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, do we need to keep the exponent then? Let's check the absolute energy values in four datasets - if we are to train a single model on MP and NOMAD, and MP data has, for example, energy values as floats in range [-10, 10] and NOMAD in range [-1e5, 1e5], then the standard regressor scaler will get confused and treat MP values as almost-zeros
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also important to remember that we have a normalize_kwargs
to help with these scaling issues. Here is an example. I often forget about this but it helps tremendously in stabilizing training.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The normalize_kwargs
are per-task, so you can re-scale based on each task (and by extension, dataset).
I think the design question is, do we apply the unit conversion in the data_from_keys
step, or do we save them (like is done here) converted? Personally I feel like I prefer the former, and just document what is being done as opposed to having no metadata associated with the precomputed sets.