BalancedBatchSampler QoL Updates #566

nimashoghi · 2023-08-24T18:39:28Z

Right now, BalancedBatchSampler has some rough edges:

It requires a very specific npz format for the metadata storage which needs to be followed for any dataset that wants to be balanced. Currently, the new ASE dataset doesn't support this format.
The force_balancing and throw_on_error parameters are confusing.

For this PR, I updated BalancedBatchSampler to rely on a protocol which expects datasets to implement a data_sizes method, which returns the "size" of each dataset sample. I have updated LmdbDataset and OC22LmdbDataset to implement this in a backward-compatible manner w/ the old implementation.

I have also completely removed the neighbors balancing support, as our graph generation happens on GPU anyway, and the number of neighbors changes depending on max_neighbors/cutoff. In many cases, the values stored in the metadata would end up not being accurate.

As a TL;DR, here's essentially the main change (as far as the datasets are concerned).
Previously:

class MyDataset(Dataset):
    def __init__(self, ...):
        ...
        self.metadata_path = ...

Now:

class MyDataset(Dataset):
    def data_sizes(self, batch_idx: List[int]) -> np.ndarray:
        # Use the loaded metadata to load the natoms for samples in batch_idx
        return self.metadata["natoms"][batch_idx]

    def __init__(self, ...):
        ...
        self.metadata = np.load(...) # Load all metadata in the init method

Tasks:

Update BalancedBatchSampler to use datasets' data_sizes method
Replace BalancedBatchSampler's force_balancing and throw_on_error parameters with on_error

Replace BalancedBatchSampler's `force_balancing` and `throw_on_error` parameters with `on_error`

codecov · 2023-08-30T02:55:04Z

Codecov Report

Attention: Patch coverage is 55.63910% with 59 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
ocpmodels/common/balanced_batch_sampler.py	58.82%	42 Missing ⚠️
ocpmodels/trainers/base_trainer.py	22.22%	7 Missing ⚠️
ocpmodels/datasets/lmdb_dataset.py	50.00%	5 Missing ⚠️
ocpmodels/datasets/oc22_lmdb_dataset.py	54.54%	5 Missing ⚠️

Files with missing lines	Coverage Δ
ocpmodels/common/data_parallel.py	`21.05% <100.00%> (-27.38%)`	⬇️
ocpmodels/datasets/lmdb_dataset.py	`38.58% <50.00%> (+0.44%)`	⬆️
ocpmodels/datasets/oc22_lmdb_dataset.py	`16.31% <54.54%> (+3.23%)`	⬆️
ocpmodels/trainers/base_trainer.py	`16.93% <22.22%> (+0.08%)`	⬆️
ocpmodels/common/balanced_batch_sampler.py	`58.82% <58.82%> (ø)`

github-actions · 2023-09-30T00:33:32Z

This PR has been marked as stale because it has been open for 30 days with no activity.

wood-b · 2024-10-24T00:44:50Z

Closing this PR as it was incorporated in this PR #753

nimashoghi and others added 5 commits August 24, 2023 18:27

Update BalancedBatchSampler to use datasets' data_sizes method

ae4add3

Replace BalancedBatchSampler's `force_balancing` and `throw_on_error` parameters with `on_error`

Remove python 3.10 syntax

01fe2b4

Documentation

2bf8213

Added set_epoch method

7ba5b8a

Format

a367d1e

Changed "resolved dataset" message to be a debug log to reduce log spam

46e3c57

github-actions bot added the stale label Sep 30, 2023

abhshkdz added dont-close and removed stale labels Oct 2, 2023

abhshkdz self-assigned this Oct 2, 2023

mshuaibii unassigned abhshkdz Apr 8, 2024

mshuaibii marked this pull request as draft April 8, 2024 19:53

mshuaibii mentioned this pull request Apr 8, 2024

Balanced Sampler QOL #644

Closed

mshuaibii requested a review from wood-b April 8, 2024 21:12

mshuaibii added the enhancement New feature or request label Apr 9, 2024

wood-b removed the dont-close label Oct 24, 2024

wood-b closed this Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BalancedBatchSampler QoL Updates #566

BalancedBatchSampler QoL Updates #566

nimashoghi commented Aug 24, 2023 •

edited

Loading

codecov bot commented Aug 30, 2023 •

edited

Loading

github-actions bot commented Sep 30, 2023

wood-b commented Oct 24, 2024

BalancedBatchSampler QoL Updates #566

BalancedBatchSampler QoL Updates #566

Conversation

nimashoghi commented Aug 24, 2023 • edited Loading

codecov bot commented Aug 30, 2023 • edited Loading

Codecov Report

github-actions bot commented Sep 30, 2023

wood-b commented Oct 24, 2024

nimashoghi commented Aug 24, 2023 •

edited

Loading

codecov bot commented Aug 30, 2023 •

edited

Loading