Replace ImageNet ensemble baseline with Robustness Metrics.

There are two approaches to implement ensembles. 1. Load all SavedModels into a single model. + Pro: Simple to compute results. + Con: All models must fit in memory and compute can't parallelize across models. 2. Eval each model in parallel, saving predictions. Then load predictions and compute metrics. (approach in Uncertainty Baselines) + Pro: Scales with compute and memory. + Con: Requires two stages (the first uses accelerators, the second is CPU-only). We're already doing the first stage to report non-ensemble results. So two stages is not that inconvenient. This CL does #2. Fixes google/uncertainty-baselines#63, google/uncertainty-baselines#71. Note: I added 'ece' back to the imagenet_variants report. TODOs in later PRs + Loading predictions is slow. Each file is at most 200MB with 50K predictions of 1000 float32 values, and read_predictions shouldn't take this long. np.load gets, say, read speeds of 200 MB/s (https://stackoverflow.com/a/30332316). It may be because we're loading batch_size=1? + Replace het_ensemble.py and sngp_ensemble.py. PiperOrigin-RevId: 370938990
google-research · May 18, 2021 · fa6cafc · fa6cafc
1 parent 4929fd6
commit fa6cafc
Show file tree

Hide file tree

Showing 2 changed files with 18 additions and 5 deletions.
diff --git a/robustness_metrics/metrics/serialization.py b/robustness_metrics/metrics/serialization.py
@@ -46,6 +46,8 @@ def add_predictions(self,
         tf.convert_to_tensor(model_predictions.predictions, dtype=tf.float32))
     serialized_metadata = {}
     for key, value in metadata.items():
+      if isinstance(value, tf.Tensor):
+        value = value.numpy()
       if hasattr(value, "dtype") and value.dtype == np.int:
         if isinstance(value, np.ndarray):
           value = [int(x) for x in value.tolist()]
@@ -56,6 +58,9 @@ def add_predictions(self,
           value = [float(x) for x in value.tolist()]
         else:
           value = float(value)
+      # Convert bytes (e.g., ImageNetVidRobust's video_frame_id).
+      if isinstance(value, bytes):
+        value = value.decode("utf-8")
       serialized_metadata[key] = value
     serialized_metadata = json.dumps(serialized_metadata).encode()
     tf_example = tf.train.Example(features=tf.train.Features(feature={
@@ -98,4 +103,11 @@ def parse(features_serialized):
       prediction = types.ModelPredictions(
           predictions=example["predictions"].numpy())
       metadata = json.loads(example["metadata"].numpy())
+      # Apply a special case to lists of size 1. We need to adjust for the fact
+      # that int-casting a Tensor with shape [1] works (this may be the original
+      # element), but int-casting a list of size 1 (this may be the saved
+      # element) doesn't work.
+      for key, value in metadata.items():
+        if isinstance(value, list) and len(value) == 1:
+          metadata[key] = value[0]
       yield prediction, metadata
diff --git a/robustness_metrics/reports/imagenet_variants.py b/robustness_metrics/reports/imagenet_variants.py
@@ -62,11 +62,12 @@ class ImagenetVariantsReport(base.Report):
 
   This report contains the following ImageNet variants:
     * imagenet
-    * imagenet_a,
-    * imagenet_v2 (all variants)
+    * imagenet_a
+    * imagenet_v2/matched_frequency
     * imagenet_c (all variants)
-  For each dataset, we compute accuracy, expected calibration
-  error (ece), log-likelihood, Brier, timing, and adaptive ECE.
+
+  For each dataset, we compute accuracy, expected calibration error,
+  log-likelihood, Brier.
   """
 
   def __init__(self):
@@ -77,7 +78,7 @@ def __init__(self):
 
   def _yield_metrics_to_evaluate(self, use_dataset_labelset=None):
     """Yields metrics to be evaluated."""
-    metrics = ["accuracy", "nll", "brier"]
+    metrics = ["accuracy", "ece", "nll", "brier"]
     if use_dataset_labelset is not None:
       metrics = [f"{metric}(use_dataset_labelset={use_dataset_labelset})"
                  for metric in metrics]