Merge branch 'release/v1.1.0' of https://github.com/voxel51/fiftyone …

…into develop
voxel51 · Nov 26, 2024 · 835ccec · 835ccec
2 parents 5e32142 + 738b128
commit 835ccec
Show file tree

Hide file tree

Showing 2 changed files with 148 additions and 0 deletions.
diff --git a/docs/source/brain.rst b/docs/source/brain.rst
@@ -74,6 +74,13 @@ workflow:
   examples to train on in your data and for visualizing common modes of the
   data.
 
+* :ref:`Leaky Splits <brain-image-leaky-splits>`:
+  Often when sourcing data en masse, duplicates and near duplicates can slip
+  through the cracks. The FiftyOne Brain offers a *leaky-splits analysis* that
+  can be used to find potential leaks between dataset splits. These splits can
+  be misleading when evaluating a model, giving an overly optimistic measure
+  for the quality of training. 
+
 .. note::
 
     Check out the :ref:`tutorials page <tutorials>` for detailed examples
@@ -1759,6 +1766,147 @@ samples being less representative and closer samples being more representative.
    :alt: representativeness
    :align: center
 
+
+.. _brain-image-leaky-splits:
+
+Leaky Splits
+____________
+
+Despite our best efforts, duplicates and other forms of non-IID samples
+show up in our data. When these samples end up in different splits, this
+can have consequences when evaluating a model. It can often be easy to
+overestimate model capability due to this issue. The FiftyOne Brain offers a way
+to identify such cases in dataset splits.
+
+The leaks of a |Dataset| or |DatasetView| can be computed directly without the need
+for the predictions of a pre-trained model via the
+:meth:`compute_leaky_splits() <fiftyone.brain.compute_leaky_splits>`
+method:. The splits of a dataset can be defined in three ways. Through tags, by
+tagging samples with their corresponding split. Through a field, by giving each
+split a unique value in that field. And finally through views, by having views
+corresponding to each split.
+
+.. code-block:: python
+    :linenos:
+
+    import fiftyone as fo
+    import fiftyone.brain as fob
+
+    dataset = fo.load_dataset(...)
+    
+    # splits via tags
+    split_tags = ['train', 'test']
+    index, leaks = fob.compute_leaky_splits(dataset, split_tags=split_tags)
+
+    # splits via field
+    split_field = ['split'] # holds split values e.g. 'train' or 'test'
+    index, leaks = fob.compute_leaky_splits(dataset, split_field=split_field)
+
+    # splits via views
+    split_views = {
+        'train' : some_view
+        'test' : some_other_view
+    }
+    index, leaks = fob.compute_leaky_splits(dataset, split_views=split_views)
+
+Here is a sample snippet to run this on the `COCO <https://cocodataset.org/#home>`_ dataset.
+Try it for yourself and see what you may find.
+
+.. code-block:: python
+    :linenos:
+
+    import fiftyone as fo
+    import fiftyone.zoo as foz
+    import fiftyone.utils.random as four
+    from fiftyone.brain import compute_leaky_splits
+
+    # load coco
+    dataset = foz.load_zoo_dataset("coco-2017", split="test")
+    
+    # set up splits via tags
+    dataset.untag_samples(dataset.distinct("tags"))
+    four.random_split(dataset, {"train": 0.7, "test": 0.3})
+
+    # compute leaks
+    index, leaks = compute_leaky_splits(dataset, split_tags=['train', 'test'])
+
+Once you have these leaks, it is wise to look through them. You may gain some insight
+into the source of the leaks.
+
+.. code-block:: python
+    :linenos:
+
+    session = fo.launch_app(leaks)
+
+Before evaluating your model on your test set, consider getting a version of it
+with the leaks removed. This can be easily done with the built in method
+:meth:`no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.no_leaks_view>`.
+
+.. code-block:: python
+    :linenos:
+
+    # if you already have it
+    test_set = some_view
+
+    # can also be found with the variable `split_views` from the index
+    # make sure to put in the right string based on the field/tag/key in view dict
+    # passed when building the index
+    test_set = index.split_views['test']
+
+    test_set_no_leaks = index.no_leaks_view(test_set) # return a view with leaks removed
+    session.view = test_set_no_leaks
+
+    # do evaluations on test_set_no_leaks rather than test_set
+
+Performance on the clean test set will can be closer to the performance of the
+model in the wild. If you found some leaks in your dataset, consider comparing
+performance on the base test set against the clean test set.
+
+**Input**: A |Dataset| or |DatasetView|, and a definition of splits through one
+of tags, a field, or views.
+
+**Output**: An index that will allow you to look through your leaks and
+provides some useful actions once they are discovered such as automatically
+cleaning the dataset with
+:meth:`no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.no_leaks_view>`
+or tagging them for the future with
+:meth:`tag_leaks() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.tag_leaks>`.
+Besides this, a view with all leaks is returned. Visualization of this view
+can give you an insight into the source of the leaks in your dataset.
+
+**What to expect**: Leakiness find leaks by embedding samples with a powerful
+model and finding very close samples in different splits in this space. Large,
+powerful models that were *not* trained on a dataset can provide insight into
+visual and semantic similarity between images, without creating further leaks
+in the process.
+
+**Similarity**: At its core, the leaky-splits module is a wrapper for the brain's
+:class:`SimilarityIndex <fiftyone.brain.similarity.SimilarityIndex>`. Any similarity
+backend, (see :ref:`similarity backends <brain-similarity-backends>`) that implements
+the :class:`DuplicatesMixin <fiftyone.brain.similarity.DuplicatesMixin>` can be used
+to compute leaky splits. You can either pass an existing similarity index by passing
+its brain key to the argument `similarity_brain_key`, or have the method create one on
+the fly for you. If there is a specific configuration for `Similarity` you would like
+to use, pass it in the argument `similarity_config_dict`.
+
+**Models and Embeddings**: If you opt for the method to create a `SimilarityIndex`
+for you, you can still bring you own model by passing it in the `model` argument.
+Alternatively, compute embeddings and pass the field that they reside on. We will
+handle the rest.
+
+**Thresholds**: The leaky-splits module uses a threshold to decide what samples
+are 'too close' and mark them as potential leaks. This threshold can be changed
+either by passing a value to the `threshold` argument of the `compute_leaky_splits()`
+method, or by using the
+:meth:`set_threshold() <fiftyone.brain.internal.core.leaky_splits.SimilarityIndex.set_threshold>`
+method. The best value for your use-case may vary depending on your dataset, as well
+as the embeddings used. A threshold that's too big will have a lot of false positives,
+a threshold that's too small will have a lot of false negatives.
+
+.. image:: /images/brain/brain-leaky-splits.png
+   :alt: leaky-splits
+   :align: center
+
 .. _brain-managing-runs:
 
 Managing brain runs

diff --git a/docs/source/images/brain/brain-leaky-splits.png b/docs/source/images/brain/brain-leaky-splits.png