Skip to content

Commit

Permalink
Merge branch 'release/v1.1.0' of https://github.com/voxel51/fiftyone
Browse files Browse the repository at this point in the history
…into develop
  • Loading branch information
voxel51-bot committed Nov 26, 2024
2 parents 5e32142 + 738b128 commit 835ccec
Show file tree
Hide file tree
Showing 2 changed files with 148 additions and 0 deletions.
148 changes: 148 additions & 0 deletions docs/source/brain.rst
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,13 @@ workflow:
examples to train on in your data and for visualizing common modes of the
data.

* :ref:`Leaky Splits <brain-image-leaky-splits>`:
Often when sourcing data en masse, duplicates and near duplicates can slip
through the cracks. The FiftyOne Brain offers a *leaky-splits analysis* that
can be used to find potential leaks between dataset splits. These splits can
be misleading when evaluating a model, giving an overly optimistic measure
for the quality of training.

.. note::

Check out the :ref:`tutorials page <tutorials>` for detailed examples
Expand Down Expand Up @@ -1759,6 +1766,147 @@ samples being less representative and closer samples being more representative.
:alt: representativeness
:align: center


.. _brain-image-leaky-splits:

Leaky Splits
____________

Despite our best efforts, duplicates and other forms of non-IID samples
show up in our data. When these samples end up in different splits, this
can have consequences when evaluating a model. It can often be easy to
overestimate model capability due to this issue. The FiftyOne Brain offers a way
to identify such cases in dataset splits.

The leaks of a |Dataset| or |DatasetView| can be computed directly without the need
for the predictions of a pre-trained model via the
:meth:`compute_leaky_splits() <fiftyone.brain.compute_leaky_splits>`
method:. The splits of a dataset can be defined in three ways. Through tags, by
tagging samples with their corresponding split. Through a field, by giving each
split a unique value in that field. And finally through views, by having views
corresponding to each split.

.. code-block:: python
:linenos:
import fiftyone as fo
import fiftyone.brain as fob
dataset = fo.load_dataset(...)
# splits via tags
split_tags = ['train', 'test']
index, leaks = fob.compute_leaky_splits(dataset, split_tags=split_tags)
# splits via field
split_field = ['split'] # holds split values e.g. 'train' or 'test'
index, leaks = fob.compute_leaky_splits(dataset, split_field=split_field)
# splits via views
split_views = {
'train' : some_view
'test' : some_other_view
}
index, leaks = fob.compute_leaky_splits(dataset, split_views=split_views)
Here is a sample snippet to run this on the `COCO <https://cocodataset.org/#home>`_ dataset.
Try it for yourself and see what you may find.

.. code-block:: python
:linenos:
import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.utils.random as four
from fiftyone.brain import compute_leaky_splits
# load coco
dataset = foz.load_zoo_dataset("coco-2017", split="test")
# set up splits via tags
dataset.untag_samples(dataset.distinct("tags"))
four.random_split(dataset, {"train": 0.7, "test": 0.3})
# compute leaks
index, leaks = compute_leaky_splits(dataset, split_tags=['train', 'test'])
Once you have these leaks, it is wise to look through them. You may gain some insight
into the source of the leaks.

.. code-block:: python
:linenos:
session = fo.launch_app(leaks)
Before evaluating your model on your test set, consider getting a version of it
with the leaks removed. This can be easily done with the built in method
:meth:`no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.no_leaks_view>`.

.. code-block:: python
:linenos:
# if you already have it
test_set = some_view
# can also be found with the variable `split_views` from the index
# make sure to put in the right string based on the field/tag/key in view dict
# passed when building the index
test_set = index.split_views['test']
test_set_no_leaks = index.no_leaks_view(test_set) # return a view with leaks removed
session.view = test_set_no_leaks
# do evaluations on test_set_no_leaks rather than test_set
Performance on the clean test set will can be closer to the performance of the
model in the wild. If you found some leaks in your dataset, consider comparing
performance on the base test set against the clean test set.

**Input**: A |Dataset| or |DatasetView|, and a definition of splits through one
of tags, a field, or views.

**Output**: An index that will allow you to look through your leaks and
provides some useful actions once they are discovered such as automatically
cleaning the dataset with
:meth:`no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.no_leaks_view>`
or tagging them for the future with
:meth:`tag_leaks() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.tag_leaks>`.
Besides this, a view with all leaks is returned. Visualization of this view
can give you an insight into the source of the leaks in your dataset.

**What to expect**: Leakiness find leaks by embedding samples with a powerful
model and finding very close samples in different splits in this space. Large,
powerful models that were *not* trained on a dataset can provide insight into
visual and semantic similarity between images, without creating further leaks
in the process.

**Similarity**: At its core, the leaky-splits module is a wrapper for the brain's
:class:`SimilarityIndex <fiftyone.brain.similarity.SimilarityIndex>`. Any similarity
backend, (see :ref:`similarity backends <brain-similarity-backends>`) that implements
the :class:`DuplicatesMixin <fiftyone.brain.similarity.DuplicatesMixin>` can be used
to compute leaky splits. You can either pass an existing similarity index by passing
its brain key to the argument `similarity_brain_key`, or have the method create one on
the fly for you. If there is a specific configuration for `Similarity` you would like
to use, pass it in the argument `similarity_config_dict`.

**Models and Embeddings**: If you opt for the method to create a `SimilarityIndex`
for you, you can still bring you own model by passing it in the `model` argument.
Alternatively, compute embeddings and pass the field that they reside on. We will
handle the rest.

**Thresholds**: The leaky-splits module uses a threshold to decide what samples
are 'too close' and mark them as potential leaks. This threshold can be changed
either by passing a value to the `threshold` argument of the `compute_leaky_splits()`
method, or by using the
:meth:`set_threshold() <fiftyone.brain.internal.core.leaky_splits.SimilarityIndex.set_threshold>`
method. The best value for your use-case may vary depending on your dataset, as well
as the embeddings used. A threshold that's too big will have a lot of false positives,
a threshold that's too small will have a lot of false negatives.

.. image:: /images/brain/brain-leaky-splits.png
:alt: leaky-splits
:align: center

.. _brain-managing-runs:

Managing brain runs
Expand Down
Binary file added docs/source/images/brain/brain-leaky-splits.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 835ccec

Please sign in to comment.