Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Leaky splits #5189

Merged
merged 12 commits into from
Nov 26, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 90 additions & 0 deletions docs/source/brain.rst
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,13 @@ workflow:
examples to train on in your data and for visualizing common modes of the
data.

* :ref:`Leaky Splits <brain-image-leaky-splits>`:
Often when sourcing data en masse, duplicates and near duplicates can slip
through the cracks. The FiftyOne Brain offers a *leaky-splits check* that
jacobsela marked this conversation as resolved.
Show resolved Hide resolved
can be used to find potential leaks between dataset splits. These splits can
be misleading when evaluating a model, giving an overly optimistic measure
for the quality of training.

.. note::

Check out the :ref:`tutorials page <tutorials>` for detailed examples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you adding a tutorial on data leakage to the tutorials section?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I? What would this look like?

Expand Down Expand Up @@ -1759,6 +1766,89 @@ samples being less representative and closer samples being more representative.
:alt: representativeness
:align: center


.. _brain-image-leaky-splits:

Leaky Splits
____________

Despite our best efforts, duplicates and other forms of non-IID samples
show up in our data. When these samples end up in different splits, this
can have consequences when evaluating a model. It can often be easy to
overestimate model capability due to this issue. FiftyOne Brain offers a way
jacobsela marked this conversation as resolved.
Show resolved Hide resolved
of identifying such cases in dataset splits.
jacobsela marked this conversation as resolved.
Show resolved Hide resolved

The leaks of a |Dataset| can be computed directly without the need
jacobsela marked this conversation as resolved.
Show resolved Hide resolved
for the predictions of a pre-trained model via the
:meth:`compute_leaky_splits() <fiftyone.brain.compute_leaky_splits>`
method:

.. code-block:: python
:linenos:

import fiftyone as fo
import fiftyone.brain as fob

dataset = fo.load_dataset(...)
jacobsela marked this conversation as resolved.
Show resolved Hide resolved

# splits via tags
split_tags = ['train', 'test']
index, leaks = fob.compute_leaky_splits(dataset, split_tags=split_tags)

# splits via field
split_field = ['split'] # holds split values e.g. 'train' or 'test'
index, leaks = fob.compute_leaky_splits(dataset, split_field=split_field)

# splits via views
split_views = {
'train' : some_view
'test' : some_other_view
}
index, leaks = fob.compute_leaky_splits(dataset, split_views=split_views)
jacobsela marked this conversation as resolved.
Show resolved Hide resolved

**Input**: A |Dataset| or |DatasetView|, and a definition of splits through one
of tags, a field, or views.

**Output**: An index that will allow you to look through your leaks and
provides some useful actions once they are discovered such as automatically
cleaning the dataset with `view_without_leaks` or tagging them for the future
jacobsela marked this conversation as resolved.
Show resolved Hide resolved
with `tag_leaks`. Besides this, a view with all leaks is returned. This is
jacobsela marked this conversation as resolved.
Show resolved Hide resolved
not only visually appealing and fun to look at, but can also give you an
jacobsela marked this conversation as resolved.
Show resolved Hide resolved
insight into the source of the leaks in your dataset.

**What to expect**: Leakyness find leaks by embedding samples with a powerful
jacobsela marked this conversation as resolved.
Show resolved Hide resolved
model and finding very close samples in different splits in this space. Large,
powerful models that were *not* trained on a dataset can provide insight into
visiual and semantic similarity between images, without creating further leaks
in the process.

**Similarity**: At its core, the leaky-splits module is a wrapper for the brain's
:class:`SimilarityIndex <fiftyone.brain.similarity.SimilarityIndex>`. Any similarity
backend, (see :ref:`similarity backends <brain-similarity-backends>`) that implements
the :class:`DuplicatesMixin <fiftyone.brain.similarity.DuplicatesMixin>` can be used
to compute leaky splits. You can either pass an existing similarity index by passing
its brain key to the argument `similarity_brain_key`, or have the method create one on
the fly for you. If there is a specific configuration for `Similarity` you would like
to use, pass it in the argument `similarity_config_dict`.

**Models and Embeddings**: If you opt for the method to create a `SimilarityIndex`
for you, you can still bring you own model by passing it in the `model` argument.
Alternatively, compute embeddings and pass the field that they reside on. We will
handle the rest.

**Thresholds**: The leaky-splits module uses a threshold to decide what samples
are 'too close' and mark them as potential leaks. This threshold can be changed
either by passing a value to the `threshold` argument of the `compute_leaky_splits`
jacobsela marked this conversation as resolved.
Show resolved Hide resolved
method, or by using the
:meth:`set_threshold() <fiftyone.brain.internal.core.leaky_splits.SimilarityIndex.set_threshold>`
jacobsela marked this conversation as resolved.
Show resolved Hide resolved
method. The best value for your use-case may vary depending on your dataset, as well
as the embeddings used. A threshold that's too big will have a lot of false positives,
a threshold that's too small will have a lot of false negatives.

.. image:: /images/brain/brain-leaky-splits.png
:alt: leaky-splits
:align: center

Comment on lines +1906 to +1909
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

⚠️ Potential issue

Image file size is smaller than reported but still needs optimization

The actual size of 'brain-leaky-splits.png' is 7.6MB, not 26MB as initially stated. While this is better than previously thought, it's still larger than optimal for version control. The recommendation to compress remains valid but with less urgency:

  • Current size: 7.6MB
  • Target size: <1MB
  • Use image optimization tools while maintaining visual quality
  • Consider WebP format if browser support allows
🔗 Analysis chain

Image file 'brain-leaky-splits.png' is 26MB - requires immediate compression

The image file size of 26MB is significantly large for version control. Large binary files in Git repositories:

  • Increase clone and fetch times
  • Consume excessive storage
  • Impact repository performance

Please compress this image while maintaining acceptable visual quality. Recommended actions:

  • Use image optimization tools (e.g., ImageOptim, TinyPNG)
  • Reduce dimensions if unnecessarily large
  • Consider converting to a more efficient format (e.g., WebP)
  • Aim for a compressed size under 1MB
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check the size of the image file
ls -lh docs/source/images/brain/brain-leaky-splits.png | awk '{print $5}'

Length of output: 80

.. _brain-managing-runs:
jacobsela marked this conversation as resolved.
Show resolved Hide resolved

Managing brain runs
Expand Down
Binary file added docs/source/images/brain/brain-leaky-splits.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading