Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Leaky splits #5189

Merged
merged 12 commits into from
Nov 26, 2024
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
143 changes: 143 additions & 0 deletions docs/source/brain.rst
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,13 @@ workflow:
examples to train on in your data and for visualizing common modes of the
data.

* :ref:`Leaky Splits <brain-image-leaky-splits>`:
Often when sourcing data en masse, duplicates and near duplicates can slip
through the cracks. The FiftyOne Brain offers a *leaky-splits analysis* that
can be used to find potential leaks between dataset splits. These splits can
be misleading when evaluating a model, giving an overly optimistic measure
for the quality of training.

.. note::

Check out the :ref:`tutorials page <tutorials>` for detailed examples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you adding a tutorial on data leakage to the tutorials section?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I? What would this look like?

Expand Down Expand Up @@ -1759,6 +1766,142 @@ samples being less representative and closer samples being more representative.
:alt: representativeness
:align: center


.. _brain-image-leaky-splits:

Leaky Splits
____________

Despite our best efforts, duplicates and other forms of non-IID samples
show up in our data. When these samples end up in different splits, this
can have consequences when evaluating a model. It can often be easy to
overestimate model capability due to this issue. The FiftyOne Brain offers a way
to identify such cases in dataset splits.

The leaks of a |Dataset| or |DatasetView| can be computed directly without the need
jacobsela marked this conversation as resolved.
Show resolved Hide resolved
for the predictions of a pre-trained model via the
:meth:`compute_leaky_splits() <fiftyone.brain.compute_leaky_splits>`
method:.

.. code-block:: python
:linenos:

import fiftyone as fo
import fiftyone.brain as fob

dataset = fo.load_dataset(...)
jacobsela marked this conversation as resolved.
Show resolved Hide resolved

# splits via tags
split_tags = ['train', 'test']
index, leaks = fob.compute_leaky_splits(dataset, split_tags=split_tags)

# splits via field
split_field = ['split'] # holds split values e.g. 'train' or 'test'
index, leaks = fob.compute_leaky_splits(dataset, split_field=split_field)

# splits via views
split_views = {
'train' : some_view
'test' : some_other_view
}
index, leaks = fob.compute_leaky_splits(dataset, split_views=split_views)
jacobsela marked this conversation as resolved.
Show resolved Hide resolved

Here is a sample snippet to run this on the `COCO <https://cocodataset.org/#home>`_.
Try it for yourself and see what you may find.

.. code-block:: python
:linenos:

import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.utils.random as four
from fiftyone.brain import compute_leaky_splits

coco = foz.load_zoo_dataset("coco-2017", split="test")
jacobsela marked this conversation as resolved.
Show resolved Hide resolved
coco.untag_samples(coco.distinct("tags"))

four.random_split(coco, {"train": 0.7, "test": 0.3})
index, leaks = compute_leaky_splits(coco, split_tags=['train', 'test'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe print out the number of leaks? Right now it isn't 100% clear the complete value this provides. You're underselling it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best way to convince the reader is either through having them run it and seeing the visual one they open the app or having them read the blogpost. I think just the number doesn't tell the reader much because how can they trust that the algorithm doesn't have a lot of false positives?


session = fo.Session(leaks)
jacobsela marked this conversation as resolved.
Show resolved Hide resolved

Once you have these leaks, it is wise to look through them. You may gain some insight
into the source of the leaks.

.. code-block:: python
:linenos:

session = fo.Session(leaks)
jacobsela marked this conversation as resolved.
Show resolved Hide resolved

Before evaluating your model on your test set, consider getting a version of it
with the leaks removed. This can be easily done with the built in method
:meth:`no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.no_leaks_view>`.

Comment on lines +1841 to +1844
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix method name inconsistency.

The documentation refers to no_leaks_view() but the actual method name is get_no_leaks_view().

-:meth:`no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.no_leaks_view>`
+:meth:`get_no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.get_no_leaks_view>`
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Before evaluating your model on your test set, consider getting a version of it
with the leaks removed. This can be easily done with the built in method
:meth:`no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.no_leaks_view>`.
Before evaluating your model on your test set, consider getting a version of it
with the leaks removed. This can be easily done with the built in method
:meth:`get_no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.get_no_leaks_view>`.

.. code-block:: python
:linenos:

# if you already have it
test_set = some_view

# can also be found with the variable `split_views` from the index
# make sure to put in the right string based on the field/tag/key in view dict
# passed when building the index
test_set = index.split_views['test']

test_set_no_leaks = index.no_leaks_view(test_set) # return a view with leaks removed
session = fo.Session(leaks)
jacobsela marked this conversation as resolved.
Show resolved Hide resolved

# do evaluations on test_set_no_leaks rather than test_set

brimoor marked this conversation as resolved.
Show resolved Hide resolved
Performance on the clean test set will can be closer to the performance of the
model in the wild. If you found some leaks in your dataset, consider comparing
performance on the base test set against the clean test set.

**Input**: A |Dataset| or |DatasetView|, and a definition of splits through one
of tags, a field, or views.

**Output**: An index that will allow you to look through your leaks and
provides some useful actions once they are discovered such as automatically
cleaning the dataset with
:meth:`no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.no_leaks_view>`
brimoor marked this conversation as resolved.
Show resolved Hide resolved
or tagging them for the future with
:meth:`tag_leaks() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.tag_leaks>`.
Besides this, a view with all leaks is returned. Visualization of this view
can give you an insight into the source of the leaks in your dataset.

**What to expect**: Leakiness find leaks by embedding samples with a powerful
model and finding very close samples in different splits in this space. Large,
powerful models that were *not* trained on a dataset can provide insight into
visual and semantic similarity between images, without creating further leaks
in the process.

**Similarity**: At its core, the leaky-splits module is a wrapper for the brain's
:class:`SimilarityIndex <fiftyone.brain.similarity.SimilarityIndex>`. Any similarity
backend, (see :ref:`similarity backends <brain-similarity-backends>`) that implements
the :class:`DuplicatesMixin <fiftyone.brain.similarity.DuplicatesMixin>` can be used
to compute leaky splits. You can either pass an existing similarity index by passing
its brain key to the argument `similarity_brain_key`, or have the method create one on
the fly for you. If there is a specific configuration for `Similarity` you would like
to use, pass it in the argument `similarity_config_dict`.

**Models and Embeddings**: If you opt for the method to create a `SimilarityIndex`
for you, you can still bring you own model by passing it in the `model` argument.
Alternatively, compute embeddings and pass the field that they reside on. We will
handle the rest.

**Thresholds**: The leaky-splits module uses a threshold to decide what samples
are 'too close' and mark them as potential leaks. This threshold can be changed
either by passing a value to the `threshold` argument of the `compute_leaky_splits()`
method, or by using the
:meth:`set_threshold() <fiftyone.brain.internal.core.leaky_splits.SimilarityIndex.set_threshold>`
jacobsela marked this conversation as resolved.
Show resolved Hide resolved
method. The best value for your use-case may vary depending on your dataset, as well
as the embeddings used. A threshold that's too big will have a lot of false positives,
a threshold that's too small will have a lot of false negatives.

Comment on lines +1897 to +1905
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add threshold guidance and examples.

The threshold explanation would benefit from:

  1. Recommended threshold ranges for common use cases
  2. Examples of how different thresholds affect results
  3. A method to automatically determine optimal threshold values

.. image:: /images/brain/brain-leaky-splits.png
:alt: leaky-splits
:align: center

Comment on lines +1906 to +1909
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

⚠️ Potential issue

Image file size is smaller than reported but still needs optimization

The actual size of 'brain-leaky-splits.png' is 7.6MB, not 26MB as initially stated. While this is better than previously thought, it's still larger than optimal for version control. The recommendation to compress remains valid but with less urgency:

  • Current size: 7.6MB
  • Target size: <1MB
  • Use image optimization tools while maintaining visual quality
  • Consider WebP format if browser support allows
🔗 Analysis chain

Image file 'brain-leaky-splits.png' is 26MB - requires immediate compression

The image file size of 26MB is significantly large for version control. Large binary files in Git repositories:

  • Increase clone and fetch times
  • Consume excessive storage
  • Impact repository performance

Please compress this image while maintaining acceptable visual quality. Recommended actions:

  • Use image optimization tools (e.g., ImageOptim, TinyPNG)
  • Reduce dimensions if unnecessarily large
  • Consider converting to a more efficient format (e.g., WebP)
  • Aim for a compressed size under 1MB
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check the size of the image file
ls -lh docs/source/images/brain/brain-leaky-splits.png | awk '{print $5}'

Length of output: 80

.. _brain-managing-runs:
jacobsela marked this conversation as resolved.
Show resolved Hide resolved

Managing brain runs
Expand Down
Binary file added docs/source/images/brain/brain-leaky-splits.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading