-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Docs] Leaky splits #5189
[Docs] Leaky splits #5189
Changes from 11 commits
5d835ef
46c53ba
e345c05
40ff57c
5a823a4
d67015a
02d6e70
259e14e
19ebdad
873287e
0900594
e948e4d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -74,6 +74,13 @@ workflow: | |||||||||||||
examples to train on in your data and for visualizing common modes of the | ||||||||||||||
data. | ||||||||||||||
|
||||||||||||||
* :ref:`Leaky Splits <brain-image-leaky-splits>`: | ||||||||||||||
Often when sourcing data en masse, duplicates and near duplicates can slip | ||||||||||||||
through the cracks. The FiftyOne Brain offers a *leaky-splits analysis* that | ||||||||||||||
can be used to find potential leaks between dataset splits. These splits can | ||||||||||||||
be misleading when evaluating a model, giving an overly optimistic measure | ||||||||||||||
for the quality of training. | ||||||||||||||
|
||||||||||||||
.. note:: | ||||||||||||||
|
||||||||||||||
Check out the :ref:`tutorials page <tutorials>` for detailed examples | ||||||||||||||
|
@@ -1759,6 +1766,142 @@ samples being less representative and closer samples being more representative. | |||||||||||||
:alt: representativeness | ||||||||||||||
:align: center | ||||||||||||||
|
||||||||||||||
|
||||||||||||||
.. _brain-image-leaky-splits: | ||||||||||||||
|
||||||||||||||
Leaky Splits | ||||||||||||||
____________ | ||||||||||||||
|
||||||||||||||
Despite our best efforts, duplicates and other forms of non-IID samples | ||||||||||||||
show up in our data. When these samples end up in different splits, this | ||||||||||||||
can have consequences when evaluating a model. It can often be easy to | ||||||||||||||
overestimate model capability due to this issue. The FiftyOne Brain offers a way | ||||||||||||||
to identify such cases in dataset splits. | ||||||||||||||
|
||||||||||||||
The leaks of a |Dataset| or |DatasetView| can be computed directly without the need | ||||||||||||||
jacobsela marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
for the predictions of a pre-trained model via the | ||||||||||||||
:meth:`compute_leaky_splits() <fiftyone.brain.compute_leaky_splits>` | ||||||||||||||
method:. | ||||||||||||||
|
||||||||||||||
.. code-block:: python | ||||||||||||||
:linenos: | ||||||||||||||
|
||||||||||||||
import fiftyone as fo | ||||||||||||||
import fiftyone.brain as fob | ||||||||||||||
|
||||||||||||||
dataset = fo.load_dataset(...) | ||||||||||||||
jacobsela marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
|
||||||||||||||
# splits via tags | ||||||||||||||
split_tags = ['train', 'test'] | ||||||||||||||
index, leaks = fob.compute_leaky_splits(dataset, split_tags=split_tags) | ||||||||||||||
|
||||||||||||||
# splits via field | ||||||||||||||
split_field = ['split'] # holds split values e.g. 'train' or 'test' | ||||||||||||||
index, leaks = fob.compute_leaky_splits(dataset, split_field=split_field) | ||||||||||||||
|
||||||||||||||
# splits via views | ||||||||||||||
split_views = { | ||||||||||||||
'train' : some_view | ||||||||||||||
'test' : some_other_view | ||||||||||||||
} | ||||||||||||||
index, leaks = fob.compute_leaky_splits(dataset, split_views=split_views) | ||||||||||||||
jacobsela marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
|
||||||||||||||
Here is a sample snippet to run this on the `COCO <https://cocodataset.org/#home>`_. | ||||||||||||||
Try it for yourself and see what you may find. | ||||||||||||||
|
||||||||||||||
.. code-block:: python | ||||||||||||||
:linenos: | ||||||||||||||
|
||||||||||||||
import fiftyone as fo | ||||||||||||||
import fiftyone.zoo as foz | ||||||||||||||
import fiftyone.utils.random as four | ||||||||||||||
from fiftyone.brain import compute_leaky_splits | ||||||||||||||
|
||||||||||||||
coco = foz.load_zoo_dataset("coco-2017", split="test") | ||||||||||||||
jacobsela marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
coco.untag_samples(coco.distinct("tags")) | ||||||||||||||
|
||||||||||||||
four.random_split(coco, {"train": 0.7, "test": 0.3}) | ||||||||||||||
index, leaks = compute_leaky_splits(coco, split_tags=['train', 'test']) | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe print out the number of leaks? Right now it isn't 100% clear the complete value this provides. You're underselling it There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the best way to convince the reader is either through having them run it and seeing the visual one they open the app or having them read the blogpost. I think just the number doesn't tell the reader much because how can they trust that the algorithm doesn't have a lot of false positives? |
||||||||||||||
|
||||||||||||||
session = fo.Session(leaks) | ||||||||||||||
jacobsela marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
|
||||||||||||||
Once you have these leaks, it is wise to look through them. You may gain some insight | ||||||||||||||
into the source of the leaks. | ||||||||||||||
|
||||||||||||||
.. code-block:: python | ||||||||||||||
:linenos: | ||||||||||||||
|
||||||||||||||
session = fo.Session(leaks) | ||||||||||||||
jacobsela marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
|
||||||||||||||
Before evaluating your model on your test set, consider getting a version of it | ||||||||||||||
with the leaks removed. This can be easily done with the built in method | ||||||||||||||
:meth:`no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.no_leaks_view>`. | ||||||||||||||
|
||||||||||||||
Comment on lines
+1841
to
+1844
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fix method name inconsistency. The documentation refers to -:meth:`no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.no_leaks_view>`
+:meth:`get_no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.get_no_leaks_view>` 📝 Committable suggestion
Suggested change
|
||||||||||||||
.. code-block:: python | ||||||||||||||
:linenos: | ||||||||||||||
|
||||||||||||||
# if you already have it | ||||||||||||||
test_set = some_view | ||||||||||||||
|
||||||||||||||
# can also be found with the variable `split_views` from the index | ||||||||||||||
# make sure to put in the right string based on the field/tag/key in view dict | ||||||||||||||
# passed when building the index | ||||||||||||||
test_set = index.split_views['test'] | ||||||||||||||
|
||||||||||||||
test_set_no_leaks = index.no_leaks_view(test_set) # return a view with leaks removed | ||||||||||||||
session = fo.Session(leaks) | ||||||||||||||
jacobsela marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
|
||||||||||||||
# do evaluations on test_set_no_leaks rather than test_set | ||||||||||||||
|
||||||||||||||
brimoor marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
Performance on the clean test set will can be closer to the performance of the | ||||||||||||||
model in the wild. If you found some leaks in your dataset, consider comparing | ||||||||||||||
performance on the base test set against the clean test set. | ||||||||||||||
|
||||||||||||||
**Input**: A |Dataset| or |DatasetView|, and a definition of splits through one | ||||||||||||||
of tags, a field, or views. | ||||||||||||||
|
||||||||||||||
**Output**: An index that will allow you to look through your leaks and | ||||||||||||||
provides some useful actions once they are discovered such as automatically | ||||||||||||||
cleaning the dataset with | ||||||||||||||
:meth:`no_leaks_view() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.no_leaks_view>` | ||||||||||||||
brimoor marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
or tagging them for the future with | ||||||||||||||
:meth:`tag_leaks() <fiftyone.brain.internal.core.leaky_splits.LeakySplitsIndex.tag_leaks>`. | ||||||||||||||
Besides this, a view with all leaks is returned. Visualization of this view | ||||||||||||||
can give you an insight into the source of the leaks in your dataset. | ||||||||||||||
|
||||||||||||||
**What to expect**: Leakiness find leaks by embedding samples with a powerful | ||||||||||||||
model and finding very close samples in different splits in this space. Large, | ||||||||||||||
powerful models that were *not* trained on a dataset can provide insight into | ||||||||||||||
visual and semantic similarity between images, without creating further leaks | ||||||||||||||
in the process. | ||||||||||||||
|
||||||||||||||
**Similarity**: At its core, the leaky-splits module is a wrapper for the brain's | ||||||||||||||
:class:`SimilarityIndex <fiftyone.brain.similarity.SimilarityIndex>`. Any similarity | ||||||||||||||
backend, (see :ref:`similarity backends <brain-similarity-backends>`) that implements | ||||||||||||||
the :class:`DuplicatesMixin <fiftyone.brain.similarity.DuplicatesMixin>` can be used | ||||||||||||||
to compute leaky splits. You can either pass an existing similarity index by passing | ||||||||||||||
its brain key to the argument `similarity_brain_key`, or have the method create one on | ||||||||||||||
the fly for you. If there is a specific configuration for `Similarity` you would like | ||||||||||||||
to use, pass it in the argument `similarity_config_dict`. | ||||||||||||||
|
||||||||||||||
**Models and Embeddings**: If you opt for the method to create a `SimilarityIndex` | ||||||||||||||
for you, you can still bring you own model by passing it in the `model` argument. | ||||||||||||||
Alternatively, compute embeddings and pass the field that they reside on. We will | ||||||||||||||
handle the rest. | ||||||||||||||
|
||||||||||||||
**Thresholds**: The leaky-splits module uses a threshold to decide what samples | ||||||||||||||
are 'too close' and mark them as potential leaks. This threshold can be changed | ||||||||||||||
either by passing a value to the `threshold` argument of the `compute_leaky_splits()` | ||||||||||||||
method, or by using the | ||||||||||||||
:meth:`set_threshold() <fiftyone.brain.internal.core.leaky_splits.SimilarityIndex.set_threshold>` | ||||||||||||||
jacobsela marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
method. The best value for your use-case may vary depending on your dataset, as well | ||||||||||||||
as the embeddings used. A threshold that's too big will have a lot of false positives, | ||||||||||||||
a threshold that's too small will have a lot of false negatives. | ||||||||||||||
|
||||||||||||||
Comment on lines
+1897
to
+1905
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🛠️ Refactor suggestion Add threshold guidance and examples. The threshold explanation would benefit from:
|
||||||||||||||
.. image:: /images/brain/brain-leaky-splits.png | ||||||||||||||
:alt: leaky-splits | ||||||||||||||
:align: center | ||||||||||||||
|
||||||||||||||
Comment on lines
+1906
to
+1909
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 💡 Codebase verification Image file size is smaller than reported but still needs optimization The actual size of 'brain-leaky-splits.png' is 7.6MB, not 26MB as initially stated. While this is better than previously thought, it's still larger than optimal for version control. The recommendation to compress remains valid but with less urgency:
🔗 Analysis chainImage file 'brain-leaky-splits.png' is 26MB - requires immediate compression The image file size of 26MB is significantly large for version control. Large binary files in Git repositories:
Please compress this image while maintaining acceptable visual quality. Recommended actions:
🏁 Scripts executedThe following scripts were executed for the analysis: Script: #!/bin/bash
# Check the size of the image file
ls -lh docs/source/images/brain/brain-leaky-splits.png | awk '{print $5}'
Length of output: 80 |
||||||||||||||
.. _brain-managing-runs: | ||||||||||||||
jacobsela marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
|
||||||||||||||
Managing brain runs | ||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you adding a tutorial on data leakage to the tutorials section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I? What would this look like?