ENH: speed up `vak.csv.has_unlabeled` #243

yardencsGitHub · 2020-07-26T20:02:44Z

vak.csv.has_unlabeled goes over all files and looks for 'unlabeled'
The function can run faster if it just returns True the first time it finds a file with unlabeled

The text was updated successfully, but these errors were encountered:

NickleDave · 2022-07-30T00:45:56Z

Looking at this again.

I'm a bit paranoid about breaking something / someone's pipeline by assuming that we always want to "greedily" decide that has_unlabeled is True
but still, reading through the function, I think there is a faster way to determine if there are unlabeled segments
we should instead only load the annotations, measure the inter-segment intervals by subtracting onsets[1:] - offsets[:-1], and then testing whether any are greater than 0. In most cases this will be true, except if someone labeled silent intervals between vocalization segments.
If we measure inter-segment intervals this way we also need to test for unlabeled segments before and after all annotated segments.
But still this would be faster than what the function does now, even looping over everything, since we skip a slow-ish list comprehension + function call that does several other computations, replacing those with a single numpy array subtraction

I should test to see if this really improves anything but my guess is it will

This is the only function in this csv module
But it doesn't really have anything to do with a .csv, except that it needs a dataset .csv to get the annotation files. Would prefer to move to an annotation module? And get rid of this poorly-named csv module

NickleDave · 2022-07-30T10:53:40Z

It does look like we can cut the time about in half by simply using crowsetta.Annotation.seq alone, see attached pdf of a notebook

test-has-unlabeled.pdf

Timing is not affected by calling a "helper" function that runs logic on a single Annotation.

So:

add has_unlabeled to vak.annotation that operates on a single crowsetta.Annotation
add to "raise crowsetta" issue that this will need to be more specific, a seq annot, when we raise version
rename vak.csv -> vak.dataset, that clashes a bit with vak.datasets but writing vak.datasets.has_unlabeled is weird
rewrite vak.dataset.has_unlabeled to use vak.annotation.has_unlabeled
write tests for all the above

NickleDave · 2022-07-30T11:33:07Z

Thinking about this more ☹️

Edge cases:

There are unlabeled intervals between segments, but the very first segment starts right at 0. So we can't just look at "is the first onset > 0" to determine whether there are any unlabeled segments.
There are no unlabeled intervals between segments, but (for some weird reason) the period before the first onset and the period after the last offset are unlabeled. This would not be detected by just asking whether there are any unlabeled intervals between the labeled segments
We can detect whether there's an unlabeled period before the first segment by just asking "is the first onset > 0". So this at least lets us detect the first half of the weird edge case with unlabeled periods before and after a continuous set of all segments labeled
What we can't do is detect where there's an unlabeled period after from the annotations alone, because these typically won't contain the total duration of the vocalization. We need to pass that in separately.

Make sure unit tests cover all these edge cases

NickleDave · 2022-07-30T15:42:59Z

Rewrote new version that we use duration and try to catch edge cases in comment above.
Now we only shave off about 30ms, from ~84 to ~54 -- see attached PDF.
Starts to feel like it's not worth the effort but maybe this adds up when we are working with a much larger data set, like the canary song.
test-has-unlabeled-v2.pdf

Will stop obsessing and just make this minor change to close the issue

NickleDave · 2022-07-30T19:44:47Z

Still obsessing.
I don't like the csv module name but wasn't sure where else to put it.

After thinking more:

we will have two types of datasets, basically, that map to the two types of annotation formats in crowsetta: sequence and bounding box
therefore there should be two sub-packages in datasets: seq and bbox. The has_unlabeled (segments) function should live in seq. The seq can actually be a sub-sub-package? So that all the dataset classes can be inside that sub-sub-package in their own modules. I guess has_unlabeled should go in a separate module. Trying desperately to not use the name utils to avoid that becoming a dumpster, so I'll call it validators 🤷 -- even though the function is not used to validate per se, it returns a boolean

as discussed in #243

Refactor "has unlabeled", fixes #243

NickleDave changed the title ~~vak.csv.has_unlabeled takes a long time to run and can be made faster~~ make vak.csv.has_unlabeled return True the first time it finds a file with unlabeled Jul 31, 2020

NickleDave changed the title ~~make vak.csv.has_unlabeled return True the first time it finds a file with unlabeled~~ ENH: speed up vak.csv.has_unlabeled Jul 30, 2022

NickleDave self-assigned this Jul 30, 2022

NickleDave added the ENH: enhancement enhancement; new feature or request label Jul 30, 2022

NickleDave mentioned this issue Jul 30, 2022

ENH: add benchmark datasets (e.g. BFSongRepository) to datasets sub-package #446

Open

NickleDave added a commit that referenced this issue Jul 31, 2022

ENH: Add has_unlabeled function to vak.annotation

98740a3

as discussed in #243

NickleDave mentioned this issue Jul 31, 2022

Refactor "has unlabeled" #559

Merged

NickleDave closed this as completed in d0fd19c Aug 1, 2022

NickleDave added a commit that referenced this issue Aug 1, 2022

MRG: #559 from vocalpy/refactor-has-unlabeled

8bc3f51

Refactor "has unlabeled", fixes #243

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: speed up `vak.csv.has_unlabeled` #243

ENH: speed up `vak.csv.has_unlabeled` #243

yardencsGitHub commented Jul 26, 2020 •

edited by NickleDave

Loading

NickleDave commented Jul 30, 2022

NickleDave commented Jul 30, 2022

NickleDave commented Jul 30, 2022 •

edited

Loading

NickleDave commented Jul 30, 2022

NickleDave commented Jul 30, 2022

ENH: speed up vak.csv.has_unlabeled #243

ENH: speed up vak.csv.has_unlabeled #243

Comments

yardencsGitHub commented Jul 26, 2020 • edited by NickleDave Loading

NickleDave commented Jul 30, 2022

NickleDave commented Jul 30, 2022

NickleDave commented Jul 30, 2022 • edited Loading

NickleDave commented Jul 30, 2022

NickleDave commented Jul 30, 2022

ENH: speed up `vak.csv.has_unlabeled` #243

ENH: speed up `vak.csv.has_unlabeled` #243

yardencsGitHub commented Jul 26, 2020 •

edited by NickleDave

Loading

NickleDave commented Jul 30, 2022 •

edited

Loading