Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: speed up vak.csv.has_unlabeled #243

Closed
yardencsGitHub opened this issue Jul 26, 2020 · 5 comments
Closed

ENH: speed up vak.csv.has_unlabeled #243

yardencsGitHub opened this issue Jul 26, 2020 · 5 comments
Assignees
Labels
ENH: enhancement enhancement; new feature or request

Comments

@yardencsGitHub
Copy link
Collaborator

yardencsGitHub commented Jul 26, 2020

vak.csv.has_unlabeled goes over all files and looks for 'unlabeled'
The function can run faster if it just returns True the first time it finds a file with unlabeled

@NickleDave NickleDave changed the title vak.csv.has_unlabeled takes a long time to run and can be made faster make vak.csv.has_unlabeled return True the first time it finds a file with unlabeled Jul 31, 2020
@NickleDave NickleDave changed the title make vak.csv.has_unlabeled return True the first time it finds a file with unlabeled ENH: speed up vak.csv.has_unlabeled Jul 30, 2022
@NickleDave NickleDave self-assigned this Jul 30, 2022
@NickleDave NickleDave added the ENH: enhancement enhancement; new feature or request label Jul 30, 2022
@NickleDave
Copy link
Collaborator

Looking at this again.

  • I'm a bit paranoid about breaking something / someone's pipeline by assuming that we always want to "greedily" decide that has_unlabeled is True
  • but still, reading through the function, I think there is a faster way to determine if there are unlabeled segments
  • we should instead only load the annotations, measure the inter-segment intervals by subtracting onsets[1:] - offsets[:-1], and then testing whether any are greater than 0. In most cases this will be true, except if someone labeled silent intervals between vocalization segments.
  • If we measure inter-segment intervals this way we also need to test for unlabeled segments before and after all annotated segments.
  • But still this would be faster than what the function does now, even looping over everything, since we skip a slow-ish list comprehension + function call that does several other computations, replacing those with a single numpy array subtraction

I should test to see if this really improves anything but my guess is it will

  • This is the only function in this csv module
  • But it doesn't really have anything to do with a .csv, except that it needs a dataset .csv to get the annotation files. Would prefer to move to an annotation module? And get rid of this poorly-named csv module

@NickleDave
Copy link
Collaborator

It does look like we can cut the time about in half by simply using crowsetta.Annotation.seq alone, see attached pdf of a notebook

test-has-unlabeled.pdf

Timing is not affected by calling a "helper" function that runs logic on a single Annotation.

So:

  • add has_unlabeled to vak.annotation that operates on a single crowsetta.Annotation
  • add to "raise crowsetta" issue that this will need to be more specific, a seq annot, when we raise version
  • rename vak.csv -> vak.dataset, that clashes a bit with vak.datasets but writing vak.datasets.has_unlabeled is weird
  • rewrite vak.dataset.has_unlabeled to use vak.annotation.has_unlabeled
  • write tests for all the above

@NickleDave
Copy link
Collaborator

NickleDave commented Jul 30, 2022

Thinking about this more ☹️

Edge cases:

  • There are unlabeled intervals between segments, but the very first segment starts right at 0. So we can't just look at "is the first onset > 0" to determine whether there are any unlabeled segments.
  • There are no unlabeled intervals between segments, but (for some weird reason) the period before the first onset and the period after the last offset are unlabeled. This would not be detected by just asking whether there are any unlabeled intervals between the labeled segments
  • We can detect whether there's an unlabeled period before the first segment by just asking "is the first onset > 0". So this at least lets us detect the first half of the weird edge case with unlabeled periods before and after a continuous set of all segments labeled
  • What we can't do is detect where there's an unlabeled period after from the annotations alone, because these typically won't contain the total duration of the vocalization. We need to pass that in separately.
  • Make sure unit tests cover all these edge cases

@NickleDave
Copy link
Collaborator

Rewrote new version that we use duration and try to catch edge cases in comment above.
Now we only shave off about 30ms, from ~84 to ~54 -- see attached PDF.
Starts to feel like it's not worth the effort but maybe this adds up when we are working with a much larger data set, like the canary song.
test-has-unlabeled-v2.pdf

Will stop obsessing and just make this minor change to close the issue

@NickleDave
Copy link
Collaborator

Still obsessing.
I don't like the csv module name but wasn't sure where else to put it.

After thinking more:

  • we will have two types of datasets, basically, that map to the two types of annotation formats in crowsetta: sequence and bounding box
  • therefore there should be two sub-packages in datasets: seq and bbox. The has_unlabeled (segments) function should live in seq. The seq can actually be a sub-sub-package? So that all the dataset classes can be inside that sub-sub-package in their own modules. I guess has_unlabeled should go in a separate module. Trying desperately to not use the name utils to avoid that becoming a dumpster, so I'll call it validators 🤷 -- even though the function is not used to validate per se, it returns a boolean

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ENH: enhancement enhancement; new feature or request
Projects
None yet
Development

No branches or pull requests

2 participants