Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make cache annotation optional #1332

Merged
merged 15 commits into from
Aug 1, 2023

Conversation

Louis-Dupont
Copy link
Contributor

@Louis-Dupont Louis-Dupont commented Jul 30, 2023

Goal

Adding option to run without caching

Notes

Why is it more complex than it may seem at first?

  • Even when we don't want to cache, ignore_empty_annotations requires to go over the whole dataset to know which index are empty or not in order to know the dataset length.
  • Caching cannot be done on the fly because it doesnt work out of the box with multiprocessing (pytorch workers).
  • Some transforms require to know index of empty annotations, even when we want to run training on all samples. This can create situation where we iterate over the whole dataset, but sample randomly over subset
  • We used to run self.annotation many times in different places of the code. If we want to add option of working without cache, this would create a lot of redundancy and be very inefficient because loading a sample may include many computations (see COCO _load_annotation).

@BloodAxe
Copy link
Contributor

So, let's first make a really, really clear what cache annotation does. I feel everyone feels different about what it does. So we should clarify EXACTLY what it does.

I believe initially it was indented to cache resized images on disk, but AFAIK this is never used nowadays. So that part of code can be safely deleted as dead one (_get_cached_image).

Secondly, caching annotations is nothing but reading it from format stored on disk (COCO JSON) to something more usable by a dataset itself. And this is where our problem is coming from.

First issue: We did not measure (benchmarked) how much time does it take to parse COCO annotation file and to get the necessary sample at a specified index. We should check whether increasing dataset size from 1e3 to 1e4, 1e5,1e6, 1e7 obeys O(n) for parsing and O(1) for accessing sample. If it's not - that should be fixed.

Action point 1: Benchmark where we are now

Second issue: I feel the whole concept of the base detection dataset class is plainly wrong and should be rewritten from scratch. It mush receive numpy arrays of boxes/labels/sample ids.
That would make base class much simpler and easier to understand.

Action point 2: I believe we should aim to make the base class as simple as possible:

  • drop _get_cached_image/_cache_images entirely as dead code.
  • remove ignore_empty_annotations from base class. It is the derived classes who should respect this flag, but base class much act as a storage for ready to use data and not apply any mapping/filtering under the hood.

That's my take how it could be improved

Copy link
Contributor

@BloodAxe BloodAxe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment in the discussion

@Louis-Dupont
Copy link
Contributor Author

I believe initially it was indented to cache resized images on disk, but AFAIK this is never used nowadays. So that part of code can be safely deleted as dead one (_get_cached_image).
I believe the original motivation for caching annotation was not just in case of image caching BUT I completely agree on the part of deleting image caching. And I also agree that both are highly coupled since the resizing in annotations is required in the image caching.

First issue: We did not measure (benchmarked) how much time does it take to parse COCO annotation file and to get the necessary sample at a specified index. We should check whether increasing dataset size from 1e3 to 1e4, 1e5,1e6, 1e7 obeys O(n) for parsing and O(1) for accessing sample. If it's not - that should be fixed.

There are 2 things; 1. COCO annotations loading, and 2. how we process these annotations.

  1. The first is currently done with pycocotools. I ran some tests to see if we can improve code efficiency.
  • Instantiation: is almost exclusively due to opening a huge json. We can improve it by using libs suchs as orjson; for COCO on tzag8 it reduced loading from 15s to 9s. Not that huge, not sure if it is worth adding a new library for this, but maybe for bigger dataset.
  • Loading: This piece of code is definitely slower when using pycocotools compared to my own custom code (not pushed here). It most likely means it's not O(1) but I did not run a clear benchmark on different dataset size to make sure.
  1. For the second, I also don't have a proper benchmark. What I know is that overall, it takes some time especially when tight_box_rotation = True, and with crowd. I did not benchmark it with idea that we want to keep the caching option anyway but you are right that it can be questioned.

Second issue: I feel the whole concept of the base detection dataset class is plainly wrong and should be rewritten from scratch. It mush receive numpy arrays of boxes/labels/sample ids.
That would make base class much simpler and easier to understand.

I agree that it should be rewritten, but the main issue is backward compatibility.
More precisely, because of wrong design, any inheriting class from DetectionDataset is highly coupled with its design, by requiring to implement this _load_annotations. Changing this would be a huge breaking change for them as it does not only affect the input parameters but also and mainly the code implementation.

@shaydeci
Copy link
Contributor

shaydeci commented Jul 31, 2023

So, let's first make a really, really clear what cache annotation does. I feel everyone feels different about what it does. So we should clarify EXACTLY what it does.

I believe initially it was indented to cache resized images on disk, but AFAIK this is never used nowadays. So that part of code can be safely deleted as dead one (_get_cached_image).

Secondly, caching annotations is nothing but reading it from format stored on disk (COCO JSON) to something more usable by a dataset itself. And this is where our problem is coming from.

First issue: We did not measure (benchmarked) how much time does it take to parse COCO annotation file and to get the necessary sample at a specified index. We should check whether increasing dataset size from 1e3 to 1e4, 1e5,1e6, 1e7 obeys O(n) for parsing and O(1) for accessing sample. If it's not - that should be fixed.

Action point 1: Benchmark where we are now

Second issue: I feel the whole concept of the base detection dataset class is plainly wrong and should be rewritten from scratch. It mush receive numpy arrays of boxes/labels/sample ids. That would make base class much simpler and easier to understand.

Action point 2: I believe we should aim to make the base class as simple as possible:

  • drop _get_cached_image/_cache_images entirely as dead code.
  • remove ignore_empty_annotations from base class. It is the derived classes who should respect this flag, but base class much act as a storage for ready to use data and not apply any mapping/filtering under the hood.

That's my take how it could be improved

I don't see your point about the first issue.

  1. "something more usable by the dataset"- I witnessed myself that for small datasets there is a value for caching them (i.e when the RAM is not full) and for larger ones the swap memory gets full and at that point there is not benefit.

  2. Why does simplifying the base class has to be in this PR ?
    @BloodAxe

Copy link
Contributor

@BloodAxe BloodAxe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Louis-Dupont Louis-Dupont merged commit 7afa83d into master Aug 1, 2023
@Louis-Dupont Louis-Dupont deleted the feature/SG-742-make_cache_label_optional branch August 1, 2023 12:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants