-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ml): composable ml #9973
feat(ml): composable ml #9973
Conversation
3961541
to
d3a43ca
Compare
simplify
server fixes fix typing
formatting and typing rename
fix detection-only response no need for defaultdict
update api linting
28375f6
to
308966d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to approve this, I don't think any of us really are going to be able to properly review this. Imo if you have tested this and are confident with the change, go ahead and merge it in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
This PR addresses some limitations of the ML service design. Currently, detection and recognition models for facial recognition are bundled in the same class and likewise for textual and visual CLIP models. As a result, they duplicate certain shared behaviors for their own set of models. Moreover, there is no good way to choose particular detection and recognition models. This is a big limitation for OCR, as it is common to mix and match unrelated detection and recognition models. Lastly, there is no way to query an individual detection or recognition model. CLIP models have a custom
mode
option to do this, but this is specific to that model task.This PR redesigns the ML service such that each model session is its own class, and broader tasks like facial recognition are modeled as dependencies (recognition being dependent on detection, etc.). This lays a solid foundation for a composable set of models with separate settings for each.
As part of this change, the API has also been updated. A given request can look like this:
And the response can look like this:
Some implementation notes:
For facial recognition, a side effect of this change is drastically better performance when there are multiple faces. This is because they are handled in one model pass instead of fed sequentially. With 10 faces in an image, it was 70%+ faster on CPU and 6x faster on GPU. A smaller gain comes from always using Pillow to decode images, as it is several times faster than OpenCV.
While unlikely to be relevant in practice, a nice perk of this change is that it allows one to query any number of tasks at once, like so: