Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ml): composable ml #9973

Merged
merged 27 commits into from
Jun 7, 2024
Merged

feat(ml): composable ml #9973

merged 27 commits into from
Jun 7, 2024

Conversation

mertalev
Copy link
Contributor

@mertalev mertalev commented Jun 4, 2024

Description

This PR addresses some limitations of the ML service design. Currently, detection and recognition models for facial recognition are bundled in the same class and likewise for textual and visual CLIP models. As a result, they duplicate certain shared behaviors for their own set of models. Moreover, there is no good way to choose particular detection and recognition models. This is a big limitation for OCR, as it is common to mix and match unrelated detection and recognition models. Lastly, there is no way to query an individual detection or recognition model. CLIP models have a custom mode option to do this, but this is specific to that model task.

This PR redesigns the ML service such that each model session is its own class, and broader tasks like facial recognition are modeled as dependencies (recognition being dependent on detection, etc.). This lays a solid foundation for a composable set of models with separate settings for each.

As part of this change, the API has also been updated. A given request can look like this:

{
  "facial-recognition": {
    "detection": {
      "modelName": "buffalo_l"
    },
    "recognition": {
      "modelName": "buffalo_l",
      "options": {
        "minScore": 0.5
      }
    }
  }
}

And the response can look like this:

{
  "facial-recognition": [
    {
      "boundingBox": {
        "x1": 463.0,
        "y1": 133.0,
        "x2": 763.0,
        "y2": 526.0
      },
      "embedding": "vector(512)",
      "score": 0.89526224
    }
  ],
  "imageHeight": 1440,
  "imageWidth": 1152
}

Some implementation notes:

  • A given request can only be visual or textual; it cannot have multiple modalities at the same time
  • Images are only decoded once and the decoded data is shared across all models
  • Dependencies can only go one level deep; a model can depend on the output of multiple models, but those models must not have any dependencies
  • If a model has a dependency, it is the caller's responsibility to ensure this dependency is included in the request

For facial recognition, a side effect of this change is drastically better performance when there are multiple faces. This is because they are handled in one model pass instead of fed sequentially. With 10 faces in an image, it was 70%+ faster on CPU and 6x faster on GPU. A smaller gain comes from always using Pillow to decode images, as it is several times faster than OpenCV.

While unlikely to be relevant in practice, a nice perk of this change is that it allows one to query any number of tasks at once, like so:

{
  "clip": {
    "visual": {
      "modelName": "ViT-B-32__openai"
    }
  },
  "facial-recognition": {
    "detection": {
      "modelName": "buffalo_l"
    },
    "recognition": {
      "modelName": "buffalo_l",
      "options": {
        "minScore": 0.5
      }
    }
  },
  "ocr": {
    "detection": {
      "modelName": "ch_ppocr_v4",
      "options": {
        "minScore": 0.5
      }
    },
    "recognition": {
      "modelName": "ch_ppocr_v4",
      "options": {
        "minScore": 0.3
      }
    }
  }
}

@mertalev mertalev force-pushed the refactor/composable-ml branch from 3961541 to d3a43ca Compare June 4, 2024 23:07
@mertalev mertalev changed the title refactor(server, ml): composable ml refactor(ml): composable ml Jun 4, 2024
mertalev added 14 commits June 4, 2024 20:16
fixes

remove unnecessary interface

support text input, cleanup
server fixes

fix typing
update locustfile

fixes
formatting and typing

rename
fix type

actually fix typing
fix detection-only response

no need for defaultdict
update api

linting
@mertalev mertalev force-pushed the refactor/composable-ml branch from 28375f6 to 308966d Compare June 5, 2024 00:17
@mertalev mertalev changed the title refactor(ml): composable ml feat(ml): composable ml Jun 5, 2024
@immich-app immich-app deleted a comment Jun 5, 2024
Copy link
Contributor

@zackpollard zackpollard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to approve this, I don't think any of us really are going to be able to properly review this. Imo if you have tested this and are confident with the change, go ahead and merge it in.

Copy link
Contributor

@jrasm91 jrasm91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mertalev mertalev enabled auto-merge (squash) June 7, 2024 03:09
@mertalev mertalev merged commit 2b1b43a into main Jun 7, 2024
22 checks passed
@mertalev mertalev deleted the refactor/composable-ml branch June 7, 2024 03:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants