Skip to content

Commit

Permalink
Add support for grounding dino (#1137)
Browse files Browse the repository at this point in the history
* Add sum, norm, normalize unit tests

* Add min/max unit tests

* make tests synchronous

* Cleanup

* Update mean op unit tests

* Add more tensor unit tests

* Update view unit test

* Add tensor construction unit tests

* Add more tensor op unit tests

* Add another squeeze unit test

* Multiple dims for squeeze unit test

* Refactor tensor reduce ops

* Add support for `gt` and `lt` tensor ops

* Add grounding dino implementation

* Allow grounding dino to be usable via the pipeline API

* Add listed support for grounding dino

* Add grounding dino unit tests

* Add zero-shot object detection pipeline unit test for grounding dino
  • Loading branch information
xenova authored Jan 15, 2025
1 parent f126091 commit a938a56
Show file tree
Hide file tree
Showing 15 changed files with 915 additions and 274 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -335,6 +335,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki.
1. **[GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode)** (from BigCode) released with the paper [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988) by Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra.
1. **[Granite](https://huggingface.co/docs/transformers/main/model_doc/granite)** (from IBM) released with the paper [Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler](https://arxiv.org/abs/2408.13359) by Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox, Rameswar Panda.
1. **[Grounding DINO](https://huggingface.co/docs/transformers/model_doc/grounding-dino)** (from IDEA-Research) released with the paper [Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection](https://arxiv.org/abs/2303.05499) by Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang.
1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
1. **[HerBERT](https://huggingface.co/docs/transformers/model_doc/herbert)** (from Allegro.pl, AGH University of Science and Technology) released with the paper [KLEJ: Comprehensive Benchmark for Polish Language Understanding](https://www.aclweb.org/anthology/2020.acl-main.111.pdf) by Piotr Rybak, Robert Mroczkowski, Janusz Tracz, Ireneusz Gawlik.
1. **[Hiera](https://huggingface.co/docs/transformers/model_doc/hiera)** (from Meta) released with the paper [Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles](https://arxiv.org/pdf/2306.00989) by Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer.
Expand Down
1 change: 1 addition & 0 deletions docs/snippets/6_supported-models.snippet
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@
1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki.
1. **[GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode)** (from BigCode) released with the paper [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988) by Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra.
1. **[Granite](https://huggingface.co/docs/transformers/main/model_doc/granite)** (from IBM) released with the paper [Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler](https://arxiv.org/abs/2408.13359) by Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox, Rameswar Panda.
1. **[Grounding DINO](https://huggingface.co/docs/transformers/model_doc/grounding-dino)** (from IDEA-Research) released with the paper [Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection](https://arxiv.org/abs/2303.05499) by Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang.
1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
1. **[HerBERT](https://huggingface.co/docs/transformers/model_doc/herbert)** (from Allegro.pl, AGH University of Science and Technology) released with the paper [KLEJ: Comprehensive Benchmark for Polish Language Understanding](https://www.aclweb.org/anthology/2020.acl-main.111.pdf) by Piotr Rybak, Robert Mroczkowski, Janusz Tracz, Ireneusz Gawlik.
1. **[Hiera](https://huggingface.co/docs/transformers/model_doc/hiera)** (from Meta) released with the paper [Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles](https://arxiv.org/pdf/2306.00989) by Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer.
Expand Down
2 changes: 1 addition & 1 deletion src/base/image_processors_utils.js
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ function enforce_size_divisibility([width, height], divisor) {
* @param {number[]} arr The coordinate for the center of the box and its width, height dimensions (center_x, center_y, width, height)
* @returns {number[]} The coodinates for the top-left and bottom-right corners of the box (top_left_x, top_left_y, bottom_right_x, bottom_right_y)
*/
function center_to_corners_format([centerX, centerY, width, height]) {
export function center_to_corners_format([centerX, centerY, width, height]) {
return [
centerX - width / 2,
centerY - height / 2,
Expand Down
11 changes: 11 additions & 0 deletions src/base/processing_utils.js
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,17 @@ export class Processor extends Callable {
return this.tokenizer.batch_decode(...args);
}

/**
* @param {Parameters<PreTrainedTokenizer['decode']>} args
* @returns {ReturnType<PreTrainedTokenizer['decode']>}
*/
decode(...args) {
if (!this.tokenizer) {
throw new Error('Unable to decode without a tokenizer.');
}
return this.tokenizer.decode(...args);
}


/**
* Calls the feature_extractor function with the given input.
Expand Down
22 changes: 17 additions & 5 deletions src/models.js
Original file line number Diff line number Diff line change
Expand Up @@ -532,14 +532,23 @@ async function encoderForward(self, model_inputs) {
encoderFeeds.inputs_embeds = await self.encode_text({ input_ids: model_inputs.input_ids });
}
if (session.inputNames.includes('token_type_ids') && !encoderFeeds.token_type_ids) {
if (!encoderFeeds.input_ids) {
throw new Error('Both `input_ids` and `token_type_ids` are missing in the model inputs.');
}
// Assign default `token_type_ids` (all zeroes) to the `encoderFeeds` if the model expects it,
// but they weren't created by the tokenizer.
encoderFeeds.token_type_ids = new Tensor(
'int64',
new BigInt64Array(encoderFeeds.input_ids.data.length),
encoderFeeds.input_ids.dims
)
encoderFeeds.token_type_ids = zeros_like(encoderFeeds.input_ids);
}
if (session.inputNames.includes('pixel_mask') && !encoderFeeds.pixel_mask) {
if (!encoderFeeds.pixel_values) {
throw new Error('Both `pixel_values` and `pixel_mask` are missing in the model inputs.');
}
// Assign default `pixel_mask` (all ones) to the `encoderFeeds` if the model expects it,
// but they weren't created by the processor.
const dims = encoderFeeds.pixel_values.dims;
encoderFeeds.pixel_mask = ones([dims[0], dims[2], dims[3]]);
}

return await sessionRun(session, encoderFeeds);
}

Expand Down Expand Up @@ -5428,6 +5437,8 @@ export class Dinov2WithRegistersForImageClassification extends Dinov2WithRegiste
}
}
//////////////////////////////////////////////////
export class GroundingDinoPreTrainedModel extends PreTrainedModel { }
export class GroundingDinoForObjectDetection extends GroundingDinoPreTrainedModel { }

//////////////////////////////////////////////////
export class YolosPreTrainedModel extends PreTrainedModel { }
Expand Down Expand Up @@ -7338,6 +7349,7 @@ const MODEL_FOR_OBJECT_DETECTION_MAPPING_NAMES = new Map([
const MODEL_FOR_ZERO_SHOT_OBJECT_DETECTION_MAPPING_NAMES = new Map([
['owlvit', ['OwlViTForObjectDetection', OwlViTForObjectDetection]],
['owlv2', ['Owlv2ForObjectDetection', Owlv2ForObjectDetection]],
['grounding-dino', ['GroundingDinoForObjectDetection', GroundingDinoForObjectDetection]],
]);

const MODEL_FOR_IMAGE_SEGMENTATION_MAPPING_NAMES = new Map([
Expand Down
29 changes: 29 additions & 0 deletions src/models/grounding_dino/image_processing_grounding_dino.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@

import {
ImageProcessor,
} from "../../base/image_processors_utils.js";
import { ones } from '../../utils/tensor.js';


/**
* @typedef {object} GroundingDinoFeatureExtractorResultProps
* @property {import('../../utils/tensor.js').Tensor} pixel_mask
* @typedef {import('../../base/image_processors_utils.js').ImageProcessorResult & GroundingDinoFeatureExtractorResultProps} GroundingDinoFeatureExtractorResult
*/

export class GroundingDinoImageProcessor extends ImageProcessor {
/**
* Calls the feature extraction process on an array of images, preprocesses
* each image, and concatenates the resulting features into a single Tensor.
* @param {import('../../utils/image.js').RawImage[]} images The image(s) to extract features from.
* @returns {Promise<GroundingDinoFeatureExtractorResult>} An object containing the concatenated pixel values of the preprocessed images.
*/
async _call(images) {
const result = await super._call(images);

const dims = result.pixel_values.dims;
const pixel_mask = ones([dims[0], dims[2], dims[3]]);

return { ...result, pixel_mask };
}
}
101 changes: 101 additions & 0 deletions src/models/grounding_dino/processing_grounding_dino.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
import { Processor } from "../../base/processing_utils.js";
import { AutoImageProcessor } from "../auto/image_processing_auto.js";
import { AutoTokenizer } from "../../tokenizers.js";
import { center_to_corners_format } from "../../base/image_processors_utils.js";

/**
* Get token ids of phrases from posmaps and input_ids.
* @param {import('../../utils/tensor.js').Tensor} posmaps A boolean tensor of unbatched text-thresholded logits related to the detected bounding boxes of shape `(hidden_size, )`.
* @param {import('../../utils/tensor.js').Tensor} input_ids A tensor of token ids of shape `(sequence_length, )`.
*/
function get_phrases_from_posmap(posmaps, input_ids) {

const left_idx = 0;
const right_idx = posmaps.dims.at(-1) - 1;

const posmaps_list = posmaps.tolist();
posmaps_list.fill(false, 0, left_idx + 1);
posmaps_list.fill(false, right_idx);

const input_ids_list = input_ids.tolist();
return posmaps_list
.map((val, idx) => val ? idx : null)
.filter(idx => idx !== null)
.map(i => input_ids_list[i]);
}

export class GroundingDinoProcessor extends Processor {
static tokenizer_class = AutoTokenizer
static image_processor_class = AutoImageProcessor

/**
* @typedef {import('../../utils/image.js').RawImage} RawImage
*/
/**
*
* @param {RawImage|RawImage[]|RawImage[][]} images
* @param {string|string[]} text
* @returns {Promise<any>}
*/
async _call(images, text, options = {}) {

const image_inputs = images ? await this.image_processor(images, options) : {};
const text_inputs = text ? this.tokenizer(text, options) : {};

return {
...text_inputs,
...image_inputs,
}
}
post_process_grounded_object_detection(outputs, input_ids, {
box_threshold = 0.25,
text_threshold = 0.25,
target_sizes = null
} = {}) {
const { logits, pred_boxes } = outputs;
const batch_size = logits.dims[0];

if (target_sizes !== null && target_sizes.length !== batch_size) {
throw Error("Make sure that you pass in as many target sizes as the batch dimension of the logits")
}
const num_queries = logits.dims.at(1);

const probs = logits.sigmoid(); // (batch_size, num_queries, 256)
const scores = probs.max(-1).tolist(); // (batch_size, num_queries)

// Convert to [x0, y0, x1, y1] format
const boxes = pred_boxes.tolist() // (batch_size, num_queries, 4)
.map(batch => batch.map(box => center_to_corners_format(box)));

const results = [];
for (let i = 0; i < batch_size; ++i) {
const target_size = target_sizes !== null ? target_sizes[i] : null;

// Convert from relative [0, 1] to absolute [0, height] coordinates
if (target_size !== null) {
boxes[i] = boxes[i].map(box => box.map((x, j) => x * target_size[(j + 1) % 2]));
}

const batch_scores = scores[i];
const final_scores = [];
const final_phrases = [];
const final_boxes = [];
for (let j = 0; j < num_queries; ++j) {
const score = batch_scores[j];
if (score <= box_threshold) {
continue;
}
const box = boxes[i][j];
const prob = probs[i][j];

final_scores.push(score);
final_boxes.push(box);

const phrases = get_phrases_from_posmap(prob.gt(text_threshold), input_ids[i]);
final_phrases.push(phrases);
}
results.push({ scores: final_scores, boxes: final_boxes, labels: this.batch_decode(final_phrases) });
}
return results;
}
}
1 change: 1 addition & 0 deletions src/models/image_processors.js
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ export * from './donut/image_processing_donut.js'
export * from './dpt/image_processing_dpt.js'
export * from './efficientnet/image_processing_efficientnet.js'
export * from './glpn/image_processing_glpn.js'
export * from './grounding_dino/image_processing_grounding_dino.js'
export * from './idefics3/image_processing_idefics3.js'
export * from './janus/image_processing_janus.js'
export * from './jina_clip/image_processing_jina_clip.js'
Expand Down
5 changes: 3 additions & 2 deletions src/models/processors.js
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
export * from './florence2/processing_florence2.js';
export * from './mgp_str/processing_mgp_str.js';
export * from './moonshine/processing_moonshine.js';
export * from './grounding_dino/processing_grounding_dino.js';
export * from './idefics3/processing_idefics3.js';
export * from './janus/processing_janus.js';
export * from './jina_clip/processing_jina_clip.js';
export * from './mgp_str/processing_mgp_str.js';
export * from './moonshine/processing_moonshine.js';
export * from './owlvit/processing_owlvit.js';
export * from './phi3_v/processing_phi3_v.js';
export * from './paligemma/processing_paligemma.js';
Expand Down
36 changes: 29 additions & 7 deletions src/pipelines.js
Original file line number Diff line number Diff line change
Expand Up @@ -2553,13 +2553,35 @@ export class ZeroShotObjectDetectionPipeline extends (/** @type {new (options: T
// Run model with both text and pixel inputs
const output = await this.model({ ...text_inputs, pixel_values });

// @ts-ignore
const processed = this.processor.image_processor.post_process_object_detection(output, threshold, imageSize, true)[0];
let result = processed.boxes.map((box, i) => ({
score: processed.scores[i],
label: candidate_labels[processed.classes[i]],
box: get_bounding_box(box, !percentage),
})).sort((a, b) => b.score - a.score);
let result;
if('post_process_grounded_object_detection' in this.processor) {
// @ts-ignore
const processed = this.processor.post_process_grounded_object_detection(
output,
text_inputs.input_ids,
{
// TODO: support separate threshold values
box_threshold: threshold,
text_threshold: threshold,
target_sizes: imageSize,
},
)[0];
result = processed.boxes.map((box, i) => ({
score: processed.scores[i],
label: processed.labels[i],
box: get_bounding_box(box, !percentage),
}))
} else {
// @ts-ignore
const processed = this.processor.image_processor.post_process_object_detection(output, threshold, imageSize, true)[0];
result = processed.boxes.map((box, i) => ({
score: processed.scores[i],
label: candidate_labels[processed.classes[i]],
box: get_bounding_box(box, !percentage),
}))
}
result.sort((a, b) => b.score - a.score);

if (top_k !== null) {
result = result.slice(0, top_k);
}
Expand Down
Loading

0 comments on commit a938a56

Please sign in to comment.