data/anet_entities_skeleton.txt

Format of JSON ActivityNet-Entities annotation files

### for anet_entities_cleaned_class_thresh50_*.json
-> vocab: the 431 object classes (not including the background class)
-> annotations
  -> [video name]: identifier of video
    -> duration: duration of video
    -> segments
      -> [segment id]: segment from video with bounding box annotations
        -> timestamps: start and end timestamps of segment
        -> process_clss: object class of all the bounding boxes
        -> tokens: tokenized sentence
        -> frame_ind: frame index of all the bounding boxes
        -> process_idx: the index of the object class in the sentence
        -> process_bnd_box: coordinate of all the bounding boxes (xtl, ytl, xbr, ybr)
        -> crowds: whether the box represents a group of objects


### for anet_entities_trainval.json
-> database
  -> [video name]: identifier of video
    - rwidth: resized width of video, will be 720px
    - rheight: resized height of video, maintains aspect ratio
    -> segments
      -> [segment id]: segment from video with bounding box annotations
        -> objects
          -> [object number]: annotated object from segment
            -> noun_phrases: a list of noun phrase (NP) annotations of the object, both the text and the index of the word in the sentence
            - frame_ind: frame index (0-9, we divide each video evenly into 10 clips and sample the middle frame of each clip)
            - xtl: x coordinate of top left corner of bounding box (between 0 and 720)
            - ytl: y coordinate of top left corner of bounding box
            - xbr: x coordinate of bottom right corner of bounding box (between 0 and 720)
            - ybr: y coordinate of bottom right corner of bounding box
            - crowds: whether the box represents a group of objects


### an example on grounding evaluation subsmission files
Evaluation server on Challenge 2020:
For Sub-task I - GT Captions: https://competitions.codalab.org/competitions/24336
For Sub-task II - Generated Captions: https://competitions.codalab.org/competitions/24334

Evaluation server (general): https://competitions.codalab.org/competitions/22835
Submissions need to be named as submission_gt.json (for GT mode) or submission_gen.json (for gen mode) and zipped into a single submission.zip file.
See instructions on the evaluation server for more details.
```
{
  "results": {
    "v_QOlSCBRmfWY": {
      "0": { # segment id
          "clss": ["room", "woman", "she"], # object class
          "idx_in_sent": [8, 2, 12], # index of object in the sentence
          "bbox_for_all_frames": [[[1,2,3,4], …, [1,2,3,4]], [[1,2,3,4], …, [1,2,3,4]], [[1,2,3,4], …, [1,2,3,4]]] # predicted bbox on all 10 uniformly sampled frames
      }
    }
  },
  "eval_mode": 'gen' | 'GT' # depending on whether the results are on GT sentences or generated sentences
  "external_data": {
    "used": true, # Boolean flag
    "details": "Object detector pre-trained on Visual Genome on object detection task."
  }
}
```