Multiple images for an identical matched_text_index #11

fauconnier · 2023-05-11T12:24:15Z

Dear authors,

Thanks for releasing MMC4.

In the paper the following is stated:

"we use [14] to compute a bipartite assignment of images to sentences, under the constraint that each sentence can only be assigned a single image." ,
"For documents with more images than sentences, after assigning an image to each sentence, we assign according to max similarity.".

However, we found examples where multiple images are aligned to a text span.
For instance, consider the following example in ./docs_shard_10063_v3.jsonl.

{
    "url": "http://easydesigns.biz/easydesigns-wins-3rd-consecutive-best-of-houzz-award/",
    "text_list": [
        "cherry hill, nj, january 19, 2015 \u2013 easydesigns of cherry hill, nj has been awarded \u201cbest of houzz\u201d for customer satisfaction by houzz, the leading platform for home remodeling and design.",
        "the interior design and real estate staging firm, in business since 2005, was chosen by the more than 25 million monthly unique users that comprise the houzz community from among more than 500,000 active home building, remodeling and design industry professionals.",
        "\u201ci am so happy to be selected for the 3rd consecutive year.",
        "customer satisfaction is a primary goal of my firm so i am thrilled to be recognized by such a large and prominent community\u201d, said beth secosky, owner of easydesigns."
    ],
    "image_info": [
        {
            "image_name": "202706b24ac4.png",
            "raw_url": "https://st.hzcdn.com/static/[email protected]",
            "matched_text_index": 0,
            "matched_sim": 0.33591771125793457,
            "face_detections": null
        },
        {
            "image_name": "a97c58871c38.png",
            "raw_url": "https://st.hzcdn.com/static/[email protected]",
            "matched_text_index": 0,
            "matched_sim": 0.31495240330696106,
            "face_detections": null
        },
        {
            "image_name": "ce3c9aa070ce.png",
            "raw_url": "https://st.hzcdn.com/static/[email protected]",
            "matched_text_index": 1,
            "matched_sim": 0.2770630717277527,
            "face_detections": null
        },
        {
            "image_name": "c22425c7d977.png",
            "raw_url": "https://st.hzcdn.com/static/[email protected]",
            "matched_text_index": 0,
            "matched_sim": 0.3448386490345001,
            "face_detections": null
        }
    ],
    "similarity_matrix": [
        [
            0.33591771125793457,
            0.2377069592475891,
            0.17204634845256805,
            0.22403109073638916
        ],
        [
            0.31495240330696106,
            0.27460938692092896,
            0.12367681413888931,
            0.17759563028812408
        ],
        [
            0.3045308589935303,
            0.2770630717277527,
            0.15680742263793945,
            0.21054978668689728
        ],
        [
            0.3448386490345001,
            0.26175469160079956,
            0.16365793347358704,
            0.237198144197464
        ]
    ]
}

Is that intended?

Thanks for any pointers.

The text was updated successfully, but these errors were encountered:

jmhessel · 2023-05-11T15:08:30Z

Hi @fauconnier ! Thanks for pointing this out. Let me check it what's going on: this example is not the intended behavior.

jmhessel · 2023-05-11T15:31:10Z

Taking some public notes... This is how we do the image text assignments:

def get_image_assignments(im2txt):
    '''                                                                                                                                                                                                    
    returns a list assignments of length N_images such that assignments[i] is the sentence index that image i was assigned to.                                                                             
    '''
    # if there are more images than texts, not quite sure what to do...                                                                                                                                    
    im_idxs_s, txt_idxs_s, sol = linear_assignment.base_solve(-im2txt)
    im2txt_idxs = {im_idxs_s[k]: txt_idxs_s[k] for k in range(len(im_idxs_s))}
    if im2txt.shape[0] > im2txt.shape[1]:
        # there are more images than sentences. we dont want to discard images. so, for unassigned images, we will put them with their corresponding max.                                                  
        for imidx in range(len(im2txt)):
            if imidx not in im2txt_idxs:
                im2txt_idxs[imidx] = int(np.argmax(im2txt[imidx]))

    return [im2txt_idxs[idx] for idx in range(len(im2txt_idxs))]

where the base solve function is:

def base_solve(W, max_dummy_cost_value=1000):
    '''                                                                                                                                                                                                    
    Gives hungarian solve for a non-square matrix. it's roughly from:                                                                                                                                      
                                                                                                                                                                                                           
    NOTE: this ** MINIMIZES COST **. So, if you're handing sims, make sure to negate them!                                                                                                                 
                                                                                                                                                                                                           
    https://github.com/jmhessel/multi-retrieval/blob/master/bipartite_utils.py                                                                                                                             
                                                                                                                                                                                                           
    returns i_s, j_s, cost such that:                                                                                                                                                                      
    for i, j in zip(i_s, j_s)                                                                                                                                                                              
                                                                                                                                                                                                           
    are the (i, j) row column entries selected.                                                                                                                                                            
                                                                                                                                                                                                           
    cost is sum( cost[i, j] for i, j in zip(i_s, j_s) )                                                                                                                                                    
                                                                                                                                                                                                           
    '''
    if np.sum(np.abs(W)) > max_dummy_cost_value:
        print('Warning, you values in your matrix may be too big, please raise max_dummy_cost_value')


    orig_shape = W.shape
    if orig_shape[0] != orig_shape[1]:
	if orig_shape[0] > orig_shape[1]:
            pad_idxs = [[0, 0], [0, W.shape[0]-W.shape[1]]]
            col_pad = True
        else:
            pad_idxs = [[0, W.shape[1]-W.shape[0]], [0, 0]]
            col_pad = False
        W = np.pad(W, pad_idxs, 'constant', constant_values=max_dummy_cost_value)

    sol, _, cost = lapjv(W)

    i_s = np.arange(len(sol))
    j_s = sol[i_s]

    sort_idxs = np.argsort(-W[i_s, j_s])
    i_s, j_s = map(lambda x: x[sort_idxs], [i_s, j_s])

    if orig_shape[0] != orig_shape[1]:
	if col_pad:
            valid_idxs = np.where(j_s < orig_shape[1])[0]
        else:
            valid_idxs = np.where(i_s < orig_shape[0])[0]
        i_s, j_s = i_s[valid_idxs], j_s[valid_idxs]

    # indices = np.hstack([np.expand_dims(i_s, -1), np.expand_dims(j_s, -1)]).astype(np.int32)                                                                                                             
    m_cost = 0.0
    for i, j in zip(i_s, j_s):
        m_cost += W[i, j]

    return i_s, j_s, m_cost

When I run that similarity matrix through this code, I get what I believe to be the correct assignment:

im txt sim
0 0 0.33591771125793457
1 1 0.27460938692092896
2 2 0.15680742263793945
3 3 0.237198144197464

But, this is not reflected in the matched_text_index field or the matched_sim field. I'll need to take a closer look at this shortly.

jmhessel · 2023-05-11T18:22:54Z

Hi @fauconnier --- I tracked down the bug! Thanks for reporting it. We didn't notice it because it only affects a subset of documents, and the issue was hard to track down. Here's what's going on:

~10% of documents have an incorrect assignment. Specifically, for shard number 10063, in the full case at most 398/4529 (8.79%) documents have an incorrect assignment, and for the mmc4 core documents in the same shard, at most 48/350 (13.71%) documents have an incorrect assignment.
The bug is caused by a one-liner annoyance; when we were writing to the matched_text_index field, we used == (the equality test operator) instead of =, the assignment operator. Looking through the code, the only fields that should be impacted are matched_text_index and matched_sim.
The rest of the data is passing all of my checks. In particular, I re-computed the similarity matrices for the shard you sent, and the versions in the documents are within numerical precision identical to the recomputed versions. This means that the correct assignments can be recomputed easily.

Thanks for helping us track this down! I can handle the updates and next steps from here --- I'll keep you posted once I've fixed everything.

jmhessel · 2023-05-11T19:32:09Z

More updates:

I added the script we used to compute assignments, which will now save the correct assignment.

https://github.com/allenai/mmc4/blob/main/scripts/compute_assignments.py

as soon as I can, I will run this over the whole database and update the mmc4 documents in-place.

fauconnier · 2023-05-11T20:16:18Z

Fantastic. Thank you @jmhessel for the quick turnaround!

jmhessel · 2023-05-12T19:10:54Z

Hi @fauconnier , I wrote up the fix and am running it over everything. v1.1 of the corpus will be out ASAP. here's what this doc looks like in the new version :-)

{'image_info': [{'face_detections': None,
                 'image_name': '202706b24ac4.png',
                 'matched_sim': 0.33591771125793457,
                 'matched_text_index': 0,
                 'raw_url': 'https://st.hzcdn.com/static/[email protected]'},
                {'face_detections': None,
                 'image_name': 'a97c58871c38.png',
                 'matched_sim': 0.27460938692092896,
                 'matched_text_index': 1,
                 'raw_url': 'https://st.hzcdn.com/static/[email protected]'},
                {'face_detections': None,
                 'image_name': 'ce3c9aa070ce.png',
                 'matched_sim': 0.15680742263793945,
                 'matched_text_index': 2,
                 'raw_url': 'https://st.hzcdn.com/static/[email protected]'},
                {'face_detections': None,
                 'image_name': 'c22425c7d977.png',
                 'matched_sim': 0.237198144197464,
                 'matched_text_index': 3,
                 'raw_url': 'https://st.hzcdn.com/static/[email protected]'}],

aleSuglia · 2023-05-15T19:53:44Z

Hey @jmhessel, what is the ETA for this? I hope everything is running smoothly :)

Thanks so much for the prompt response on this!

jmhessel · 2023-05-15T19:55:33Z

Hi Alessandro --- hope you're well! actually, the new versions are ready and passing all of my known checks. I was about to push the release, but then I saw #12 , so I am checking on that now

jmhessel · 2023-05-16T19:03:59Z

Marking this as resolved by #13

fauconnier changed the title ~~Multiple image for the same sentence~~ Multiple images for an identical matched_text_index May 11, 2023

jmhessel added the bug Something isn't working label May 11, 2023

jmhessel mentioned this issue May 11, 2023

Duplicates and multiple versions of samples #10

Closed

jmhessel closed this as completed May 16, 2023

VegB mentioned this issue Jul 16, 2023

The data performance is inconsistent with the paper #15

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple images for an identical matched_text_index #11

Multiple images for an identical matched_text_index #11

fauconnier commented May 11, 2023 •

edited

Loading

jmhessel commented May 11, 2023

jmhessel commented May 11, 2023

jmhessel commented May 11, 2023 •

edited

Loading

jmhessel commented May 11, 2023 •

edited

Loading

fauconnier commented May 11, 2023

jmhessel commented May 12, 2023

aleSuglia commented May 15, 2023 •

edited

Loading

jmhessel commented May 15, 2023 •

edited

Loading

jmhessel commented May 16, 2023

Multiple images for an identical matched_text_index #11

Multiple images for an identical matched_text_index #11

Comments

fauconnier commented May 11, 2023 • edited Loading

jmhessel commented May 11, 2023

jmhessel commented May 11, 2023

jmhessel commented May 11, 2023 • edited Loading

jmhessel commented May 11, 2023 • edited Loading

fauconnier commented May 11, 2023

jmhessel commented May 12, 2023

aleSuglia commented May 15, 2023 • edited Loading

jmhessel commented May 15, 2023 • edited Loading

jmhessel commented May 16, 2023

fauconnier commented May 11, 2023 •

edited

Loading

jmhessel commented May 11, 2023 •

edited

Loading

jmhessel commented May 11, 2023 •

edited

Loading

aleSuglia commented May 15, 2023 •

edited

Loading

jmhessel commented May 15, 2023 •

edited

Loading