Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple images for an identical matched_text_index #11

Closed
fauconnier opened this issue May 11, 2023 · 9 comments
Closed

Multiple images for an identical matched_text_index #11

fauconnier opened this issue May 11, 2023 · 9 comments
Labels
bug Something isn't working

Comments

@fauconnier
Copy link

fauconnier commented May 11, 2023

Dear authors,

Thanks for releasing MMC4.

In the paper the following is stated:

"we use [14] to compute a bipartite assignment of images to sentences, under the constraint that each sentence can only be assigned a single image." ,
"For documents with more images than sentences, after assigning an image to each sentence, we assign according to max similarity.".

However, we found examples where multiple images are aligned to a text span.
For instance, consider the following example in ./docs_shard_10063_v3.jsonl.

{
    "url": "http://easydesigns.biz/easydesigns-wins-3rd-consecutive-best-of-houzz-award/",
    "text_list": [
        "cherry hill, nj, january 19, 2015 \u2013 easydesigns of cherry hill, nj has been awarded \u201cbest of houzz\u201d for customer satisfaction by houzz, the leading platform for home remodeling and design.",
        "the interior design and real estate staging firm, in business since 2005, was chosen by the more than 25 million monthly unique users that comprise the houzz community from among more than 500,000 active home building, remodeling and design industry professionals.",
        "\u201ci am so happy to be selected for the 3rd consecutive year.",
        "customer satisfaction is a primary goal of my firm so i am thrilled to be recognized by such a large and prominent community\u201d, said beth secosky, owner of easydesigns."
    ],
    "image_info": [
        {
            "image_name": "202706b24ac4.png",
            "raw_url": "https://st.hzcdn.com/static/[email protected]",
            "matched_text_index": 0,
            "matched_sim": 0.33591771125793457,
            "face_detections": null
        },
        {
            "image_name": "a97c58871c38.png",
            "raw_url": "https://st.hzcdn.com/static/[email protected]",
            "matched_text_index": 0,
            "matched_sim": 0.31495240330696106,
            "face_detections": null
        },
        {
            "image_name": "ce3c9aa070ce.png",
            "raw_url": "https://st.hzcdn.com/static/[email protected]",
            "matched_text_index": 1,
            "matched_sim": 0.2770630717277527,
            "face_detections": null
        },
        {
            "image_name": "c22425c7d977.png",
            "raw_url": "https://st.hzcdn.com/static/[email protected]",
            "matched_text_index": 0,
            "matched_sim": 0.3448386490345001,
            "face_detections": null
        }
    ],
    "similarity_matrix": [
        [
            0.33591771125793457,
            0.2377069592475891,
            0.17204634845256805,
            0.22403109073638916
        ],
        [
            0.31495240330696106,
            0.27460938692092896,
            0.12367681413888931,
            0.17759563028812408
        ],
        [
            0.3045308589935303,
            0.2770630717277527,
            0.15680742263793945,
            0.21054978668689728
        ],
        [
            0.3448386490345001,
            0.26175469160079956,
            0.16365793347358704,
            0.237198144197464
        ]
    ]
}

Is that intended?

Thanks for any pointers.

@fauconnier fauconnier changed the title Multiple image for the same sentence Multiple images for an identical matched_text_index May 11, 2023
@jmhessel
Copy link
Contributor

Hi @fauconnier ! Thanks for pointing this out. Let me check it what's going on: this example is not the intended behavior.

@jmhessel
Copy link
Contributor

Taking some public notes... This is how we do the image text assignments:

def get_image_assignments(im2txt):
    '''                                                                                                                                                                                                    
    returns a list assignments of length N_images such that assignments[i] is the sentence index that image i was assigned to.                                                                             
    '''
    # if there are more images than texts, not quite sure what to do...                                                                                                                                    
    im_idxs_s, txt_idxs_s, sol = linear_assignment.base_solve(-im2txt)
    im2txt_idxs = {im_idxs_s[k]: txt_idxs_s[k] for k in range(len(im_idxs_s))}
    if im2txt.shape[0] > im2txt.shape[1]:
        # there are more images than sentences. we dont want to discard images. so, for unassigned images, we will put them with their corresponding max.                                                  
        for imidx in range(len(im2txt)):
            if imidx not in im2txt_idxs:
                im2txt_idxs[imidx] = int(np.argmax(im2txt[imidx]))

    return [im2txt_idxs[idx] for idx in range(len(im2txt_idxs))]

where the base solve function is:

def base_solve(W, max_dummy_cost_value=1000):
    '''                                                                                                                                                                                                    
    Gives hungarian solve for a non-square matrix. it's roughly from:                                                                                                                                      
                                                                                                                                                                                                           
    NOTE: this ** MINIMIZES COST **. So, if you're handing sims, make sure to negate them!                                                                                                                 
                                                                                                                                                                                                           
    https://github.com/jmhessel/multi-retrieval/blob/master/bipartite_utils.py                                                                                                                             
                                                                                                                                                                                                           
    returns i_s, j_s, cost such that:                                                                                                                                                                      
    for i, j in zip(i_s, j_s)                                                                                                                                                                              
                                                                                                                                                                                                           
    are the (i, j) row column entries selected.                                                                                                                                                            
                                                                                                                                                                                                           
    cost is sum( cost[i, j] for i, j in zip(i_s, j_s) )                                                                                                                                                    
                                                                                                                                                                                                           
    '''
    if np.sum(np.abs(W)) > max_dummy_cost_value:
        print('Warning, you values in your matrix may be too big, please raise max_dummy_cost_value')


    orig_shape = W.shape
    if orig_shape[0] != orig_shape[1]:
	if orig_shape[0] > orig_shape[1]:
            pad_idxs = [[0, 0], [0, W.shape[0]-W.shape[1]]]
            col_pad = True
        else:
            pad_idxs = [[0, W.shape[1]-W.shape[0]], [0, 0]]
            col_pad = False
        W = np.pad(W, pad_idxs, 'constant', constant_values=max_dummy_cost_value)

    sol, _, cost = lapjv(W)

    i_s = np.arange(len(sol))
    j_s = sol[i_s]

    sort_idxs = np.argsort(-W[i_s, j_s])
    i_s, j_s = map(lambda x: x[sort_idxs], [i_s, j_s])

    if orig_shape[0] != orig_shape[1]:
	if col_pad:
            valid_idxs = np.where(j_s < orig_shape[1])[0]
        else:
            valid_idxs = np.where(i_s < orig_shape[0])[0]
        i_s, j_s = i_s[valid_idxs], j_s[valid_idxs]

    # indices = np.hstack([np.expand_dims(i_s, -1), np.expand_dims(j_s, -1)]).astype(np.int32)                                                                                                             
    m_cost = 0.0
    for i, j in zip(i_s, j_s):
        m_cost += W[i, j]

    return i_s, j_s, m_cost

When I run that similarity matrix through this code, I get what I believe to be the correct assignment:

im txt sim
0 0 0.33591771125793457
1 1 0.27460938692092896
2 2 0.15680742263793945
3 3 0.237198144197464

But, this is not reflected in the matched_text_index field or the matched_sim field. I'll need to take a closer look at this shortly.

@jmhessel jmhessel added the bug Something isn't working label May 11, 2023
@jmhessel
Copy link
Contributor

jmhessel commented May 11, 2023

Hi @fauconnier --- I tracked down the bug! Thanks for reporting it. We didn't notice it because it only affects a subset of documents, and the issue was hard to track down. Here's what's going on:

  • ~10% of documents have an incorrect assignment. Specifically, for shard number 10063, in the full case at most 398/4529 (8.79%) documents have an incorrect assignment, and for the mmc4 core documents in the same shard, at most 48/350 (13.71%) documents have an incorrect assignment.
  • The bug is caused by a one-liner annoyance; when we were writing to the matched_text_index field, we used == (the equality test operator) instead of =, the assignment operator. Looking through the code, the only fields that should be impacted are matched_text_index and matched_sim.
  • The rest of the data is passing all of my checks. In particular, I re-computed the similarity matrices for the shard you sent, and the versions in the documents are within numerical precision identical to the recomputed versions. This means that the correct assignments can be recomputed easily.

Thanks for helping us track this down! I can handle the updates and next steps from here --- I'll keep you posted once I've fixed everything.

@jmhessel
Copy link
Contributor

jmhessel commented May 11, 2023

More updates:

I added the script we used to compute assignments, which will now save the correct assignment.

https://github.com/allenai/mmc4/blob/main/scripts/compute_assignments.py

as soon as I can, I will run this over the whole database and update the mmc4 documents in-place.

@fauconnier
Copy link
Author

Fantastic. Thank you @jmhessel for the quick turnaround!

@jmhessel
Copy link
Contributor

Hi @fauconnier , I wrote up the fix and am running it over everything. v1.1 of the corpus will be out ASAP. here's what this doc looks like in the new version :-)

{'image_info': [{'face_detections': None,
                 'image_name': '202706b24ac4.png',
                 'matched_sim': 0.33591771125793457,
                 'matched_text_index': 0,
                 'raw_url': 'https://st.hzcdn.com/static/[email protected]'},
                {'face_detections': None,
                 'image_name': 'a97c58871c38.png',
                 'matched_sim': 0.27460938692092896,
                 'matched_text_index': 1,
                 'raw_url': 'https://st.hzcdn.com/static/[email protected]'},
                {'face_detections': None,
                 'image_name': 'ce3c9aa070ce.png',
                 'matched_sim': 0.15680742263793945,
                 'matched_text_index': 2,
                 'raw_url': 'https://st.hzcdn.com/static/[email protected]'},
                {'face_detections': None,
                 'image_name': 'c22425c7d977.png',
                 'matched_sim': 0.237198144197464,
                 'matched_text_index': 3,
                 'raw_url': 'https://st.hzcdn.com/static/[email protected]'}],

@aleSuglia
Copy link

aleSuglia commented May 15, 2023

Hey @jmhessel, what is the ETA for this? I hope everything is running smoothly :)

Thanks so much for the prompt response on this!

@jmhessel
Copy link
Contributor

jmhessel commented May 15, 2023

Hi Alessandro --- hope you're well! actually, the new versions are ready and passing all of my known checks. I was about to push the release, but then I saw #12 , so I am checking on that now

@jmhessel
Copy link
Contributor

Marking this as resolved by #13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants