-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple images for an identical matched_text_index #11
Comments
Hi @fauconnier ! Thanks for pointing this out. Let me check it what's going on: this example is not the intended behavior. |
Taking some public notes... This is how we do the image text assignments: def get_image_assignments(im2txt):
'''
returns a list assignments of length N_images such that assignments[i] is the sentence index that image i was assigned to.
'''
# if there are more images than texts, not quite sure what to do...
im_idxs_s, txt_idxs_s, sol = linear_assignment.base_solve(-im2txt)
im2txt_idxs = {im_idxs_s[k]: txt_idxs_s[k] for k in range(len(im_idxs_s))}
if im2txt.shape[0] > im2txt.shape[1]:
# there are more images than sentences. we dont want to discard images. so, for unassigned images, we will put them with their corresponding max.
for imidx in range(len(im2txt)):
if imidx not in im2txt_idxs:
im2txt_idxs[imidx] = int(np.argmax(im2txt[imidx]))
return [im2txt_idxs[idx] for idx in range(len(im2txt_idxs))] where the base solve function is: def base_solve(W, max_dummy_cost_value=1000):
'''
Gives hungarian solve for a non-square matrix. it's roughly from:
NOTE: this ** MINIMIZES COST **. So, if you're handing sims, make sure to negate them!
https://github.com/jmhessel/multi-retrieval/blob/master/bipartite_utils.py
returns i_s, j_s, cost such that:
for i, j in zip(i_s, j_s)
are the (i, j) row column entries selected.
cost is sum( cost[i, j] for i, j in zip(i_s, j_s) )
'''
if np.sum(np.abs(W)) > max_dummy_cost_value:
print('Warning, you values in your matrix may be too big, please raise max_dummy_cost_value')
orig_shape = W.shape
if orig_shape[0] != orig_shape[1]:
if orig_shape[0] > orig_shape[1]:
pad_idxs = [[0, 0], [0, W.shape[0]-W.shape[1]]]
col_pad = True
else:
pad_idxs = [[0, W.shape[1]-W.shape[0]], [0, 0]]
col_pad = False
W = np.pad(W, pad_idxs, 'constant', constant_values=max_dummy_cost_value)
sol, _, cost = lapjv(W)
i_s = np.arange(len(sol))
j_s = sol[i_s]
sort_idxs = np.argsort(-W[i_s, j_s])
i_s, j_s = map(lambda x: x[sort_idxs], [i_s, j_s])
if orig_shape[0] != orig_shape[1]:
if col_pad:
valid_idxs = np.where(j_s < orig_shape[1])[0]
else:
valid_idxs = np.where(i_s < orig_shape[0])[0]
i_s, j_s = i_s[valid_idxs], j_s[valid_idxs]
# indices = np.hstack([np.expand_dims(i_s, -1), np.expand_dims(j_s, -1)]).astype(np.int32)
m_cost = 0.0
for i, j in zip(i_s, j_s):
m_cost += W[i, j]
return i_s, j_s, m_cost When I run that similarity matrix through this code, I get what I believe to be the correct assignment:
But, this is not reflected in the |
Hi @fauconnier --- I tracked down the bug! Thanks for reporting it. We didn't notice it because it only affects a subset of documents, and the issue was hard to track down. Here's what's going on:
Thanks for helping us track this down! I can handle the updates and next steps from here --- I'll keep you posted once I've fixed everything. |
More updates: I added the script we used to compute assignments, which will now save the correct assignment. https://github.com/allenai/mmc4/blob/main/scripts/compute_assignments.py as soon as I can, I will run this over the whole database and update the mmc4 documents in-place. |
Fantastic. Thank you @jmhessel for the quick turnaround! |
Hi @fauconnier , I wrote up the fix and am running it over everything. v1.1 of the corpus will be out ASAP. here's what this doc looks like in the new version :-)
|
Hey @jmhessel, what is the ETA for this? I hope everything is running smoothly :) Thanks so much for the prompt response on this! |
Hi Alessandro --- hope you're well! actually, the new versions are ready and passing all of my known checks. I was about to push the release, but then I saw #12 , so I am checking on that now |
Marking this as resolved by #13 |
Dear authors,
Thanks for releasing MMC4.
In the paper the following is stated:
However, we found examples where multiple images are aligned to a text span.
For instance, consider the following example in
./docs_shard_10063_v3.jsonl
.Is that intended?
Thanks for any pointers.
The text was updated successfully, but these errors were encountered: