Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged images are being incorrectly included #2460

Closed
paul-butcher opened this issue Oct 6, 2023 · 17 comments
Closed

Merged images are being incorrectly included #2460

paul-butcher opened this issue Oct 6, 2023 · 17 comments
Assignees

Comments

@paul-butcher
Copy link
Contributor

paul-butcher commented Oct 6, 2023

https://wellcome.slack.com/archives/C8X9YKM5X/p1695906279857879?thread_ts=1695906112.578439&cid=C8X9YKM5X

https://wellcomecollection.org/works/kqy8b2yh is correctly showing "Contains 1 image", but incorrectly showing "View 1 image" in the Selected Images section.

Similarly, https://wellcomecollection.org/works/bvxsrk7s is correctly showing "Contains 3 images" but incorrectly showing "View 4 images".

There should be no images in the Selected Images - View X Images for either of these.

@paul-butcher paul-butcher converted this from a draft issue Oct 6, 2023
@paul-butcher paul-butcher self-assigned this Oct 6, 2023
@paul-butcher
Copy link
Contributor Author

The Sierra-based record https://wellcomecollection.org/works/kqy8b2yh merges as expected with the corresponding METS and Miro records. both h6nmh8c2 and tmc968hm redirect to kqy8b2yh.

Screenshot 2023-10-06 at 16 32 31

@paul-butcher
Copy link
Contributor Author

The same pattern is true of bvxsrk7s, (Miro: efdghac2, Mets: zy2xuu9x)

@paul-butcher
Copy link
Contributor Author

The Selected Images - "View n Images" section is driven by the content of the images property, which in bvxsrk7s is

 "images": [
        {
          "id": "cujqs9hz",
          "type": "Image"
        },
        {
          "id": "wv82wnvx",
          "type": "Image"
        },
        {
          "id": "m4k4sky2",
          "type": "Image"
        },
        {
          "id": "vzc9ptad",
          "type": "Image"
        }
      ],

Similarly, this list is one entry long in kqy8b2yh.

[
        {
          "id": "kje5zhrd",
          "type": "Image"
        }
      ]

This list is derived from the content of imageData after the merger, in both cases, the content of images in the indexed document is what I would expect based on the content of imageData.
e.g. in kqy8b2yh, this has one entry:

{
          "id": {
            "canonicalId": "kje5zhrd",
            "sourceIdentifier": {
              "identifierType": {
                "id": "mets-image"
              },
              "ontologyType": "Image",
              "value": "b12024673/FILE_0001_OBJECTS"
            },
            "otherIdentifiers": []
          },
   ... <some location and licence info for kje5zhrd>

So I suspect the problem is (in part) that the merger is incorrectly adding Images to the Works. For example when the merger encounters some images that are already in essence dealt with by being in the IIIF Manifest, or is otherwise set to be ignored.

In the case of Miro L0016704 (efdghac2)(vzc9ptad), this image is not expected to be seen at all, but is still being added as an Image in the Work.

@paul-butcher
Copy link
Contributor Author

A related issue is that images in the pipeline do not have State in the same way Works do. However, I do not think they are the same.

If we resolve 5470 before this issue, I think we'll still end up with the wrong number in the "View n Images" link, but it will link to an empty search (or at least one that returns a different number of results to n).

If we resolve this issue first, then users might still be able to find "suppressed" images with a search (e.g. q=bvxsrk7s might still return 4 records), but the link from the Work will be correct (or correctly absent)

@paul-butcher
Copy link
Contributor Author

Initially, I wondered if this might have been related to my change that fixed licence mismatches, but it doesn't look like it. The licences involved in kqy8b2yh are all the same, so that wouldn't have made a difference.

@paul-butcher
Copy link
Contributor Author

Oho! ImageDataRule is not being tested in the same manner in which it is used.

In merger, it fetches the redirectSources from the output, but in the tests, it uniformly fetches the data from the output. data is never used in real life.

@paul-butcher
Copy link
Contributor Author

paul-butcher commented Oct 10, 2023

The only digmiro rule we test is: "it does not use Miro images when a METS image is present for a digmiro Sierra work"

Which only covers the scenario where the Sierra item has digmiro, but does not have anything to say about what the resulting Sierra target ends up looking like.

@paul-butcher
Copy link
Contributor Author

However, what is happening in kqy8b2yh is that imageData contains kje5zhrd (the METS image), which is not incorrect according to that rule. So I suspect there may be a missing rule.

@paul-butcher
Copy link
Contributor Author

When creating a merged Work from a target and sources, the same sourceImageData is used to populate the Images for redirection and the imageData on the target Work.

I think this is correct, an entry in imageDataWithSources does require that the corresponding Work has that image, in order for the redirection to actually work.

So again, this points to the rule generating it being wrong.

I fear this might connect this issue even more tightly with wellcomecollection/platform#5470, If we fix this one without being able to change the state of an image, then what happens to the existing image that currently redirects?

@paul-butcher
Copy link
Contributor Author

paul-butcher commented Oct 12, 2023

Oh.

I was working on creating an appropriate set of tests that demonstrate this issue in isolation. It reveals an inconsistency depending on the number of items in the Sierra Work.

Given a Sierra work with 0, 1 or more Physical items, merging with a Mets and a Miro item with one item each, this table shows the counts of the items in the target work on output.

items in Sierra input items in output images in output
0 1 1
1 1 1
2 3 1

@paul-butcher
Copy link
Contributor Author

Actually, it looks like that's not a problem, when there is one item in the output, it gets two locations (the physical from Sierra and the digital from METS)

@paul-butcher
Copy link
Contributor Author

paul-butcher commented Oct 12, 2023

I'm coming to the conclusion that this might not necessarily be a digmiro issue, but actually that the merger should not insert both imageData and items for the same source into the target.

To find other examples, I ran this query:

GET works-indexed-2023-09-29/_search
{
  "query": {
    "bool": {
      "must_not": [
        {"term": {
      "query.identifiers.value": {
        "value": "digmiro"
      }
      
    }}
      ], 
      "must":[{
    "term": {
      "query.items.locations.locationType.id": {
        "value": "iiif-presentation"
      }
    }
    },
    {"term": {
      "query.items.locations.locationType.id": {
        "value": "closed-stores"
      }
    }
    }
    ]
  }}
}

which yielded plenty of results, including a98wfchg, which "contains 1 image", and allows you to view 1 image, which eventually reaches xpzujv3x. No digmiro involved, but still two routes to reach the same picture.

There are a number of tests that explicitly check that a merged source does appear in both items and imageData, so I think a bit of a conversation about expectations is required here.

@paul-butcher
Copy link
Contributor Author

This also suggests that the two problematic records that started this whole investigation might not be demonstrating the same issue. The bvxsrk7s should not be showing the monochrome image at all.

@paul-butcher paul-butcher changed the title Merged images are not being suppressed Merged images are being incorrectly included Oct 13, 2023
@paul-butcher
Copy link
Contributor Author

paul-butcher commented Oct 13, 2023

Yes! It is two separate issues. (This also explains why I had such trouble coming up with a minimal example to test with. Every time I tried just including the bits I thought relevant, it all worked correctly!)

In bvxsrk7s, there are two digcodes. The code that checks whether it is digmiro only expects one, so it's checking whether the first digcode it finds is digmiro. In this document, it is digpaintings, so it declares that it is not a digmiro Sierra file.

Excitingly, the fix for this (not including new tests) has a Levenshtein distance of 4.

@paul-butcher
Copy link
Contributor Author

The problem evident in kqy8b2yh (and a98wfchg) is also evident in bvxsrk7s. METS images that are present in items are also present in images, and this is unwanted.

I think a bit of a discussion is required to ensure that fixing that doesn't break anything, but I don't think it will.

@paul-butcher
Copy link
Contributor Author

paul-butcher commented Oct 17, 2023

I note that the same behaviour is present when merging into CALM.

Here is a query that returns CALM records with items from METS and images

works-indexed-2023-09-29/_search
{
  "_source": ["query.identifiers.value", "query.images.identifiers.value"], 
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "query.items.locations.locationType.id": {
              "value": "iiif-presentation"
            }
          }
        },
        {"exists": {"field": "query.collectionPath.label"}},
        {"regexp": {"query.identifiers.value": {"value": ".*/.*"} }}
      ]
    }
  }
}

https://wellcomecollection.org/works/kh8f3nzt contains two images and has 29 selected images. Oddly, this record in Elasticsearch has 42 locations (33 Digital)

@paul-butcher
Copy link
Contributor Author

paul-butcher commented Oct 17, 2023

I suspect that CALM thing is actually a completely different, but related issue, but that is also supplemented by this one.

@paul-butcher paul-butcher moved this from In progress to Ready for review in Digital platform Oct 19, 2023
@paul-butcher paul-butcher moved this from Ready for review to Blocked in Digital platform Nov 3, 2023
@paul-butcher paul-butcher moved this from Blocked to Ready for review in Digital platform Nov 3, 2023
@paul-butcher paul-butcher moved this from Ready for review to Blocked in Digital platform Nov 3, 2023
@paul-butcher paul-butcher moved this from Blocked to Done in Digital platform Nov 3, 2023
@pollecuttn pollecuttn moved this from Done to Archive in Digital platform Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant