Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If Solr is down when a file is deleted from a draft dataset, reindexing the draft doesn't delete the Solr document for the deleted file #2086

Closed
pdurbin opened this issue Apr 24, 2015 · 4 comments

Comments

@pdurbin
Copy link
Member

pdurbin commented Apr 24, 2015

To fix this bug we might need to first work on #2038 to make the Solr field parentId searchable. We need something to search on in order to find and delete the Solr document for a file, which looks something like this:

  {
    "entityId":17,
    "identifier":"17",
    "persistentUrl":"http://dx.doi.org/10.5072/FK2/ZT7HDY",
    "dvObjectType":"files",
    "fileNameWithoutExtension":["Female-House-Finch"],
    "fileName":["Female-House-Finch",
      "Female-House-Finch.jpg"],
    "name":"Female-House-Finch.jpg",
    "nameSort":"Female-House-Finch.jpg",
    "datasetVersionId":2,
    "dateSort":"2015-04-24T13:40:16.818Z",
    "dateFriendly":"Apr 24, 2015",
    "publicationStatus":["Unpublished",
      "Draft"],
    "id":"datafile_17_draft",
    "fileTypeDisplay":"JPEG Image",
    "fileContentType":"image/jpeg",
    "fileType":["JPEG Image",
      "image"],
    "fileTypeGroupFacet":"image",
    "fileSizeInBytes":28244,
    "fileMd5":"d86a029ec165e2f1fd072cc7c2290800",
    "subtreePaths":["/2",
      "/2/3"],
    "parentId":"11",
    "parentIdentifier":"doi:10.5072/FK2/ZT7HDY",
    "parentCitation":"Finch, Fiona, 2015, \"Darwin's Finches\", http://dx.doi.org/10.5072/FK2/ZT7HDY,  Root Dataverse,  DRAFT VERSION ",
    "parentName":"Darwin's Finches",
    "_version_":1499340804812963840}]

}

That is to say, if you try to find the Solr document based on parentId you won't find it (numFound:0):

$ curl 'http://localhost:8983/solr/collection1/select?wt=json&indent=true&q=parentId:11'
{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "indent":"true",
      "q":"parentId:11",
      "wt":"json"}},
  "response":{"numFound":0,"start":0,"docs":[]
  }}

The good news is that definitionPointDocId is already searchable so once we find the Solr document for the file, we should be able to also delete the corresponding "permission" document:

$ curl 'http://localhost:8983/solr/collection1/select?wt=json&indent=true&q=definitionPointDocId:datafile_17_draft'
{
  "responseHeader":{
    "status":0,
    "QTime":1,
    "params":{
      "indent":"true",
      "q":"definitionPointDocId:datafile_17_draft",
      "wt":"json"}},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "id":"datafile_17_draft_permission",
        "definitionPointDocId":"datafile_17_draft",
        "definitionPointDvObjectId":"17",
        "discoverableBy":["group_user2",
          "group_user2"],
        "_version_":1499340804995416064}]
  }}

Related: #702

@pdurbin pdurbin added the Type: Bug a defect label Apr 24, 2015
pdurbin added a commit that referenced this issue Apr 24, 2015
Requires schema change and re-indexing #2038

Also show output for orphaned files in index/status API call.
@pdurbin
Copy link
Member Author

pdurbin commented Apr 24, 2015

@scolapasta please take a look at a06ef62 for a proposed fix (in a branch). This fix requires a Solr schema change and re-indexing in order to do searches on parentId. The fix does not address deletion of Solr permission documents but I did note this in a todo.

@scolapasta scolapasta added this to the 4.0.1 milestone Apr 27, 2015
@scolapasta scolapasta modified the milestones: 4.0.1, Candidates for 4.0.1 May 8, 2015
@scolapasta scolapasta assigned pdurbin and unassigned scolapasta May 8, 2015
@pdurbin
Copy link
Member Author

pdurbin commented May 11, 2015

@scolapasta I'm going to assign this to you as a reminder to review the commit I mentioned in #2086 (comment)

Since that fix requires a change to the Solr schema.xml as well as reindexing, we'll be partially addressing the "making Solr debug fields searchable" issue (#2038) so we may want to make a decision on the list of candidates in that issue as part of this milestone.

@pdurbin pdurbin assigned scolapasta and unassigned pdurbin May 11, 2015
@scolapasta scolapasta assigned pdurbin and unassigned scolapasta May 19, 2015
pdurbin added a commit that referenced this issue May 21, 2015
Requires schema change and re-indexing #2038

Also show output for orphaned files in index/status API call.
pdurbin added a commit that referenced this issue May 21, 2015
- Also fix a bug where Integer.MAX_VALUE was intended
@pdurbin
Copy link
Member Author

pdurbin commented May 21, 2015

I pushed a fix that requires updating the Solr schema.xml . In addition, it's not enough to simply update the schema.xml. The fix is only good for files that were indexed after the Solr schema.xml has been updated. This is because the fix relies on being able to identify a list of files that once were part of a dataset (now orphaned) based on a search across files for a parentId that matches. Previously parentId was not searchable. Now it is, as of 96e411c, with the change to schema.xml.

The way I've been testing the bug:

  • create a dataset
  • upload a file
  • stop Solr
  • delete the file from the dataset
  • start Solr
  • observe that the "file" card is still there (it couldn't be deleted from solr because Solr was down)
  • reindex the dataset with curl http://localhost:8080/api/admin/index/datasets/11 or whatever
  • ensure that the "file" card is now gone

Passing to QA.

@kcondon
Copy link
Contributor

kcondon commented Jun 4, 2015

Works!! Closing.

@kcondon kcondon closed this as completed Jun 4, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants