-
Notifications
You must be signed in to change notification settings - Fork 502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
6142 - Flexible Solr schema deployment #6146
Conversation
Moving all <field> and <copyField> for custom metadata block indexing into separate files for easier deployment and maintainability. See IQSS#6142 for more.
…copy included XML files. See IQSS#6142 for more.
See IQSS#6142 for further reference.
Retrieve schema fields for custom metadata blocks from Dataverse API, mangle to create schema XML files, deploy and reload Solr. See IQSS#6142 for more.
…he update script. See IQSS#6142 for more.
…olr content. Done as requested by @pdurbin on IRC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please try taking the Harvard-specific custom metadata block fields out of schema_dv_cmb_copies.xml and schema_dv_cmb_fields.xml? I mean the ones in the TSV files that start with "custom" at scripts/api/data/metadatablocks.
Just to be sure:
but remain all fields from
? |
Yes, these fields match the ones in https://github.com/IQSS/dataverse/blob/v4.16/scripts/api/setup-datasetfields.sh which are curl http://localhost:8080/api/admin/datasetfield/load -X POST --data-binary @data/metadatablocks/citation.tsv -H "Content-type: text/tab-separated-values" And these almost match the list at http://guides.dataverse.org/en/4.16/user/appendix.html#metadata-references Here's a screenshot: Now might be a good time to change the list above in the appendix from 5 to 6, which is what this issue is about: Add note about journal metadata block in User Guides appendix #3976 I hope this is making sense. The idea is that there are exactly 6 official metadata blocks that are available in a vanilla installation of Dataverse. Here's how they look in the GUI: So to me it makes sense to have Solr agree with this list of 6 for a vanilla installation. It seems cleaner and more maintainable. |
I just pushed two commits with the requested changes. Moving this to Code Review again. |
957fc2f
to
07b579a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking better! Please see my comments below.
|
||
``curl http://localhost:8080/api/admin/index/solr/schema`` | ||
|
||
For convenience and automation you can use the *updateSchemaCMB.sh* script. It downloads, parses and writes the schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future (not in scope for this pull request), it might be nice to walk through a fake example of adding a metadata block with a single field. We could call this new script as part of that walkthrough.
07b579a
to
811c6cb
Compare
|
||
By default, it will download from Dataverse at `http://localhost:8080` and reload Solr at `http://localhost:8983`. | ||
You may use the following environment variables with this script or mix'n'match with options: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would maybe add that, if Solr and Dataverse are running on two different servers, the script must be run on the former - i.e., the Solr system, because it needs access to the local disk, in order to save the schema. (Thinking about it, what's the point of the -s parameter? - shouldn't the Solr ULR be hard-coded to http://localhost:8983
? - because if it is a remote server, then it's likely not going to work... unless TARGET directory is on an NFS mount, or something similar, accessible to both systems... seems easier, to just say that the script must be run on the Solr server, and assume that it should be reachable at the localhost address, no?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@landreev good point and we can let @poikilotherm tell us but I'm going to guess he's thinking ahead to someday using the Solr API to load the field rather than futzing with files on disk. That's the vision of #5989, to use the Solr APIs. @poikilotherm does use the Solr API for one operation... reloading of Solr:
echo "Triggering Solr RELOAD at ${SOLR_URL}/solr/admin/cores?action=RELOAD&core=collection1"
curl -f -sS "${SOLR_URL}/solr/admin/cores?action=RELOAD&core=collection1"
So maybe he's just trying to help guide us forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pdurbin I saw that the script was using the API for restarting Solr. Is it actually possible to use Solr API to modify its schema remotely? - I thought it wasn't.
Regardless, there's no harm in keeping SOLR_URL configurable. But the script in its current form relies on writing the schema file, so I feel the guide should mention that it needs to be run locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@landreev yes, is is possible to modify Solr's schema remotely and @pkiraly is investigating this in #5989
While chatting with @kcondon just now it occurred to me that @poikilotherm may be more in need of configuring the port that Solr is running on for his script rather than the hostname. He's running Dataverse in Kubernetes so who knows what crazy stuff he's doing in there. 😄 Anyway, that might be the reason why he doesn't want to hard code a hostname and port. Just a theory.
I'll go add a line in the guide to indicate that the script must be run on the Solr server itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I emphasized that the script must be run on the Solr server itself in 7b2becd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey guys,
indeed I need hostname and port fully configurable due to usage on Kubernetes. Most likely one will provide Some K8s service IP and port. (It's also a good practice not hardcoding such things anyway...)
Please be aware that writing the files on K8s is most likely done by some sidecar container/job, so this most certainly will not run in the same container on K8s.
As the guides are mostly for people using the classic installation I am totally happy with any additions you provide. Just go ahead and push it. 😄
Cheers!
@@ -55,4 +55,6 @@ curl http://localhost:8080/api/admin/datasetfield/load -X POST --data-binary @da | |||
curl http://localhost:8080/api/admin/datasetfield/load -X POST --data-binary @data/metadatablocks/customCHIA.tsv -H "Content-type: text/tab-separated-values" | |||
curl http://localhost:8080/api/admin/datasetfield/load -X POST --data-binary @data/metadatablocks/customDigaai.tsv -H "Content-type: text/tab-separated-values" | |||
curl http://localhost:8080/api/admin/datasetfield/load -X POST --data-binary @data/metadatablocks/custom_hbgdki.tsv -H "Content-type: text/tab-separated-values" | |||
# setup/update local solr instance with custom metadata block fields | |||
sudo -u solr ../../conf/solr/7.3.1/updateSchemaMDB.sh -t /usr/local/solr/server/solr/collection1/conf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about adding this line to the script - it assumes that Solr is running locally; AND has a hard-coded path to the installation directory, AND the username under which solr is running... too many assumptions?
I would maybe replace it with just an "echo: ATTENTION! you have installed optional metadata blocks, you now have to add the extra fields to the solr schema, by running the updateSchemaMDB.sh script, like this: ... etc. etc."
This may not be super important - I don't think we distribute this script publicly; it's not part of the installer bundle, and we (here at Harvard) may be the only installation actually running it... But that said, Solr is running on a different server, both in our production environment, and on some of our test instances... so the line above isn't going to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@poikilotherm @pdurbin [2019-09-11T17:34:20.195-0400] [glassfish 4.1] [WARNING] [] [javax.enterprise.resource.webcontainer.jsf.lifecycle] [tid: _ThreadID=30 _ThreadName=http-listener-1(5)] [timeMillis: 1568237660195] [levelValue: 900] [[ |
@kcondon is there any more to the stack trace? I don't see an exception like this (which is what I'd expect): Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/collection1: ERROR: [doc=dataset_72_draft] unknown field 'tag' |
@pdurbin Did you see my updated comment that it currently exists in develop? Adv search appears to work so I will verify his pr that way. It is, however, a serious bug introduced in this release. I'm not sure where it was introduced. |
@pdurbin @poikilotherm OK, so I was able to confirm, using advanced search, all the default metadata blocks were indexed. I will open a separate ticket for the basic search issue and it will not affect this branch. I then added setup-optional-harvard.sh and it instructed me to run updateSchemaMDB.sh because I had added a custom metadata block. So I ran it without any arguments first, got an error, then realized I needed to add the solr target dir, added that, saw an error, then added all the args that applied, saw the same error: ./updateSchemaMDB.sh -d http://localhost:8080 -s http://localhost:8983 -t /usr/local/solr/solr-7.3.0/server/solr/collection1/conf |
@pdurbin looks like we finally have someone running this on Mac 😄 If not @kcondon, what platform did you use to run this? About the error you see: In case you are indeed on Mac: does it need to be runnable on Mac? |
@poikilotherm @pdurbin |
Oh it does! Coreutils changed between 6 and 7. IMHO CentOS 6 should not be the baseline here, as it reaches EOL in 11/2020... (I edited/extended my comment above, just in case...) |
OK I think I have an idea how to avoid csplit. Give me a few minutes, hope you guys are still in town... |
Yes, I just saw #6161 and reproduced this bug on the phoenix server. Sad! |
A big thank you to all of you guys involved in this PR. Happy this is merged now and hopefully it will be useful for others, too. |
Thanks, @poikilotherm ! |
Related Issues
Pull Request Checklist