Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

6142 - Flexible Solr schema deployment #6146

Merged
merged 25 commits into from
Sep 12, 2019

Conversation

poikilotherm
Copy link
Contributor

@poikilotherm poikilotherm commented Sep 6, 2019

Related Issues

Pull Request Checklist

Moving all <field> and <copyField> for custom metadata block indexing into separate files
for easier deployment and maintainability. See IQSS#6142 for more.
Retrieve schema fields for custom metadata blocks from Dataverse API,
mangle to create schema XML files, deploy and reload Solr.
See IQSS#6142 for more.
@coveralls
Copy link

coveralls commented Sep 6, 2019

Coverage Status

Coverage decreased (-0.001%) to 19.487% when pulling a5cb7e8 on poikilotherm:6142-flex-solr-schema into 09fe94b on IQSS:develop.

@pdurbin pdurbin self-assigned this Sep 9, 2019
Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please try taking the Harvard-specific custom metadata block fields out of schema_dv_cmb_copies.xml and schema_dv_cmb_fields.xml? I mean the ones in the TSV files that start with "custom" at scripts/api/data/metadatablocks.

@pdurbin pdurbin removed their assignment Sep 9, 2019
@poikilotherm
Copy link
Contributor Author

Just to be sure:
I shall remove all fields from

scripts/api/data/metadatablocks/customARCS.tsv
scripts/api/data/metadatablocks/customCHIA.tsv
scripts/api/data/metadatablocks/customDigaai.tsv
scripts/api/data/metadatablocks/customGSD.tsv
scripts/api/data/metadatablocks/custom_hbgdki.tsv
scripts/api/data/metadatablocks/customMRA.tsv
scripts/api/data/metadatablocks/customPSI.tsv
scripts/api/data/metadatablocks/customPSRI.tsv

but remain all fields from

scripts/api/data/metadatablocks/astrophysics.tsv
scripts/api/data/metadatablocks/biomedical.tsv
scripts/api/data/metadatablocks/citation.tsv
scripts/api/data/metadatablocks/geospatial.tsv
scripts/api/data/metadatablocks/journals.tsv
scripts/api/data/metadatablocks/social_science.tsv

?

@pdurbin
Copy link
Member

pdurbin commented Sep 9, 2019

scripts/api/data/metadatablocks/astrophysics.tsv
scripts/api/data/metadatablocks/biomedical.tsv
scripts/api/data/metadatablocks/citation.tsv
scripts/api/data/metadatablocks/geospatial.tsv
scripts/api/data/metadatablocks/journals.tsv
scripts/api/data/metadatablocks/social_science.tsv

Yes, these fields match the ones in https://github.com/IQSS/dataverse/blob/v4.16/scripts/api/setup-datasetfields.sh which are

curl http://localhost:8080/api/admin/datasetfield/load -X POST --data-binary @data/metadatablocks/citation.tsv -H "Content-type: text/tab-separated-values"
curl http://localhost:8080/api/admin/datasetfield/load -X POST --data-binary @data/metadatablocks/geospatial.tsv -H "Content-type: text/tab-separated-values"
curl http://localhost:8080/api/admin/datasetfield/load -X POST --data-binary @data/metadatablocks/social_science.tsv -H "Content-type: text/tab-separated-values"
curl http://localhost:8080/api/admin/datasetfield/load -X POST --data-binary @data/metadatablocks/astrophysics.tsv -H "Content-type: text/tab-separated-values"
curl http://localhost:8080/api/admin/datasetfield/load -X POST --data-binary @data/metadatablocks/biomedical.tsv -H "Content-type: text/tab-separated-values"
curl http://localhost:8080/api/admin/datasetfield/load -X POST --data-binary @data/metadatablocks/journals.tsv -H "Content-type: text/tab-separated-values"

And these almost match the list at http://guides.dataverse.org/en/4.16/user/appendix.html#metadata-references

Here's a screenshot:

Screen Shot 2019-09-09 at 7 46 29 AM

Now might be a good time to change the list above in the appendix from 5 to 6, which is what this issue is about: Add note about journal metadata block in User Guides appendix #3976

I hope this is making sense. The idea is that there are exactly 6 official metadata blocks that are available in a vanilla installation of Dataverse. Here's how they look in the GUI:

Screen Shot 2019-09-09 at 7 49 24 AM

So to me it makes sense to have Solr agree with this list of 6 for a vanilla installation. It seems cleaner and more maintainable.

poikilotherm added a commit to poikilotherm/dataverse that referenced this pull request Sep 9, 2019
poikilotherm added a commit to poikilotherm/dataverse that referenced this pull request Sep 9, 2019
@poikilotherm
Copy link
Contributor Author

I just pushed two commits with the requested changes. Moving this to Code Review again.

poikilotherm added a commit to poikilotherm/dataverse that referenced this pull request Sep 9, 2019
@pdurbin pdurbin assigned pdurbin and unassigned poikilotherm Sep 9, 2019
Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking better! Please see my comments below.


``curl http://localhost:8080/api/admin/index/solr/schema``

For convenience and automation you can use the *updateSchemaCMB.sh* script. It downloads, parses and writes the schema
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future (not in scope for this pull request), it might be nice to walk through a fake example of adding a metadata block with a single field. We could call this new script as part of that walkthrough.

@pdurbin pdurbin removed their assignment Sep 9, 2019
@kcondon kcondon self-assigned this Sep 11, 2019

By default, it will download from Dataverse at `http://localhost:8080` and reload Solr at `http://localhost:8983`.
You may use the following environment variables with this script or mix'n'match with options:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would maybe add that, if Solr and Dataverse are running on two different servers, the script must be run on the former - i.e., the Solr system, because it needs access to the local disk, in order to save the schema. (Thinking about it, what's the point of the -s parameter? - shouldn't the Solr ULR be hard-coded to http://localhost:8983? - because if it is a remote server, then it's likely not going to work... unless TARGET directory is on an NFS mount, or something similar, accessible to both systems... seems easier, to just say that the script must be run on the Solr server, and assume that it should be reachable at the localhost address, no?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@landreev good point and we can let @poikilotherm tell us but I'm going to guess he's thinking ahead to someday using the Solr API to load the field rather than futzing with files on disk. That's the vision of #5989, to use the Solr APIs. @poikilotherm does use the Solr API for one operation... reloading of Solr:

echo "Triggering Solr RELOAD at ${SOLR_URL}/solr/admin/cores?action=RELOAD&core=collection1"
curl -f -sS "${SOLR_URL}/solr/admin/cores?action=RELOAD&core=collection1"

So maybe he's just trying to help guide us forward.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pdurbin I saw that the script was using the API for restarting Solr. Is it actually possible to use Solr API to modify its schema remotely? - I thought it wasn't.
Regardless, there's no harm in keeping SOLR_URL configurable. But the script in its current form relies on writing the schema file, so I feel the guide should mention that it needs to be run locally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@landreev yes, is is possible to modify Solr's schema remotely and @pkiraly is investigating this in #5989

While chatting with @kcondon just now it occurred to me that @poikilotherm may be more in need of configuring the port that Solr is running on for his script rather than the hostname. He's running Dataverse in Kubernetes so who knows what crazy stuff he's doing in there. 😄 Anyway, that might be the reason why he doesn't want to hard code a hostname and port. Just a theory.

I'll go add a line in the guide to indicate that the script must be run on the Solr server itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I emphasized that the script must be run on the Solr server itself in 7b2becd

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey guys,

indeed I need hostname and port fully configurable due to usage on Kubernetes. Most likely one will provide Some K8s service IP and port. (It's also a good practice not hardcoding such things anyway...)

Please be aware that writing the files on K8s is most likely done by some sidecar container/job, so this most certainly will not run in the same container on K8s.

As the guides are mostly for people using the classic installation I am totally happy with any additions you provide. Just go ahead and push it. 😄

Cheers!

@@ -55,4 +55,6 @@ curl http://localhost:8080/api/admin/datasetfield/load -X POST --data-binary @da
curl http://localhost:8080/api/admin/datasetfield/load -X POST --data-binary @data/metadatablocks/customCHIA.tsv -H "Content-type: text/tab-separated-values"
curl http://localhost:8080/api/admin/datasetfield/load -X POST --data-binary @data/metadatablocks/customDigaai.tsv -H "Content-type: text/tab-separated-values"
curl http://localhost:8080/api/admin/datasetfield/load -X POST --data-binary @data/metadatablocks/custom_hbgdki.tsv -H "Content-type: text/tab-separated-values"
# setup/update local solr instance with custom metadata block fields
sudo -u solr ../../conf/solr/7.3.1/updateSchemaMDB.sh -t /usr/local/solr/server/solr/collection1/conf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about adding this line to the script - it assumes that Solr is running locally; AND has a hard-coded path to the installation directory, AND the username under which solr is running... too many assumptions?
I would maybe replace it with just an "echo: ATTENTION! you have installed optional metadata blocks, you now have to add the extra fields to the solr schema, by running the updateSchemaMDB.sh script, like this: ... etc. etc."

This may not be super important - I don't think we distribute this script publicly; it's not part of the installer bundle, and we (here at Harvard) may be the only installation actually running it... But that said, Solr is running on a different server, both in our production environment, and on some of our test instances... so the line above isn't going to work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@landreev good point. In aebc850 I switched it to an echo instead of actually running the script.

@kcondon
Copy link
Contributor

kcondon commented Sep 11, 2019

@poikilotherm @pdurbin
Update: This does not appear to be a problem with your branch. It somehow made it into develop.
I've installed a new dataverse instance, copied over the new schema.xml, the new schema mdb files, restarted solr. The app loads. I have created a dataset using at least one field from all default blocks. I was able to publish and therefore index the fields. When I performed a basic search from the root dataverse, I'm getting this stack trace:

[2019-09-11T17:34:20.195-0400] [glassfish 4.1] [WARNING] [] [javax.enterprise.resource.webcontainer.jsf.lifecycle] [tid: _ThreadID=30 _ThreadName=http-listener-1(5)] [timeMillis: 1568237660195] [levelValue: 900] [[
#{SearchIncludeFragment.searchRedirect(dataverseRedirectPage)}: java.lang.NullPointerException
javax.faces.FacesException: #{SearchIncludeFragment.searchRedirect(dataverseRedirectPage)}: java.lang.NullPointerException
at com.sun.faces.application.ActionListenerImpl.processAction(ActionListenerImpl.java:118)
at javax.faces.component.UICommand.broadcast(UICommand.java:315)
at javax.faces.component.UIViewRoot.broadcastEvents(UIViewRoot.java:790)
at javax.faces.component.UIViewRoot.processApplication(UIViewRoot.java:1282)
at com.sun.faces.lifecycle.InvokeApplicationPhase.execute(InvokeApplicationPhase.java:81)
at com.sun.faces.lifecycle.Phase.doPhase(Phase.java:101)
at com.sun.faces.lifecycle.LifecycleImpl.execute(LifecycleImpl.java:198)
at javax.faces.webapp.FacesServlet.service(FacesServlet.java:646)
at org.apache.catalina.core.StandardWrapper.service(StandardWrapper.java:1682)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:344)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:214)
at org.ocpsoft.rewrite.servlet.RewriteFilter.doFilter(RewriteFilter.java:226)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:256)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:214)
at org.apache.catalina.core.ApplicationDispatcher.doInvoke(ApplicationDispatcher.java:873)
at org.apache.catalina.core.ApplicationDispatcher.invoke(ApplicationDispatcher.java:739)
at org.apache.catalina.core.ApplicationDispatcher.processRequest(ApplicationDispatcher.java:575)
at org.apache.catalina.core.ApplicationDispatcher.doDispatch(ApplicationDispatcher.java:546)
at org.apache.catalina.core.ApplicationDispatcher.dispatch(ApplicationDispatcher.java:428)
at org.apache.catalina.core.ApplicationDispatcher.forward(ApplicationDispatcher.java:378)
at org.ocpsoft.rewrite.servlet.impl.HttpRewriteResultHandler.handleResult(HttpRewriteResultHandler.java:42)
at org.ocpsoft.rewrite.servlet.RewriteFilter.rewrite(RewriteFilter.java:297)
at org.ocpsoft.rewrite.servlet.RewriteFilter.doFilter(RewriteFilter.java:198)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:256)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:214)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:316)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:160)
at org.apache.catalina.core.StandardPipeline.doInvoke(StandardPipeline.java:734)
at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:673)
at com.sun.enterprise.web.WebPipeline.invoke(WebPipeline.java:99)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:174)
at org.apache.catalina.connector.CoyoteAdapter.doService(CoyoteAdapter.java:415)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:282)
at com.sun.enterprise.v3.services.impl.ContainerMapper$HttpHandlerCallable.call(ContainerMapper.java:459)
at com.sun.enterprise.v3.services.impl.ContainerMapper.service(ContainerMapper.java:167)
at org.glassfish.grizzly.http.server.HttpHandler.runService(HttpHandler.java:201)
at org.glassfish.grizzly.http.server.HttpHandler.doHandle(HttpHandler.java:175)
at org.glassfish.grizzly.http.server.HttpServerFilter.handleRead(HttpServerFilter.java:235)
at org.glassfish.grizzly.filterchain.ExecutorResolver$9.execute(ExecutorResolver.java:119)
at org.glassfish.grizzly.filterchain.DefaultFilterChain.executeFilter(DefaultFilterChain.java:284)
at org.glassfish.grizzly.filterchain.DefaultFilterChain.executeChainPart(DefaultFilterChain.
java:201)
at org.glassfish.grizzly.filterchain.DefaultFilterChain.execute(DefaultFilterChain.java:133)
at org.glassfish.grizzly.filterchain.DefaultFilterChain.process(DefaultFilterChain.java:112)
at org.glassfish.grizzly.ProcessorExecutor.execute(ProcessorExecutor.java:77)
at org.glassfish.grizzly.nio.transport.TCPNIOTransport.fireIOEvent(TCPNIOTransport.java:561)
at org.glassfish.grizzly.strategies.AbstractIOStrategy.fireIOEvent(AbstractIOStrategy.java:112)
at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy.run0(WorkerThreadIOStrategy.java:117)
at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy.access$100(WorkerThreadIOStrategy.java:56)
at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy$WorkerThreadRunnable.run(WorkerThreadIOStrategy.java:137)
at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:565)
at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:545)
at java.lang.Thread.run(Thread.java:748)
Caused by: javax.faces.el.EvaluationException: java.lang.NullPointerException
at javax.faces.component.MethodBindingMethodExpressionAdapter.invoke(MethodBindingMethodExpressionAdapter.java:101)
at com.sun.faces.application.ActionListenerImpl.processAction(ActionListenerImpl.java:102)
... 51 more
Caused by: java.lang.NullPointerException
at edu.harvard.iq.dataverse.search.SearchIncludeFragment.searchRedirect(SearchIncludeFragment.java:211)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at javax.el.ELUtil.invokeMethod(ELUtil.java:332)
at javax.el.BeanELResolver.invoke(BeanELResolver.java:537)
at javax.el.CompositeELResolver.invoke(CompositeELResolver.java:256)
at com.sun.el.parser.AstValue.invoke(AstValue.java:283)
at com.sun.el.MethodExpressionImpl.invoke(MethodExpressionImpl.java:304)
at org.jboss.weld.util.el.ForwardingMethodExpression.invoke(ForwardingMethodExpression.java:40)
at org.jboss.weld.el.WeldMethodExpression.invoke(WeldMethodExpression.java:50)
at com.sun.faces.facelets.el.TagMethodExpression.invoke(TagMethodExpression.java:105)
at javax.faces.component.MethodBindingMethodExpressionAdapter.invoke(MethodBindingMethodExpressionAdapter.java:87)
... 52 more
]]

@pdurbin
Copy link
Member

pdurbin commented Sep 11, 2019

@kcondon is there any more to the stack trace? I don't see an exception like this (which is what I'd expect): Caused by: org.apache.solr.client.solrj.impl.H​ttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/collection1: ERROR: [doc=dataset_72_draft] unknown field 'tag'

@kcondon
Copy link
Contributor

kcondon commented Sep 11, 2019

@pdurbin Did you see my updated comment that it currently exists in develop? Adv search appears to work so I will verify his pr that way. It is, however, a serious bug introduced in this release. I'm not sure where it was introduced.

@kcondon
Copy link
Contributor

kcondon commented Sep 11, 2019

@pdurbin @poikilotherm OK, so I was able to confirm, using advanced search, all the default metadata blocks were indexed. I will open a separate ticket for the basic search issue and it will not affect this branch.

I then added setup-optional-harvard.sh and it instructed me to run updateSchemaMDB.sh because I had added a custom metadata block. So I ran it without any arguments first, got an error, then realized I needed to add the solr target dir, added that, saw an error, then added all the args that applied, saw the same error:

./updateSchemaMDB.sh -d http://localhost:8080 -s http://localhost:8983 -t /usr/local/solr/solr-7.3.0/server/solr/collection1/conf
Retrieve schema data from http://localhost:8080/api/admin/index/solr/schema
Splitting up based on "---" marker
csplit: unrecognized option '--suppress-matched'
Try `csplit --help' for more information.

@poikilotherm
Copy link
Contributor Author

poikilotherm commented Sep 11, 2019

@pdurbin looks like we finally have someone running this on Mac 😄 If not @kcondon, what platform did you use to run this?

About the error you see: csplit is part of coreutils and might differ on your platform when not running on Linux. Actually it is quite hard to split files without using Perl or similar, and I did not want to rely on that.

In case you are indeed on Mac: does it need to be runnable on Mac? docker run might be an easy option then...

@kcondon
Copy link
Contributor

kcondon commented Sep 11, 2019

@poikilotherm @pdurbin
CentOS 6 but it should not matter, right?

@poikilotherm
Copy link
Contributor Author

poikilotherm commented Sep 11, 2019

Oh it does! Coreutils changed between 6 and 7. IMHO CentOS 6 should not be the baseline here, as it reaches EOL in 11/2020...

(I edited/extended my comment above, just in case...)

@poikilotherm
Copy link
Contributor Author

OK I think I have an idea how to avoid csplit. Give me a few minutes, hope you guys are still in town...

@poikilotherm
Copy link
Contributor Author

@kcondon, @pdurbin: could you take another look / give it another try? Should be usable on all platforms now.

@pdurbin
Copy link
Member

pdurbin commented Sep 12, 2019

Did you see my updated comment that it currently exists in develop?

Yes, I just saw #6161 and reproduced this bug on the phoenix server. Sad!

@poikilotherm
Copy link
Contributor Author

@kcondon and @pdurbin: any chance this might get merged today? 😄

@kcondon kcondon merged commit 9399881 into IQSS:develop Sep 12, 2019
@poikilotherm
Copy link
Contributor Author

A big thank you to all of you guys involved in this PR. Happy this is merged now and hopefully it will be useful for others, too.

@poikilotherm poikilotherm deleted the 6142-flex-solr-schema branch September 12, 2019 21:46
@pdurbin
Copy link
Member

pdurbin commented Sep 12, 2019

Thanks, @poikilotherm !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment